TokSuite

community

AI & ML interests

Tokenization, Robustness, LLMs

Recent Activity

gsaltintas  updated a dataset about 20 hours ago
toksuite/toksuite_farsi
gsaltintas  updated a dataset about 20 hours ago
toksuite/toksuite_chinese
gsaltintas  updated a dataset about 20 hours ago
toksuite/toksuite_math
View all activity

TokSuite Logo

TokSuite is a collection of models and benchmarks designed to isolate and study the impact of tokenization on language model behavior across English, Chinese, Turkish, Italian, and Farsi languages, as well as STEM and mathematical text. It includes fourteen models that share the same architecture, training data, training budget, and initialization but differ only in their tokenizers, alongside a set of benchmarks that evaluate performance under real-world perturbations that affect tokenization.

Our code is available at https://github.com/r-three/Tokenizers.