| | --- |
| | |
| |
|
| | license: mit |
| | library_name: tokenizers |
| | tags: |
| | - tokenizer |
| | - bpe |
| | - byte-level |
| | - multilingual |
| | - code |
| | - unitronx |
| | --- |
| | |
| | # {DISPLAY_NAME} |
| | |
| | **UnitronX** is a 32k byte-level BPE tokenizer optimized for English, multilingual (ru/ar/de/es/fr/it/cs/hr/sr), and code. |
| | It enforces safe merge boundaries (script changes, ZWJ, letter↔digit), preserves code identifiers, and uses |
| | placeholder tokens for URLs/emails/paths/hashes/UUIDs/handles/hashtags. |
| | |
| | ## Files |
| | - `tokenizer.json`, `merges.txt`, `vocab.json` |
| | - `tokenizer_config.json`, `special_tokens_map.json` |
| | - `meta.json` |
| | - *(optional)* `unitronx.tiktoken.json` (tiktoken-compatible) |
| |
|
| | ## Load with Transformers |
| |
|
| | ```python |
| | from transformers import AutoTokenizer |
| | tok = AutoTokenizer.from_pretrained("UnitronX-Tokenizer-32k-v1") |
| | print(tok.encode("don't split hyphen-words or fooBar123_id in code!")) |
| | |