Update README.md
Browse files
README.md
CHANGED
|
@@ -1,5 +1,68 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
tags:
|
| 4 |
+
- tokenizers
|
| 5 |
+
- sglang
|
| 6 |
+
license: other
|
| 7 |
+
license_name: grok-2
|
| 8 |
+
license_link: https://huggingface.co/xai-org/grok-2/blob/main/LICENSE
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# Grok-2 Tokenizer
|
| 12 |
+
|
| 13 |
+
A 🤗-compatible version of the **Grok-2 tokenizer** (adapted from [xai-org/grok-2](https://github.com/xai-org/grok-2)).
|
| 14 |
+
|
| 15 |
+
This means it can be used with Hugging Face libraries including [Transformers](https://github.com/huggingface/transformers),
|
| 16 |
+
[Tokenizers](https://github.com/huggingface/tokenizers), and [Transformers.js](https://github.com/xenova/transformers.js).
|
| 17 |
+
|
| 18 |
+
## Motivation
|
| 19 |
+
|
| 20 |
+
As Grok 2.5 aka. [xai-org/grok-2](https://github.com/xai-org/grok-2) has been recently released on the 🤗 Hub with [SGLang](https://github.com/sgl-project/sglang)
|
| 21 |
+
native support, but the checkpoints on the Hub won't come with a Hugging Face compatible tokenizer, but rather with a `tiktoken`-based
|
| 22 |
+
JSON export, which is [internally read and patched in SGLang](https://github.com/sgl-project/sglang/blob/fd71b11b1d96d385b09cb79c91a36f1f01293639/python/sglang/srt/tokenizer/tiktoken_tokenizer.py#L29-L108).
|
| 23 |
+
|
| 24 |
+
This repository then contains the Hugging Face compatible export so that users can easily interact and play around with the Grok-2 tokenizer,
|
| 25 |
+
besides that allowing to use it via SGLang without having to pull the repository manually from the Hub and then using a mount, to prevent from directly having
|
| 26 |
+
to point to the tokenizer path, so that Grok-2 can be deployed as:
|
| 27 |
+
|
| 28 |
+
```bash
|
| 29 |
+
python3 -m sglang.launch_server --model xai-org/grok-2 --tokenizer-path alvarobartt/grok-2-tokenizer --tp 8 --quantization fp8 --attention-backend triton
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
Rather than the former 2-step process:
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
hf download xai-org/grok-2 --local-dir /local/grok-2
|
| 36 |
+
|
| 37 |
+
python3 -m sglang.launch_server --model /local/grok-2 --tokenizer-path /local/grok-2/tokenizer.tok.json --tp 8 --quantization fp8 --attention-backend triton
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
## Example
|
| 41 |
+
|
| 42 |
+
```py
|
| 43 |
+
from transformers import AutoTokenizer
|
| 44 |
+
|
| 45 |
+
tokenizer = AutoTokenizer.from_pretrained("alvarobartt/grok-2-tokenizer")
|
| 46 |
+
|
| 47 |
+
assert tokenizer.encode("Human: What is Deep Learning?<|separator|>\n\n") == [
|
| 48 |
+
35406,
|
| 49 |
+
186,
|
| 50 |
+
2171,
|
| 51 |
+
458,
|
| 52 |
+
17454,
|
| 53 |
+
14803,
|
| 54 |
+
191,
|
| 55 |
+
1,
|
| 56 |
+
417,
|
| 57 |
+
]
|
| 58 |
+
|
| 59 |
+
assert (
|
| 60 |
+
tokenizer.apply_chat_template(
|
| 61 |
+
[{"role": "user", "content": "What is the capital of France?"}], tokenize=False
|
| 62 |
+
)
|
| 63 |
+
== "Human: What is the capital of France?<|separator|>\n\n"
|
| 64 |
+
)
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
> [!NOTE]
|
| 68 |
+
> This repository has been inspired by earlier similar work by [Xenova](https://huggingface.co/Xenova) in [`Xenova/grok-1-tokenizer`](https://huggingface.co/Xenova/grok-1-tokenizer).
|