alvarobartt
/

grok-2-tokenizer

Model card Files Files and versions

alvarobartt HF Staff commited on Aug 27

Commit

495701d

·

verified ·

1 Parent(s): 798e0f3

Update README.md

Files changed (1) hide show

README.md +68 -5

README.md CHANGED Viewed

@@ -1,5 +1,68 @@
----
-license: other
-license_name: grok-2
-license_link: https://huggingface.co/xai-org/grok-2/blob/main/LICENSE
----

+---
+library_name: transformers
+tags:
+- tokenizers
+- sglang
+license: other
+license_name: grok-2
+license_link: https://huggingface.co/xai-org/grok-2/blob/main/LICENSE
+---
+# Grok-2 Tokenizer
+A 🤗-compatible version of the **Grok-2 tokenizer** (adapted from [xai-org/grok-2](https://github.com/xai-org/grok-2)).
+This means it can be used with Hugging Face libraries including [Transformers](https://github.com/huggingface/transformers),
+[Tokenizers](https://github.com/huggingface/tokenizers), and [Transformers.js](https://github.com/xenova/transformers.js).
+## Motivation
+As Grok 2.5 aka. [xai-org/grok-2](https://github.com/xai-org/grok-2) has been recently released on the 🤗 Hub with [SGLang](https://github.com/sgl-project/sglang)
+native support, but the checkpoints on the Hub won't come with a Hugging Face compatible tokenizer, but rather with a `tiktoken`-based
+JSON export, which is [internally read and patched in SGLang](https://github.com/sgl-project/sglang/blob/fd71b11b1d96d385b09cb79c91a36f1f01293639/python/sglang/srt/tokenizer/tiktoken_tokenizer.py#L29-L108).
+This repository then contains the Hugging Face compatible export so that users can easily interact and play around with the Grok-2 tokenizer,
+besides that allowing to use it via SGLang without having to pull the repository manually from the Hub and then using a mount, to prevent from directly having
+to point to the tokenizer path, so that Grok-2 can be deployed as:
+```bash
+python3 -m sglang.launch_server --model xai-org/grok-2 --tokenizer-path alvarobartt/grok-2-tokenizer --tp 8 --quantization fp8 --attention-backend triton
+```
+Rather than the former 2-step process:
+```bash
+hf download xai-org/grok-2 --local-dir /local/grok-2
+python3 -m sglang.launch_server --model /local/grok-2 --tokenizer-path /local/grok-2/tokenizer.tok.json --tp 8 --quantization fp8 --attention-backend triton
+```
+## Example
+```py
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("alvarobartt/grok-2-tokenizer")
+assert tokenizer.encode("Human: What is Deep Learning?<|separator|>\n\n") == [
+    35406,
+    186,
+    2171,
+    458,
+    17454,
+    14803,
+    191,
+    1,
+    417,
+]
+assert (
+    tokenizer.apply_chat_template(
+        [{"role": "user", "content": "What is the capital of France?"}], tokenize=False
+    )
+    == "Human: What is the capital of France?<|separator|>\n\n"
+)
+```
+> [!NOTE]
+> This repository has been inspired by earlier similar work by [Xenova](https://huggingface.co/Xenova) in [`Xenova/grok-1-tokenizer`](https://huggingface.co/Xenova/grok-1-tokenizer).