alvarobartt HF Staff commited on
Commit
495701d
·
verified ·
1 Parent(s): 798e0f3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -5
README.md CHANGED
@@ -1,5 +1,68 @@
1
- ---
2
- license: other
3
- license_name: grok-2
4
- license_link: https://huggingface.co/xai-org/grok-2/blob/main/LICENSE
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - tokenizers
5
+ - sglang
6
+ license: other
7
+ license_name: grok-2
8
+ license_link: https://huggingface.co/xai-org/grok-2/blob/main/LICENSE
9
+ ---
10
+
11
+ # Grok-2 Tokenizer
12
+
13
+ A 🤗-compatible version of the **Grok-2 tokenizer** (adapted from [xai-org/grok-2](https://github.com/xai-org/grok-2)).
14
+
15
+ This means it can be used with Hugging Face libraries including [Transformers](https://github.com/huggingface/transformers),
16
+ [Tokenizers](https://github.com/huggingface/tokenizers), and [Transformers.js](https://github.com/xenova/transformers.js).
17
+
18
+ ## Motivation
19
+
20
+ As Grok 2.5 aka. [xai-org/grok-2](https://github.com/xai-org/grok-2) has been recently released on the 🤗 Hub with [SGLang](https://github.com/sgl-project/sglang)
21
+ native support, but the checkpoints on the Hub won't come with a Hugging Face compatible tokenizer, but rather with a `tiktoken`-based
22
+ JSON export, which is [internally read and patched in SGLang](https://github.com/sgl-project/sglang/blob/fd71b11b1d96d385b09cb79c91a36f1f01293639/python/sglang/srt/tokenizer/tiktoken_tokenizer.py#L29-L108).
23
+
24
+ This repository then contains the Hugging Face compatible export so that users can easily interact and play around with the Grok-2 tokenizer,
25
+ besides that allowing to use it via SGLang without having to pull the repository manually from the Hub and then using a mount, to prevent from directly having
26
+ to point to the tokenizer path, so that Grok-2 can be deployed as:
27
+
28
+ ```bash
29
+ python3 -m sglang.launch_server --model xai-org/grok-2 --tokenizer-path alvarobartt/grok-2-tokenizer --tp 8 --quantization fp8 --attention-backend triton
30
+ ```
31
+
32
+ Rather than the former 2-step process:
33
+
34
+ ```bash
35
+ hf download xai-org/grok-2 --local-dir /local/grok-2
36
+
37
+ python3 -m sglang.launch_server --model /local/grok-2 --tokenizer-path /local/grok-2/tokenizer.tok.json --tp 8 --quantization fp8 --attention-backend triton
38
+ ```
39
+
40
+ ## Example
41
+
42
+ ```py
43
+ from transformers import AutoTokenizer
44
+
45
+ tokenizer = AutoTokenizer.from_pretrained("alvarobartt/grok-2-tokenizer")
46
+
47
+ assert tokenizer.encode("Human: What is Deep Learning?<|separator|>\n\n") == [
48
+ 35406,
49
+ 186,
50
+ 2171,
51
+ 458,
52
+ 17454,
53
+ 14803,
54
+ 191,
55
+ 1,
56
+ 417,
57
+ ]
58
+
59
+ assert (
60
+ tokenizer.apply_chat_template(
61
+ [{"role": "user", "content": "What is the capital of France?"}], tokenize=False
62
+ )
63
+ == "Human: What is the capital of France?<|separator|>\n\n"
64
+ )
65
+ ```
66
+
67
+ > [!NOTE]
68
+ > This repository has been inspired by earlier similar work by [Xenova](https://huggingface.co/Xenova) in [`Xenova/grok-1-tokenizer`](https://huggingface.co/Xenova/grok-1-tokenizer).