Transformers
0xDing commited on
Commit
ba3bf20
·
verified ·
1 Parent(s): 3c4d717

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -7
README.md CHANGED
@@ -15,23 +15,18 @@ A BPE tokenizer that extends the LLaMA tokenizer vocabulary with all Unicode CJK
15
  - **Character-Level**
16
  - Each CJK character is treated as its own token
17
  - Other scripts (Latin, punctuation, etc.) follow the original LLaMA subword scheme
18
- - **Hugging Face Integration**
19
- - Fully compatible with the `transformers` library
20
- - Supports `from_pretrained`, `save_pretrained`, and standard tokenizer methods
21
 
22
  ## Installation
23
 
24
-
25
  from transformers import AutoTokenizer
26
 
27
- # Load the tokenizer by repository name or local path
28
  tokenizer = AutoTokenizer.from_pretrained("AgentBull/CJK-Tokenizer")
29
 
30
- # Tokenize some CJK text
31
  text = "天地玄黃, 宇宙洪荒。 日月盈昃, 辰宿列張。"
32
  tokens = tokenizer(text)
33
  print(tokens.tokens)
34
-
35
 
36
 
37
  Tokenizer Details
 
15
  - **Character-Level**
16
  - Each CJK character is treated as its own token
17
  - Other scripts (Latin, punctuation, etc.) follow the original LLaMA subword scheme
 
 
 
18
 
19
  ## Installation
20
 
21
+ ```python
22
  from transformers import AutoTokenizer
23
 
 
24
  tokenizer = AutoTokenizer.from_pretrained("AgentBull/CJK-Tokenizer")
25
 
 
26
  text = "天地玄黃, 宇宙洪荒。 日月盈昃, 辰宿列張。"
27
  tokens = tokenizer(text)
28
  print(tokens.tokens)
29
+ ```
30
 
31
 
32
  Tokenizer Details