AgentBull
/

CJK-Tokenizer

Model card Files Files and versions

0xDing commited on Apr 20, 2025

Commit

ba3bf20

·

verified ·

1 Parent(s): 3c4d717

Update README.md

Files changed (1) hide show

README.md +2 -7

README.md CHANGED Viewed

@@ -15,23 +15,18 @@ A BPE tokenizer that extends the LLaMA tokenizer vocabulary with all Unicode CJK
 - **Character-Level**
   - Each CJK character is treated as its own token
   - Other scripts (Latin, punctuation, etc.) follow the original LLaMA subword scheme
-- **Hugging Face Integration**
-  - Fully compatible with the `transformers` library
-  - Supports `from_pretrained`, `save_pretrained`, and standard tokenizer methods
 ## Installation
 from transformers import AutoTokenizer
-# Load the tokenizer by repository name or local path
 tokenizer = AutoTokenizer.from_pretrained("AgentBull/CJK-Tokenizer")
-# Tokenize some CJK text
 text = "天地玄黃， 宇宙洪荒。 日月盈昃， 辰宿列張。"
 tokens = tokenizer(text)
 print(tokens.tokens)
 Tokenizer Details

 - **Character-Level**
   - Each CJK character is treated as its own token
   - Other scripts (Latin, punctuation, etc.) follow the original LLaMA subword scheme
 ## Installation
+```python
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("AgentBull/CJK-Tokenizer")
 text = "天地玄黃， 宇宙洪荒。 日月盈昃， 辰宿列張。"
 tokens = tokenizer(text)
 print(tokens.tokens)
+```
 Tokenizer Details