A newer version of this model is available: DatarrX/myX-Tokenizer

DatarrX / myX-Tokenizer-BPE ⚙️

myX-Tokenizer-BPE is a Byte Pair Encoding (BPE) based tokenizer specifically trained for the Burmese language. Developed by Khant Sint Heinn (Kalix Louis) under DatarrX (Myanmar Open Source NGO), this model serves as a baseline for Burmese NLP tasks using the BPE algorithm.

🎯 Objectives & Characteristics

BPE Baseline: Designed to provide a standard BPE-based segmentation for Burmese text.
Burmese Focus: This model was trained exclusively on Burmese text, making it highly specialized for native scripts.
Memory Efficiency: Trained using a RAM-efficient approach with a large-scale corpus.

🛠️ Technical Specifications

Algorithm: Byte Pair Encoding (BPE).
Vocabulary Size: 64,000.
Normalization: NFKC.
Features: Byte-fallback, Split Digits, and Dummy Prefix.

Training Data

Trained on kalixlouiis/raw-data using 1.5 million Burmese-only sentences.

⚠️ Important Considerations (Limitations)

English Language Weakness: Since this model was trained purely on Burmese data, it is notably weak in processing English text, often leading to excessive character-level fragmentation for Latin scripts.
BPE Nature: Compared to our Unigram models, this BPE version may offer different segmentation logic which might affect certain downstream NLP tasks.

Citation

If you use this tokenizer in your research or project, please cite it as follows:

APA 7th Edition

Khant Sint Heinn. (2026). myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese (Version 1.0) [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-BPE

BibTeX

@software{khantsintheinn2026bpe,
  author = {Khant Sint Heinn},
  title = {myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese},
  version = {1.0},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DatarrX/myX-Tokenizer-BPE},
  note = {BPE algorithm based on Burmese raw data}
}

DatarrX - myX-Tokenizer-BPE (မြန်မာဘာသာ) ⚙️

myX-Tokenizer-BPE သည် Byte Pair Encoding (BPE) algorithm ကို အသုံးပြု၍ မြန်မာဘာသာစကားအတွက် အထူးရည်ရွယ် တည်ဆောက်ထားသော Tokenizer ဖြစ်ပါသည်။ ဤ Model ကို DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX) မှ ထုတ်ဝေခြင်းဖြစ်ပြီ Khant Sint Heinn (Kalix Louis) မှ အဓိက ဖန်တီးထားခြင်း ဖြစ်ပါသည်။

🎯 ရည်ရွယ်ချက်နှင့် ထူးခြားချက်များ

BPE အခြေခံ: မြန်မာစာသားများကို BPE နည်းပညာဖြင့် ဖြတ်တောက်ရာတွင် စံနှုန်းတစ်ခုအဖြစ် အသုံးပြုနိုင်ရန်။
မြန်မာစာ သီးသန့်: ဤ Model ကို မြန်မာစာသား သီးသန့်ဖြင့်သာ လေ့ကျင့်ထားသဖြင့် ဗမာ(မြန်မာ)စာအရေးအသားများအတွက် အထူးပြုထားပါသည်။
အရည်အသွေးမြင့် Training: စာကြောင်းပေါင်း ၁.၅ သန်းကို အသုံးပြု၍ RAM-efficient ဖြစ်သော နည်းလမ်းဖြင့် တည်ဆောက်ထားပါသည်။

🛠️ နည်းပညာဆိုင်ရာ အချက်အလက်များ

Algorithm: Byte Pair Encoding (BPE)။
Vocab Size: 64,000။
Normalization: NFKC။
Features: Byte-fallback, Split Digits နှင့် Dummy Prefix အင်္ဂါရပ်များ ပါဝင်ပါသည်။

အသုံးပြုထားသော Dataset

kalixlouiis/raw-data ထဲမှ သန့်စင်ပြီးသား မြန်မာစာကြောင်းပေါင်း ၁.၅ သန်း (1.5 Million) ကို အသုံးပြုထားပါသည်။

⚠️ သိထားရန် ကန့်သတ်ချက်များ

အင်္ဂလိပ်စာ အားနည်းမှု: ဤ Model ကို မြန်မာစာ သီးသန့်ဖြင့်သာ Train ထားခြင်းကြောင့် အင်္ဂလိပ်စာလုံးများကို ဖြတ်တောက်ရာတွင် အလွန်အားနည်းပြီး စာလုံးတစ်လုံးချင်းစီ ကွဲထွက်သွားတတ်ပါသည်။
BPE ၏ သဘာဝ: ကျွန်တော်တို့၏ Unigram model များနှင့် ယှဉ်ပါက ဖြတ်တောက်ပုံခြင်း ကွဲပြားနိုင်သဖြင့် မိမိအသုံးပြုမည့် task အပေါ် မူတည်၍ ရွေးချယ်ရန် လိုအပ်ပါသည်။

💻 How to Use (အသုံးပြုနည်း)

import sentencepiece as spm
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="DatarrX/myX-Tokenizer-BPE", filename="myX-Tokenizer.model")
sp = spm.SentencePieceProcessor(model_file=model_path)

text = "မြန်မာစာကို BPE algorithm နဲ့ ဖြတ်တောက်ကြည့်ခြင်း။"
print(sp.encode_as_pieces(text))

✍️ Project Authors

Developer: Khant Sint Heinn (Kalix Louis)
Organization: DatarrX (Myanmar Open Source NGO)

Citation

အကယ်၍ သင်သည် ဤ model ကို သင်၏ သုတေသနလုပ်ငန်းများတွင် အသုံးပြုခဲ့ပါက အောက်ပါအတိုင်း ကိုးကားပေးရန် မေတ္တာရပ်ခံအပ်ပါသည်။

APA 7th Edition

Khant Sint Heinn. (2026). myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese (Version 1.0) [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-BPE

BibTeX

@software{khantsintheinn2026bpe,
  author = {Khant Sint Heinn},
  title = {myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese},
  version = {1.0},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DatarrX/myX-Tokenizer-BPE},
  note = {BPE algorithm based on Burmese raw data}
}

License 📜

This project is licensed under the Apache License 2.0.

What does this mean?

The Apache License 2.0 is a permissive license that allows you to:

Commercial Use: You can use this tokenizer for commercial purposes.
Modification: You can modify the model or the code for your specific needs.
Distribution: You can share and distribute the original or modified versions.
Sublicensing: You can grant sublicenses to others.

Conditions:

Attribute: You must give appropriate credit to the author (Khant Sint Heinn) and the organization (DatarrX).
License Notice: You must include a copy of the license and any original copyright notice in your distribution.

For more details, you can read the full license text at http://www.apache.org/licenses/LICENSE-2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train DatarrX/myX-Tokenizer-BPE

Collection including DatarrX/myX-Tokenizer-BPE

myX-Tokenizer Series

Collection

A comprehensive collection of syllable-aware tokenizers optimized for Burmese-English NLP tasks, developed by DatarrX. • 4 items • Updated 2 days ago