πΊπΏ NeuronAI-Uzbek
The Most Advanced Open-Source Language Model for Uzbek
π 4th Place Globally | π₯ 1st Place in Uzbekistan on UzLiB Benchmark
Outperforming GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Flash on Uzbek language tasks
π Key Results
| Achievement | Value |
|---|---|
| UzLiB Overall Score | 0.662 |
| Global Ranking | #4 |
| Regional Ranking | #1 in Uzbekistan |
| Tokenizer Efficiency Improvement | +22.5% vs Qwen3-4B |
π UzLiB Benchmark Performance
NeuronAI-Uzbek achieves exceptional performance on the UzLiB Benchmark, the comprehensive evaluation suite for Uzbek language understanding.
Leaderboard Position
Note: NeuronAI-Uzbek is the smallest model in the top 10, with only 4B parameters, while competing against models with 100B+ parameters.
Performance Comparison vs Original Qwen3-4B
| Metric | Qwen3-4B (Original) | NeuronAI-Uzbek | Improvement |
|---|---|---|---|
| Overall (All) | 0.345 | 0.662 | +91.9% |
| Correct Word | 0.351 | 0.718 | +104.6% |
| Meaning | 0.309 | 0.466 | +50.8% |
| Meaning in Context | 0.347 | 0.333 | -4.0% |
| Fill-in | 0.327 | 0.385 | +17.7% |
π€ Tokenizer Efficiency
We optimized the tokenizer specifically for Uzbek, achieving significantly better tokenization efficiency (lower fertility rate = fewer tokens per word = faster inference and lower costs).
Fertility Rate Comparison
| Model | Fertility Rate | Std Dev | Vocab Size | Improvement vs Qwen3 |
|---|---|---|---|---|
| NeuronAI-Uzbek (Ours) π | 2.67 | 0.15 | 180,000 | +22.5% |
| Gemma 2-9B | 3.15 | 0.22 | 256,000 | +8.3% |
| LLaMA 3.1-8B | 3.32 | 0.22 | 128,256 | +3.7% |
| DeepSeek-V3 | 3.32 | 0.21 | 128,815 | +3.4% |
| Qwen3-4B (Original) | 3.44 | 0.22 | 151,669 | - |
Fertility Rate: Average number of tokens per word. Lower is better for efficiency.
What This Means
- 22.5% fewer tokens needed to represent Uzbek text
- Faster inference due to shorter sequences
- Lower API costs when deployed
- Better context utilization - fit more content in the same context window
π οΈ Model Details
Architecture
| Property | Value |
|---|---|
| Base Model | Qwen3-4B |
| Parameters | 4 Billion |
| Vocabulary Size | 180,000 tokens |
| Context Length | 32,768 tokens |
| Architecture | Transformer (Decoder-only) |
| Precision | BFloat16 |
Training Methodology
- Tokenizer Surgery: Extended vocabulary with 40,000 Uzbek-optimized tokens
- Embedding Initialization: Semantic initialization using subword composition
- Continual Pretraining: Trained on 22GB Uzbek text corpus
- Instruction Fine-tuning: Aligned using Uzbek and English instruction datasets
Training Data
| Dataset | Type | Purpose |
|---|---|---|
| Uzbek Web Corpus | Pretraining | Language modeling |
| behbudiy/alpaca-cleaned-uz | SFT | Uzbek instructions |
| NeuronUz/uzbek-spelling-mcq | SFT | Benchmark-targeted training |
| vicgalle/alpaca-gpt4 | SFT | English capability retention |
π Quick Start
Installation
pip install transformers torch
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "NeuronUz/NeuronAI-Uzbek"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
prompt = "O'zbekiston haqida qisqacha ma'lumot bering."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True
)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
With Thinking Mode (Chain-of-Thought)
messages = [
{"role": "user", "content": "5 ta 3 ga bo'linuvchi 100 dan kichik natural sonlarni toping."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Enable step-by-step reasoning
)
π Use Cases
NeuronAI-Uzbek excels at:
- π Text Generation: Creative writing, content creation in Uzbek
- β Question Answering: Answering questions about Uzbek culture, history, and general knowledge
- π Reading Comprehension: Understanding and analyzing Uzbek texts
- π€ Grammar & Spelling: Uzbek language correctness tasks
- π Translation Assistance: Uzbek-English language tasks
- π¬ Conversational AI: Building Uzbek chatbots and assistants
β οΈ Limitations
- Knowledge Cutoff: Training data has a knowledge cutoff date
- Hallucinations: May generate plausible-sounding but incorrect information
- Bias: May reflect biases present in training data
- Not for Critical Applications: Should not be used for medical, legal, or safety-critical applications without human oversight
π License
This model is released under the Apache 2.0 License.
π Acknowledgments
- Qwen Team at Alibaba for the excellent Qwen3-4B base model
- UzLiB Benchmark creators for the comprehensive evaluation framework
- Uzbek NLP Community for datasets and linguistic resources
π Citation
@misc{neuronai-uzbek-2025,
title={NeuronAI-Uzbek: An Optimized Language Model for Uzbek},
author={NeuronAI Team},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/NeuronUz/NeuronAI-Uzbek}
}
Built with β€οΈ in Uzbekistan by NeuronUz
- Downloads last month
- 105
Model tree for NeuronUz/NeuronAI-Uzbek
Dataset used to train NeuronUz/NeuronAI-Uzbek
Evaluation results
- Overall Accuracy on UzLiB Benchmarkself-reported0.662
