🇺🇿 NeuronAI-Uzbek

The Most Advanced Open-Source Language Model for Uzbek

🏆 4th Place Globally | 🥇 1st Place in Uzbekistan on UzLiB Benchmark

Outperforming GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Flash on Uzbek language tasks

📊 Key Results

Achievement	Value
UzLiB Overall Score	0.662
Global Ranking	#4
Regional Ranking	#1 in Uzbekistan
Tokenizer Efficiency Improvement	+22.5% vs Qwen3-4B

🏆 UzLiB Benchmark Performance

NeuronAI-Uzbek achieves exceptional performance on the UzLiB Benchmark, the comprehensive evaluation suite for Uzbek language understanding.

Leaderboard Position

Note: NeuronAI-Uzbek is the smallest model in the top 10, with only 4B parameters, while competing against models with 100B+ parameters.

Performance Comparison vs Original Qwen3-4B

Metric	Qwen3-4B (Original)	NeuronAI-Uzbek	Improvement
Overall (All)	0.345	0.662	+91.9%
Correct Word	0.351	0.718	+104.6%
Meaning	0.309	0.466	+50.8%
Meaning in Context	0.347	0.333	-4.0%
Fill-in	0.327	0.385	+17.7%

🔤 Tokenizer Efficiency

We optimized the tokenizer specifically for Uzbek, achieving significantly better tokenization efficiency (lower fertility rate = fewer tokens per word = faster inference and lower costs).

Fertility Rate Comparison

Model	Fertility Rate	Std Dev	Vocab Size	Improvement vs Qwen3
NeuronAI-Uzbek (Ours) 🏆	2.67	0.15	180,000	+22.5%
Gemma 2-9B	3.15	0.22	256,000	+8.3%
LLaMA 3.1-8B	3.32	0.22	128,256	+3.7%
DeepSeek-V3	3.32	0.21	128,815	+3.4%
Qwen3-4B (Original)	3.44	0.22	151,669	-

Fertility Rate: Average number of tokens per word. Lower is better for efficiency.

What This Means

22.5% fewer tokens needed to represent Uzbek text
Faster inference due to shorter sequences
Lower API costs when deployed
Better context utilization - fit more content in the same context window

🛠️ Model Details

Architecture

Property	Value
Base Model	Qwen3-4B
Parameters	4 Billion
Vocabulary Size	180,000 tokens
Context Length	32,768 tokens
Architecture	Transformer (Decoder-only)
Precision	BFloat16

Training Methodology

Tokenizer Surgery: Extended vocabulary with 40,000 Uzbek-optimized tokens
Embedding Initialization: Semantic initialization using subword composition
Continual Pretraining: Trained on 22GB Uzbek text corpus
Instruction Fine-tuning: Aligned using Uzbek and English instruction datasets

Training Data

Dataset	Type	Purpose
Uzbek Web Corpus	Pretraining	Language modeling
behbudiy/alpaca-cleaned-uz	SFT	Uzbek instructions
NeuronUz/uzbek-spelling-mcq	SFT	Benchmark-targeted training
vicgalle/alpaca-gpt4	SFT	English capability retention

🚀 Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "NeuronUz/NeuronAI-Uzbek"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

prompt = "O'zbekiston haqida qisqacha ma'lumot bering."

messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

With Thinking Mode (Chain-of-Thought)

messages = [
    {"role": "user", "content": "5 ta 3 ga bo'linuvchi 100 dan kichik natural sonlarni toping."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Enable step-by-step reasoning
)

📈 Use Cases

NeuronAI-Uzbek excels at:

📝 Text Generation: Creative writing, content creation in Uzbek
❓ Question Answering: Answering questions about Uzbek culture, history, and general knowledge
📚 Reading Comprehension: Understanding and analyzing Uzbek texts
🔤 Grammar & Spelling: Uzbek language correctness tasks
🌐 Translation Assistance: Uzbek-English language tasks
💬 Conversational AI: Building Uzbek chatbots and assistants

⚠️ Limitations

Knowledge Cutoff: Training data has a knowledge cutoff date
Hallucinations: May generate plausible-sounding but incorrect information
Bias: May reflect biases present in training data
Not for Critical Applications: Should not be used for medical, legal, or safety-critical applications without human oversight

📜 License

This model is released under the Apache 2.0 License.

🙏 Acknowledgments

Qwen Team at Alibaba for the excellent Qwen3-4B base model
UzLiB Benchmark creators for the comprehensive evaluation framework
Uzbek NLP Community for datasets and linguistic resources

📖 Citation

@misc{neuronai-uzbek-2025,
  title={NeuronAI-Uzbek: An Optimized Language Model for Uzbek},
  author={NeuronAI Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/NeuronUz/NeuronAI-Uzbek}
}

Built with ❤️ in Uzbekistan by NeuronUz

Downloads last month: 105

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for NeuronUz/NeuronAI-Uzbek

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

(418)

this model

Dataset used to train NeuronUz/NeuronAI-Uzbek

Evaluation results

Overall Accuracy on UzLiB Benchmark
self-reported

0.662