πŸ‡ΊπŸ‡Ώ NeuronAI-Uzbek

The Most Advanced Open-Source Language Model for Uzbek

Model License Base Model

πŸ† 4th Place Globally | πŸ₯‡ 1st Place in Uzbekistan on UzLiB Benchmark

Outperforming GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Flash on Uzbek language tasks


πŸ“Š Key Results

Achievement Value
UzLiB Overall Score 0.662
Global Ranking #4
Regional Ranking #1 in Uzbekistan
Tokenizer Efficiency Improvement +22.5% vs Qwen3-4B

πŸ† UzLiB Benchmark Performance

NeuronAI-Uzbek achieves exceptional performance on the UzLiB Benchmark, the comprehensive evaluation suite for Uzbek language understanding.

Leaderboard Position

image

Note: NeuronAI-Uzbek is the smallest model in the top 10, with only 4B parameters, while competing against models with 100B+ parameters.

Performance Comparison vs Original Qwen3-4B

Metric Qwen3-4B (Original) NeuronAI-Uzbek Improvement
Overall (All) 0.345 0.662 +91.9%
Correct Word 0.351 0.718 +104.6%
Meaning 0.309 0.466 +50.8%
Meaning in Context 0.347 0.333 -4.0%
Fill-in 0.327 0.385 +17.7%

πŸ”€ Tokenizer Efficiency

We optimized the tokenizer specifically for Uzbek, achieving significantly better tokenization efficiency (lower fertility rate = fewer tokens per word = faster inference and lower costs).

Fertility Rate Comparison

Model Fertility Rate Std Dev Vocab Size Improvement vs Qwen3
NeuronAI-Uzbek (Ours) πŸ† 2.67 0.15 180,000 +22.5%
Gemma 2-9B 3.15 0.22 256,000 +8.3%
LLaMA 3.1-8B 3.32 0.22 128,256 +3.7%
DeepSeek-V3 3.32 0.21 128,815 +3.4%
Qwen3-4B (Original) 3.44 0.22 151,669 -

Fertility Rate: Average number of tokens per word. Lower is better for efficiency.

Tokenizer Fertility Rate Comparison

What This Means

  • 22.5% fewer tokens needed to represent Uzbek text
  • Faster inference due to shorter sequences
  • Lower API costs when deployed
  • Better context utilization - fit more content in the same context window

πŸ› οΈ Model Details

Architecture

Property Value
Base Model Qwen3-4B
Parameters 4 Billion
Vocabulary Size 180,000 tokens
Context Length 32,768 tokens
Architecture Transformer (Decoder-only)
Precision BFloat16

Training Methodology

  1. Tokenizer Surgery: Extended vocabulary with 40,000 Uzbek-optimized tokens
  2. Embedding Initialization: Semantic initialization using subword composition
  3. Continual Pretraining: Trained on 22GB Uzbek text corpus
  4. Instruction Fine-tuning: Aligned using Uzbek and English instruction datasets

Training Data

Dataset Type Purpose
Uzbek Web Corpus Pretraining Language modeling
behbudiy/alpaca-cleaned-uz SFT Uzbek instructions
NeuronUz/uzbek-spelling-mcq SFT Benchmark-targeted training
vicgalle/alpaca-gpt4 SFT English capability retention

πŸš€ Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "NeuronUz/NeuronAI-Uzbek"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

prompt = "O'zbekiston haqida qisqacha ma'lumot bering."

messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

With Thinking Mode (Chain-of-Thought)

messages = [
    {"role": "user", "content": "5 ta 3 ga bo'linuvchi 100 dan kichik natural sonlarni toping."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Enable step-by-step reasoning
)

πŸ“ˆ Use Cases

NeuronAI-Uzbek excels at:

  • πŸ“ Text Generation: Creative writing, content creation in Uzbek
  • ❓ Question Answering: Answering questions about Uzbek culture, history, and general knowledge
  • πŸ“š Reading Comprehension: Understanding and analyzing Uzbek texts
  • πŸ”€ Grammar & Spelling: Uzbek language correctness tasks
  • 🌐 Translation Assistance: Uzbek-English language tasks
  • πŸ’¬ Conversational AI: Building Uzbek chatbots and assistants

⚠️ Limitations

  • Knowledge Cutoff: Training data has a knowledge cutoff date
  • Hallucinations: May generate plausible-sounding but incorrect information
  • Bias: May reflect biases present in training data
  • Not for Critical Applications: Should not be used for medical, legal, or safety-critical applications without human oversight

πŸ“œ License

This model is released under the Apache 2.0 License.


πŸ™ Acknowledgments

  • Qwen Team at Alibaba for the excellent Qwen3-4B base model
  • UzLiB Benchmark creators for the comprehensive evaluation framework
  • Uzbek NLP Community for datasets and linguistic resources

πŸ“– Citation

@misc{neuronai-uzbek-2025,
  title={NeuronAI-Uzbek: An Optimized Language Model for Uzbek},
  author={NeuronAI Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/NeuronUz/NeuronAI-Uzbek}
}

Built with ❀️ in Uzbekistan by NeuronUz

Downloads last month
105
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for NeuronUz/NeuronAI-Uzbek

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Finetuned
(418)
this model

Dataset used to train NeuronUz/NeuronAI-Uzbek

Evaluation results