Qwen3-8B-ODA-Math-460k

$Leaderboard Performance$

Qwen3-8B-ODA-Math-460k is a supervised fine-tuned (SFT) model built on top of Qwen3-8B-Base, trained with ODA-Math-460k.

ODA-Math-460k is a large-scale math reasoning dataset curated from top-performing open mathematics corpora (selected via the OpenDataArena leaderboard) and refined through deduplication, benchmark decontamination, LLM-based filtering, and verifier-backed response distillation.
It targets a “learnable but challenging” difficulty band: non-trivial for smaller models yet solvable by stronger reasoning models.

🧠 Model Summary

Base Model: Qwen/Qwen3-8B-Base
Training Data: OpenDataArena/ODA-Math-460k
Domain Coverage: Mathematics (strictly filtered)
Scale (selected training set): ~460K problems (after selection and verification pipeline)
Goal: Efficiently improve mathematical reasoning and competition-style problem solving via high-quality, validated solutions.

⚙️ Training Data Curation Pipeline

ODA-Math-460k is constructed from an aggregated question pool and then progressively filtered and selected.

1️⃣ Data Collection

We prioritize source datasets based on their empirical impact on downstream model performance. Using the OpenDataArena leaderboard, we aggregate top-ranking math datasets that show strong efficacy for the Qwen and Llama model families. These sources form the initial pool for ODA-Math.

2️⃣ Deduplication & Decontamination

We first perform exact deduplication over all questions to remove identical items, and then run benchmark decontamination to reduce evaluation leakage by removing overlaps with standard and competition benchmarks.

3️⃣ Question Filtering (Quality & Suitability)

A multi-stage filtering pipeline refines domain specificity and usability by applying an LLM-based domain classifier (to remove out-of-domain items such as coding/general instruction tasks), an LLM-based validity validator (to remove ill-formed questions with missing premises or undefined notation), and problem-type filtering (via the Big Math toolkit) to exclude proof questions and guessing-prone formats like multiple-choice and true/false—leaving predominantly free-form problems with objectively verifiable answers.

📊 Filtration Statistics

Pipeline Stage	Count	Percentage
Raw Collection	11.4M	100%
Dedup & Decontamination	4.3M	37.7%
Question Filtering	3.3M	28.9%
Stage-1 Filtering	815.3K	7.2%
Stage-2 Filtering	459.6K	4.0%

🎯 Data Selection

Given the large curated pool, ODA-Math-460k retains problems that are hard for small models but solvable for stronger reasoning models.

Stage-1: Lower-Bound Filtering

Stage-1 removes trivial problems using Qwen3-8B in non-thinking mode: for each problem we sample k=4 responses, compute Pass@4 by matching each predicted final answer to y_gt, and keep the problem only if Pass@4(x) = 0 (i.e., none of four attempts is correct).

Stage-2: Upper-Bound Filtering

Stage-2 removes unsolvable or ambiguous problems using Qwen3-30B-A3B in thinking mode: we generate k=5 reasoning traces per problem, compute Pass@5, and keep the problem only if Pass@5(x) > 0 (i.e., at least one attempt solves it).

✅ Distillation & Verification

🧪 Response Synthesis

We distill solutions using AM-Thinking-v1 as the teacher, generating k=5 candidate reasoning traces (step-by-step solution + final answer) for each selected problem.

🔍 Response Verification

We verify generated responses with Compass-Verifier-7B, which takes (problem x, generated response y_gen, ground-truth answer y_gt) and outputs a binary correctness decision (correct / incorrect). We keep only the (problem, response) pairs judged correct, and discard the rest—so the released dataset contains verified solutions only.

📚 Training Data Source Composition

ODA-Math-460k is a mixture of multiple high-quality math datasets to avoid domination by a single style/annotation protocol. Top contributors:

Source	Count	Percentage
ScaleQuest-Math	87,755	19.09%
NuminaMath-CoT	75,971	16.53%
OpenMathInstruct-2	65,688	14.29%
MegaScience (math)	54,904	11.94%
OpenMathReasoning	49,463	10.76%
AM-Thinking-Distilled	38,375	8.35%
MiroMind-M1-SFT-719K	23,417	5.09%
SCP-116K	16,066	3.50%
DeepMath-309K	11,956	2.60%
math-gpt-4o-200k	8,355	1.82%
OpenR1-Math-220k	7,999	1.74%
MathFusionQA	6,510	1.42%

🔬 Content Characteristics

📘 Subject Distribution

$Subject Distribution$

ODA-Math-460k maintains a more balanced subject composition than several peers:

Algebra remains substantial (~44.8%),
Geometry roughly 20–22%,
Calculus, Discrete Math & Probability, and Number Theory each around ~11%.

This mitigates subject bias and reduces performance drops on underrepresented topics.

📉 Difficulty Distribution

Apart from model-based pass rate, we also adopt LLM-as-Judge difficulty estimation on a 1-10 scale, mapped to the AoPS ratings.

Level	Equivalent Competition Tier	Description
1	Elementary / Middle School	MOEMS, AMC 8 (Early Qs). Standard word problems.
2	Junior High	AMC 8 (Hard), AMC 10 (Early). Complex word problems.
3	High School Beginner	AMC 10 (Mid), AMC 12 (Early). Requires creative thinking.
4	High School Intermediate	AMC 12 (Mid), AIME (Early). Intermediate complexity.
5	Advanced High School	AIME (Mid), JBMO. Simple proof-based Olympiad style.
6	Pre-Olympiad	AIME (Hard), USAJMO. Introductory Olympiad level.
7	Olympiad (Entry)	IMO (Easy/Medium), USAMO. Requires technical knowledge.
8	Olympiad (Medium)	IMO (Medium/Hard). High-level competition problems.
9	Olympiad (Expert)	IMO (Hard). Expert-level constructions/proofs.
10	Historically Hard	Outliers. Exceedingly tedious or difficult even for Olympians.

$Difficulty Distribution$

ODA-Math-460k features a balanced mix of fundamental and intermediate reasoning tasks:

Primary Mode: Difficulty 1 (~110k samples), providing a dense foundation of basic mathematical concepts.
Secondary Mode: Difficulty 6 (~72k samples), offering a significant concentration of intermediate-level challenges.
Tail: A steady decline toward Difficulty 10, maintaining a specialized set of high-complexity queries.

📈 Performance

ODA-Math-460k is evaluated as an SFT corpus for Qwen3-8B-Base.

Results show consistent gains over base checkpoints, with particularly strong improvements on competition-style benchmarks.

Performance Comparison. Best scores in **bold**, second-best underlined.
Dataset	Size	GSM8K	Math500	Omni-Math	Olympiad	AIME'24	AIME'25	CMIMC'25	HMMT'25	BRUMO'25	AVG
Qwen3-8B-Base
Qwen3-8B-Base	-	92.0	79.6	30.6	47.2	6.7	10.8	4.7	0.0	16.7	32.0
LIMO	817	83.9	69.0	21.8	31.3	12.5	8.8	2.2	1.7	13.8	27.2
MegaScience (math)	414k	93.4	84.8	35.8	57.6	25.4	17.9	11.3	12.1	33.8	41.3
Fast-Math-R1-SFT	8k	92.8	86.6	39.6	61.0	28.8	25.8	14.1	13.3	34.2	44.0
Light-R1-SFT	79k	93.8	92.6	48.5	69.7	54.6	31.3	22.8	25.0	48.8	54.1
SYNTHETIC-2 (math)	50k	93.9	93.8	58.8	71.5	58.8	45.8	28.4	32.9	54.2	59.8
MiroMind-M1-SFT	719k	94.8	96.8	54.5	77.0	62.9	47.5	25.6	27.5	60.4	60.8
OmniThought-0528	365k	94.2	95.4	59.0	74.9	67.9	45.4	31.3	35.8	52.5	61.8
AM-Thinking (math)	558k	95.2	95.6	64.5	77.5	65.8	54.6	36.3	41.3	62.5	65.9
ODA-Math	460k	94.3	96.0	66.9	76.3	67.9	63.3	41.6	45.4	67.5	68.8

🌐 About OpenDataArena

OpenDataArena is an open research platform dedicated to discovering, evaluating, and advancing high-quality datasets for AI post-training. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing.

Key Features:

🏆 Dataset Leaderboard — helps researchers identify the most valuable and high-quality datasets across different domains.
📊 Detailed Evaluation Scores — provides comprehensive metrics to assess data quality, complexity, difficulty etc.
🧰 Data Processing Toolkit — OpenDataArena-Tool offers an open-source pipeline for dataset curation and scoring.

If you find our work helpful, please consider ⭐ starring and subscribing to support our research.

🚀 Usage

Model repo: OpenDataArena/Qwen3-8B-ODA-Math-460k. Below is a minimal runnable example for loading and inference:

from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "OpenDataArena/Qwen3-8B-ODA-Math-460k"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True)

messages = [
    {"role": "user", "content": "Solve: If f(x)=x^2+1, what is f(3)?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📚 Citation

@article{cai2025opendataarena,
  title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value},
  author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others},
  journal={arXiv preprint arXiv:2512.14051},
  year={2025}
}

Downloads last month: 5

Safetensors

Model size

308k params

Tensor type

BF16

Model tree for OpenDataArena/Qwen3-8B-ODA-Math-460k

Base model

Qwen/Qwen3-8B-Base

Finetuned

(308)

this model

OpenDataArena
/

Qwen3-8B-ODA-Math-460k