Qwen3-8B-ODA-Math-460k
Qwen3-8B-ODA-Math-460k is a supervised fine-tuned (SFT) model built on top of Qwen3-8B-Base, trained with ODA-Math-460k.
ODA-Math-460k is a large-scale math reasoning dataset curated from top-performing open mathematics corpora (selected via the OpenDataArena leaderboard) and refined through deduplication, benchmark decontamination, LLM-based filtering, and verifier-backed response distillation.
It targets a βlearnable but challengingβ difficulty band: non-trivial for smaller models yet solvable by stronger reasoning models.
π§ Model Summary
- Base Model:
Qwen/Qwen3-8B-Base - Training Data:
OpenDataArena/ODA-Math-460k - Domain Coverage: Mathematics (strictly filtered)
- Scale (selected training set): ~460K problems (after selection and verification pipeline)
- Goal: Efficiently improve mathematical reasoning and competition-style problem solving via high-quality, validated solutions.
βοΈ Training Data Curation Pipeline
ODA-Math-460k is constructed from an aggregated question pool and then progressively filtered and selected.
1οΈβ£ Data Collection
We prioritize source datasets based on their empirical impact on downstream model performance. Using the OpenDataArena leaderboard, we aggregate top-ranking math datasets that show strong efficacy for the Qwen and Llama model families. These sources form the initial pool for ODA-Math.
2οΈβ£ Deduplication & Decontamination
We first perform exact deduplication over all questions to remove identical items, and then run benchmark decontamination to reduce evaluation leakage by removing overlaps with standard and competition benchmarks.
3οΈβ£ Question Filtering (Quality & Suitability)
A multi-stage filtering pipeline refines domain specificity and usability by applying an LLM-based domain classifier (to remove out-of-domain items such as coding/general instruction tasks), an LLM-based validity validator (to remove ill-formed questions with missing premises or undefined notation), and problem-type filtering (via the Big Math toolkit) to exclude proof questions and guessing-prone formats like multiple-choice and true/falseβleaving predominantly free-form problems with objectively verifiable answers.
π Filtration Statistics
| Pipeline Stage | Count | Percentage |
|---|---|---|
| Raw Collection | 11.4M | 100% |
| Dedup & Decontamination | 4.3M | 37.7% |
| Question Filtering | 3.3M | 28.9% |
| Stage-1 Filtering | 815.3K | 7.2% |
| Stage-2 Filtering | 459.6K | 4.0% |
π― Data Selection
Given the large curated pool, ODA-Math-460k retains problems that are hard for small models but solvable for stronger reasoning models.
Stage-1: Lower-Bound Filtering
Stage-1 removes trivial problems using Qwen3-8B in non-thinking mode: for each problem we sample k=4 responses, compute Pass@4 by matching each predicted final answer to y_gt, and keep the problem only if Pass@4(x) = 0 (i.e., none of four attempts is correct).
Stage-2: Upper-Bound Filtering
Stage-2 removes unsolvable or ambiguous problems using Qwen3-30B-A3B in thinking mode: we generate k=5 reasoning traces per problem, compute Pass@5, and keep the problem only if Pass@5(x) > 0 (i.e., at least one attempt solves it).
β Distillation & Verification
π§ͺ Response Synthesis
We distill solutions using AM-Thinking-v1 as the teacher, generating k=5 candidate reasoning traces (step-by-step solution + final answer) for each selected problem.
π Response Verification
We verify generated responses with Compass-Verifier-7B, which takes (problem x, generated response y_gen, ground-truth answer y_gt) and outputs a binary correctness decision (correct / incorrect). We keep only the (problem, response) pairs judged correct, and discard the restβso the released dataset contains verified solutions only.
π Training Data Source Composition
ODA-Math-460k is a mixture of multiple high-quality math datasets to avoid domination by a single style/annotation protocol. Top contributors:
| Source | Count | Percentage |
|---|---|---|
| ScaleQuest-Math | 87,755 | 19.09% |
| NuminaMath-CoT | 75,971 | 16.53% |
| OpenMathInstruct-2 | 65,688 | 14.29% |
| MegaScience (math) | 54,904 | 11.94% |
| OpenMathReasoning | 49,463 | 10.76% |
| AM-Thinking-Distilled | 38,375 | 8.35% |
| MiroMind-M1-SFT-719K | 23,417 | 5.09% |
| SCP-116K | 16,066 | 3.50% |
| DeepMath-309K | 11,956 | 2.60% |
| math-gpt-4o-200k | 8,355 | 1.82% |
| OpenR1-Math-220k | 7,999 | 1.74% |
| MathFusionQA | 6,510 | 1.42% |
π¬ Content Characteristics
π Subject Distribution
ODA-Math-460k maintains a more balanced subject composition than several peers:
- Algebra remains substantial (~44.8%),
- Geometry roughly 20β22%,
- Calculus, Discrete Math & Probability, and Number Theory each around ~11%.
This mitigates subject bias and reduces performance drops on underrepresented topics.
π Difficulty Distribution
Apart from model-based pass rate, we also adopt LLM-as-Judge difficulty estimation on a 1-10 scale, mapped to the AoPS ratings.
| Level | Equivalent Competition Tier | Description |
|---|---|---|
| 1 | Elementary / Middle School | MOEMS, AMC 8 (Early Qs). Standard word problems. |
| 2 | Junior High | AMC 8 (Hard), AMC 10 (Early). Complex word problems. |
| 3 | High School Beginner | AMC 10 (Mid), AMC 12 (Early). Requires creative thinking. |
| 4 | High School Intermediate | AMC 12 (Mid), AIME (Early). Intermediate complexity. |
| 5 | Advanced High School | AIME (Mid), JBMO. Simple proof-based Olympiad style. |
| 6 | Pre-Olympiad | AIME (Hard), USAJMO. Introductory Olympiad level. |
| 7 | Olympiad (Entry) | IMO (Easy/Medium), USAMO. Requires technical knowledge. |
| 8 | Olympiad (Medium) | IMO (Medium/Hard). High-level competition problems. |
| 9 | Olympiad (Expert) | IMO (Hard). Expert-level constructions/proofs. |
| 10 | Historically Hard | Outliers. Exceedingly tedious or difficult even for Olympians. |
ODA-Math-460k features a balanced mix of fundamental and intermediate reasoning tasks:
- Primary Mode: Difficulty 1 (~110k samples), providing a dense foundation of basic mathematical concepts.
- Secondary Mode: Difficulty 6 (~72k samples), offering a significant concentration of intermediate-level challenges.
- Tail: A steady decline toward Difficulty 10, maintaining a specialized set of high-complexity queries.
π Performance
ODA-Math-460k is evaluated as an SFT corpus for Qwen3-8B-Base.
Results show consistent gains over base checkpoints, with particularly strong improvements on competition-style benchmarks.
| Dataset | Size | GSM8K | Math500 | Omni-Math | Olympiad | AIME'24 | AIME'25 | CMIMC'25 | HMMT'25 | BRUMO'25 | AVG |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-8B-Base | |||||||||||
| Qwen3-8B-Base | - | 92.0 | 79.6 | 30.6 | 47.2 | 6.7 | 10.8 | 4.7 | 0.0 | 16.7 | 32.0 |
| LIMO | 817 | 83.9 | 69.0 | 21.8 | 31.3 | 12.5 | 8.8 | 2.2 | 1.7 | 13.8 | 27.2 |
| MegaScience (math) | 414k | 93.4 | 84.8 | 35.8 | 57.6 | 25.4 | 17.9 | 11.3 | 12.1 | 33.8 | 41.3 |
| Fast-Math-R1-SFT | 8k | 92.8 | 86.6 | 39.6 | 61.0 | 28.8 | 25.8 | 14.1 | 13.3 | 34.2 | 44.0 |
| Light-R1-SFT | 79k | 93.8 | 92.6 | 48.5 | 69.7 | 54.6 | 31.3 | 22.8 | 25.0 | 48.8 | 54.1 |
| SYNTHETIC-2 (math) | 50k | 93.9 | 93.8 | 58.8 | 71.5 | 58.8 | 45.8 | 28.4 | 32.9 | 54.2 | 59.8 |
| MiroMind-M1-SFT | 719k | 94.8 | 96.8 | 54.5 | 77.0 | 62.9 | 47.5 | 25.6 | 27.5 | 60.4 | 60.8 |
| OmniThought-0528 | 365k | 94.2 | 95.4 | 59.0 | 74.9 | 67.9 | 45.4 | 31.3 | 35.8 | 52.5 | 61.8 |
| AM-Thinking (math) | 558k | 95.2 | 95.6 | 64.5 | 77.5 | 65.8 | 54.6 | 36.3 | 41.3 | 62.5 | 65.9 |
| ODA-Math | 460k | 94.3 | 96.0 | 66.9 | 76.3 | 67.9 | 63.3 | 41.6 | 45.4 | 67.5 | 68.8 |
π About OpenDataArena
OpenDataArena is an open research platform dedicated to discovering, evaluating, and advancing high-quality datasets for AI post-training. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing.
Key Features:
- π Dataset Leaderboard β helps researchers identify the most valuable and high-quality datasets across different domains.
- π Detailed Evaluation Scores β provides comprehensive metrics to assess data quality, complexity, difficulty etc.
- π§° Data Processing Toolkit β OpenDataArena-Tool offers an open-source pipeline for dataset curation and scoring.
If you find our work helpful, please consider β starring and subscribing to support our research.
π Usage
Model repo: OpenDataArena/Qwen3-8B-ODA-Math-460k. Below is a minimal runnable example for loading and inference:
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "OpenDataArena/Qwen3-8B-ODA-Math-460k"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True)
messages = [
{"role": "user", "content": "Solve: If f(x)=x^2+1, what is f(3)?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π Citation
@article{cai2025opendataarena,
title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value},
author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others},
journal={arXiv preprint arXiv:2512.14051},
year={2025}
}
- Downloads last month
- 5
Model tree for OpenDataArena/Qwen3-8B-ODA-Math-460k
Base model
Qwen/Qwen3-8B-Base