Qwen2.5-7B-ODA-Math-460k

Leaderboard Performance

Qwen2.5-7B-ODA-Math-460k is a supervised fine-tuned (SFT) model built on top of Qwen2.5-7B-Base, trained with ODA-Math-460k.

ODA-Math-460k is a large-scale math reasoning dataset curated from top-performing open mathematics corpora (selected via the OpenDataArena leaderboard) and refined through deduplication, benchmark decontamination, LLM-based filtering, and verifier-backed response distillation.
It targets a β€œlearnable but challenging” difficulty band: non-trivial for smaller models yet solvable by stronger reasoning models.


🧠 Model Summary

  • Base Model: Qwen/Qwen2.5-7B-Base
  • Training Data: OpenDataArena/ODA-Math-460k
  • Domain Coverage: Mathematics (strictly filtered)
  • Scale (selected training set): ~460K problems (after selection and verification pipeline)
  • Goal: Efficiently improve mathematical reasoning and competition-style problem solving via high-quality, validated solutions.

βš™οΈ Training Data Curation Pipeline

ODA-Math-460k is constructed from an aggregated question pool and then progressively filtered and selected.

1️⃣ Data Collection

We prioritize source datasets based on their empirical impact on downstream model performance. Using the OpenDataArena leaderboard, we aggregate top-ranking math datasets that show strong efficacy for the Qwen and Llama model families. These sources form the initial pool for ODA-Math.

2️⃣ Deduplication & Decontamination

We first perform exact deduplication over all questions to remove identical items, and then run benchmark decontamination to reduce evaluation leakage by removing overlaps with standard and competition benchmarks.

3️⃣ Question Filtering (Quality & Suitability)

A multi-stage filtering pipeline refines domain specificity and usability by applying an LLM-based domain classifier (to remove out-of-domain items such as coding/general instruction tasks), an LLM-based validity validator (to remove ill-formed questions with missing premises or undefined notation), and problem-type filtering (via the Big Math toolkit) to exclude proof questions and guessing-prone formats like multiple-choice and true/falseβ€”leaving predominantly free-form problems with objectively verifiable answers.

πŸ“Š Filtration Statistics

Pipeline Stage Count Percentage
Raw Collection 11.4M 100%
Dedup & Decontamination 4.3M 37.7%
Question Filtering 3.3M 28.9%
Stage-1 Filtering 815.3K 7.2%
Stage-2 Filtering 459.6K 4.0%

🎯 Data Selection

Given the large curated pool, ODA-Math-460k retains problems that are hard for small models but solvable for stronger reasoning models.

Stage-1: Lower-Bound Filtering

Stage-1 removes trivial problems using Qwen3-8B in non-thinking mode: for each problem we sample k=4 responses, compute Pass@4 by matching each predicted final answer to y_gt, and keep the problem only if Pass@4(x) = 0 (i.e., none of four attempts is correct).

Stage-2: Upper-Bound Filtering

Stage-2 removes unsolvable or ambiguous problems using Qwen3-30B-A3B in thinking mode: we generate k=5 reasoning traces per problem, compute Pass@5, and keep the problem only if Pass@5(x) > 0 (i.e., at least one attempt solves it).


βœ… Distillation & Verification

πŸ§ͺ Response Synthesis

We distill solutions using AM-Thinking-v1 as the teacher, generating k=5 candidate reasoning traces (step-by-step solution + final answer) for each selected problem.

πŸ” Response Verification

We verify generated responses with Compass-Verifier-7B, which takes (problem x, generated response y_gen, ground-truth answer y_gt) and outputs a binary correctness decision (correct / incorrect). We keep only the (problem, response) pairs judged correct, and discard the restβ€”so the released dataset contains verified solutions only.


πŸ“š Training Data Source Composition

ODA-Math-460k is a mixture of multiple high-quality math datasets to avoid domination by a single style/annotation protocol. Top contributors:

Source Count Percentage
ScaleQuest-Math 87,755 19.09%
NuminaMath-CoT 75,971 16.53%
OpenMathInstruct-2 65,688 14.29%
MegaScience (math) 54,904 11.94%
OpenMathReasoning 49,463 10.76%
AM-Thinking-Distilled 38,375 8.35%
MiroMind-M1-SFT-719K 23,417 5.09%
SCP-116K 16,066 3.50%
DeepMath-309K 11,956 2.60%
math-gpt-4o-200k 8,355 1.82%
OpenR1-Math-220k 7,999 1.74%
MathFusionQA 6,510 1.42%

πŸ”¬ Content Characteristics

πŸ“˜ Subject Distribution

Subject Distribution

ODA-Math-460k maintains a more balanced subject composition than several peers:

  • Algebra remains substantial (~44.8%),
  • Geometry roughly 20–22%,
  • Calculus, Discrete Math & Probability, and Number Theory each around ~11%.

This mitigates subject bias and reduces performance drops on underrepresented topics.

πŸ“‰ Difficulty Distribution

Apart from model-based pass rate, we also adopt LLM-as-Judge difficulty estimation on a 1-10 scale, mapped to the AoPS ratings.

Level Equivalent Competition Tier Description
1 Elementary / Middle School MOEMS, AMC 8 (Early Qs). Standard word problems.
2 Junior High AMC 8 (Hard), AMC 10 (Early). Complex word problems.
3 High School Beginner AMC 10 (Mid), AMC 12 (Early). Requires creative thinking.
4 High School Intermediate AMC 12 (Mid), AIME (Early). Intermediate complexity.
5 Advanced High School AIME (Mid), JBMO. Simple proof-based Olympiad style.
6 Pre-Olympiad AIME (Hard), USAJMO. Introductory Olympiad level.
7 Olympiad (Entry) IMO (Easy/Medium), USAMO. Requires technical knowledge.
8 Olympiad (Medium) IMO (Medium/Hard). High-level competition problems.
9 Olympiad (Expert) IMO (Hard). Expert-level constructions/proofs.
10 Historically Hard Outliers. Exceedingly tedious or difficult even for Olympians.
Difficulty Distribution

ODA-Math-460k features a balanced mix of fundamental and intermediate reasoning tasks:

  • Primary Mode: Difficulty 1 (~110k samples), providing a dense foundation of basic mathematical concepts.
  • Secondary Mode: Difficulty 6 (~72k samples), offering a significant concentration of intermediate-level challenges.
  • Tail: A steady decline toward Difficulty 10, maintaining a specialized set of high-complexity queries.

πŸ“ˆ Performance

ODA-Math-460k is evaluated as an SFT corpus for Qwen2.5-7B-Base.

Results show consistent gains over base checkpoints, with particularly strong improvements on competition-style benchmarks.

Performance Comparison. Best scores in bold, second-best underlined.
Dataset Size GSM8K Math500 Omni-Math Olympiad AIME'24 AIME'25 CMIMC'25 HMMT'25 BRUMO'25 AVG
Qwen2.5-7B-Base
Qwen2.5-7B-Base -80.050.226.035.96.76.710.00.020.026.2
LIMO 81792.166.821.634.94.61.70.00.05.425.2
OpenMathInstruct-2 1M91.665.922.530.76.75.05.00.013.626.8
MegaScience (math) 414k90.177.828.744.516.715.08.10.026.734.2
Fast-Math-R1-SFT 8k90.680.035.850.323.326.77.58.331.739.4
DeepMath-103K 103k92.192.045.460.234.231.710.011.715.043.6
Light-R1-SFT 79k92.088.043.360.238.326.722.513.338.347.0
SYNTHETIC-2 (math) 50k92.190.054.567.445.035.019.720.036.751.2
MiroMind-M1-SFT 719k93.991.648.166.355.030.027.518.350.053.4
OmniThought-0528 365k93.289.854.368.150.440.025.028.345.054.9
OpenThoughts3 1.2M91.793.844.868.860.045.027.531.750.057.0
AM-Thinking (math) 558k92.996.260.674.263.350.027.836.763.362.8
ODA-Math 460k94.395.462.670.956.756.735.045.060.064.1

🌐 About OpenDataArena

OpenDataArena is an open research platform dedicated to discovering, evaluating, and advancing high-quality datasets for AI post-training. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing.

Key Features:

  • πŸ† Dataset Leaderboard β€” helps researchers identify the most valuable and high-quality datasets across different domains.
  • πŸ“Š Detailed Evaluation Scores β€” provides comprehensive metrics to assess data quality, complexity, difficulty etc.
  • 🧰 Data Processing Toolkit β€” OpenDataArena-Tool offers an open-source pipeline for dataset curation and scoring.

If you find our work helpful, please consider ⭐ starring and subscribing to support our research.


πŸš€ Usage

Model repo: OpenDataArena/Qwen2.5-7B-ODA-Math-460k. Below is a minimal runnable example for loading and inference:

from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "OpenDataArena/Qwen2.5-7B-ODA-Math-460k"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True)

messages = [
    {"role": "user", "content": "Solve: If f(x)=x^2+1, what is f(3)?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ“š Citation

@article{cai2025opendataarena,
  title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value},
  author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others},
  journal={arXiv preprint arXiv:2512.14051},
  year={2025}
}
Downloads last month
1
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train OpenDataArena/Qwen2.5-7B-ODA-Math-460k