What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge
Abstract
Data curation for multimodal reasoning shows that difficulty-based example selection on aligned datasets drives performance gains, while increasing dataset size mainly reduces variance and synthetic augmentation heuristics often degrade performance.
We study data curation for multimodal reasoning through the NeurIPS 2025 Data Curation for Vision-Language Reasoning (DCVLR) challenge, which isolates dataset selection by fixing the model and training protocol. Using a compact curated dataset derived primarily from Walton Multimodal Cold Start, our submission placed first in the challenge. Through post-competition ablations, we show that difficulty-based example selection on an aligned base dataset is the dominant driver of performance gains. Increasing dataset size does not reliably improve mean accuracy under the fixed training recipe, but mainly reduces run-to-run variance, while commonly used diversity and synthetic augmentation heuristics provide no additional benefit and often degrade performance. These results characterize DCVLR as a saturation-regime evaluation and highlight the central role of alignment and difficulty in data-efficient multimodal reasoning.
Community
Some of the observations founded are :
i. Difficulty based example selection is the dominant driver of performance:
Selecting challenging but learnable examples yields the largest gains in multimodal reasoning accuracy, outperforming other curation strategies.
ii. Increasing dataset size does not reliably improve mean accuracy:
Once a well aligned base dataset is chosen, larger datasets mainly reduce run to run variance rather than boosting average performance.
iii. Data curation operates in a saturation regime:
Most performance improvements come from a relatively small number of carefully curated examples, with diminishing returns from adding more data.
iv. Common diversity heuristics provide little or no benefit:
Techniques such as clustering based diversity, category balancing, and synthetic augmentation often fail to improve performance and can even degrade accuracy.
v. Alignment between dataset, benchmark, and base model is crucial:
Strong alignment amplifies the effectiveness of difficulty filtering and explains why compact, well aligned datasets can outperform larger but less aligned ones.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AIR: Post-training Data Selection for Reasoning via Attention Head Influence (2025)
- Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning (2025)
- VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice (2026)
- Skill-Aware Data Selection and Fine-Tuning for Data-Efficient Reasoning Distillation (2026)
- Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling (2026)
- Self-Improving VLM Judges Without Human Annotations (2025)
- SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper