GAIR
/

daVinci-Dev-72B

@@ -19,7 +19,7 @@ library_name: transformers
 <div align="center">
 [![Paper](https://img.shields.io/badge/Paper-PDF-1f6feb.svg)](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/daVinci-Dev.pdf)
-[![arXiv](https://img.shields.io/badge/arXiv-Coming_Soon-b31b1b.svg)](https://arxiv.org/pdf/)
 [![GitHub](https://img.shields.io/badge/GitHub-Repository-green)](https://github.com/GAIR-NLP/daVinci-Dev)
 [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/GAIR/daVinci-Dev)
 [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https://huggingface.co/GAIR/daVinci-Dev-72B)
@@ -53,8 +53,8 @@ This work presents a systematic study of **agentic mid-training** and introduces
 Our training uses two complementary trajectory types (details in the paper):
-- **Contextually-native trajectories $\mathcal{D}^{\text{ctx}}_{\text{py}}$ (PR-derived):** preserve the full information flow by bundling file discovery/context retrieval together with sequential edits. This provides broad coverage and diversity.
-- **Environmentally-native trajectories $\mathcal{D}^{\text{env}}_{\text{pass}}$ (executable rollouts):** collected from real executable repositories with genuine tool/test outputs, capturing authentic feedback loops.
 Resources (open-source / open-release):
@@ -95,11 +95,11 @@ We will open-source our datasets through Hugging Face:
 ## Pipeline
-The GitHub repository contains a high-performance pipeline that calls the GitHub API and constructs the structured PR representation used to build $\mathcal{D}^{\text{ctx}}_{\text{py}}$.
 | Pipeline | Description | Link |
 |----------|---------|-------------|
-| daVinci-Dev Pipeline | a high-performance pipeline used to build $\mathcal{D}^{\text{ctx}}_{\text{py}}$ | [`GAIR-NLP/daVinci-Dev`](https://github.com/GAIR-NLP/daVinci-Dev) |
 ## Quick Start
@@ -204,4 +204,16 @@ Users are responsible for ensuring their downstream usage complies with the lice
 ## Citation
-ArXiv link and the official citation block are coming soon (the manuscript is under review at the time of release).

 <div align="center">
 [![Paper](https://img.shields.io/badge/Paper-PDF-1f6feb.svg)](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/daVinci-Dev.pdf)
+[![arXiv](https://img.shields.io/badge/arXiv-2601.18418-b31b1b.svg)](https://arxiv.org/pdf/2601.18418)
 [![GitHub](https://img.shields.io/badge/GitHub-Repository-green)](https://github.com/GAIR-NLP/daVinci-Dev)
 [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/GAIR/daVinci-Dev)
 [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https://huggingface.co/GAIR/daVinci-Dev-72B)
 Our training uses two complementary trajectory types (details in the paper):
+- **Contextually-native trajectories \\(\mathcal{D}^{\text{ctx}}_{\text{py}}\\) (PR-derived):** preserve the full information flow by bundling file discovery/context retrieval together with sequential edits. This provides broad coverage and diversity.
+- **Environmentally-native trajectories \\(\mathcal{D}^{\text{env}}_{\text{pass}}\\) (executable rollouts):** collected from real executable repositories with genuine tool/test outputs, capturing authentic feedback loops.
 Resources (open-source / open-release):
 ## Pipeline
+The GitHub repository contains a high-performance pipeline that calls the GitHub API and constructs the structured PR representation used to build \\(\mathcal{D}^{\text{ctx}}_{\text{py}}\\).
 | Pipeline | Description | Link |
 |----------|---------|-------------|
+| daVinci-Dev Pipeline | a high-performance pipeline used to build \\(\mathcal{D}^{\text{ctx}}_{\text{py}}\\) | [`GAIR-NLP/daVinci-Dev`](https://github.com/GAIR-NLP/daVinci-Dev) |
 ## Quick Start
 ## Citation
+If you use this work, please cite the daVinci-Dev paper.
+```
+@misc{zeng2026davincidevagentnativemidtrainingsoftware,
+      title={daVinci-Dev: Agent-native Mid-training for Software Engineering},
+      author={Ji Zeng and Dayuan Fu and Tiantian Mi and Yumin Zhuang and Yaxing Huang and Xuefeng Li and Lyumanshan Ye and Muhang Xie and Qishuo Hua and Zhen Huang and Mohan Jiang and Hanning Wang and Jifan Lin and Yang Xiao and Jie Sun and Yunze Wu and Pengfei Liu},
+      year={2026},
+      eprint={2601.18418},
+      archivePrefix={arXiv},
+      primaryClass={cs.SE},
+      url={https://arxiv.org/abs/2601.18418},
+}
+```