Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -19,7 +19,7 @@ library_name: transformers
|
|
| 19 |
<div align="center">
|
| 20 |
|
| 21 |
[](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/daVinci-Dev.pdf)
|
| 22 |
-
[](https://github.com/GAIR-NLP/daVinci-Dev)
|
| 24 |
[](https://huggingface.co/datasets/GAIR/daVinci-Dev)
|
| 25 |
[](https://huggingface.co/GAIR/daVinci-Dev-72B)
|
|
@@ -53,8 +53,8 @@ This work presents a systematic study of **agentic mid-training** and introduces
|
|
| 53 |
|
| 54 |
Our training uses two complementary trajectory types (details in the paper):
|
| 55 |
|
| 56 |
-
- **Contextually-native trajectories
|
| 57 |
-
- **Environmentally-native trajectories
|
| 58 |
|
| 59 |
Resources (open-source / open-release):
|
| 60 |
|
|
@@ -95,11 +95,11 @@ We will open-source our datasets through Hugging Face:
|
|
| 95 |
|
| 96 |
## Pipeline
|
| 97 |
|
| 98 |
-
The GitHub repository contains a high-performance pipeline that calls the GitHub API and constructs the structured PR representation used to build
|
| 99 |
|
| 100 |
| Pipeline | Description | Link |
|
| 101 |
|----------|---------|-------------|
|
| 102 |
-
| daVinci-Dev Pipeline | a high-performance pipeline used to build
|
| 103 |
|
| 104 |
## Quick Start
|
| 105 |
|
|
@@ -204,4 +204,16 @@ Users are responsible for ensuring their downstream usage complies with the lice
|
|
| 204 |
|
| 205 |
## Citation
|
| 206 |
|
| 207 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
<div align="center">
|
| 20 |
|
| 21 |
[](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/daVinci-Dev.pdf)
|
| 22 |
+
[](https://arxiv.org/pdf/2601.18418)
|
| 23 |
[](https://github.com/GAIR-NLP/daVinci-Dev)
|
| 24 |
[](https://huggingface.co/datasets/GAIR/daVinci-Dev)
|
| 25 |
[](https://huggingface.co/GAIR/daVinci-Dev-72B)
|
|
|
|
| 53 |
|
| 54 |
Our training uses two complementary trajectory types (details in the paper):
|
| 55 |
|
| 56 |
+
- **Contextually-native trajectories \\(\mathcal{D}^{\text{ctx}}_{\text{py}}\\) (PR-derived):** preserve the full information flow by bundling file discovery/context retrieval together with sequential edits. This provides broad coverage and diversity.
|
| 57 |
+
- **Environmentally-native trajectories \\(\mathcal{D}^{\text{env}}_{\text{pass}}\\) (executable rollouts):** collected from real executable repositories with genuine tool/test outputs, capturing authentic feedback loops.
|
| 58 |
|
| 59 |
Resources (open-source / open-release):
|
| 60 |
|
|
|
|
| 95 |
|
| 96 |
## Pipeline
|
| 97 |
|
| 98 |
+
The GitHub repository contains a high-performance pipeline that calls the GitHub API and constructs the structured PR representation used to build \\(\mathcal{D}^{\text{ctx}}_{\text{py}}\\).
|
| 99 |
|
| 100 |
| Pipeline | Description | Link |
|
| 101 |
|----------|---------|-------------|
|
| 102 |
+
| daVinci-Dev Pipeline | a high-performance pipeline used to build \\(\mathcal{D}^{\text{ctx}}_{\text{py}}\\) | [`GAIR-NLP/daVinci-Dev`](https://github.com/GAIR-NLP/daVinci-Dev) |
|
| 103 |
|
| 104 |
## Quick Start
|
| 105 |
|
|
|
|
| 204 |
|
| 205 |
## Citation
|
| 206 |
|
| 207 |
+
If you use this work, please cite the daVinci-Dev paper.
|
| 208 |
+
|
| 209 |
+
```
|
| 210 |
+
@misc{zeng2026davincidevagentnativemidtrainingsoftware,
|
| 211 |
+
title={daVinci-Dev: Agent-native Mid-training for Software Engineering},
|
| 212 |
+
author={Ji Zeng and Dayuan Fu and Tiantian Mi and Yumin Zhuang and Yaxing Huang and Xuefeng Li and Lyumanshan Ye and Muhang Xie and Qishuo Hua and Zhen Huang and Mohan Jiang and Hanning Wang and Jifan Lin and Yang Xiao and Jie Sun and Yunze Wu and Pengfei Liu},
|
| 213 |
+
year={2026},
|
| 214 |
+
eprint={2601.18418},
|
| 215 |
+
archivePrefix={arXiv},
|
| 216 |
+
primaryClass={cs.SE},
|
| 217 |
+
url={https://arxiv.org/abs/2601.18418},
|
| 218 |
+
}
|
| 219 |
+
```
|