HCY123902 commited on
Commit
138d481
Β·
verified Β·
1 Parent(s): 7792fe8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -0
README.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
+ ---
6
+
7
+ # PasoDoble: Better LLM Reasoning via Dual-Play
8
+
9
+ This repository hosts models developed within the **PasoDoble** framework, a novel LLM dual-play approach presented in the paper [Better LLM Reasoning via Dual-Play](https://huggingface.co/papers/2511.11881).
10
+
11
+ PasoDoble is designed to improve the reasoning performance of Large Language Models (LLMs) by adversarially training two models: a "Proposer" which generates challenging questions with ground-truth answers, and a "Solver" which attempts to solve them. This framework enables LLMs to iteratively learn from themselves, fostering sustained competition and mutual evolution, thus reducing reliance on external supervision.
12
+
13
+ **Project Page**: https://hcy123902.github.io/PasoDoble
14
+ **Code Repository**: https://github.com/HCY123902/PasoDoble
15
+
16
+ ## Abstract Summary
17
+
18
+ PasoDoble addresses the reliance of LLMs on external supervision by introducing a dual-play adversarial learning framework. It trains a Proposer to generate challenging questions with ground-truth answers and a Solver to solve them. The Proposer is rewarded for generating valid, difficult questions, while the Solver is rewarded for correct answers, with both updated jointly to prevent reward hacking. An optional offline paradigm further enhances training stability. This self-play approach improves LLM reasoning performance without external supervision during training.
19
+
20
+ ## Setup
21
+
22
+ To explore the PasoDoble project's core implementation and reproduce experiments, follow these setup steps:
23
+
24
+ ```bash
25
+ conda create -n pasodoble python=3.10.16
26
+ conda activate pasodoble
27
+
28
+ git clone https://github.com/PasoDoble-Cornell/PasoDoble.git
29
+ cd PasoDoble
30
+ pip install -r requirements.txt
31
+
32
+ # Install flash-attention separately
33
+ wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
34
+ pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
35
+
36
+ # (Optional) If your current binutils version is lower than 2.38, upgrade with
37
+ conda install -c conda-forge binutils=2.40
38
+
39
+ mkdir history_record
40
+ ```
41
+
42
+ ## Sample Usage
43
+
44
+ The PasoDoble models can be loaded and used with the `transformers` library for text generation. Below is an example using the `PasoDoble-Cornell/Qwen2.5-3b-solver-online` model. Remember to replace `model_id` with the specific model checkpoint you intend to use.
45
+
46
+ ```python
47
+ from transformers import AutoTokenizer, AutoModelForCausalLM
48
+ import torch
49
+
50
+ model_id = "PasoDoble-Cornell/Qwen2.5-3b-solver-online" # Example model
51
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
52
+ model = AutoModelForCausalLM.from_pretrained(
53
+ model_id,
54
+ torch_dtype=torch.bfloat16,
55
+ device_map="auto"
56
+ )
57
+
58
+ messages = [
59
+ {"role": "user", "content": "What is the capital of France?"},
60
+ ]
61
+
62
+ text = tokenizer.apply_chat_template(
63
+ messages,
64
+ tokenize=False,
65
+ add_generation_prompt=True
66
+ )
67
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
68
+
69
+ generated_ids = model.generate(
70
+ model_inputs.input_ids,
71
+ max_new_tokens=50,
72
+ temperature=0.7,
73
+ do_sample=True,
74
+ eos_token_id=tokenizer.eos_token_id
75
+ )
76
+ generated_ids = [
77
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
78
+ ]
79
+
80
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
81
+ print(response)
82
+ ```
83
+
84
+ For more details on training and advanced usage, please refer to the [official GitHub repository](https://github.com/HCY123902/PasoDoble).
85
+
86
+ ## Trained Checkpoints
87
+
88
+ The following PasoDoble Solver checkpoints are available:
89
+
90
+ | **Model** | **Training** | **Download** |
91
+ | :------------: | :------------: | :------------: |
92
+ | PasoDoble Qwen2.5-0.5B | online | [πŸ€— HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen2.5-0.5b-solver-online-new) |
93
+ | PasoDoble Qwen2.5-0.5B | offline | [πŸ€— HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen2.5-0.5b-solver-offline) |
94
+ | PasoDoble Qwen2.5-1.5B | online | [πŸ€— HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen2.5-1.5b-solver-online) |
95
+ | PasoDoble Qwen2.5-1.5B | offline | [πŸ€— HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen2.5-1.5b-solver-offline) |
96
+ | PasoDoble Qwen2.5-3B | online | [πŸ€— HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen2.5-3b-solver-online) |
97
+ | PasoDoble Qwen2.5-3B | offline | [πŸ€— HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen2.5-3b-solver-offline) |
98
+ | PasoDoble Qwen3-0.6B | online | [πŸ€— HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen3-0.6b-solver-online) |
99
+ | PasoDoble Qwen3-0.6B | offline | [πŸ€— HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen3-0.6b-solver-offline) |
100
+ | PasoDoble Qwen3-1.7B | online | [πŸ€— HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen3-1.7b-solver-online) |
101
+ | PasoDoble Qwen3-1.7B | offline | [πŸ€— HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen3-1.7b-solver-offline) |
102
+ | PasoDoble Qwen3-4B | online | [πŸ€— HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen3-4b-solver-online) |
103
+ | PasoDoble Qwen3-4B | offline | [πŸ€— HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen3-4b-solver-offline) |
104
+
105
+ ## Citation
106
+
107
+ If you find PasoDoble useful for your research, please cite our paper:
108
+
109
+ ```bibtex
110
+ @article{zhang2025pasodoble,
111
+ title={Better LLM Reasoning via Dual-Play},
112
+ author={Zhengxin Zhang and Chengyu Huang and Aochong Oliver Li and Claire Cardie},
113
+ eprint={2511.11881},
114
+ archivePrefix={arXiv},
115
+ year={2025},
116
+ url={https://arxiv.org/abs/2511.11881}
117
+ }
118
+ ```