PasoDoble-Cornell
/

Qwen2.5-0.5b-solver-offline

+---
+license: apache-2.0
+pipeline_tag: text-generation
+library_name: transformers
+---
+# PasoDoble: Better LLM Reasoning via Dual-Play
+This repository hosts models developed within the **PasoDoble** framework, a novel LLM dual-play approach presented in the paper [Better LLM Reasoning via Dual-Play](https://huggingface.co/papers/2511.11881).
+PasoDoble is designed to improve the reasoning performance of Large Language Models (LLMs) by adversarially training two models: a "Proposer" which generates challenging questions with ground-truth answers, and a "Solver" which attempts to solve them. This framework enables LLMs to iteratively learn from themselves, fostering sustained competition and mutual evolution, thus reducing reliance on external supervision.
+**Project Page**: https://hcy123902.github.io/PasoDoble
+**Code Repository**: https://github.com/HCY123902/PasoDoble
+## Abstract Summary
+PasoDoble addresses the reliance of LLMs on external supervision by introducing a dual-play adversarial learning framework. It trains a Proposer to generate challenging questions with ground-truth answers and a Solver to solve them. The Proposer is rewarded for generating valid, difficult questions, while the Solver is rewarded for correct answers, with both updated jointly to prevent reward hacking. An optional offline paradigm further enhances training stability. This self-play approach improves LLM reasoning performance without external supervision during training.
+## Setup
+To explore the PasoDoble project's core implementation and reproduce experiments, follow these setup steps:
+```bash
+conda create -n pasodoble python=3.10.16
+conda activate pasodoble
+git clone https://github.com/PasoDoble-Cornell/PasoDoble.git
+cd PasoDoble
+pip install -r requirements.txt
+# Install flash-attention separately
+wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
+pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
+# (Optional) If your current binutils version is lower than 2.38, upgrade with
+conda install -c conda-forge binutils=2.40
+mkdir history_record
+```
+## Sample Usage
+The PasoDoble models can be loaded and used with the `transformers` library for text generation. Below is an example using the `PasoDoble-Cornell/Qwen2.5-3b-solver-online` model. Remember to replace `model_id` with the specific model checkpoint you intend to use.
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_id = "PasoDoble-Cornell/Qwen2.5-3b-solver-online" # Example model
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto"
+)
+messages = [
+    {"role": "user", "content": "What is the capital of France?"},
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+generated_ids = model.generate(
+    model_inputs.input_ids,
+    max_new_tokens=50,
+    temperature=0.7,
+    do_sample=True,
+    eos_token_id=tokenizer.eos_token_id
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(response)
+```
+For more details on training and advanced usage, please refer to the [official GitHub repository](https://github.com/HCY123902/PasoDoble).
+## Trained Checkpoints
+The following PasoDoble Solver checkpoints are available:
+| **Model** | **Training** | **Download** |
+| :------------: | :------------: | :------------: |
+| PasoDoble Qwen2.5-0.5B | online | [🤗 HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen2.5-0.5b-solver-online-new) |
+| PasoDoble Qwen2.5-0.5B | offline | [🤗 HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen2.5-0.5b-solver-offline) |
+| PasoDoble Qwen2.5-1.5B | online | [🤗 HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen2.5-1.5b-solver-online) |
+| PasoDoble Qwen2.5-1.5B | offline | [🤗 HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen2.5-1.5b-solver-offline) |
+| PasoDoble Qwen2.5-3B | online | [🤗 HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen2.5-3b-solver-online) |
+| PasoDoble Qwen2.5-3B | offline | [🤗 HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen2.5-3b-solver-offline) |
+| PasoDoble Qwen3-0.6B | online | [🤗 HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen3-0.6b-solver-online) |
+| PasoDoble Qwen3-0.6B | offline | [🤗 HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen3-0.6b-solver-offline) |
+| PasoDoble Qwen3-1.7B | online | [🤗 HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen3-1.7b-solver-online) |
+| PasoDoble Qwen3-1.7B | offline | [🤗 HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen3-1.7b-solver-offline) |
+| PasoDoble Qwen3-4B | online | [🤗 HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen3-4b-solver-online) |
+| PasoDoble Qwen3-4B | offline | [🤗 HuggingFace](https://huggingface.co/PasoDoble-Cornell/Qwen3-4b-solver-offline) |
+## Citation
+If you find PasoDoble useful for your research, please cite our paper:
+```bibtex
+@article{zhang2025pasodoble,
+  title={Better LLM Reasoning via Dual-Play},
+  author={Zhengxin Zhang and Chengyu Huang and Aochong Oliver Li and Claire Cardie},
+  eprint={2511.11881},
+  archivePrefix={arXiv},
+  year={2025},
+  url={https://arxiv.org/abs/2511.11881}
+}
+```