|
|
--- |
|
|
tags: |
|
|
- reinforcement-learning |
|
|
- game-theory |
|
|
- codenames |
|
|
- neurips-2025 |
|
|
- graph-neural-networks |
|
|
- preference-learning |
|
|
- llm-distillation |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# Codenames: Graph-Based RL with LLM-Guided Preference Distillation |
|
|
|
|
|
 |
|
|
 |
|
|
 |
|
|
|
|
|
This repository contains trained **Codenames agents** developed for the **NeurIPS 2025 MindGames Workshop**. |
|
|
The system combines a structured graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, targeting improved risk calibration and decision robustness. |
|
|
|
|
|
--- |
|
|
|
|
|
## Overview |
|
|
|
|
|
The approach integrates: |
|
|
|
|
|
- **Graph Neural Networks** for structured board and history representation |
|
|
- **Proximal Policy Optimization (PPO)** for policy learning |
|
|
- **Role-conditioned decoding** for spymaster and operative behaviors |
|
|
- **Rollout-grounded preference learning** using large language models |
|
|
- **Supervised fine tuning (SFT)** and **Direct Preference Optimization (DPO)** for teacher alignment |
|
|
- **Knowledge distillation** from the aligned teacher back into a compact policy |
|
|
|
|
|
The objective is to improve strategic consistency and reduce catastrophic failures such as assassin selections, while maintaining efficient inference suitable for interactive play. |
|
|
|
|
|
--- |
|
|
|
|
|
## Game Configuration |
|
|
|
|
|
- **Game**: Codenames |
|
|
- **Board size**: 25 words |
|
|
- **Roles**: Spymaster and Operative |
|
|
- **Evaluation games**: 600 full episodes |
|
|
- **Opponents**: Scripted baseline agents |
|
|
|
|
|
--- |
|
|
|
|
|
## Policy Architecture |
|
|
|
|
|
### Graph-Based State Encoder |
|
|
- Heterogeneous graph with **30–40 nodes** |
|
|
- Node types include: |
|
|
- Word nodes with semantic and state features |
|
|
- Historical clue nodes |
|
|
- Global summary node |
|
|
- Node feature dimension: **35** |
|
|
- Encoder: |
|
|
- 3 Graph Attention layers |
|
|
- 6 attention heads |
|
|
- Hidden size 192 |
|
|
|
|
|
### Role Conditioning |
|
|
- Shared policy trunk |
|
|
- Role-conditioned action decoding: |
|
|
- Clue generation and constraint handling for spymaster |
|
|
- Guess selection and stopping decisions for operative |
|
|
|
|
|
### Model Size |
|
|
- Total parameters: **~6.8M** |
|
|
- Enables fast inference under competitive constraints |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Pipeline |
|
|
|
|
|
Training follows a multi-stage curriculum: |
|
|
|
|
|
1. **Graph PPO Pretraining** |
|
|
- PPO with clip ratio 0.2 |
|
|
- Discount factor γ = 0.99 |
|
|
- GAE λ = 0.95 |
|
|
- Trained against scripted Codenames agents |
|
|
|
|
|
2. **Preference Generation via Rollouts** |
|
|
- ~800 intermediate states sampled |
|
|
- Candidate actions proposed by: |
|
|
- Llama 3.1 Instruct |
|
|
- Qwen 2.5 Instruct |
|
|
- Each proposal evaluated using multiple stochastic rollouts |
|
|
- Higher-return actions labeled preferred |
|
|
|
|
|
3. **Teacher Alignment** |
|
|
- Supervised Fine Tuning on chosen actions |
|
|
- Direct Preference Optimization using frozen reference model |
|
|
|
|
|
4. **Policy Distillation** |
|
|
- Aligned teacher generates state-and-role to action labels |
|
|
- Graph policy trained via cross-entropy imitation |
|
|
|
|
|
5. **PPO Refinement** |
|
|
- PPO resumes using environment rewards |
|
|
- Stabilizes policy after distillation |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
Evaluation uses **600 full games** against scripted opponents. |
|
|
|
|
|
| Agent | Win Rate | Assassin Rate | |
|
|
|------|---------|---------------| |
|
|
| Graph PPO | 44.8% | 12.6% | |
|
|
| PPO + Distillation | 52.9% | 6.9% | |
|
|
|
|
|
- Distillation yields an **8.1 point** absolute win-rate improvement |
|
|
- Assassin-triggered losses are reduced by **45%** |
|
|
- Improvements arise primarily from **better risk calibration**, not increased guessing aggressiveness |
|
|
|
|
|
--- |
|
|
|
|
|
## Repository Contents |
|
|
|
|
|
### Policy Checkpoints |
|
|
- `policy_models/policy_after_ppo.pt` |
|
|
- `policy_models/policy_after_distill.pt` |
|
|
|
|
|
### Teacher Models |
|
|
- `sft_model/` – supervised fine-tuned teacher |
|
|
- `dpo_model/` – preference-aligned teacher |
|
|
|
|
|
### Configuration and Logs |
|
|
- `master_config.json` |
|
|
- `evaluation_results.json` |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Load Policy |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from policy import GraphPolicy |
|
|
|
|
|
policy = GraphPolicy(...) |
|
|
policy.load_state_dict(torch.load("policy_models/policy_after_distill.pt")) |
|
|
policy.eval() |
|
|
``` |
|
|
|
|
|
### Loading Fine-tuned LLM |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load SFT or DPO model |
|
|
tokenizer = AutoTokenizer.from_pretrained("./sft_model") |
|
|
model = AutoModelForCausalLM.from_pretrained("./sft_model") |
|
|
|
|
|
# Use for inference |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=32) |
|
|
``` |
|
|
|
|
|
## 🎓 Research Context |
|
|
|
|
|
This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on: |
|
|
|
|
|
- Language models provide useful strategic priors when grounded by rollouts |
|
|
- Graph-based representations enable structured reasoning in semantic games |
|
|
- Distillation transfers high-level reasoning into efficient, deployable agents |
|
|
|
|
|
### Key Innovations |
|
|
|
|
|
1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states |
|
|
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism |
|
|
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings |
|
|
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies |
|
|
|
|
|
|
|
|
## 📄 License |
|
|
|
|
|
MIT License - See LICENSE file for details |
|
|
|
|
|
## 🙏 Acknowledgments |
|
|
|
|
|
- Built for **NeurIPS 2025 MindGames Workshop** |
|
|
- Uses PyTorch, HuggingFace Transformers, and PEFT |
|
|
- Training infrastructure: NVIDIA H200 GPU |
|
|
|
|
|
--- |
|
|
|
|
|
**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} |
|
|
**Uploaded from**: Notebook Environment |
|
|
|