File size: 5,597 Bytes
2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 5c80055 2890d84 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
---
tags:
- reinforcement-learning
- game-theory
- codenames
- neurips-2025
- graph-neural-networks
- preference-learning
- llm-distillation
license: mit
---
# Codenames: Graph-Based RL with LLM-Guided Preference Distillation



This repository contains trained **Codenames agents** developed for the **NeurIPS 2025 MindGames Workshop**.
The system combines a structured graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, targeting improved risk calibration and decision robustness.
---
## Overview
The approach integrates:
- **Graph Neural Networks** for structured board and history representation
- **Proximal Policy Optimization (PPO)** for policy learning
- **Role-conditioned decoding** for spymaster and operative behaviors
- **Rollout-grounded preference learning** using large language models
- **Supervised fine tuning (SFT)** and **Direct Preference Optimization (DPO)** for teacher alignment
- **Knowledge distillation** from the aligned teacher back into a compact policy
The objective is to improve strategic consistency and reduce catastrophic failures such as assassin selections, while maintaining efficient inference suitable for interactive play.
---
## Game Configuration
- **Game**: Codenames
- **Board size**: 25 words
- **Roles**: Spymaster and Operative
- **Evaluation games**: 600 full episodes
- **Opponents**: Scripted baseline agents
---
## Policy Architecture
### Graph-Based State Encoder
- Heterogeneous graph with **30–40 nodes**
- Node types include:
- Word nodes with semantic and state features
- Historical clue nodes
- Global summary node
- Node feature dimension: **35**
- Encoder:
- 3 Graph Attention layers
- 6 attention heads
- Hidden size 192
### Role Conditioning
- Shared policy trunk
- Role-conditioned action decoding:
- Clue generation and constraint handling for spymaster
- Guess selection and stopping decisions for operative
### Model Size
- Total parameters: **~6.8M**
- Enables fast inference under competitive constraints
---
## Training Pipeline
Training follows a multi-stage curriculum:
1. **Graph PPO Pretraining**
- PPO with clip ratio 0.2
- Discount factor γ = 0.99
- GAE λ = 0.95
- Trained against scripted Codenames agents
2. **Preference Generation via Rollouts**
- ~800 intermediate states sampled
- Candidate actions proposed by:
- Llama 3.1 Instruct
- Qwen 2.5 Instruct
- Each proposal evaluated using multiple stochastic rollouts
- Higher-return actions labeled preferred
3. **Teacher Alignment**
- Supervised Fine Tuning on chosen actions
- Direct Preference Optimization using frozen reference model
4. **Policy Distillation**
- Aligned teacher generates state-and-role to action labels
- Graph policy trained via cross-entropy imitation
5. **PPO Refinement**
- PPO resumes using environment rewards
- Stabilizes policy after distillation
---
## Evaluation Results
Evaluation uses **600 full games** against scripted opponents.
| Agent | Win Rate | Assassin Rate |
|------|---------|---------------|
| Graph PPO | 44.8% | 12.6% |
| PPO + Distillation | 52.9% | 6.9% |
- Distillation yields an **8.1 point** absolute win-rate improvement
- Assassin-triggered losses are reduced by **45%**
- Improvements arise primarily from **better risk calibration**, not increased guessing aggressiveness
---
## Repository Contents
### Policy Checkpoints
- `policy_models/policy_after_ppo.pt`
- `policy_models/policy_after_distill.pt`
### Teacher Models
- `sft_model/` – supervised fine-tuned teacher
- `dpo_model/` – preference-aligned teacher
### Configuration and Logs
- `master_config.json`
- `evaluation_results.json`
---
## Usage
### Load Policy
```python
import torch
from policy import GraphPolicy
policy = GraphPolicy(...)
policy.load_state_dict(torch.load("policy_models/policy_after_distill.pt"))
policy.eval()
```
### Loading Fine-tuned LLM
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")
# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
```
## 🎓 Research Context
This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
- Language models provide useful strategic priors when grounded by rollouts
- Graph-based representations enable structured reasoning in semantic games
- Distillation transfers high-level reasoning into efficient, deployable agents
### Key Innovations
1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
## 📄 License
MIT License - See LICENSE file for details
## 🙏 Acknowledgments
- Built for **NeurIPS 2025 MindGames Workshop**
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU
---
**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
**Uploaded from**: Notebook Environment
|