File size: 5,597 Bytes
2890d84
 
 
 
5c80055
2890d84
 
5c80055
 
2890d84
 
 
5c80055
2890d84
 
 
 
 
5c80055
 
2890d84
5c80055
2890d84
5c80055
2890d84
5c80055
2890d84
5c80055
 
 
 
 
 
2890d84
5c80055
2890d84
5c80055
2890d84
5c80055
2890d84
5c80055
 
 
 
 
2890d84
5c80055
2890d84
5c80055
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2890d84
5c80055
2890d84
5c80055
2890d84
5c80055
2890d84
5c80055
 
 
 
 
2890d84
5c80055
 
 
 
 
 
 
2890d84
5c80055
 
 
2890d84
5c80055
 
 
2890d84
5c80055
 
 
2890d84
5c80055
2890d84
5c80055
2890d84
5c80055
2890d84
5c80055
 
 
 
2890d84
5c80055
 
 
2890d84
5c80055
2890d84
5c80055
2890d84
5c80055
 
 
2890d84
5c80055
 
 
2890d84
5c80055
 
 
2890d84
5c80055
2890d84
5c80055
2890d84
5c80055
2890d84
 
 
5c80055
 
 
 
2890d84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5c80055
 
 
2890d84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
tags:
- reinforcement-learning
- game-theory
- codenames
- neurips-2025
- graph-neural-networks
- preference-learning
- llm-distillation
license: mit
---

# Codenames: Graph-Based RL with LLM-Guided Preference Distillation

![Status](https://img.shields.io/badge/status-trained-success)
![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
![License](https://img.shields.io/badge/license-MIT-blue)

This repository contains trained **Codenames agents** developed for the **NeurIPS 2025 MindGames Workshop**.  
The system combines a structured graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, targeting improved risk calibration and decision robustness.

---

## Overview

The approach integrates:

- **Graph Neural Networks** for structured board and history representation  
- **Proximal Policy Optimization (PPO)** for policy learning  
- **Role-conditioned decoding** for spymaster and operative behaviors  
- **Rollout-grounded preference learning** using large language models  
- **Supervised fine tuning (SFT)** and **Direct Preference Optimization (DPO)** for teacher alignment  
- **Knowledge distillation** from the aligned teacher back into a compact policy  

The objective is to improve strategic consistency and reduce catastrophic failures such as assassin selections, while maintaining efficient inference suitable for interactive play.

---

## Game Configuration

- **Game**: Codenames  
- **Board size**: 25 words  
- **Roles**: Spymaster and Operative  
- **Evaluation games**: 600 full episodes  
- **Opponents**: Scripted baseline agents  

---

## Policy Architecture

### Graph-Based State Encoder
- Heterogeneous graph with **30–40 nodes**
- Node types include:
  - Word nodes with semantic and state features
  - Historical clue nodes
  - Global summary node
- Node feature dimension: **35**
- Encoder:
  - 3 Graph Attention layers
  - 6 attention heads
  - Hidden size 192

### Role Conditioning
- Shared policy trunk
- Role-conditioned action decoding:
  - Clue generation and constraint handling for spymaster
  - Guess selection and stopping decisions for operative

### Model Size
- Total parameters: **~6.8M**
- Enables fast inference under competitive constraints

---

## Training Pipeline

Training follows a multi-stage curriculum:

1. **Graph PPO Pretraining**  
   - PPO with clip ratio 0.2  
   - Discount factor γ = 0.99  
   - GAE λ = 0.95  
   - Trained against scripted Codenames agents  

2. **Preference Generation via Rollouts**  
   - ~800 intermediate states sampled  
   - Candidate actions proposed by:
     - Llama 3.1 Instruct
     - Qwen 2.5 Instruct  
   - Each proposal evaluated using multiple stochastic rollouts  
   - Higher-return actions labeled preferred  

3. **Teacher Alignment**  
   - Supervised Fine Tuning on chosen actions  
   - Direct Preference Optimization using frozen reference model  

4. **Policy Distillation**  
   - Aligned teacher generates state-and-role to action labels  
   - Graph policy trained via cross-entropy imitation  

5. **PPO Refinement**  
   - PPO resumes using environment rewards  
   - Stabilizes policy after distillation  

---

## Evaluation Results

Evaluation uses **600 full games** against scripted opponents.

| Agent | Win Rate | Assassin Rate |
|------|---------|---------------|
| Graph PPO | 44.8% | 12.6% |
| PPO + Distillation | 52.9% | 6.9% |

- Distillation yields an **8.1 point** absolute win-rate improvement  
- Assassin-triggered losses are reduced by **45%**  
- Improvements arise primarily from **better risk calibration**, not increased guessing aggressiveness  

---

## Repository Contents

### Policy Checkpoints
- `policy_models/policy_after_ppo.pt`
- `policy_models/policy_after_distill.pt`

### Teacher Models
- `sft_model/` – supervised fine-tuned teacher
- `dpo_model/` – preference-aligned teacher

### Configuration and Logs
- `master_config.json`
- `evaluation_results.json`

---

## Usage

### Load Policy

```python
import torch
from policy import GraphPolicy

policy = GraphPolicy(...)
policy.load_state_dict(torch.load("policy_models/policy_after_distill.pt"))
policy.eval()
```

### Loading Fine-tuned LLM

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")

# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
```

## 🎓 Research Context

This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:

- Language models provide useful strategic priors when grounded by rollouts
- Graph-based representations enable structured reasoning in semantic games
- Distillation transfers high-level reasoning into efficient, deployable agents

### Key Innovations

1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies


## 📄 License

MIT License - See LICENSE file for details

## 🙏 Acknowledgments

- Built for **NeurIPS 2025 MindGames Workshop**
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU

---

**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
**Uploaded from**: Notebook Environment