File size: 5,597 Bytes

---
tags:
- reinforcement-learning
- game-theory
- codenames
- neurips-2025
- graph-neural-networks
- preference-learning
- llm-distillation
license: mit
---

# Codenames: Graph-Based RL with LLM-Guided Preference Distillation

![Status](https://img.shields.io/badge/status-trained-success)
![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
![License](https://img.shields.io/badge/license-MIT-blue)

This repository contains trained **Codenames agents** developed for the **NeurIPS 2025 MindGames Workshop**.  
The system combines a structured graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, targeting improved risk calibration and decision robustness.

---

## Overview

The approach integrates:

- **Graph Neural Networks** for structured board and history representation  
- **Proximal Policy Optimization (PPO)** for policy learning  
- **Role-conditioned decoding** for spymaster and operative behaviors  
- **Rollout-grounded preference learning** using large language models  
- **Supervised fine tuning (SFT)** and **Direct Preference Optimization (DPO)** for teacher alignment  
- **Knowledge distillation** from the aligned teacher back into a compact policy  

The objective is to improve strategic consistency and reduce catastrophic failures such as assassin selections, while maintaining efficient inference suitable for interactive play.

---

## Game Configuration

- **Game**: Codenames  
- **Board size**: 25 words  
- **Roles**: Spymaster and Operative  
- **Evaluation games**: 600 full episodes  
- **Opponents**: Scripted baseline agents  

---

## Policy Architecture

### Graph-Based State Encoder
- Heterogeneous graph with **30–40 nodes**
- Node types include:
  - Word nodes with semantic and state features
  - Historical clue nodes
  - Global summary node
- Node feature dimension: **35**
- Encoder:
  - 3 Graph Attention layers
  - 6 attention heads
  - Hidden size 192

### Role Conditioning
- Shared policy trunk
- Role-conditioned action decoding:
  - Clue generation and constraint handling for spymaster
  - Guess selection and stopping decisions for operative

### Model Size
- Total parameters: **~6.8M**
- Enables fast inference under competitive constraints

---

## Training Pipeline

Training follows a multi-stage curriculum:

1. **Graph PPO Pretraining**  
   - PPO with clip ratio 0.2  
   - Discount factor γ = 0.99  
   - GAE λ = 0.95  
   - Trained against scripted Codenames agents  

2. **Preference Generation via Rollouts**  
   - ~800 intermediate states sampled  
   - Candidate actions proposed by:
     - Llama 3.1 Instruct
     - Qwen 2.5 Instruct  
   - Each proposal evaluated using multiple stochastic rollouts  
   - Higher-return actions labeled preferred  

3. **Teacher Alignment**  
   - Supervised Fine Tuning on chosen actions  
   - Direct Preference Optimization using frozen reference model  

4. **Policy Distillation**  
   - Aligned teacher generates state-and-role to action labels  
   - Graph policy trained via cross-entropy imitation  

5. **PPO Refinement**  
   - PPO resumes using environment rewards  
   - Stabilizes policy after distillation  

---

## Evaluation Results

Evaluation uses **600 full games** against scripted opponents.

| Agent | Win Rate | Assassin Rate |
|------|---------|---------------|
| Graph PPO | 44.8% | 12.6% |
| PPO + Distillation | 52.9% | 6.9% |

- Distillation yields an **8.1 point** absolute win-rate improvement  
- Assassin-triggered losses are reduced by **45%**  
- Improvements arise primarily from **better risk calibration**, not increased guessing aggressiveness  

---

## Repository Contents

### Policy Checkpoints
- `policy_models/policy_after_ppo.pt`
- `policy_models/policy_after_distill.pt`

### Teacher Models
- `sft_model/` – supervised fine-tuned teacher
- `dpo_model/` – preference-aligned teacher

### Configuration and Logs
- `master_config.json`
- `evaluation_results.json`

---

## Usage

### Load Policy

```python
import torch
from policy import GraphPolicy

policy = GraphPolicy(...)
policy.load_state_dict(torch.load("policy_models/policy_after_distill.pt"))
policy.eval()
```

### Loading Fine-tuned LLM

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")

# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
```

## 🎓 Research Context

This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:

- Language models provide useful strategic priors when grounded by rollouts
- Graph-based representations enable structured reasoning in semantic games
- Distillation transfers high-level reasoning into efficient, deployable agents

### Key Innovations

1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies


## 📄 License

MIT License - See LICENSE file for details

## 🙏 Acknowledgments

- Built for **NeurIPS 2025 MindGames Workshop**
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU

---

**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
**Uploaded from**: Notebook Environment