MindGamesCodeNames / README.md
GOVINDFROM's picture
Update README.md
5c80055 verified
---
tags:
- reinforcement-learning
- game-theory
- codenames
- neurips-2025
- graph-neural-networks
- preference-learning
- llm-distillation
license: mit
---
# Codenames: Graph-Based RL with LLM-Guided Preference Distillation
![Status](https://img.shields.io/badge/status-trained-success)
![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
![License](https://img.shields.io/badge/license-MIT-blue)
This repository contains trained **Codenames agents** developed for the **NeurIPS 2025 MindGames Workshop**.
The system combines a structured graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, targeting improved risk calibration and decision robustness.
---
## Overview
The approach integrates:
- **Graph Neural Networks** for structured board and history representation
- **Proximal Policy Optimization (PPO)** for policy learning
- **Role-conditioned decoding** for spymaster and operative behaviors
- **Rollout-grounded preference learning** using large language models
- **Supervised fine tuning (SFT)** and **Direct Preference Optimization (DPO)** for teacher alignment
- **Knowledge distillation** from the aligned teacher back into a compact policy
The objective is to improve strategic consistency and reduce catastrophic failures such as assassin selections, while maintaining efficient inference suitable for interactive play.
---
## Game Configuration
- **Game**: Codenames
- **Board size**: 25 words
- **Roles**: Spymaster and Operative
- **Evaluation games**: 600 full episodes
- **Opponents**: Scripted baseline agents
---
## Policy Architecture
### Graph-Based State Encoder
- Heterogeneous graph with **30–40 nodes**
- Node types include:
- Word nodes with semantic and state features
- Historical clue nodes
- Global summary node
- Node feature dimension: **35**
- Encoder:
- 3 Graph Attention layers
- 6 attention heads
- Hidden size 192
### Role Conditioning
- Shared policy trunk
- Role-conditioned action decoding:
- Clue generation and constraint handling for spymaster
- Guess selection and stopping decisions for operative
### Model Size
- Total parameters: **~6.8M**
- Enables fast inference under competitive constraints
---
## Training Pipeline
Training follows a multi-stage curriculum:
1. **Graph PPO Pretraining**
- PPO with clip ratio 0.2
- Discount factor γ = 0.99
- GAE λ = 0.95
- Trained against scripted Codenames agents
2. **Preference Generation via Rollouts**
- ~800 intermediate states sampled
- Candidate actions proposed by:
- Llama 3.1 Instruct
- Qwen 2.5 Instruct
- Each proposal evaluated using multiple stochastic rollouts
- Higher-return actions labeled preferred
3. **Teacher Alignment**
- Supervised Fine Tuning on chosen actions
- Direct Preference Optimization using frozen reference model
4. **Policy Distillation**
- Aligned teacher generates state-and-role to action labels
- Graph policy trained via cross-entropy imitation
5. **PPO Refinement**
- PPO resumes using environment rewards
- Stabilizes policy after distillation
---
## Evaluation Results
Evaluation uses **600 full games** against scripted opponents.
| Agent | Win Rate | Assassin Rate |
|------|---------|---------------|
| Graph PPO | 44.8% | 12.6% |
| PPO + Distillation | 52.9% | 6.9% |
- Distillation yields an **8.1 point** absolute win-rate improvement
- Assassin-triggered losses are reduced by **45%**
- Improvements arise primarily from **better risk calibration**, not increased guessing aggressiveness
---
## Repository Contents
### Policy Checkpoints
- `policy_models/policy_after_ppo.pt`
- `policy_models/policy_after_distill.pt`
### Teacher Models
- `sft_model/` – supervised fine-tuned teacher
- `dpo_model/` – preference-aligned teacher
### Configuration and Logs
- `master_config.json`
- `evaluation_results.json`
---
## Usage
### Load Policy
```python
import torch
from policy import GraphPolicy
policy = GraphPolicy(...)
policy.load_state_dict(torch.load("policy_models/policy_after_distill.pt"))
policy.eval()
```
### Loading Fine-tuned LLM
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")
# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
```
## 🎓 Research Context
This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
- Language models provide useful strategic priors when grounded by rollouts
- Graph-based representations enable structured reasoning in semantic games
- Distillation transfers high-level reasoning into efficient, deployable agents
### Key Innovations
1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
## 📄 License
MIT License - See LICENSE file for details
## 🙏 Acknowledgments
- Built for **NeurIPS 2025 MindGames Workshop**
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU
---
**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
**Uploaded from**: Notebook Environment