---
tags:
- reinforcement-learning
- game-theory
- codenames
- neurips-2025
- graph-neural-networks
- meta-learning
license: mit
---

# CodeNames: Advanced RL + LLM System for NeurIPS 2025

![Status](https://img.shields.io/badge/status-trained-success)
![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
![License](https://img.shields.io/badge/license-MIT-blue)

This repository contains trained models for the **Colonel Blotto game**, targeting the **NeurIPS 2025 MindGames workshop**. The system combines cutting-edge reinforcement learning with large language model fine-tuning.

## 🎯 Model Overview

This is an advanced system that achieves strong performance on Colonel Blotto through:

- **Graph Neural Networks** for game state representation
- **FiLM layers** for fast opponent adaptation  
- **Meta-learning** for strategy portfolios
- **LLM fine-tuning** (SFT + DPO) for strategic reasoning
- **Distillation** from LLMs back to efficient RL policies

### Game Configuration

- **Fields**: 3
- **Units per round**: 20
- **Rounds per game**: 5
- **Training episodes**: 1000

## 📊 Performance Results

### Against Scripted Opponents

**Overall Win Rate**: N/A

### Against LLMs

| Matchup | Win Rate |
|---------|----------|
| Policy vs Base Llama | 93.00% |
| Policy vs Qwen | 22.00% |


## 🏗️ Architecture

### Policy Network

The core policy network uses a sophisticated architecture:

1. **Graph Encoder**: Multi-layer Graph Attention Networks (GAT)
   - Heterogeneous nodes: field nodes, round nodes, summary node
   - Multi-head attention with 6 heads
   - 3 layers of message passing

2. **Opponent Encoder**: MLP-based encoder for opponent modeling
   - Processes opponent history
   - Learns behavioral patterns

3. **FiLM Layers**: Feature-wise Linear Modulation
   - Fast adaptation to opponent behavior
   - Conditioned on opponent encoding

4. **Portfolio Head**: Multi-strategy selection
   - 6 specialist strategy heads
   - Soft attention-based mixing

### Training Pipeline

The models were trained through a comprehensive 7-phase pipeline:

1. **Phase A**: Environment setup and action space generation
2. **Phase B**: PPO training against diverse scripted opponents
3. **Phase C**: Preference dataset generation (LLM vs LLM rollouts)
4. **Phase D**: Supervised Fine-Tuning (SFT) of base LLM
5. **Phase E**: Direct Preference Optimization (DPO)
6. **Phase F**: Knowledge distillation from LLM to policy
7. **Phase G**: PPO refinement after distillation

## 📦 Repository Contents

### Policy Models

- `policy_models/policy_final.pt`: PyTorch checkpoint
- `policy_models/policy_after_distill.pt`: PyTorch checkpoint
- `policy_models/policy_after_ppo.pt`: PyTorch checkpoint

### Fine-tuned LLM Models

- `sft_model/`: SFT model (HuggingFace Transformers compatible)


### Configuration & Results

- `master_config.json`: Complete training configuration
- `battleground_eval.json`: Comprehensive evaluation results
- `eval_scripted_after_ppo.json`: Post-PPO evaluation

## 🚀 Usage

### Loading Policy Model

```python
import torch
from your_policy_module import PolicyNet

# Load configuration
with open("master_config.json", "r") as f:
    config = json.load(f)

# Initialize policy
policy = PolicyNet(
    Ff=config["F"],
    n_actions=231,  # For F=3, U=20
    hidden=config["hidden"],
    gnn_layers=config["gnn_layers"],
    gnn_heads=config["gnn_heads"],
    n_strat=config["n_strat"]
)

# Load trained weights
policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
policy.eval()
```

### Loading Fine-tuned LLM

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")

# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
```

## 🎓 Research Context

This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:

- **Strategic game AI** beyond traditional game-theoretic approaches
- **Hybrid systems** combining neural RL and LLM reasoning
- **Fast adaptation** to diverse opponents through meta-learning
- **Efficient deployment** via distillation

### Key Innovations

1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies

## 📝 Citation

If you use this work, please cite:

```bibtex
@misc{colonelblotto2025neurips,
  title={{Advanced Reinforcement Learning System for Colonel Blotto Games}},
  author={{NeurIPS 2025 MindGames Submission}},
  year={2025},
  publisher={HuggingFace Hub},
  howpublished={{\url{{https://huggingface.co/{repo_id}}}}},
}
```

## 📄 License

MIT License - See LICENSE file for details

## 🙏 Acknowledgments

- Built for **NeurIPS 2025 MindGames Workshop**
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU

---

**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
**Uploaded from**: Notebook Environment