|
|
--- |
|
|
tags: |
|
|
- reinforcement-learning |
|
|
- game-theory |
|
|
- codenames |
|
|
- neurips-2025 |
|
|
- graph-neural-networks |
|
|
- meta-learning |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# CodeNames: Advanced RL + LLM System for NeurIPS 2025 |
|
|
|
|
|
 |
|
|
 |
|
|
 |
|
|
|
|
|
This repository contains trained models for the **Colonel Blotto game**, targeting the **NeurIPS 2025 MindGames workshop**. The system combines cutting-edge reinforcement learning with large language model fine-tuning. |
|
|
|
|
|
## π― Model Overview |
|
|
|
|
|
This is an advanced system that achieves strong performance on Colonel Blotto through: |
|
|
|
|
|
- **Graph Neural Networks** for game state representation |
|
|
- **FiLM layers** for fast opponent adaptation |
|
|
- **Meta-learning** for strategy portfolios |
|
|
- **LLM fine-tuning** (SFT + DPO) for strategic reasoning |
|
|
- **Distillation** from LLMs back to efficient RL policies |
|
|
|
|
|
### Game Configuration |
|
|
|
|
|
- **Fields**: 3 |
|
|
- **Units per round**: 20 |
|
|
- **Rounds per game**: 5 |
|
|
- **Training episodes**: 1000 |
|
|
|
|
|
## π Performance Results |
|
|
|
|
|
### Against Scripted Opponents |
|
|
|
|
|
**Overall Win Rate**: N/A |
|
|
|
|
|
### Against LLMs |
|
|
|
|
|
| Matchup | Win Rate | |
|
|
|---------|----------| |
|
|
| Policy vs Base Llama | 93.00% | |
|
|
| Policy vs Qwen | 22.00% | |
|
|
|
|
|
|
|
|
## ποΈ Architecture |
|
|
|
|
|
### Policy Network |
|
|
|
|
|
The core policy network uses a sophisticated architecture: |
|
|
|
|
|
1. **Graph Encoder**: Multi-layer Graph Attention Networks (GAT) |
|
|
- Heterogeneous nodes: field nodes, round nodes, summary node |
|
|
- Multi-head attention with 6 heads |
|
|
- 3 layers of message passing |
|
|
|
|
|
2. **Opponent Encoder**: MLP-based encoder for opponent modeling |
|
|
- Processes opponent history |
|
|
- Learns behavioral patterns |
|
|
|
|
|
3. **FiLM Layers**: Feature-wise Linear Modulation |
|
|
- Fast adaptation to opponent behavior |
|
|
- Conditioned on opponent encoding |
|
|
|
|
|
4. **Portfolio Head**: Multi-strategy selection |
|
|
- 6 specialist strategy heads |
|
|
- Soft attention-based mixing |
|
|
|
|
|
### Training Pipeline |
|
|
|
|
|
The models were trained through a comprehensive 7-phase pipeline: |
|
|
|
|
|
1. **Phase A**: Environment setup and action space generation |
|
|
2. **Phase B**: PPO training against diverse scripted opponents |
|
|
3. **Phase C**: Preference dataset generation (LLM vs LLM rollouts) |
|
|
4. **Phase D**: Supervised Fine-Tuning (SFT) of base LLM |
|
|
5. **Phase E**: Direct Preference Optimization (DPO) |
|
|
6. **Phase F**: Knowledge distillation from LLM to policy |
|
|
7. **Phase G**: PPO refinement after distillation |
|
|
|
|
|
## π¦ Repository Contents |
|
|
|
|
|
### Policy Models |
|
|
|
|
|
- `policy_models/policy_final.pt`: PyTorch checkpoint |
|
|
- `policy_models/policy_after_distill.pt`: PyTorch checkpoint |
|
|
- `policy_models/policy_after_ppo.pt`: PyTorch checkpoint |
|
|
|
|
|
### Fine-tuned LLM Models |
|
|
|
|
|
- `sft_model/`: SFT model (HuggingFace Transformers compatible) |
|
|
|
|
|
|
|
|
### Configuration & Results |
|
|
|
|
|
- `master_config.json`: Complete training configuration |
|
|
- `battleground_eval.json`: Comprehensive evaluation results |
|
|
- `eval_scripted_after_ppo.json`: Post-PPO evaluation |
|
|
|
|
|
## π Usage |
|
|
|
|
|
### Loading Policy Model |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from your_policy_module import PolicyNet |
|
|
|
|
|
# Load configuration |
|
|
with open("master_config.json", "r") as f: |
|
|
config = json.load(f) |
|
|
|
|
|
# Initialize policy |
|
|
policy = PolicyNet( |
|
|
Ff=config["F"], |
|
|
n_actions=231, # For F=3, U=20 |
|
|
hidden=config["hidden"], |
|
|
gnn_layers=config["gnn_layers"], |
|
|
gnn_heads=config["gnn_heads"], |
|
|
n_strat=config["n_strat"] |
|
|
) |
|
|
|
|
|
# Load trained weights |
|
|
policy.load_state_dict(torch.load("policy_models/policy_final.pt")) |
|
|
policy.eval() |
|
|
``` |
|
|
|
|
|
### Loading Fine-tuned LLM |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load SFT or DPO model |
|
|
tokenizer = AutoTokenizer.from_pretrained("./sft_model") |
|
|
model = AutoModelForCausalLM.from_pretrained("./sft_model") |
|
|
|
|
|
# Use for inference |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=32) |
|
|
``` |
|
|
|
|
|
## π Research Context |
|
|
|
|
|
This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on: |
|
|
|
|
|
- **Strategic game AI** beyond traditional game-theoretic approaches |
|
|
- **Hybrid systems** combining neural RL and LLM reasoning |
|
|
- **Fast adaptation** to diverse opponents through meta-learning |
|
|
- **Efficient deployment** via distillation |
|
|
|
|
|
### Key Innovations |
|
|
|
|
|
1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states |
|
|
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism |
|
|
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings |
|
|
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use this work, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{colonelblotto2025neurips, |
|
|
title={{Advanced Reinforcement Learning System for Colonel Blotto Games}}, |
|
|
author={{NeurIPS 2025 MindGames Submission}}, |
|
|
year={2025}, |
|
|
publisher={HuggingFace Hub}, |
|
|
howpublished={{\url{{https://huggingface.co/{repo_id}}}}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## π License |
|
|
|
|
|
MIT License - See LICENSE file for details |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
- Built for **NeurIPS 2025 MindGames Workshop** |
|
|
- Uses PyTorch, HuggingFace Transformers, and PEFT |
|
|
- Training infrastructure: NVIDIA H200 GPU |
|
|
|
|
|
--- |
|
|
|
|
|
**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} |
|
|
**Uploaded from**: Notebook Environment |
|
|
|