MindGamesCodeNames / README.md
sadhvikbathini's picture
Update README.md
7173793 verified
|
raw
history blame
5.3 kB
---
tags:
- reinforcement-learning
- game-theory
- codenames
- neurips-2025
- graph-neural-networks
- meta-learning
license: mit
---
# CodeNames: Advanced RL + LLM System for NeurIPS 2025
![Status](https://img.shields.io/badge/status-trained-success)
![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
![License](https://img.shields.io/badge/license-MIT-blue)
This repository contains trained models for the **Colonel Blotto game**, targeting the **NeurIPS 2025 MindGames workshop**. The system combines cutting-edge reinforcement learning with large language model fine-tuning.
## 🎯 Model Overview
This is an advanced system that achieves strong performance on Colonel Blotto through:
- **Graph Neural Networks** for game state representation
- **FiLM layers** for fast opponent adaptation
- **Meta-learning** for strategy portfolios
- **LLM fine-tuning** (SFT + DPO) for strategic reasoning
- **Distillation** from LLMs back to efficient RL policies
### Game Configuration
- **Fields**: 3
- **Units per round**: 20
- **Rounds per game**: 5
- **Training episodes**: 1000
## πŸ“Š Performance Results
### Against Scripted Opponents
**Overall Win Rate**: N/A
### Against LLMs
| Matchup | Win Rate |
|---------|----------|
| Policy vs Base Llama | 93.00% |
| Policy vs Qwen | 22.00% |
## πŸ—οΈ Architecture
### Policy Network
The core policy network uses a sophisticated architecture:
1. **Graph Encoder**: Multi-layer Graph Attention Networks (GAT)
- Heterogeneous nodes: field nodes, round nodes, summary node
- Multi-head attention with 6 heads
- 3 layers of message passing
2. **Opponent Encoder**: MLP-based encoder for opponent modeling
- Processes opponent history
- Learns behavioral patterns
3. **FiLM Layers**: Feature-wise Linear Modulation
- Fast adaptation to opponent behavior
- Conditioned on opponent encoding
4. **Portfolio Head**: Multi-strategy selection
- 6 specialist strategy heads
- Soft attention-based mixing
### Training Pipeline
The models were trained through a comprehensive 7-phase pipeline:
1. **Phase A**: Environment setup and action space generation
2. **Phase B**: PPO training against diverse scripted opponents
3. **Phase C**: Preference dataset generation (LLM vs LLM rollouts)
4. **Phase D**: Supervised Fine-Tuning (SFT) of base LLM
5. **Phase E**: Direct Preference Optimization (DPO)
6. **Phase F**: Knowledge distillation from LLM to policy
7. **Phase G**: PPO refinement after distillation
## πŸ“¦ Repository Contents
### Policy Models
- `policy_models/policy_final.pt`: PyTorch checkpoint
- `policy_models/policy_after_distill.pt`: PyTorch checkpoint
- `policy_models/policy_after_ppo.pt`: PyTorch checkpoint
### Fine-tuned LLM Models
- `sft_model/`: SFT model (HuggingFace Transformers compatible)
### Configuration & Results
- `master_config.json`: Complete training configuration
- `battleground_eval.json`: Comprehensive evaluation results
- `eval_scripted_after_ppo.json`: Post-PPO evaluation
## πŸš€ Usage
### Loading Policy Model
```python
import torch
from your_policy_module import PolicyNet
# Load configuration
with open("master_config.json", "r") as f:
config = json.load(f)
# Initialize policy
policy = PolicyNet(
Ff=config["F"],
n_actions=231, # For F=3, U=20
hidden=config["hidden"],
gnn_layers=config["gnn_layers"],
gnn_heads=config["gnn_heads"],
n_strat=config["n_strat"]
)
# Load trained weights
policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
policy.eval()
```
### Loading Fine-tuned LLM
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")
# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
```
## πŸŽ“ Research Context
This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
- **Strategic game AI** beyond traditional game-theoretic approaches
- **Hybrid systems** combining neural RL and LLM reasoning
- **Fast adaptation** to diverse opponents through meta-learning
- **Efficient deployment** via distillation
### Key Innovations
1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
## πŸ“ Citation
If you use this work, please cite:
```bibtex
@misc{colonelblotto2025neurips,
title={{Advanced Reinforcement Learning System for Colonel Blotto Games}},
author={{NeurIPS 2025 MindGames Submission}},
year={2025},
publisher={HuggingFace Hub},
howpublished={{\url{{https://huggingface.co/{repo_id}}}}},
}
```
## πŸ“„ License
MIT License - See LICENSE file for details
## πŸ™ Acknowledgments
- Built for **NeurIPS 2025 MindGames Workshop**
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU
---
**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
**Uploaded from**: Notebook Environment