File size: 5,313 Bytes
b2a4f73 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
---
tags:
- reinforcement-learning
- game-theory
- colonel-blotto
- neurips-2025
- graph-neural-networks
- meta-learning
license: mit
---
# Colonel Blotto: Advanced RL + LLM System for NeurIPS 2025



This repository contains trained models for the **Colonel Blotto game**, targeting the **NeurIPS 2025 MindGames workshop**. The system combines cutting-edge reinforcement learning with large language model fine-tuning.
## π― Model Overview
This is an advanced system that achieves strong performance on Colonel Blotto through:
- **Graph Neural Networks** for game state representation
- **FiLM layers** for fast opponent adaptation
- **Meta-learning** for strategy portfolios
- **LLM fine-tuning** (SFT + DPO) for strategic reasoning
- **Distillation** from LLMs back to efficient RL policies
### Game Configuration
- **Fields**: 3
- **Units per round**: 20
- **Rounds per game**: 5
- **Training episodes**: N/A
## π Performance Results
### Against Scripted Opponents
**Overall Win Rate**: 0.00%
### LLM Elo Ratings
| Model | Elo Rating |
|-------|------------|
## ποΈ Architecture
### Policy Network
The core policy network uses a sophisticated architecture:
1. **Graph Encoder**: Multi-layer Graph Attention Networks (GAT)
- Heterogeneous nodes: field nodes, round nodes, summary node
- Multi-head attention with 6 heads
- 3 layers of message passing
2. **Opponent Encoder**: MLP-based encoder for opponent modeling
- Processes opponent history
- Learns behavioral patterns
3. **FiLM Layers**: Feature-wise Linear Modulation
- Fast adaptation to opponent behavior
- Conditioned on opponent encoding
4. **Portfolio Head**: Multi-strategy selection
- 6 specialist strategy heads
- Soft attention-based mixing
### Training Pipeline
The models were trained through a comprehensive 7-phase pipeline:
1. **Phase A**: Environment setup and action space generation
2. **Phase B**: PPO training against diverse scripted opponents
3. **Phase C**: Preference dataset generation (LLM vs LLM rollouts)
4. **Phase D**: Supervised Fine-Tuning (SFT) of base LLM
5. **Phase E**: Direct Preference Optimization (DPO)
6. **Phase F**: Knowledge distillation from LLM to policy
7. **Phase G**: PPO refinement after distillation
## π¦ Repository Contents
### Policy Models
- `policy_models/policy_final.pt`: PyTorch checkpoint
- `policy_models/policy_after_distill.pt`: PyTorch checkpoint
- `policy_models/policy_after_ppo.pt`: PyTorch checkpoint
### Fine-tuned LLM Models
- `sft_model/`: SFT model (HuggingFace Transformers compatible)
- `dpo_model/`: DPO model (HuggingFace Transformers compatible)
### Configuration & Results
- `master_config.json`: Complete training configuration
- `battleground_eval.json`: Comprehensive evaluation results
- `eval_scripted_after_ppo.json`: Post-PPO evaluation
## π Usage
### Loading Policy Model
```python
import torch
from your_policy_module import PolicyNet
# Load configuration
with open("master_config.json", "r") as f:
config = json.load(f)
# Initialize policy
policy = PolicyNet(
Ff=config["F"],
n_actions=231, # For F=3, U=20
hidden=config["hidden"],
gnn_layers=config["gnn_layers"],
gnn_heads=config["gnn_heads"],
n_strat=config["n_strat"]
)
# Load trained weights
policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
policy.eval()
```
### Loading Fine-tuned LLM
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")
# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
```
## π Research Context
This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
- **Strategic game AI** beyond traditional game-theoretic approaches
- **Hybrid systems** combining neural RL and LLM reasoning
- **Fast adaptation** to diverse opponents through meta-learning
- **Efficient deployment** via distillation
### Key Innovations
1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
## π Citation
If you use this work, please cite:
```bibtex
@misc{colonelblotto2025neurips,
title={{Advanced Reinforcement Learning System for Colonel Blotto Games}},
author={{NeurIPS 2025 MindGames Submission}},
year={2025},
publisher={HuggingFace Hub},
howpublished={{\url{{https://huggingface.co/{repo_id}}}}},
}
```
## π License
MIT License - See LICENSE file for details
## π Acknowledgments
- Built for **NeurIPS 2025 MindGames Workshop**
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU
---
**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
**Uploaded from**: Notebook Environment
|