File size: 5,313 Bytes
b2a4f73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
tags:
- reinforcement-learning
- game-theory
- colonel-blotto
- neurips-2025
- graph-neural-networks
- meta-learning
license: mit
---

# Colonel Blotto: Advanced RL + LLM System for NeurIPS 2025

![Status](https://img.shields.io/badge/status-trained-success)
![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
![License](https://img.shields.io/badge/license-MIT-blue)

This repository contains trained models for the **Colonel Blotto game**, targeting the **NeurIPS 2025 MindGames workshop**. The system combines cutting-edge reinforcement learning with large language model fine-tuning.

## 🎯 Model Overview

This is an advanced system that achieves strong performance on Colonel Blotto through:

- **Graph Neural Networks** for game state representation
- **FiLM layers** for fast opponent adaptation  
- **Meta-learning** for strategy portfolios
- **LLM fine-tuning** (SFT + DPO) for strategic reasoning
- **Distillation** from LLMs back to efficient RL policies

### Game Configuration

- **Fields**: 3
- **Units per round**: 20
- **Rounds per game**: 5
- **Training episodes**: N/A

## πŸ“Š Performance Results

### Against Scripted Opponents

**Overall Win Rate**: 0.00%

### LLM Elo Ratings

| Model | Elo Rating |
|-------|------------|


## πŸ—οΈ Architecture

### Policy Network

The core policy network uses a sophisticated architecture:

1. **Graph Encoder**: Multi-layer Graph Attention Networks (GAT)
   - Heterogeneous nodes: field nodes, round nodes, summary node
   - Multi-head attention with 6 heads
   - 3 layers of message passing

2. **Opponent Encoder**: MLP-based encoder for opponent modeling
   - Processes opponent history
   - Learns behavioral patterns

3. **FiLM Layers**: Feature-wise Linear Modulation
   - Fast adaptation to opponent behavior
   - Conditioned on opponent encoding

4. **Portfolio Head**: Multi-strategy selection
   - 6 specialist strategy heads
   - Soft attention-based mixing

### Training Pipeline

The models were trained through a comprehensive 7-phase pipeline:

1. **Phase A**: Environment setup and action space generation
2. **Phase B**: PPO training against diverse scripted opponents
3. **Phase C**: Preference dataset generation (LLM vs LLM rollouts)
4. **Phase D**: Supervised Fine-Tuning (SFT) of base LLM
5. **Phase E**: Direct Preference Optimization (DPO)
6. **Phase F**: Knowledge distillation from LLM to policy
7. **Phase G**: PPO refinement after distillation

## πŸ“¦ Repository Contents

### Policy Models

- `policy_models/policy_final.pt`: PyTorch checkpoint
- `policy_models/policy_after_distill.pt`: PyTorch checkpoint
- `policy_models/policy_after_ppo.pt`: PyTorch checkpoint

### Fine-tuned LLM Models

- `sft_model/`: SFT model (HuggingFace Transformers compatible)
- `dpo_model/`: DPO model (HuggingFace Transformers compatible)


### Configuration & Results

- `master_config.json`: Complete training configuration
- `battleground_eval.json`: Comprehensive evaluation results
- `eval_scripted_after_ppo.json`: Post-PPO evaluation

## πŸš€ Usage

### Loading Policy Model

```python
import torch
from your_policy_module import PolicyNet

# Load configuration
with open("master_config.json", "r") as f:
    config = json.load(f)

# Initialize policy
policy = PolicyNet(
    Ff=config["F"],
    n_actions=231,  # For F=3, U=20
    hidden=config["hidden"],
    gnn_layers=config["gnn_layers"],
    gnn_heads=config["gnn_heads"],
    n_strat=config["n_strat"]
)

# Load trained weights
policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
policy.eval()
```

### Loading Fine-tuned LLM

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")

# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
```

## πŸŽ“ Research Context

This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:

- **Strategic game AI** beyond traditional game-theoretic approaches
- **Hybrid systems** combining neural RL and LLM reasoning
- **Fast adaptation** to diverse opponents through meta-learning
- **Efficient deployment** via distillation

### Key Innovations

1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies

## πŸ“ Citation

If you use this work, please cite:

```bibtex
@misc{colonelblotto2025neurips,
  title={{Advanced Reinforcement Learning System for Colonel Blotto Games}},
  author={{NeurIPS 2025 MindGames Submission}},
  year={2025},
  publisher={HuggingFace Hub},
  howpublished={{\url{{https://huggingface.co/{repo_id}}}}},
}
```

## πŸ“„ License

MIT License - See LICENSE file for details

## πŸ™ Acknowledgments

- Built for **NeurIPS 2025 MindGames Workshop**
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU

---

**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
**Uploaded from**: Notebook Environment