GOVINDFROM commited on
Commit
5c80055
·
verified ·
1 Parent(s): e6db4f9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -97
README.md CHANGED
@@ -2,130 +2,149 @@
2
  tags:
3
  - reinforcement-learning
4
  - game-theory
5
- - colonel-blotto
6
  - neurips-2025
7
  - graph-neural-networks
8
- - meta-learning
 
9
  license: mit
10
  ---
11
 
12
- # Colonel Blotto: Advanced RL + LLM System for NeurIPS 2025
13
 
14
  ![Status](https://img.shields.io/badge/status-trained-success)
15
  ![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
16
  ![License](https://img.shields.io/badge/license-MIT-blue)
17
 
18
- This repository contains trained models for the **Colonel Blotto game**, targeting the **NeurIPS 2025 MindGames workshop**. The system combines cutting-edge reinforcement learning with large language model fine-tuning.
 
19
 
20
- ## 🎯 Model Overview
21
 
22
- This is an advanced system that achieves strong performance on Colonel Blotto through:
23
 
24
- - **Graph Neural Networks** for game state representation
25
- - **FiLM layers** for fast opponent adaptation
26
- - **Meta-learning** for strategy portfolios
27
- - **LLM fine-tuning** (SFT + DPO) for strategic reasoning
28
- - **Distillation** from LLMs back to efficient RL policies
29
 
30
- ### Game Configuration
 
 
 
 
 
31
 
32
- - **Fields**: 3
33
- - **Units per round**: 20
34
- - **Rounds per game**: 5
35
- - **Training episodes**: 1000
36
 
37
- ## 📊 Performance Results
38
 
39
- ### Against Scripted Opponents
40
 
41
- **Overall Win Rate**: N/A
 
 
 
 
42
 
43
- ### Against LLMs
44
 
45
- | Matchup | Win Rate |
46
- |---------|----------|
47
- | Policy vs Base Llama | 93.00% |
48
- | Policy vs Qwen | 76.00% |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
 
50
 
51
- ## 🏗️ Architecture
52
 
53
- ### Policy Network
54
 
55
- The core policy network uses a sophisticated architecture:
 
 
 
 
56
 
57
- 1. **Graph Encoder**: Multi-layer Graph Attention Networks (GAT)
58
- - Heterogeneous nodes: field nodes, round nodes, summary node
59
- - Multi-head attention with 6 heads
60
- - 3 layers of message passing
 
 
 
61
 
62
- 2. **Opponent Encoder**: MLP-based encoder for opponent modeling
63
- - Processes opponent history
64
- - Learns behavioral patterns
65
 
66
- 3. **FiLM Layers**: Feature-wise Linear Modulation
67
- - Fast adaptation to opponent behavior
68
- - Conditioned on opponent encoding
69
 
70
- 4. **Portfolio Head**: Multi-strategy selection
71
- - 6 specialist strategy heads
72
- - Soft attention-based mixing
73
 
74
- ### Training Pipeline
75
 
76
- The models were trained through a comprehensive 7-phase pipeline:
77
 
78
- 1. **Phase A**: Environment setup and action space generation
79
- 2. **Phase B**: PPO training against diverse scripted opponents
80
- 3. **Phase C**: Preference dataset generation (LLM vs LLM rollouts)
81
- 4. **Phase D**: Supervised Fine-Tuning (SFT) of base LLM
82
- 5. **Phase E**: Direct Preference Optimization (DPO)
83
- 6. **Phase F**: Knowledge distillation from LLM to policy
84
- 7. **Phase G**: PPO refinement after distillation
85
 
86
- ## 📦 Repository Contents
 
 
 
87
 
88
- ### Policy Models
 
 
89
 
90
- - `policy_models/policy_final.pt`: PyTorch checkpoint
91
- - `policy_models/policy_after_distill.pt`: PyTorch checkpoint
92
- - `policy_models/policy_after_ppo.pt`: PyTorch checkpoint
93
 
94
- ### Fine-tuned LLM Models
95
 
96
- - `sft_model/`: SFT model (HuggingFace Transformers compatible)
 
 
97
 
 
 
 
98
 
99
- ### Configuration & Results
 
 
100
 
101
- - `master_config.json`: Complete training configuration
102
- - `battleground_eval.json`: Comprehensive evaluation results
103
- - `eval_scripted_after_ppo.json`: Post-PPO evaluation
104
 
105
- ## 🚀 Usage
106
 
107
- ### Loading Policy Model
108
 
109
  ```python
110
  import torch
111
- from your_policy_module import PolicyNet
112
-
113
- # Load configuration
114
- with open("master_config.json", "r") as f:
115
- config = json.load(f)
116
-
117
- # Initialize policy
118
- policy = PolicyNet(
119
- Ff=config["F"],
120
- n_actions=231, # For F=3, U=20
121
- hidden=config["hidden"],
122
- gnn_layers=config["gnn_layers"],
123
- gnn_heads=config["gnn_heads"],
124
- n_strat=config["n_strat"]
125
- )
126
-
127
- # Load trained weights
128
- policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
129
  policy.eval()
130
  ```
131
 
@@ -147,10 +166,9 @@ outputs = model.generate(**inputs, max_new_tokens=32)
147
 
148
  This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
149
 
150
- - **Strategic game AI** beyond traditional game-theoretic approaches
151
- - **Hybrid systems** combining neural RL and LLM reasoning
152
- - **Fast adaptation** to diverse opponents through meta-learning
153
- - **Efficient deployment** via distillation
154
 
155
  ### Key Innovations
156
 
@@ -159,19 +177,6 @@ This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
159
  3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
160
  4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
161
 
162
- ## 📝 Citation
163
-
164
- If you use this work, please cite:
165
-
166
- ```bibtex
167
- @misc{colonelblotto2025neurips,
168
- title={{Advanced Reinforcement Learning System for Colonel Blotto Games}},
169
- author={{NeurIPS 2025 MindGames Submission}},
170
- year={2025},
171
- publisher={HuggingFace Hub},
172
- howpublished={{\url{{https://huggingface.co/{repo_id}}}}},
173
- }
174
- ```
175
 
176
  ## 📄 License
177
 
 
2
  tags:
3
  - reinforcement-learning
4
  - game-theory
5
+ - codenames
6
  - neurips-2025
7
  - graph-neural-networks
8
+ - preference-learning
9
+ - llm-distillation
10
  license: mit
11
  ---
12
 
13
+ # Codenames: Graph-Based RL with LLM-Guided Preference Distillation
14
 
15
  ![Status](https://img.shields.io/badge/status-trained-success)
16
  ![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
17
  ![License](https://img.shields.io/badge/license-MIT-blue)
18
 
19
+ This repository contains trained **Codenames agents** developed for the **NeurIPS 2025 MindGames Workshop**.
20
+ The system combines a structured graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, targeting improved risk calibration and decision robustness.
21
 
22
+ ---
23
 
24
+ ## Overview
25
 
26
+ The approach integrates:
 
 
 
 
27
 
28
+ - **Graph Neural Networks** for structured board and history representation
29
+ - **Proximal Policy Optimization (PPO)** for policy learning
30
+ - **Role-conditioned decoding** for spymaster and operative behaviors
31
+ - **Rollout-grounded preference learning** using large language models
32
+ - **Supervised fine tuning (SFT)** and **Direct Preference Optimization (DPO)** for teacher alignment
33
+ - **Knowledge distillation** from the aligned teacher back into a compact policy
34
 
35
+ The objective is to improve strategic consistency and reduce catastrophic failures such as assassin selections, while maintaining efficient inference suitable for interactive play.
 
 
 
36
 
37
+ ---
38
 
39
+ ## Game Configuration
40
 
41
+ - **Game**: Codenames
42
+ - **Board size**: 25 words
43
+ - **Roles**: Spymaster and Operative
44
+ - **Evaluation games**: 600 full episodes
45
+ - **Opponents**: Scripted baseline agents
46
 
47
+ ---
48
 
49
+ ## Policy Architecture
50
+
51
+ ### Graph-Based State Encoder
52
+ - Heterogeneous graph with **30–40 nodes**
53
+ - Node types include:
54
+ - Word nodes with semantic and state features
55
+ - Historical clue nodes
56
+ - Global summary node
57
+ - Node feature dimension: **35**
58
+ - Encoder:
59
+ - 3 Graph Attention layers
60
+ - 6 attention heads
61
+ - Hidden size 192
62
+
63
+ ### Role Conditioning
64
+ - Shared policy trunk
65
+ - Role-conditioned action decoding:
66
+ - Clue generation and constraint handling for spymaster
67
+ - Guess selection and stopping decisions for operative
68
+
69
+ ### Model Size
70
+ - Total parameters: **~6.8M**
71
+ - Enables fast inference under competitive constraints
72
 
73
+ ---
74
 
75
+ ## Training Pipeline
76
 
77
+ Training follows a multi-stage curriculum:
78
 
79
+ 1. **Graph PPO Pretraining**
80
+ - PPO with clip ratio 0.2
81
+ - Discount factor γ = 0.99
82
+ - GAE λ = 0.95
83
+ - Trained against scripted Codenames agents
84
 
85
+ 2. **Preference Generation via Rollouts**
86
+ - ~800 intermediate states sampled
87
+ - Candidate actions proposed by:
88
+ - Llama 3.1 Instruct
89
+ - Qwen 2.5 Instruct
90
+ - Each proposal evaluated using multiple stochastic rollouts
91
+ - Higher-return actions labeled preferred
92
 
93
+ 3. **Teacher Alignment**
94
+ - Supervised Fine Tuning on chosen actions
95
+ - Direct Preference Optimization using frozen reference model
96
 
97
+ 4. **Policy Distillation**
98
+ - Aligned teacher generates state-and-role to action labels
99
+ - Graph policy trained via cross-entropy imitation
100
 
101
+ 5. **PPO Refinement**
102
+ - PPO resumes using environment rewards
103
+ - Stabilizes policy after distillation
104
 
105
+ ---
106
 
107
+ ## Evaluation Results
108
 
109
+ Evaluation uses **600 full games** against scripted opponents.
 
 
 
 
 
 
110
 
111
+ | Agent | Win Rate | Assassin Rate |
112
+ |------|---------|---------------|
113
+ | Graph PPO | 44.8% | 12.6% |
114
+ | PPO + Distillation | 52.9% | 6.9% |
115
 
116
+ - Distillation yields an **8.1 point** absolute win-rate improvement
117
+ - Assassin-triggered losses are reduced by **45%**
118
+ - Improvements arise primarily from **better risk calibration**, not increased guessing aggressiveness
119
 
120
+ ---
 
 
121
 
122
+ ## Repository Contents
123
 
124
+ ### Policy Checkpoints
125
+ - `policy_models/policy_after_ppo.pt`
126
+ - `policy_models/policy_after_distill.pt`
127
 
128
+ ### Teacher Models
129
+ - `sft_model/` – supervised fine-tuned teacher
130
+ - `dpo_model/` – preference-aligned teacher
131
 
132
+ ### Configuration and Logs
133
+ - `master_config.json`
134
+ - `evaluation_results.json`
135
 
136
+ ---
 
 
137
 
138
+ ## Usage
139
 
140
+ ### Load Policy
141
 
142
  ```python
143
  import torch
144
+ from policy import GraphPolicy
145
+
146
+ policy = GraphPolicy(...)
147
+ policy.load_state_dict(torch.load("policy_models/policy_after_distill.pt"))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
  policy.eval()
149
  ```
150
 
 
166
 
167
  This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
168
 
169
+ - Language models provide useful strategic priors when grounded by rollouts
170
+ - Graph-based representations enable structured reasoning in semantic games
171
+ - Distillation transfers high-level reasoning into efficient, deployable agents
 
172
 
173
  ### Key Innovations
174
 
 
177
  3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
178
  4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
179
 
 
 
 
 
 
 
 
 
 
 
 
 
 
180
 
181
  ## 📄 License
182