--- tags: - reinforcement-learning - game-theory - codenames - neurips-2025 - graph-neural-networks - meta-learning license: mit --- # CodeNames: Advanced RL + LLM System for NeurIPS 2025 ![Status](https://img.shields.io/badge/status-trained-success) ![Framework](https://img.shields.io/badge/framework-PyTorch-orange) ![License](https://img.shields.io/badge/license-MIT-blue) This repository contains trained models for the **Colonel Blotto game**, targeting the **NeurIPS 2025 MindGames workshop**. The system combines cutting-edge reinforcement learning with large language model fine-tuning. ## 🎯 Model Overview This is an advanced system that achieves strong performance on Colonel Blotto through: - **Graph Neural Networks** for game state representation - **FiLM layers** for fast opponent adaptation - **Meta-learning** for strategy portfolios - **LLM fine-tuning** (SFT + DPO) for strategic reasoning - **Distillation** from LLMs back to efficient RL policies ### Game Configuration - **Fields**: 3 - **Units per round**: 20 - **Rounds per game**: 5 - **Training episodes**: 1000 ## 📊 Performance Results ### Against Scripted Opponents **Overall Win Rate**: N/A ### Against LLMs | Matchup | Win Rate | |---------|----------| | Policy vs Base Llama | 93.00% | | Policy vs Qwen | 22.00% | ## 🏗️ Architecture ### Policy Network The core policy network uses a sophisticated architecture: 1. **Graph Encoder**: Multi-layer Graph Attention Networks (GAT) - Heterogeneous nodes: field nodes, round nodes, summary node - Multi-head attention with 6 heads - 3 layers of message passing 2. **Opponent Encoder**: MLP-based encoder for opponent modeling - Processes opponent history - Learns behavioral patterns 3. **FiLM Layers**: Feature-wise Linear Modulation - Fast adaptation to opponent behavior - Conditioned on opponent encoding 4. **Portfolio Head**: Multi-strategy selection - 6 specialist strategy heads - Soft attention-based mixing ### Training Pipeline The models were trained through a comprehensive 7-phase pipeline: 1. **Phase A**: Environment setup and action space generation 2. **Phase B**: PPO training against diverse scripted opponents 3. **Phase C**: Preference dataset generation (LLM vs LLM rollouts) 4. **Phase D**: Supervised Fine-Tuning (SFT) of base LLM 5. **Phase E**: Direct Preference Optimization (DPO) 6. **Phase F**: Knowledge distillation from LLM to policy 7. **Phase G**: PPO refinement after distillation ## 📦 Repository Contents ### Policy Models - `policy_models/policy_final.pt`: PyTorch checkpoint - `policy_models/policy_after_distill.pt`: PyTorch checkpoint - `policy_models/policy_after_ppo.pt`: PyTorch checkpoint ### Fine-tuned LLM Models - `sft_model/`: SFT model (HuggingFace Transformers compatible) ### Configuration & Results - `master_config.json`: Complete training configuration - `battleground_eval.json`: Comprehensive evaluation results - `eval_scripted_after_ppo.json`: Post-PPO evaluation ## 🚀 Usage ### Loading Policy Model ```python import torch from your_policy_module import PolicyNet # Load configuration with open("master_config.json", "r") as f: config = json.load(f) # Initialize policy policy = PolicyNet( Ff=config["F"], n_actions=231, # For F=3, U=20 hidden=config["hidden"], gnn_layers=config["gnn_layers"], gnn_heads=config["gnn_heads"], n_strat=config["n_strat"] ) # Load trained weights policy.load_state_dict(torch.load("policy_models/policy_final.pt")) policy.eval() ``` ### Loading Fine-tuned LLM ```python from transformers import AutoTokenizer, AutoModelForCausalLM # Load SFT or DPO model tokenizer = AutoTokenizer.from_pretrained("./sft_model") model = AutoModelForCausalLM.from_pretrained("./sft_model") # Use for inference inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=32) ``` ## 🎓 Research Context This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on: - **Strategic game AI** beyond traditional game-theoretic approaches - **Hybrid systems** combining neural RL and LLM reasoning - **Fast adaptation** to diverse opponents through meta-learning - **Efficient deployment** via distillation ### Key Innovations 1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states 2. **Ground-truth Counterfactual Learning**: Exploiting game determinism 3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings 4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies ## 📝 Citation If you use this work, please cite: ```bibtex @misc{colonelblotto2025neurips, title={{Advanced Reinforcement Learning System for Colonel Blotto Games}}, author={{NeurIPS 2025 MindGames Submission}}, year={2025}, publisher={HuggingFace Hub}, howpublished={{\url{{https://huggingface.co/{repo_id}}}}}, } ``` ## 📄 License MIT License - See LICENSE file for details ## 🙏 Acknowledgments - Built for **NeurIPS 2025 MindGames Workshop** - Uses PyTorch, HuggingFace Transformers, and PEFT - Training infrastructure: NVIDIA H200 GPU --- **Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} **Uploaded from**: Notebook Environment