GOVINDFROM
/

MindGamesCodeNames

@@ -2,130 +2,149 @@
 tags:
 - reinforcement-learning
 - game-theory
-- colonel-blotto
 - neurips-2025
 - graph-neural-networks
-- meta-learning
 license: mit
 ---
-# Colonel Blotto: Advanced RL + LLM System for NeurIPS 2025
 ![Status](https://img.shields.io/badge/status-trained-success)
 ![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
 ![License](https://img.shields.io/badge/license-MIT-blue)
-This repository contains trained models for the **Colonel Blotto game**, targeting the **NeurIPS 2025 MindGames workshop**. The system combines cutting-edge reinforcement learning with large language model fine-tuning.
-## 🎯 Model Overview
-This is an advanced system that achieves strong performance on Colonel Blotto through:
-- **Graph Neural Networks** for game state representation
-- **FiLM layers** for fast opponent adaptation
-- **Meta-learning** for strategy portfolios
-- **LLM fine-tuning** (SFT + DPO) for strategic reasoning
-- **Distillation** from LLMs back to efficient RL policies
-### Game Configuration
-- **Fields**: 3
-- **Units per round**: 20
-- **Rounds per game**: 5
-- **Training episodes**: 1000
-## 📊 Performance Results
-### Against Scripted Opponents
-**Overall Win Rate**: N/A
-### Against LLMs
-| Matchup | Win Rate |
-|---------|----------|
-| Policy vs Base Llama | 93.00% |
-| Policy vs Qwen | 76.00% |
-## 🏗️ Architecture
-### Policy Network
-The core policy network uses a sophisticated architecture:
-1. **Graph Encoder**: Multi-layer Graph Attention Networks (GAT)
-   - Heterogeneous nodes: field nodes, round nodes, summary node
-   - Multi-head attention with 6 heads
-   - 3 layers of message passing
-2. **Opponent Encoder**: MLP-based encoder for opponent modeling
-   - Processes opponent history
-   - Learns behavioral patterns
-3. **FiLM Layers**: Feature-wise Linear Modulation
-   - Fast adaptation to opponent behavior
-   - Conditioned on opponent encoding
-4. **Portfolio Head**: Multi-strategy selection
-   - 6 specialist strategy heads
-   - Soft attention-based mixing
-### Training Pipeline
-The models were trained through a comprehensive 7-phase pipeline:
-1. **Phase A**: Environment setup and action space generation
-2. **Phase B**: PPO training against diverse scripted opponents
-3. **Phase C**: Preference dataset generation (LLM vs LLM rollouts)
-4. **Phase D**: Supervised Fine-Tuning (SFT) of base LLM
-5. **Phase E**: Direct Preference Optimization (DPO)
-6. **Phase F**: Knowledge distillation from LLM to policy
-7. **Phase G**: PPO refinement after distillation
-## 📦 Repository Contents
-### Policy Models
-- `policy_models/policy_final.pt`: PyTorch checkpoint
-- `policy_models/policy_after_distill.pt`: PyTorch checkpoint
-- `policy_models/policy_after_ppo.pt`: PyTorch checkpoint
-### Fine-tuned LLM Models
-- `sft_model/`: SFT model (HuggingFace Transformers compatible)
-### Configuration & Results
-- `master_config.json`: Complete training configuration
-- `battleground_eval.json`: Comprehensive evaluation results
-- `eval_scripted_after_ppo.json`: Post-PPO evaluation
-## 🚀 Usage
-### Loading Policy Model
 ```python
 import torch
-from your_policy_module import PolicyNet
-# Load configuration
-with open("master_config.json", "r") as f:
-    config = json.load(f)
-# Initialize policy
-policy = PolicyNet(
-    Ff=config["F"],
-    n_actions=231,  # For F=3, U=20
-    hidden=config["hidden"],
-    gnn_layers=config["gnn_layers"],
-    gnn_heads=config["gnn_heads"],
-    n_strat=config["n_strat"]
-)
-# Load trained weights
-policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
 policy.eval()
 ```
@@ -147,10 +166,9 @@ outputs = model.generate(**inputs, max_new_tokens=32)
 This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
-- **Strategic game AI** beyond traditional game-theoretic approaches
-- **Hybrid systems** combining neural RL and LLM reasoning
-- **Fast adaptation** to diverse opponents through meta-learning
-- **Efficient deployment** via distillation
 ### Key Innovations
@@ -159,19 +177,6 @@ This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
 3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
 4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
-## 📝 Citation
-If you use this work, please cite:
-```bibtex
-@misc{colonelblotto2025neurips,
-  title={{Advanced Reinforcement Learning System for Colonel Blotto Games}},
-  author={{NeurIPS 2025 MindGames Submission}},
-  year={2025},
-  publisher={HuggingFace Hub},
-  howpublished={{\url{{https://huggingface.co/{repo_id}}}}},
-}
-```
 ## 📄 License

 tags:
 - reinforcement-learning
 - game-theory
+- codenames
 - neurips-2025
 - graph-neural-networks
+- preference-learning
+- llm-distillation
 license: mit
 ---
+# Codenames: Graph-Based RL with LLM-Guided Preference Distillation
 ![Status](https://img.shields.io/badge/status-trained-success)
 ![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
 ![License](https://img.shields.io/badge/license-MIT-blue)
+This repository contains trained **Codenames agents** developed for the **NeurIPS 2025 MindGames Workshop**.
+The system combines a structured graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, targeting improved risk calibration and decision robustness.
+---
+## Overview
+The approach integrates:
+- **Graph Neural Networks** for structured board and history representation
+- **Proximal Policy Optimization (PPO)** for policy learning
+- **Role-conditioned decoding** for spymaster and operative behaviors
+- **Rollout-grounded preference learning** using large language models
+- **Supervised fine tuning (SFT)** and **Direct Preference Optimization (DPO)** for teacher alignment
+- **Knowledge distillation** from the aligned teacher back into a compact policy
+The objective is to improve strategic consistency and reduce catastrophic failures such as assassin selections, while maintaining efficient inference suitable for interactive play.
+---
+## Game Configuration
+- **Game**: Codenames
+- **Board size**: 25 words
+- **Roles**: Spymaster and Operative
+- **Evaluation games**: 600 full episodes
+- **Opponents**: Scripted baseline agents
+---
+## Policy Architecture
+### Graph-Based State Encoder
+- Heterogeneous graph with **30–40 nodes**
+- Node types include:
+  - Word nodes with semantic and state features
+  - Historical clue nodes
+  - Global summary node
+- Node feature dimension: **35**
+- Encoder:
+  - 3 Graph Attention layers
+  - 6 attention heads
+  - Hidden size 192
+### Role Conditioning
+- Shared policy trunk
+- Role-conditioned action decoding:
+  - Clue generation and constraint handling for spymaster
+  - Guess selection and stopping decisions for operative
+### Model Size
+- Total parameters: **~6.8M**
+- Enables fast inference under competitive constraints
+---
+## Training Pipeline
+Training follows a multi-stage curriculum:
+1. **Graph PPO Pretraining**
+   - PPO with clip ratio 0.2
+   - Discount factor γ = 0.99
+   - GAE λ = 0.95
+   - Trained against scripted Codenames agents
+2. **Preference Generation via Rollouts**
+   - ~800 intermediate states sampled
+   - Candidate actions proposed by:
+     - Llama 3.1 Instruct
+     - Qwen 2.5 Instruct
+   - Each proposal evaluated using multiple stochastic rollouts
+   - Higher-return actions labeled preferred
+3. **Teacher Alignment**
+   - Supervised Fine Tuning on chosen actions
+   - Direct Preference Optimization using frozen reference model
+4. **Policy Distillation**
+   - Aligned teacher generates state-and-role to action labels
+   - Graph policy trained via cross-entropy imitation
+5. **PPO Refinement**
+   - PPO resumes using environment rewards
+   - Stabilizes policy after distillation
+---
+## Evaluation Results
+Evaluation uses **600 full games** against scripted opponents.
+| Agent | Win Rate | Assassin Rate |
+|------|---------|---------------|
+| Graph PPO | 44.8% | 12.6% |
+| PPO + Distillation | 52.9% | 6.9% |
+- Distillation yields an **8.1 point** absolute win-rate improvement
+- Assassin-triggered losses are reduced by **45%**
+- Improvements arise primarily from **better risk calibration**, not increased guessing aggressiveness
+---
+## Repository Contents
+### Policy Checkpoints
+- `policy_models/policy_after_ppo.pt`
+- `policy_models/policy_after_distill.pt`
+### Teacher Models
+- `sft_model/` – supervised fine-tuned teacher
+- `dpo_model/` – preference-aligned teacher
+### Configuration and Logs
+- `master_config.json`
+- `evaluation_results.json`
+---
+## Usage
+### Load Policy
 ```python
 import torch
+from policy import GraphPolicy
+policy = GraphPolicy(...)
+policy.load_state_dict(torch.load("policy_models/policy_after_distill.pt"))
 policy.eval()
 ```
 This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
+- Language models provide useful strategic priors when grounded by rollouts
+- Graph-based representations enable structured reasoning in semantic games
+- Distillation transfers high-level reasoning into efficient, deployable agents
 ### Key Innovations
 3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
 4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
 ## 📄 License