andrej-karpathy-llm-council / DEPLOYMENT_GUIDE.md
Krishna Chaitanya Cheedella
Update deployment guides for OpenAI + HuggingFace setup
f3045de

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

LLM Council - Comprehensive Guide

πŸ“ Overview

The LLM Council is a sophisticated multi-agent system that uses multiple Large Language Models (LLMs) to collectively answer questions through a 3-stage deliberation process:

  1. Stage 1 - Individual Responses: Each council member independently answers the question
  2. Stage 2 - Peer Review: Council members rank each other's anonymized responses
  3. Stage 3 - Synthesis: A chairman model synthesizes the final answer based on all inputs

Current Implementation: Uses FREE HuggingFace models (60%) + cheap OpenAI models (40%)

πŸ—οΈ Architecture

Current Implementation

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        User Question                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Stage 1: Parallel Responses from 3-5 Council Models        β”‚
β”‚  β€’ Model 1: Individual answer                               β”‚
β”‚  β€’ Model 2: Individual answer                               β”‚
β”‚  β€’ Model 3: Individual answer                               β”‚
β”‚  β€’ (etc...)                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Stage 2: Peer Rankings (Anonymized)                        β”‚
β”‚  β€’ Each model ranks all responses (Response A, B, C...)     β”‚
β”‚  β€’ Aggregate rankings calculated                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Stage 3: Chairman Synthesis                                β”‚
β”‚  β€’ Reviews all responses + rankings                         β”‚
β”‚  β€’ Generates final comprehensive answer                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”§ Current Models (FREE HuggingFace + OpenAI)

Council Members (5 models)

FREE HuggingFace Models (via Inference API):

  • meta-llama/Llama-3.3-70B-Instruct - Meta's latest Llama (FREE)
  • Qwen/Qwen2.5-72B-Instruct - Alibaba's Qwen (FREE)
  • mistralai/Mixtral-8x7B-Instruct-v0.1 - Mistral MoE (FREE)

OpenAI Models (paid but cheap):

  • gpt-4o-mini - Fast, affordable GPT-4 variant
  • gpt-3.5-turbo - Ultra cheap, still capable

Chairman

  • gpt-4o-mini - Final synthesis model

Benefits of Current Setup:

  • 60% of models are completely FREE (HuggingFace)
  • 40% use cheap OpenAI models ($0.001-0.01 per query)
  • 90-99% cost reduction compared to all-paid alternatives
  • No experimental/beta endpoints - all stable APIs
  • Diverse model providers for varied perspectives

✨ Alternative Model Configurations

All-FREE Council (100% HuggingFace)

COUNCIL_MODELS = [
    {"provider": "huggingface", "model": "meta-llama/Llama-3.3-70B-Instruct"},
    {"provider": "huggingface", "model": "Qwen/Qwen2.5-72B-Instruct"},
    {"provider": "huggingface", "model": "mistralai/Mixtral-8x7B-Instruct-v0.1"},
    {"provider": "huggingface", "model": "meta-llama/Llama-3.1-405B-Instruct"},
    {"provider": "huggingface", "model": "microsoft/Phi-3.5-MoE-instruct"},
]
CHAIRMAN_MODEL = {"provider": "huggingface", "model": "meta-llama/Llama-3.3-70B-Instruct"}

Cost: $0.00 per query!

Premium Council (OpenAI + HuggingFace)

COUNCIL_MODELS = [
    {"provider": "openai", "model": "gpt-4o"},
    {"provider": "openai", "model": "gpt-4-turbo"},
    {"provider": "huggingface", "model": "meta-llama/Llama-3.3-70B-Instruct"},
    {"provider": "huggingface", "model": "Qwen/Qwen2.5-72B-Instruct"},
    {"provider": "openai", "model": "gpt-3.5-turbo"},
]
CHAIRMAN_MODEL = {"provider": "openai", "model": "gpt-4o"}

Cost: ~$0.05-0.15 per query

πŸš€ Running on Hugging Face Spaces

Prerequisites

  1. OpenAI API Key:

    • Sign up at platform.openai.com
    • Go to API Keys β†’ Create new secret key
    • Copy your key (starts with sk-)
    • Add billing info and credits ($5-10 is plenty)
  2. HuggingFace API Token:

    • Sign up at huggingface.co
    • Go to Settings β†’ Access Tokens β†’ New token
    • Copy your token (starts with hf_)
    • FREE! No billing required
  3. HuggingFace Account: For deploying Spaces

Step-by-Step Deployment

Step-by-Step Deployment

Method 1: Deploy Your Existing Code

  1. Create New Space

    • Go to huggingface.co/new-space
    • Choose "Gradio" as SDK
    • Select SDK version: 6.0.0
    • Choose hardware: CPU (free)
  2. Push Your Code

    # Clone your space
    git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
    cd YOUR_SPACE_NAME
    
    # Copy your LLM Council code
    cp -r /path/to/llm_council/* .
    
    # Commit and push
    git add .
    git commit -m "Initial deployment"
    git push
    
  3. Configure Secrets

    • Go to your space β†’ Settings β†’ Repository secrets
    • Add secret #1:
      • Name: OPENAI_API_KEY
      • Value: (your OpenAI key starting with sk-)
    • Add secret #2:
      • Name: HUGGINGFACE_API_KEY
      • Value: (your HuggingFace token starting with hf_)
  4. Space Auto-Restarts

    • HF Spaces will automatically rebuild and deploy
    • Check the "Logs" tab to verify successful startup

Required Files Structure

your-space/
β”œβ”€β”€ README.md                    # Space configuration
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ app.py                       # Main Gradio app
β”œβ”€β”€ .env.example                 # Environment template
└── backend/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ config.py                # Model configuration
    β”œβ”€β”€ council.py               # 3-stage logic
    β”œβ”€β”€ openrouter.py            # API client
    β”œβ”€β”€ storage.py               # Data storage
    └── main.py                  # FastAPI (optional)

πŸ” Environment Variables

Required Variables

For Local Development (.env file):

OPENAI_API_KEY=sk-proj-your-key-here
HUGGINGFACE_API_KEY=hf_your-token-here

For HuggingFace Spaces (Settings β†’ Repository secrets):

  • Secret 1: OPENAI_API_KEY = sk-proj-...
  • Secret 2: HUGGINGFACE_API_KEY = hf_...

API Endpoints Used

HuggingFace Inference API:

  • Endpoint: https://router.huggingface.co/v1/chat/completions
  • Format: OpenAI-compatible
  • Cost: FREE for inference API
  • Models: Llama, Qwen, Mixtral, etc.

OpenAI API:

  • Endpoint: https://api.openai.com/v1/chat/completions
  • Format: Native OpenAI
  • Cost: Pay-per-token (very cheap for mini/3.5-turbo)
  • Models: GPT-4o-mini, GPT-3.5-turbo, GPT-4o

Create .env file locally (DO NOT commit to git):

OPENAI_API_KEY=sk-proj-your-key-here
HUGGINGFACE_API_KEY=hf_your-token-here

For Hugging Face Spaces, use Repository Secrets instead of .env file.

πŸ“¦ Dependencies

gradio>=6.0.0
httpx>=0.27.0
python-dotenv>=1.0.0
openai>=1.0.0             # For OpenAI API

Note: The system uses:

  • httpx for async HTTP requests to HuggingFace API
  • openai SDK for OpenAI API calls
  • python-dotenv to load environment variables from .env

πŸ’» Running Locally

# 1. Clone repository (use your own space URL)
git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
cd YOUR_SPACE_NAME

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Create .env file with both API keys
echo OPENAI_API_KEY=sk-proj-your-key-here > .env
echo HUGGINGFACE_API_KEY=hf_your-token-here >> .env

# 5. Run the app
python app.py

The app will be available at http://localhost:7860

πŸ”§ Code Architecture

Key Components

1. Dual API Client (backend/api_client.py):

  • Supports both HuggingFace and OpenAI APIs
  • Automatic retry logic with exponential backoff
  • Graceful error handling and fallbacks
  • Parallel model querying for efficiency

2. FREE Model Configuration (backend/config_free.py):

  • Mix of FREE HuggingFace + cheap OpenAI models
  • Configurable timeouts and retries
  • Easy to customize and extend

3. Council Orchestration (backend/council_free.py):

  • Stage 1: Parallel response collection
  • Stage 2: Peer ranking system
  • Stage 3: Chairman synthesis with streaming

Error Handling Features

  • Retry logic with exponential backoff (3 attempts)
  • Graceful handling of individual model failures
  • Detailed error logging for debugging
  • Timeout management (60s default)

Benefits of Current Architecture

  • Cost Efficient: 60% FREE models, 40% ultra-cheap
  • Robust: Retry logic handles transient failures
  • Fast: Parallel execution minimizes wait time
  • Flexible: Easy to add/remove models
  • Observable: Detailed logging for debugging

πŸ“Š Performance Characteristics

Typical Response Times (Current Setup)

  • Stage 1: 10-30 seconds (5 models in parallel)
  • Stage 2: 15-45 seconds (peer rankings)
  • Stage 3: 15-40 seconds (synthesis with streaming)
  • Total: ~40-115 seconds per question

Cost per Query (Current Setup)

  • FREE HuggingFace portion: $0.00 (3 models)
  • OpenAI portion: $0.001-0.01 (2 models)
  • Total: ~$0.001-0.01 per query

Comparison to alternatives:

  • 90-99% cheaper than all-paid services
  • Similar quality to premium setups
  • Faster than sequential execution

Costs vary based on prompt length and response complexity

πŸ› Troubleshooting

Common Issues

  1. "401 Unauthorized" errors

    • Check both API keys are set correctly
    • Verify OpenAI key starts with sk-
    • Verify HuggingFace key starts with hf_
    • Ensure OpenAI account has billing/credits enabled
    • Check Space secrets are named exactly: OPENAI_API_KEY and HUGGINGFACE_API_KEY
  2. Timeout errors

    • Increase timeout in backend/config_free.py
    • Check network connectivity
    • Some models may be slow - consider replacing
  3. Space won't start

    • Verify requirements.txt includes all dependencies
    • Check logs in Space β†’ Logs tab
    • Ensure both secrets are added (not just one)
    • Verify Python version compatibility (3.10+)
  4. Some models fail, others work

    • Normal! System is designed to handle partial failures
    • Check logs to see which models failed
    • HuggingFace API may have rate limits (rare)
    • OpenAI API requires billing setup
  5. HuggingFace 410 error

    • Old endpoint deprecated
    • Ensure using router.huggingface.co/v1/chat/completions
    • Update backend/api_client.py if needed

🎯 Best Practices

  1. Model Selection

    • Use 3-5 council members (sweet spot for quality vs speed)
    • Mix FREE HuggingFace + cheap OpenAI for best value
    • Choose diverse models for varied perspectives
    • Match chairman to task complexity
  2. Cost Management

    • Start with current setup ($0.001-0.01 per query)
    • Consider all-FREE HuggingFace config for $0 cost
    • Monitor OpenAI usage at platform.openai.com/usage
    • Set spending limits in OpenAI billing settings
  3. Quality Optimization

    • Use more council members for important queries (5-7)
    • Use better chairman (gpt-4o instead of gpt-4o-mini)
    • Adjust timeouts based on model speed
    • Test different model combinations
  4. Security

    • NEVER commit .env to git (use .gitignore)
    • Use HuggingFace Space secrets for production
    • Rotate API keys periodically
    • Monitor usage for anomalies
    • Set spending limits
  5. Quality Optimization

    • Use Premium Council for important queries
    • Reasoning Council for math/logic problems
    • Adjust timeouts based on model speed

πŸ“š Additional Resources

🀝 Contributing

Suggestions for improvement:

  1. Add caching for repeated questions
  2. Implement conversation history
  3. Add custom model configurations via UI
  4. Support for different voting mechanisms
  5. Add cost tracking and estimates

πŸ“ License

Check the original repository for license information.