Spaces:

zade-frontier
/

andrej-karpathy-llm-council

Running

App Files Files Community

andrej-karpathy-llm-council / DEPLOYMENT_GUIDE.md

Krishna Chaitanya Cheedella

Update deployment guides for OpenAI + HuggingFace setup

f3045de 12 days ago

preview code

raw

history blame contribute delete

13.7 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

LLM Council - Comprehensive Guide

📝 Overview

The LLM Council is a sophisticated multi-agent system that uses multiple Large Language Models (LLMs) to collectively answer questions through a 3-stage deliberation process:

Stage 1 - Individual Responses: Each council member independently answers the question
Stage 2 - Peer Review: Council members rank each other's anonymized responses
Stage 3 - Synthesis: A chairman model synthesizes the final answer based on all inputs

Current Implementation: Uses FREE HuggingFace models (60%) + cheap OpenAI models (40%)

🏗️ Architecture

Current Implementation

┌─────────────────────────────────────────────────────────────┐
│                        User Question                         │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 1: Parallel Responses from 3-5 Council Models        │
│  • Model 1: Individual answer                               │
│  • Model 2: Individual answer                               │
│  • Model 3: Individual answer                               │
│  • (etc...)                                                  │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 2: Peer Rankings (Anonymized)                        │
│  • Each model ranks all responses (Response A, B, C...)     │
│  • Aggregate rankings calculated                            │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 3: Chairman Synthesis                                │
│  • Reviews all responses + rankings                         │
│  • Generates final comprehensive answer                     │
└─────────────────────────────────────────────────────────────┘

🔧 Current Models (FREE HuggingFace + OpenAI)

Council Members (5 models)

FREE HuggingFace Models (via Inference API):

meta-llama/Llama-3.3-70B-Instruct - Meta's latest Llama (FREE)
Qwen/Qwen2.5-72B-Instruct - Alibaba's Qwen (FREE)
mistralai/Mixtral-8x7B-Instruct-v0.1 - Mistral MoE (FREE)

OpenAI Models (paid but cheap):

gpt-4o-mini - Fast, affordable GPT-4 variant
gpt-3.5-turbo - Ultra cheap, still capable

Chairman

gpt-4o-mini - Final synthesis model

Benefits of Current Setup:

60% of models are completely FREE (HuggingFace)
40% use cheap OpenAI models ($0.001-0.01 per query)
90-99% cost reduction compared to all-paid alternatives
No experimental/beta endpoints - all stable APIs
Diverse model providers for varied perspectives

✨ Alternative Model Configurations

All-FREE Council (100% HuggingFace)

COUNCIL_MODELS = [
    {"provider": "huggingface", "model": "meta-llama/Llama-3.3-70B-Instruct"},
    {"provider": "huggingface", "model": "Qwen/Qwen2.5-72B-Instruct"},
    {"provider": "huggingface", "model": "mistralai/Mixtral-8x7B-Instruct-v0.1"},
    {"provider": "huggingface", "model": "meta-llama/Llama-3.1-405B-Instruct"},
    {"provider": "huggingface", "model": "microsoft/Phi-3.5-MoE-instruct"},
]
CHAIRMAN_MODEL = {"provider": "huggingface", "model": "meta-llama/Llama-3.3-70B-Instruct"}

Cost: $0.00 per query!

Premium Council (OpenAI + HuggingFace)

COUNCIL_MODELS = [
    {"provider": "openai", "model": "gpt-4o"},
    {"provider": "openai", "model": "gpt-4-turbo"},
    {"provider": "huggingface", "model": "meta-llama/Llama-3.3-70B-Instruct"},
    {"provider": "huggingface", "model": "Qwen/Qwen2.5-72B-Instruct"},
    {"provider": "openai", "model": "gpt-3.5-turbo"},
]
CHAIRMAN_MODEL = {"provider": "openai", "model": "gpt-4o"}

Cost: ~$0.05-0.15 per query

🚀 Running on Hugging Face Spaces

Prerequisites

OpenAI API Key:
- Sign up at platform.openai.com
- Go to API Keys → Create new secret key
- Copy your key (starts with sk-)
- Add billing info and credits ($5-10 is plenty)
HuggingFace API Token:
- Sign up at huggingface.co
- Go to Settings → Access Tokens → New token
- Copy your token (starts with hf_)
- FREE! No billing required
HuggingFace Account: For deploying Spaces

Step-by-Step Deployment

Method 1: Deploy Your Existing Code

Create New Space
- Go to huggingface.co/new-space
- Choose "Gradio" as SDK
- Select SDK version: 6.0.0
- Choose hardware: CPU (free)

Push Your Code

# Clone your space
git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
cd YOUR_SPACE_NAME

# Copy your LLM Council code
cp -r /path/to/llm_council/* .

# Commit and push
git add .
git commit -m "Initial deployment"
git push

Configure Secrets
- Go to your space → Settings → Repository secrets
- Add secret #1:
  - Name: OPENAI_API_KEY
  - Value: (your OpenAI key starting with sk-)
- Add secret #2:
  - Name: HUGGINGFACE_API_KEY
  - Value: (your HuggingFace token starting with hf_)
Space Auto-Restarts
- HF Spaces will automatically rebuild and deploy
- Check the "Logs" tab to verify successful startup

Required Files Structure

your-space/
├── README.md                    # Space configuration
├── requirements.txt             # Python dependencies
├── app.py                       # Main Gradio app
├── .env.example                 # Environment template
└── backend/
    ├── __init__.py
    ├── config.py                # Model configuration
    ├── council.py               # 3-stage logic
    ├── openrouter.py            # API client
    ├── storage.py               # Data storage
    └── main.py                  # FastAPI (optional)

🔐 Environment Variables

Required Variables

For Local Development (.env file):

OPENAI_API_KEY=sk-proj-your-key-here
HUGGINGFACE_API_KEY=hf_your-token-here

For HuggingFace Spaces (Settings → Repository secrets):

Secret 1: OPENAI_API_KEY = sk-proj-...
Secret 2: HUGGINGFACE_API_KEY = hf_...

API Endpoints Used

HuggingFace Inference API:

Endpoint: https://router.huggingface.co/v1/chat/completions
Format: OpenAI-compatible
Cost: FREE for inference API
Models: Llama, Qwen, Mixtral, etc.

OpenAI API:

Endpoint: https://api.openai.com/v1/chat/completions
Format: Native OpenAI
Cost: Pay-per-token (very cheap for mini/3.5-turbo)
Models: GPT-4o-mini, GPT-3.5-turbo, GPT-4o

Create .env file locally (DO NOT commit to git):

OPENAI_API_KEY=sk-proj-your-key-here
HUGGINGFACE_API_KEY=hf_your-token-here

For Hugging Face Spaces, use Repository Secrets instead of .env file.

📦 Dependencies

gradio>=6.0.0
httpx>=0.27.0
python-dotenv>=1.0.0
openai>=1.0.0             # For OpenAI API

Note: The system uses:

httpx for async HTTP requests to HuggingFace API
openai SDK for OpenAI API calls
python-dotenv to load environment variables from .env

💻 Running Locally

# 1. Clone repository (use your own space URL)
git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
cd YOUR_SPACE_NAME

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Create .env file with both API keys
echo OPENAI_API_KEY=sk-proj-your-key-here > .env
echo HUGGINGFACE_API_KEY=hf_your-token-here >> .env

# 5. Run the app
python app.py

The app will be available at http://localhost:7860

🔧 Code Architecture

Key Components

1. Dual API Client (backend/api_client.py):

Supports both HuggingFace and OpenAI APIs
Automatic retry logic with exponential backoff
Graceful error handling and fallbacks
Parallel model querying for efficiency

2. FREE Model Configuration (backend/config_free.py):

Mix of FREE HuggingFace + cheap OpenAI models
Configurable timeouts and retries
Easy to customize and extend

3. Council Orchestration (backend/council_free.py):

Stage 1: Parallel response collection
Stage 2: Peer ranking system
Stage 3: Chairman synthesis with streaming

Error Handling Features

Retry logic with exponential backoff (3 attempts)
Graceful handling of individual model failures
Detailed error logging for debugging
Timeout management (60s default)

Benefits of Current Architecture

Cost Efficient: 60% FREE models, 40% ultra-cheap
Robust: Retry logic handles transient failures
Fast: Parallel execution minimizes wait time
Flexible: Easy to add/remove models
Observable: Detailed logging for debugging

📊 Performance Characteristics

Typical Response Times (Current Setup)

Stage 1: 10-30 seconds (5 models in parallel)
Stage 2: 15-45 seconds (peer rankings)
Stage 3: 15-40 seconds (synthesis with streaming)
Total: ~40-115 seconds per question

Cost per Query (Current Setup)

FREE HuggingFace portion: $0.00 (3 models)
OpenAI portion: $0.001-0.01 (2 models)
Total: ~$0.001-0.01 per query

Comparison to alternatives:

90-99% cheaper than all-paid services
Similar quality to premium setups
Faster than sequential execution

Costs vary based on prompt length and response complexity

🐛 Troubleshooting

Common Issues

"401 Unauthorized" errors
- Check both API keys are set correctly
- Verify OpenAI key starts with sk-
- Verify HuggingFace key starts with hf_
- Ensure OpenAI account has billing/credits enabled
- Check Space secrets are named exactly: OPENAI_API_KEY and HUGGINGFACE_API_KEY
Timeout errors
- Increase timeout in backend/config_free.py
- Check network connectivity
- Some models may be slow - consider replacing
Space won't start
- Verify requirements.txt includes all dependencies
- Check logs in Space → Logs tab
- Ensure both secrets are added (not just one)
- Verify Python version compatibility (3.10+)
Some models fail, others work
- Normal! System is designed to handle partial failures
- Check logs to see which models failed
- HuggingFace API may have rate limits (rare)
- OpenAI API requires billing setup
HuggingFace 410 error
- Old endpoint deprecated
- Ensure using router.huggingface.co/v1/chat/completions
- Update backend/api_client.py if needed

🎯 Best Practices

Model Selection
- Use 3-5 council members (sweet spot for quality vs speed)
- Mix FREE HuggingFace + cheap OpenAI for best value
- Choose diverse models for varied perspectives
- Match chairman to task complexity
Cost Management
- Start with current setup ($0.001-0.01 per query)
- Consider all-FREE HuggingFace config for $0 cost
- Monitor OpenAI usage at platform.openai.com/usage
- Set spending limits in OpenAI billing settings
Quality Optimization
- Use more council members for important queries (5-7)
- Use better chairman (gpt-4o instead of gpt-4o-mini)
- Adjust timeouts based on model speed
- Test different model combinations
Security
- NEVER commit .env to git (use .gitignore)
- Use HuggingFace Space secrets for production
- Rotate API keys periodically
- Monitor usage for anomalies
- Set spending limits
Quality Optimization
- Use Premium Council for important queries
- Reasoning Council for math/logic problems
- Adjust timeouts based on model speed

📚 Additional Resources

🤝 Contributing

Suggestions for improvement:

Add caching for repeated questions
Implement conversation history
Add custom model configurations via UI
Support for different voting mechanisms
Add cost tracking and estimates

📝 License

Check the original repository for license information.