# Job Submission Guide This guide explains how to submit agent evaluation jobs to run on cloud infrastructure using TraceMind-AI. ## Table of Contents - [Overview](#overview) - [Infrastructure Options](#infrastructure-options) - [HuggingFace Jobs](#huggingface-jobs) - [Modal](#modal) - [Prerequisites](#prerequisites) - [Hardware Selection Guide](#hardware-selection-guide) - [Submitting a Job](#submitting-a-job) - [Cost Estimation](#cost-estimation) - [Monitoring Jobs](#monitoring-jobs) - [Understanding Job Results](#understanding-job-results) - [Troubleshooting](#troubleshooting) - [Advanced Configuration](#advanced-configuration) --- ## Overview TraceMind-AI allows you to submit SMOLTRACE evaluation jobs to two cloud platforms: 1. **HuggingFace Jobs** - Managed compute with GPU/CPU options 2. **Modal** - Serverless compute with pay-per-second billing Both platforms: - ✅ Run the same SMOLTRACE evaluation engine - ✅ Push results automatically to HuggingFace datasets - ✅ Appear in the TraceMind leaderboard when complete - ✅ Collect OpenTelemetry traces and GPU metrics - ✅ **Per-second billing** with no minimum duration **Choose based on your needs**: - **HuggingFace Jobs**: Best if you already have HF Pro subscription ($9/month) - **Modal**: Best if you need H200/H100 GPUs or want to avoid subscriptions **Pricing Sources**: - [HuggingFace Jobs Documentation](https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs) - [HuggingFace Spaces GPU Pricing](https://huggingface.co/docs/hub/en/spaces-gpus) - [Modal GPU Pricing](https://modal.com/pricing) --- ## Infrastructure Options ### HuggingFace Jobs **What it is**: Managed compute platform from HuggingFace with dedicated GPU/CPU instances. **Pricing Model**: Subscription-based ($9/month HF Pro) + **per-second** GPU charges **Hardware Options** (pricing from [HF Spaces GPU pricing](https://huggingface.co/docs/hub/en/spaces-gpus)): - `cpu-basic` - 2 vCPU, 16GB RAM (Free with Pro) - `cpu-upgrade` - 8 vCPU, 32GB RAM (Free with Pro) - `t4-small` - NVIDIA T4 16GB, 4 vCPU, 15GB RAM ($0.40/hr = $0.000111/sec) - `t4-medium` - NVIDIA T4 16GB, 8 vCPU, 30GB RAM ($0.60/hr = $0.000167/sec) - `l4x1` - NVIDIA L4 24GB, 8 vCPU, 30GB RAM ($0.80/hr = $0.000222/sec) - `l4x4` - 4x NVIDIA L4 96GB total, 48 vCPU, 186GB RAM ($3.80/hr = $0.001056/sec) - `a10g-small` - NVIDIA A10G 24GB ($1.00/hr = $0.000278/sec) - `a10g-large` - NVIDIA A10G 24GB (more compute) ($1.50/hr = $0.000417/sec) - `a10g-largex2` - 2x NVIDIA A10G 48GB total ($3.00/hr = $0.000833/sec) - `a10g-largex4` - 4x NVIDIA A10G 96GB total ($5.00/hr = $0.001389/sec) - `a100-large` - NVIDIA A100 80GB, 12 vCPU, 142GB RAM ($2.50/hr = $0.000694/sec) - `v5e-1x1` - Google Cloud TPU v5e (pricing TBD) - `v5e-2x2` - Google Cloud TPU v5e (pricing TBD) - `v5e-2x4` - Google Cloud TPU v5e (pricing TBD) *Note: Jobs billing is **per-second** with no minimum. You only pay for actual compute time used.* **Pros**: - Simple authentication (HuggingFace token) - Integrated with HF ecosystem - Job dashboard at https://huggingface.co/jobs - Reliable infrastructure **Cons**: - Requires HF Pro subscription ($9/month) - Slightly more expensive than Modal for most GPUs - Limited hardware options compared to Modal (no H100/H200) **When to use**: - ✅ You already have HF Pro subscription - ✅ You want simplicity and reliability - ✅ You prefer HuggingFace ecosystem integration - ✅ You prefer managed infrastructure ### Modal **What it is**: Serverless compute platform with pay-per-second billing for CPU and GPU workloads. **Pricing Model**: Pay-per-second usage (no subscription required) **Hardware Options**: - `cpu` - Physical core (2 vCPU equivalent) ($0.0000131/core/sec, min 0.125 cores) - `gpu_t4` - NVIDIA T4 16GB ($0.000164/sec ~= $0.59/hr) - `gpu_l4` - NVIDIA L4 24GB ($0.000222/sec ~= $0.80/hr) - `gpu_a10` - NVIDIA A10G 24GB ($0.000306/sec ~= $1.10/hr) - `gpu_l40s` - NVIDIA L40S 48GB ($0.000542/sec ~= $1.95/hr) - `gpu_a100` - NVIDIA A100 40GB ($0.000583/sec ~= $2.10/hr) - `gpu_a100_80gb` - NVIDIA A100 80GB ($0.000694/sec ~= $2.50/hr) - `gpu_h100` - NVIDIA H100 80GB ($0.001097/sec ~= $3.95/hr) - `gpu_h200` - NVIDIA H200 141GB ($0.001261/sec ~= $4.54/hr) - `gpu_b200` - NVIDIA B200 192GB ($0.001736/sec ~= $6.25/hr) **Pros**: - Pay-per-second (no hourly minimums) - Wide range of GPUs (including H200, H100) - No subscription required - Real-time logs and monitoring - Fast cold starts **Cons**: - Requires Modal account setup - Need to configure API tokens (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET) - Network egress charges apply - Less integrated with HF ecosystem **When to use**: - ✅ You want to minimize costs (generally cheaper than HF Jobs) - ✅ You need access to latest GPUs (H200, H100, B200) - ✅ You prefer serverless architecture - ✅ You don't have HF Pro subscription - ✅ You want more GPU options and flexibility --- ## Prerequisites ### For Viewing Leaderboard (Free) **Required**: - HuggingFace account (free) - HuggingFace token with **Read** permissions **How to get**: 1. Go to https://huggingface.co/settings/tokens 2. Create new token with **Read** permission 3. Copy token (starts with `hf_...`) 4. Add to TraceMind Settings tab ### For Submitting Jobs to HuggingFace Jobs **Required**: 1. **HuggingFace Pro** subscription ($9/month) - Sign up at https://huggingface.co/pricing - **Must add credit card** for GPU compute charges 2. HuggingFace token with **Read + Write + Run Jobs** permissions 3. LLM provider API keys (OpenAI, Anthropic, etc.) for API models **How to setup**: 1. Subscribe to HF Pro: https://huggingface.co/pricing 2. Add credit card for compute charges 3. Create token with all permissions: - Go to https://huggingface.co/settings/tokens - Click "New token" - Select: **Read**, **Write**, **Run Jobs** - Copy token 4. Add API keys in TraceMind Settings: - HuggingFace Token - OpenAI API Key (if testing OpenAI models) - Anthropic API Key (if testing Claude models) - etc. ### For Submitting Jobs to Modal **Required**: 1. Modal account (free to create, pay-per-use) 2. Modal API token (Token ID + Token Secret) 3. HuggingFace token with **Read + Write** permissions 4. LLM provider API keys (OpenAI, Anthropic, etc.) for API models **How to setup**: 1. Create Modal account: - Go to https://modal.com - Sign up (GitHub or email) 2. Create API token: - Go to https://modal.com/settings/tokens - Click "Create token" - Copy **Token ID** (starts with `ak-...`) - Copy **Token Secret** (starts with `as-...`) 3. Add credentials in TraceMind Settings: - Modal Token ID - Modal Token Secret - HuggingFace Token (Read + Write) - LLM provider API keys --- ## Hardware Selection Guide ### Auto-Selection (Recommended) Set hardware to **`auto`** to let TraceMind automatically select the optimal hardware based on: - Model size (extracted from model name) - Provider type (API vs local) - Infrastructure (HF Jobs vs Modal) **Auto-selection logic**: **For API Models** (provider = `litellm` or `inference`): - Always uses **CPU** (no GPU needed) - HF Jobs: `cpu-basic` - Modal: `cpu` **For Local Models** (provider = `transformers`): *Memory estimation for agentic workloads*: - Model weights (FP16): ~2GB per 1B params - KV cache for long contexts: ~1.5-2x model size - Inference overhead: ~20-30% additional - **Total: ~4-5GB per 1B params for safe execution** **HuggingFace Jobs**: | Model Size | Hardware | VRAM | Example Models | |------------|----------|------|----------------| | < 1B | `t4-small` | 16GB | Qwen-0.5B, Phi-3-mini | | 1B - 5B | `t4-small` | 16GB | Llama-3.2-3B, Gemma-2B | | 6B - 12B | `a10g-large` | 24GB | Llama-3.1-8B, Mistral-7B | | 13B+ | `a100-large` | 80GB | Llama-3.1-70B, Qwen-14B | **Modal**: | Model Size | Hardware | VRAM | Example Models | |------------|----------|------|----------------| | < 1B | `gpu_t4` | 16GB | Qwen-0.5B, Phi-3-mini | | 1B - 5B | `gpu_t4` | 16GB | Llama-3.2-3B, Gemma-2B | | 6B - 12B | `gpu_l40s` | 48GB | Llama-3.1-8B, Mistral-7B | | 13B - 24B | `gpu_a100_80gb` | 80GB | Llama-2-13B, Qwen-14B | | 25B - 48B | `gpu_a100_80gb` | 80GB | Gemma-27B, Yi-34B | | 49B+ | `gpu_h200` | 141GB | Llama-3.1-70B, Qwen-72B | ### Manual Selection If you know your model's requirements, you can manually select hardware: **CPU Jobs** (API models like GPT-4, Claude): - HF Jobs: `cpu-basic` or `cpu-upgrade` - Modal: `cpu` **Small Models** (1B-5B params): - HF Jobs: `t4-small` (16GB VRAM) - Modal: `gpu_t4` (16GB VRAM) - Examples: Llama-3.2-3B, Gemma-2B, Qwen-2.5-3B **Medium Models** (6B-12B params): - HF Jobs: `a10g-small` or `a10g-large` (24GB VRAM) - Modal: `gpu_l40s` (48GB VRAM) - Examples: Llama-3.1-8B, Mistral-7B, Qwen-2.5-7B **Large Models** (13B-24B params): - HF Jobs: `a100-large` (80GB VRAM) - Modal: `gpu_a100_80gb` (80GB VRAM) - Examples: Llama-2-13B, Qwen-14B, Mistral-22B **Very Large Models** (25B+ params): - HF Jobs: `a100-large` (80GB VRAM) - may need quantization - Modal: `gpu_h200` (141GB VRAM) - recommended - Examples: Llama-3.1-70B, Qwen-72B, Gemma-27B **Cost vs Performance Trade-offs**: - T4: Cheapest GPU, good for small models - L4: Newer architecture, better performance than T4 - A10G: Good balance of cost/performance for medium models - L40S: Best for 7B-12B models (Modal only) - A100: Industry standard for large models - H200: Latest GPU, massive VRAM (141GB), best for 70B+ models --- ## Submitting a Job ### Step 1: Navigate to New Evaluation Screen 1. Open TraceMind-AI 2. Click **▶️ New Evaluation** in the sidebar 3. You'll see a comprehensive configuration form ### Step 2: Configure Infrastructure **Infrastructure Provider**: - Choose `HuggingFace Jobs` or `Modal` **Hardware**: - Use `auto` (recommended) or select specific hardware - See [Hardware Selection Guide](#hardware-selection-guide) ### Step 3: Configure Model **Model**: - Enter model ID (e.g., `openai/gpt-4`, `meta-llama/Llama-3.1-8B-Instruct`) - Use HuggingFace format: `organization/model-name` **Provider**: - `litellm` - For API models (OpenAI, Anthropic, etc.) - `inference` - For HuggingFace Inference API - `transformers` - For local models loaded with transformers **HF Inference Provider** (optional): - Leave empty unless using HF Inference API - Example: `openai-community/gpt2` for HF-hosted models **HuggingFace Token** (optional): - Leave empty if already configured in Settings - Only needed for private models ### Step 4: Configure Agent **Agent Type**: - `tool` - Function calling agents only - `code` - Code execution agents only - `both` - Hybrid agents (recommended) **Search Provider**: - `duckduckgo` - Free, no API key required (recommended) - `serper` - Requires Serper API key - `brave` - Requires Brave Search API key **Enable Optional Tools**: - Select additional tools for the agent: - `google_search` - Google Search (requires API key) - `duckduckgo_search` - DuckDuckGo Search - `visit_webpage` - Web page scraping - `python_interpreter` - Python code execution - `wikipedia_search` - Wikipedia queries - `user_input` - User interaction (not recommended for batch eval) ### Step 5: Configure Test Dataset **Dataset Name**: - Default: `kshitijthakkar/smoltrace-tasks` - Or use your own HuggingFace dataset - Format: `username/dataset-name` **Dataset Split**: - Default: `train` - Other options: `test`, `validation` **Difficulty Filter**: - `all` - All difficulty levels (recommended) - `easy` - Easy tasks only - `medium` - Medium tasks only - `hard` - Hard tasks only **Parallel Workers**: - Default: `1` (sequential execution) - Higher values (2-10) for faster execution - ⚠️ Increases memory usage and API rate limits ### Step 6: Configure Output & Monitoring **Output Format**: - `hub` - Push to HuggingFace datasets (recommended) - `json` - Save locally (requires output directory) **Output Directory**: - Only for `json` format - Example: `./evaluation_results` **Enable OpenTelemetry Tracing**: - ✅ Recommended - Collects detailed execution traces - Traces appear in TraceMind trace visualization **Enable GPU Metrics**: - ✅ Recommended for GPU jobs - Collects GPU utilization, memory, temperature, CO2 emissions - No effect on CPU jobs **Private Datasets**: - ☐ Make result datasets private on HuggingFace - Default: Public datasets **Debug Mode**: - ☐ Enable verbose logging for troubleshooting - Default: Off **Quiet Mode**: - ☐ Reduce output verbosity - Default: Off **Run ID** (optional): - Auto-generated UUID if left empty - Custom ID for tracking specific runs **Job Timeout**: - Default: `1h` (1 hour) - Other examples: `30m`, `2h`, `3h` - Job will be terminated if it exceeds timeout ### Step 7: Estimate Cost (Optional but Recommended) 1. Click **💰 Estimate Cost** button 2. Wait for AI-powered cost analysis 3. Review: - Estimated total cost - Estimated duration - Hardware selection (if auto) - Historical data (if available) **Cost Estimation Sources**: - **Historical Data**: Based on previous runs of the same model in leaderboard - **MCP AI Analysis**: AI-powered estimation using Gemini 2.5 Flash (if no historical data) ### Step 8: Submit Job 1. Review all configurations 2. Click **🚀 Submit Evaluation** button 3. Wait for confirmation message 4. Copy job ID for tracking **Confirmation message includes**: - ✅ Job submission status - Job ID and platform-specific ID - Hardware selected - Estimated duration - Monitoring instructions ### Example: Submit HuggingFace Jobs Evaluation ``` Infrastructure: HuggingFace Jobs Hardware: auto → a10g-large Model: meta-llama/Llama-3.1-8B-Instruct Provider: transformers Agent Type: both Dataset: kshitijthakkar/smoltrace-tasks Output Format: hub Click "Estimate Cost": → Estimated Cost: $1.25 → Duration: 25 minutes → Hardware: a10g-large (auto-selected) Click "Submit Evaluation": → ✅ Job submitted successfully! → HF Job ID: username/job_abc123 → Monitor at: https://huggingface.co/jobs ``` ### Example: Submit Modal Evaluation ``` Infrastructure: Modal Hardware: auto → L40S Model: meta-llama/Llama-3.1-8B-Instruct Provider: transformers Agent Type: both Dataset: kshitijthakkar/smoltrace-tasks Output Format: hub Click "Estimate Cost": → Estimated Cost: $0.95 → Duration: 20 minutes → Hardware: gpu_l40s (auto-selected) Click "Submit Evaluation": → ✅ Job submitted successfully! → Modal Call ID: modal-job_xyz789 → Monitor at: https://modal.com/apps ``` --- ## Cost Estimation ### Understanding Cost Estimates TraceMind provides AI-powered cost estimation before you submit jobs: **Historical Data** (most accurate): - Based on actual runs of the same model - Shows average cost, duration from past evaluations - Displays number of historical runs used **MCP AI Analysis** (when no historical data): - Powered by Google Gemini 2.5 Flash - Analyzes model size, hardware, provider - Estimates cost based on typical usage patterns - Includes detailed breakdown and recommendations ### Cost Factors **For HuggingFace Jobs**: 1. **Hardware per-second rate** (see [Infrastructure Options](#huggingface-jobs)) 2. **Evaluation duration** (actual runtime only, billed per-second) 3. **LLM API costs** (if using API models like GPT-4) 4. **HF Pro subscription** ($9/month required) **For Modal**: 1. **Hardware per-second rate** (no minimums) 2. **Evaluation duration** (actual runtime only) 3. **Network egress** (data transfer out) 4. **LLM API costs** (if using API models) ### Cost Optimization Tips **Use Auto Hardware Selection**: - Automatically picks cheapest hardware for your model - Avoids over-provisioning (e.g., H200 for 3B model) **Choose Right Infrastructure**: - **If you have HF Pro**: Use HF Jobs (already paying subscription) - **If you don't have HF Pro**: Use Modal (no subscription required) - **For latest GPUs (H200/H100)**: Use Modal (HF Jobs doesn't offer these) **Optimize Model Selection**: - Smaller models (3B-7B) are 10x cheaper than large models (70B) - API models (GPT-4-mini) often cheaper than local 70B models **Reduce Test Count**: - Use difficulty filter (`easy` only) for quick validation - Test with small dataset first, then scale up **Parallel Workers**: - Keep at 1 for sequential execution (cheapest) - Increase only if time is critical (increases API costs) **Example Cost Comparison**: | Model | Hardware | Infrastructure | Duration | HF Jobs Cost | Modal Cost | |-------|----------|----------------|----------|--------------|------------| | GPT-4 (API) | CPU | Either | 5 min | Free* | ~$0.00* | | Llama-3.1-8B | A10G-large | HF Jobs | 25 min | $0.63** | N/A | | Llama-3.1-8B | L40S | Modal | 20 min | N/A | $0.65** | | Llama-3.1-70B | A100-80GB | Both | 45 min | $1.74** | $1.56** | | Llama-3.1-70B | H200 | Modal only | 35 min | N/A | $2.65** | \* Plus LLM API costs (OpenAI/Anthropic/etc. - not included) \** Per-second billing, actual runtime only (no minimums) --- ## Monitoring Jobs ### HuggingFace Jobs **Via HuggingFace Dashboard**: 1. Go to https://huggingface.co/jobs 2. Find your job in the list 3. Click to view details and logs **Via TraceMind Job Monitoring Tab**: 1. Click **📈 Job Monitoring** in sidebar 2. See all your submitted jobs 3. Real-time status updates 4. Click job to view logs **Job Statuses**: - `pending` - Waiting for resources - `running` - Currently executing - `completed` - Finished successfully - `failed` - Error occurred (check logs) - `cancelled` - Manually stopped ### Modal **Via Modal Dashboard**: 1. Go to https://modal.com/apps 2. Find your app: `smoltrace-eval-{job_id}` 3. Click to view real-time logs and metrics **Via TraceMind Job Monitoring Tab**: 1. Click **📈 Job Monitoring** in sidebar 2. See all your submitted jobs 3. Modal jobs show as `submitted` (check Modal dashboard for details) ### Viewing Job Logs **HuggingFace Jobs**: ``` 1. Go to Job Monitoring tab 2. Click on your job 3. Click "View Logs" button 4. See real-time output from SMOLTRACE ``` **Modal**: ``` 1. Go to https://modal.com/apps 2. Find your app 3. Click "Logs" tab 4. See streaming output in real-time ``` ### Expected Job Duration **API Models** (litellm provider): - CPU job: 2-5 minutes for 100 tests - No model download required - Depends on API rate limits **Local Models** (transformers provider): - Model download: 5-15 minutes (one-time per job) - 3B model: ~6GB download - 8B model: ~16GB download - 70B model: ~140GB download - Evaluation: 10-30 minutes for 100 tests - Total: 15-45 minutes typical **Progress Indicators**: 1. ⏳ Job queued (0-2 minutes) 2. 🔄 Downloading model (5-15 minutes for first run) 3. 🧪 Running evaluation (10-30 minutes) 4. 📤 Uploading results to HuggingFace (1-2 minutes) 5. ✅ Complete --- ## Understanding Job Results ### Where Results Are Stored **HuggingFace Datasets** (if output_format = "hub"): SMOLTRACE creates 4 datasets for each evaluation: 1. **Leaderboard Dataset**: `huggingface/smolagents-leaderboard` - Aggregate statistics for the run - Appears in TraceMind Leaderboard tab - Public, shared across all users 2. **Results Dataset**: `{your_username}/agent-results-{model}-{timestamp}` - Individual test case results - Success/failure, execution time, tokens, cost - Links to traces dataset 3. **Traces Dataset**: `{your_username}/agent-traces-{model}-{timestamp}` - OpenTelemetry traces (if enable_otel = True) - Detailed execution steps, LLM calls, tool usage - Viewable in TraceMind Trace Visualization 4. **Metrics Dataset**: `{your_username}/agent-metrics-{model}-{timestamp}` - GPU metrics (if enable_gpu_metrics = True) - GPU utilization, memory, temperature, CO2 emissions - Time-series data for each test **Local JSON Files** (if output_format = "json"): - Saved to `output_dir` on the job machine - Not automatically uploaded to HuggingFace - Useful for local testing ### Viewing Results in TraceMind **Step 1: Refresh Leaderboard** 1. Go to **📊 Leaderboard** tab 2. Click **Load Leaderboard** button 3. Your new run appears in the table **Step 2: View Run Details** 1. Click on your run in the leaderboard 2. See detailed test results: - Individual test cases - Success/failure breakdown - Execution times - Token usage - Costs **Step 3: Visualize Traces** (if enable_otel = True) 1. From run details, click on a test case 2. Click **View Trace** button 3. See OpenTelemetry waterfall diagram 4. Analyze: - LLM calls and durations - Tool executions - Reasoning steps - GPU metrics overlay (if GPU job) **Step 4: Ask Questions About Results** 1. Go to **🤖 Agent Chat** tab 2. Ask questions like: - "Analyze my latest evaluation run" - "Why did test case 5 fail?" - "Compare my run with the top model" - "What was the cost breakdown?" ### Interpreting Results **Key Metrics**: | Metric | Description | Good Value | |--------|-------------|------------| | **Success Rate** | % of tests passed | >90% excellent, >70% good | | **Avg Duration** | Time per test case | <5s good, <10s acceptable | | **Total Cost** | Cost for all tests | Varies by model | | **Tokens Used** | Total tokens consumed | Lower is better | | **CO2 Emissions** | Carbon footprint | Lower is better | | **GPU Utilization** | GPU usage % | >60% efficient | **Common Patterns**: **High accuracy, low cost**: - ✅ Excellent model for production - Examples: GPT-4-mini, Claude-3-Haiku, Gemini-1.5-Flash **High accuracy, high cost**: - ✅ Best for quality-critical tasks - Examples: GPT-4, Claude-3.5-Sonnet, Gemini-1.5-Pro **Low accuracy, low cost**: - ⚠️ May need prompt optimization or better model - Examples: Small local models (<3B params) **Low accuracy, high cost**: - ❌ Poor choice, investigate or switch models - May indicate configuration issues --- ## Troubleshooting ### Job Submission Failures **Error: "HuggingFace token not configured"** - **Cause**: Missing or invalid HF token - **Fix**: 1. Go to Settings tab 2. Add HF token with "Read + Write + Run Jobs" permissions 3. Click "Save API Keys" **Error: "HuggingFace Pro subscription required"** - **Cause**: HF Jobs requires Pro subscription - **Fix**: 1. Subscribe at https://huggingface.co/pricing ($9/month) 2. Add credit card for GPU charges 3. Try again **Error: "Modal credentials not configured"** - **Cause**: Missing Modal API tokens - **Fix**: 1. Go to https://modal.com/settings/tokens 2. Create new token 3. Copy Token ID and Token Secret 4. Add to Settings tab 5. Try again **Error: "Modal package not installed"** - **Cause**: Modal SDK missing (should not happen in hosted Space) - **Fix**: Contact support or run locally with `pip install modal` ### Job Execution Failures **Job stuck in "Pending" status** - **Cause**: High demand for GPU resources - **Fix**: - Wait 5-10 minutes - Try different hardware (e.g., T4 instead of A100) - Try different infrastructure (Modal vs HF Jobs) **Job fails with "Out of Memory"** - **Cause**: Model too large for selected hardware - **Fix**: - Use larger GPU (A100-80GB or H200) - Or use `auto` hardware selection - Or reduce `parallel_workers` to 1 **Job fails with "Model not found"** - **Cause**: Invalid model ID or private model - **Fix**: - Check model ID format: `organization/model-name` - For private models, add HF token with access - Verify model exists on HuggingFace Hub **Job fails with "API key not set"** - **Cause**: Missing LLM provider API key - **Fix**: 1. Go to Settings tab 2. Add API key for your provider (OpenAI, Anthropic, etc.) 3. Submit job again **Job fails with "Rate limit exceeded"** - **Cause**: Too many API requests - **Fix**: - Reduce `parallel_workers` to 1 - Use different model with higher rate limits - Wait and retry later **Modal job fails with "Authentication failed"** - **Cause**: Invalid Modal tokens - **Fix**: 1. Go to https://modal.com/settings/tokens 2. Create new token (old one may be expired) 3. Update tokens in Settings tab ### Results Not Appearing **Results not in leaderboard after job completes** - **Cause**: Dataset upload failed or not configured - **Fix**: - Check job logs for errors - Verify `output_format` was set to "hub" - Verify HF token has "Write" permission - Manually refresh leaderboard (click "Load Leaderboard") **Traces not appearing** - **Cause**: OpenTelemetry not enabled - **Fix**: - Re-run evaluation with `enable_otel = True` - Check traces dataset exists on your HF profile **GPU metrics not showing** - **Cause**: GPU metrics not enabled or CPU job - **Fix**: - Re-run with `enable_gpu_metrics = True` - Verify job used GPU hardware (not CPU) - Check metrics dataset exists --- ## Advanced Configuration ### Custom Test Datasets **Create your own test dataset**: 1. Use **🔬 Synthetic Data Generator** tab: - Configure domain and tools - Generate custom tasks - Push to HuggingFace Hub 2. Use generated dataset in evaluation: - Set `dataset_name` to your dataset: `{username}/dataset-name` - Configure agent with matching tools **Dataset Format Requirements**: ```python { "task_id": "task_001", "prompt": "What's the weather in Tokyo?", "expected_tool": "get_weather", "difficulty": "easy", "category": "tool_usage" } ``` ### Environment Variables **LLM Provider API Keys** (in Settings): - `OPENAI_API_KEY` - OpenAI API - `ANTHROPIC_API_KEY` - Anthropic API - `GOOGLE_API_KEY` or `GEMINI_API_KEY` - Google Gemini API - `COHERE_API_KEY` - Cohere API - `MISTRAL_API_KEY` - Mistral API - `TOGETHER_API_KEY` - Together AI API - `GROQ_API_KEY` - Groq API - `REPLICATE_API_TOKEN` - Replicate API - `ANYSCALE_API_KEY` - Anyscale API **Infrastructure Credentials**: - `HF_TOKEN` - HuggingFace token - `MODAL_TOKEN_ID` - Modal token ID - `MODAL_TOKEN_SECRET` - Modal token secret ### Parallel Execution **Use `parallel_workers` to speed up evaluation**: - `1` - Sequential execution (default, safest) - `2-4` - Moderate parallelism (2-4x faster) - `5-10` - High parallelism (5-10x faster, risky) **Trade-offs**: - ✅ **Faster**: Linear speedup with workers - ⚠️ **Higher cost**: More API calls per minute - ⚠️ **Rate limits**: May hit provider rate limits - ⚠️ **Memory**: Increases GPU memory usage **Recommendations**: - API models: Keep at 1 (avoid rate limits) - Local models: Can use 2-4 if GPU has enough VRAM - Production runs: Use 1 for reliability ### Private Datasets **Make results private**: 1. Set `private = True` in job configuration 2. Results will be private on your HuggingFace profile 3. Only you can view in leaderboard (if using private leaderboard dataset) **Use cases**: - Proprietary models - Confidential evaluation data - Internal benchmarking --- ## Quick Reference ### Job Submission Checklist Before submitting a job, verify: - [ ] Infrastructure selected (HF Jobs or Modal) - [ ] Hardware configured (auto or manual) - [ ] Model ID is correct - [ ] Provider matches model type - [ ] API keys configured in Settings - [ ] Dataset name is valid - [ ] Output format is "hub" for TraceMind integration - [ ] OpenTelemetry tracing enabled (if you want traces) - [ ] GPU metrics enabled (if using GPU) - [ ] Cost estimate reviewed - [ ] Timeout is sufficient for your model size ### Common Model Configurations **OpenAI GPT-4**: ``` Model: openai/gpt-4 Provider: litellm Hardware: auto → cpu-basic Infrastructure: Either (HF Jobs or Modal) Estimated Cost: API costs only ``` **Anthropic Claude-3.5-Sonnet**: ``` Model: anthropic/claude-3.5-sonnet Provider: litellm Hardware: auto → cpu-basic Infrastructure: Either (HF Jobs or Modal) Estimated Cost: API costs only ``` **Meta Llama-3.1-8B**: ``` Model: meta-llama/Llama-3.1-8B-Instruct Provider: transformers Hardware: auto → a10g-large (HF) or gpu_l40s (Modal) Infrastructure: Modal (cheaper for short jobs) Estimated Cost: $0.75-1.50 ``` **Meta Llama-3.1-70B**: ``` Model: meta-llama/Llama-3.1-70B-Instruct Provider: transformers Hardware: auto → a100-large (HF) or gpu_h200 (Modal) Infrastructure: Modal (if available), else HF Jobs Estimated Cost: $3.00-8.00 ``` **Qwen-2.5-Coder-32B**: ``` Model: Qwen/Qwen2.5-Coder-32B-Instruct Provider: transformers Hardware: auto → a100-large (HF) or gpu_a100_80gb (Modal) Infrastructure: Either Estimated Cost: $2.00-4.00 ``` --- ## Next Steps After submitting your first job: 1. **Monitor progress** in Job Monitoring tab 2. **View results** in Leaderboard when complete 3. **Analyze traces** in Trace Visualization 4. **Ask questions** in Agent Chat about your results 5. **Compare** with other models using Compare feature 6. **Optimize** model selection based on cost/accuracy trade-offs 7. **Generate** custom test datasets for your domain 8. **Share** your results with the community For more help: - [USER_GUIDE.md](USER_GUIDE.md) - Complete screen-by-screen walkthrough - [MCP_INTEGRATION.md](MCP_INTEGRATION.md) - MCP client architecture details - [ARCHITECTURE.md](ARCHITECTURE.md) - Technical architecture overview - GitHub Issues: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)