TraceMind / JOB_SUBMISSION.md
kshitijthakkar's picture
docs: Add comprehensive JOB_SUBMISSION.md guide with accurate pricing
ae24574
# Job Submission Guide
This guide explains how to submit agent evaluation jobs to run on cloud infrastructure using TraceMind-AI.
## Table of Contents
- [Overview](#overview)
- [Infrastructure Options](#infrastructure-options)
- [HuggingFace Jobs](#huggingface-jobs)
- [Modal](#modal)
- [Prerequisites](#prerequisites)
- [Hardware Selection Guide](#hardware-selection-guide)
- [Submitting a Job](#submitting-a-job)
- [Cost Estimation](#cost-estimation)
- [Monitoring Jobs](#monitoring-jobs)
- [Understanding Job Results](#understanding-job-results)
- [Troubleshooting](#troubleshooting)
- [Advanced Configuration](#advanced-configuration)
---
## Overview
TraceMind-AI allows you to submit SMOLTRACE evaluation jobs to two cloud platforms:
1. **HuggingFace Jobs** - Managed compute with GPU/CPU options
2. **Modal** - Serverless compute with pay-per-second billing
Both platforms:
- ✅ Run the same SMOLTRACE evaluation engine
- ✅ Push results automatically to HuggingFace datasets
- ✅ Appear in the TraceMind leaderboard when complete
- ✅ Collect OpenTelemetry traces and GPU metrics
-**Per-second billing** with no minimum duration
**Choose based on your needs**:
- **HuggingFace Jobs**: Best if you already have HF Pro subscription ($9/month)
- **Modal**: Best if you need H200/H100 GPUs or want to avoid subscriptions
**Pricing Sources**:
- [HuggingFace Jobs Documentation](https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs)
- [HuggingFace Spaces GPU Pricing](https://huggingface.co/docs/hub/en/spaces-gpus)
- [Modal GPU Pricing](https://modal.com/pricing)
---
## Infrastructure Options
### HuggingFace Jobs
**What it is**: Managed compute platform from HuggingFace with dedicated GPU/CPU instances.
**Pricing Model**: Subscription-based ($9/month HF Pro) + **per-second** GPU charges
**Hardware Options** (pricing from [HF Spaces GPU pricing](https://huggingface.co/docs/hub/en/spaces-gpus)):
- `cpu-basic` - 2 vCPU, 16GB RAM (Free with Pro)
- `cpu-upgrade` - 8 vCPU, 32GB RAM (Free with Pro)
- `t4-small` - NVIDIA T4 16GB, 4 vCPU, 15GB RAM ($0.40/hr = $0.000111/sec)
- `t4-medium` - NVIDIA T4 16GB, 8 vCPU, 30GB RAM ($0.60/hr = $0.000167/sec)
- `l4x1` - NVIDIA L4 24GB, 8 vCPU, 30GB RAM ($0.80/hr = $0.000222/sec)
- `l4x4` - 4x NVIDIA L4 96GB total, 48 vCPU, 186GB RAM ($3.80/hr = $0.001056/sec)
- `a10g-small` - NVIDIA A10G 24GB ($1.00/hr = $0.000278/sec)
- `a10g-large` - NVIDIA A10G 24GB (more compute) ($1.50/hr = $0.000417/sec)
- `a10g-largex2` - 2x NVIDIA A10G 48GB total ($3.00/hr = $0.000833/sec)
- `a10g-largex4` - 4x NVIDIA A10G 96GB total ($5.00/hr = $0.001389/sec)
- `a100-large` - NVIDIA A100 80GB, 12 vCPU, 142GB RAM ($2.50/hr = $0.000694/sec)
- `v5e-1x1` - Google Cloud TPU v5e (pricing TBD)
- `v5e-2x2` - Google Cloud TPU v5e (pricing TBD)
- `v5e-2x4` - Google Cloud TPU v5e (pricing TBD)
*Note: Jobs billing is **per-second** with no minimum. You only pay for actual compute time used.*
**Pros**:
- Simple authentication (HuggingFace token)
- Integrated with HF ecosystem
- Job dashboard at https://huggingface.co/jobs
- Reliable infrastructure
**Cons**:
- Requires HF Pro subscription ($9/month)
- Slightly more expensive than Modal for most GPUs
- Limited hardware options compared to Modal (no H100/H200)
**When to use**:
- ✅ You already have HF Pro subscription
- ✅ You want simplicity and reliability
- ✅ You prefer HuggingFace ecosystem integration
- ✅ You prefer managed infrastructure
### Modal
**What it is**: Serverless compute platform with pay-per-second billing for CPU and GPU workloads.
**Pricing Model**: Pay-per-second usage (no subscription required)
**Hardware Options**:
- `cpu` - Physical core (2 vCPU equivalent) ($0.0000131/core/sec, min 0.125 cores)
- `gpu_t4` - NVIDIA T4 16GB ($0.000164/sec ~= $0.59/hr)
- `gpu_l4` - NVIDIA L4 24GB ($0.000222/sec ~= $0.80/hr)
- `gpu_a10` - NVIDIA A10G 24GB ($0.000306/sec ~= $1.10/hr)
- `gpu_l40s` - NVIDIA L40S 48GB ($0.000542/sec ~= $1.95/hr)
- `gpu_a100` - NVIDIA A100 40GB ($0.000583/sec ~= $2.10/hr)
- `gpu_a100_80gb` - NVIDIA A100 80GB ($0.000694/sec ~= $2.50/hr)
- `gpu_h100` - NVIDIA H100 80GB ($0.001097/sec ~= $3.95/hr)
- `gpu_h200` - NVIDIA H200 141GB ($0.001261/sec ~= $4.54/hr)
- `gpu_b200` - NVIDIA B200 192GB ($0.001736/sec ~= $6.25/hr)
**Pros**:
- Pay-per-second (no hourly minimums)
- Wide range of GPUs (including H200, H100)
- No subscription required
- Real-time logs and monitoring
- Fast cold starts
**Cons**:
- Requires Modal account setup
- Need to configure API tokens (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET)
- Network egress charges apply
- Less integrated with HF ecosystem
**When to use**:
- ✅ You want to minimize costs (generally cheaper than HF Jobs)
- ✅ You need access to latest GPUs (H200, H100, B200)
- ✅ You prefer serverless architecture
- ✅ You don't have HF Pro subscription
- ✅ You want more GPU options and flexibility
---
## Prerequisites
### For Viewing Leaderboard (Free)
**Required**:
- HuggingFace account (free)
- HuggingFace token with **Read** permissions
**How to get**:
1. Go to https://huggingface.co/settings/tokens
2. Create new token with **Read** permission
3. Copy token (starts with `hf_...`)
4. Add to TraceMind Settings tab
### For Submitting Jobs to HuggingFace Jobs
**Required**:
1. **HuggingFace Pro** subscription ($9/month)
- Sign up at https://huggingface.co/pricing
- **Must add credit card** for GPU compute charges
2. HuggingFace token with **Read + Write + Run Jobs** permissions
3. LLM provider API keys (OpenAI, Anthropic, etc.) for API models
**How to setup**:
1. Subscribe to HF Pro: https://huggingface.co/pricing
2. Add credit card for compute charges
3. Create token with all permissions:
- Go to https://huggingface.co/settings/tokens
- Click "New token"
- Select: **Read**, **Write**, **Run Jobs**
- Copy token
4. Add API keys in TraceMind Settings:
- HuggingFace Token
- OpenAI API Key (if testing OpenAI models)
- Anthropic API Key (if testing Claude models)
- etc.
### For Submitting Jobs to Modal
**Required**:
1. Modal account (free to create, pay-per-use)
2. Modal API token (Token ID + Token Secret)
3. HuggingFace token with **Read + Write** permissions
4. LLM provider API keys (OpenAI, Anthropic, etc.) for API models
**How to setup**:
1. Create Modal account:
- Go to https://modal.com
- Sign up (GitHub or email)
2. Create API token:
- Go to https://modal.com/settings/tokens
- Click "Create token"
- Copy **Token ID** (starts with `ak-...`)
- Copy **Token Secret** (starts with `as-...`)
3. Add credentials in TraceMind Settings:
- Modal Token ID
- Modal Token Secret
- HuggingFace Token (Read + Write)
- LLM provider API keys
---
## Hardware Selection Guide
### Auto-Selection (Recommended)
Set hardware to **`auto`** to let TraceMind automatically select the optimal hardware based on:
- Model size (extracted from model name)
- Provider type (API vs local)
- Infrastructure (HF Jobs vs Modal)
**Auto-selection logic**:
**For API Models** (provider = `litellm` or `inference`):
- Always uses **CPU** (no GPU needed)
- HF Jobs: `cpu-basic`
- Modal: `cpu`
**For Local Models** (provider = `transformers`):
*Memory estimation for agentic workloads*:
- Model weights (FP16): ~2GB per 1B params
- KV cache for long contexts: ~1.5-2x model size
- Inference overhead: ~20-30% additional
- **Total: ~4-5GB per 1B params for safe execution**
**HuggingFace Jobs**:
| Model Size | Hardware | VRAM | Example Models |
|------------|----------|------|----------------|
| < 1B | `t4-small` | 16GB | Qwen-0.5B, Phi-3-mini |
| 1B - 5B | `t4-small` | 16GB | Llama-3.2-3B, Gemma-2B |
| 6B - 12B | `a10g-large` | 24GB | Llama-3.1-8B, Mistral-7B |
| 13B+ | `a100-large` | 80GB | Llama-3.1-70B, Qwen-14B |
**Modal**:
| Model Size | Hardware | VRAM | Example Models |
|------------|----------|------|----------------|
| < 1B | `gpu_t4` | 16GB | Qwen-0.5B, Phi-3-mini |
| 1B - 5B | `gpu_t4` | 16GB | Llama-3.2-3B, Gemma-2B |
| 6B - 12B | `gpu_l40s` | 48GB | Llama-3.1-8B, Mistral-7B |
| 13B - 24B | `gpu_a100_80gb` | 80GB | Llama-2-13B, Qwen-14B |
| 25B - 48B | `gpu_a100_80gb` | 80GB | Gemma-27B, Yi-34B |
| 49B+ | `gpu_h200` | 141GB | Llama-3.1-70B, Qwen-72B |
### Manual Selection
If you know your model's requirements, you can manually select hardware:
**CPU Jobs** (API models like GPT-4, Claude):
- HF Jobs: `cpu-basic` or `cpu-upgrade`
- Modal: `cpu`
**Small Models** (1B-5B params):
- HF Jobs: `t4-small` (16GB VRAM)
- Modal: `gpu_t4` (16GB VRAM)
- Examples: Llama-3.2-3B, Gemma-2B, Qwen-2.5-3B
**Medium Models** (6B-12B params):
- HF Jobs: `a10g-small` or `a10g-large` (24GB VRAM)
- Modal: `gpu_l40s` (48GB VRAM)
- Examples: Llama-3.1-8B, Mistral-7B, Qwen-2.5-7B
**Large Models** (13B-24B params):
- HF Jobs: `a100-large` (80GB VRAM)
- Modal: `gpu_a100_80gb` (80GB VRAM)
- Examples: Llama-2-13B, Qwen-14B, Mistral-22B
**Very Large Models** (25B+ params):
- HF Jobs: `a100-large` (80GB VRAM) - may need quantization
- Modal: `gpu_h200` (141GB VRAM) - recommended
- Examples: Llama-3.1-70B, Qwen-72B, Gemma-27B
**Cost vs Performance Trade-offs**:
- T4: Cheapest GPU, good for small models
- L4: Newer architecture, better performance than T4
- A10G: Good balance of cost/performance for medium models
- L40S: Best for 7B-12B models (Modal only)
- A100: Industry standard for large models
- H200: Latest GPU, massive VRAM (141GB), best for 70B+ models
---
## Submitting a Job
### Step 1: Navigate to New Evaluation Screen
1. Open TraceMind-AI
2. Click **▶️ New Evaluation** in the sidebar
3. You'll see a comprehensive configuration form
### Step 2: Configure Infrastructure
**Infrastructure Provider**:
- Choose `HuggingFace Jobs` or `Modal`
**Hardware**:
- Use `auto` (recommended) or select specific hardware
- See [Hardware Selection Guide](#hardware-selection-guide)
### Step 3: Configure Model
**Model**:
- Enter model ID (e.g., `openai/gpt-4`, `meta-llama/Llama-3.1-8B-Instruct`)
- Use HuggingFace format: `organization/model-name`
**Provider**:
- `litellm` - For API models (OpenAI, Anthropic, etc.)
- `inference` - For HuggingFace Inference API
- `transformers` - For local models loaded with transformers
**HF Inference Provider** (optional):
- Leave empty unless using HF Inference API
- Example: `openai-community/gpt2` for HF-hosted models
**HuggingFace Token** (optional):
- Leave empty if already configured in Settings
- Only needed for private models
### Step 4: Configure Agent
**Agent Type**:
- `tool` - Function calling agents only
- `code` - Code execution agents only
- `both` - Hybrid agents (recommended)
**Search Provider**:
- `duckduckgo` - Free, no API key required (recommended)
- `serper` - Requires Serper API key
- `brave` - Requires Brave Search API key
**Enable Optional Tools**:
- Select additional tools for the agent:
- `google_search` - Google Search (requires API key)
- `duckduckgo_search` - DuckDuckGo Search
- `visit_webpage` - Web page scraping
- `python_interpreter` - Python code execution
- `wikipedia_search` - Wikipedia queries
- `user_input` - User interaction (not recommended for batch eval)
### Step 5: Configure Test Dataset
**Dataset Name**:
- Default: `kshitijthakkar/smoltrace-tasks`
- Or use your own HuggingFace dataset
- Format: `username/dataset-name`
**Dataset Split**:
- Default: `train`
- Other options: `test`, `validation`
**Difficulty Filter**:
- `all` - All difficulty levels (recommended)
- `easy` - Easy tasks only
- `medium` - Medium tasks only
- `hard` - Hard tasks only
**Parallel Workers**:
- Default: `1` (sequential execution)
- Higher values (2-10) for faster execution
- ⚠️ Increases memory usage and API rate limits
### Step 6: Configure Output & Monitoring
**Output Format**:
- `hub` - Push to HuggingFace datasets (recommended)
- `json` - Save locally (requires output directory)
**Output Directory**:
- Only for `json` format
- Example: `./evaluation_results`
**Enable OpenTelemetry Tracing**:
- ✅ Recommended - Collects detailed execution traces
- Traces appear in TraceMind trace visualization
**Enable GPU Metrics**:
- ✅ Recommended for GPU jobs
- Collects GPU utilization, memory, temperature, CO2 emissions
- No effect on CPU jobs
**Private Datasets**:
- ☐ Make result datasets private on HuggingFace
- Default: Public datasets
**Debug Mode**:
- ☐ Enable verbose logging for troubleshooting
- Default: Off
**Quiet Mode**:
- ☐ Reduce output verbosity
- Default: Off
**Run ID** (optional):
- Auto-generated UUID if left empty
- Custom ID for tracking specific runs
**Job Timeout**:
- Default: `1h` (1 hour)
- Other examples: `30m`, `2h`, `3h`
- Job will be terminated if it exceeds timeout
### Step 7: Estimate Cost (Optional but Recommended)
1. Click **💰 Estimate Cost** button
2. Wait for AI-powered cost analysis
3. Review:
- Estimated total cost
- Estimated duration
- Hardware selection (if auto)
- Historical data (if available)
**Cost Estimation Sources**:
- **Historical Data**: Based on previous runs of the same model in leaderboard
- **MCP AI Analysis**: AI-powered estimation using Gemini 2.5 Flash (if no historical data)
### Step 8: Submit Job
1. Review all configurations
2. Click **🚀 Submit Evaluation** button
3. Wait for confirmation message
4. Copy job ID for tracking
**Confirmation message includes**:
- ✅ Job submission status
- Job ID and platform-specific ID
- Hardware selected
- Estimated duration
- Monitoring instructions
### Example: Submit HuggingFace Jobs Evaluation
```
Infrastructure: HuggingFace Jobs
Hardware: auto → a10g-large
Model: meta-llama/Llama-3.1-8B-Instruct
Provider: transformers
Agent Type: both
Dataset: kshitijthakkar/smoltrace-tasks
Output Format: hub
Click "Estimate Cost":
→ Estimated Cost: $1.25
→ Duration: 25 minutes
→ Hardware: a10g-large (auto-selected)
Click "Submit Evaluation":
→ ✅ Job submitted successfully!
→ HF Job ID: username/job_abc123
→ Monitor at: https://huggingface.co/jobs
```
### Example: Submit Modal Evaluation
```
Infrastructure: Modal
Hardware: auto → L40S
Model: meta-llama/Llama-3.1-8B-Instruct
Provider: transformers
Agent Type: both
Dataset: kshitijthakkar/smoltrace-tasks
Output Format: hub
Click "Estimate Cost":
→ Estimated Cost: $0.95
→ Duration: 20 minutes
→ Hardware: gpu_l40s (auto-selected)
Click "Submit Evaluation":
→ ✅ Job submitted successfully!
→ Modal Call ID: modal-job_xyz789
→ Monitor at: https://modal.com/apps
```
---
## Cost Estimation
### Understanding Cost Estimates
TraceMind provides AI-powered cost estimation before you submit jobs:
**Historical Data** (most accurate):
- Based on actual runs of the same model
- Shows average cost, duration from past evaluations
- Displays number of historical runs used
**MCP AI Analysis** (when no historical data):
- Powered by Google Gemini 2.5 Flash
- Analyzes model size, hardware, provider
- Estimates cost based on typical usage patterns
- Includes detailed breakdown and recommendations
### Cost Factors
**For HuggingFace Jobs**:
1. **Hardware per-second rate** (see [Infrastructure Options](#huggingface-jobs))
2. **Evaluation duration** (actual runtime only, billed per-second)
3. **LLM API costs** (if using API models like GPT-4)
4. **HF Pro subscription** ($9/month required)
**For Modal**:
1. **Hardware per-second rate** (no minimums)
2. **Evaluation duration** (actual runtime only)
3. **Network egress** (data transfer out)
4. **LLM API costs** (if using API models)
### Cost Optimization Tips
**Use Auto Hardware Selection**:
- Automatically picks cheapest hardware for your model
- Avoids over-provisioning (e.g., H200 for 3B model)
**Choose Right Infrastructure**:
- **If you have HF Pro**: Use HF Jobs (already paying subscription)
- **If you don't have HF Pro**: Use Modal (no subscription required)
- **For latest GPUs (H200/H100)**: Use Modal (HF Jobs doesn't offer these)
**Optimize Model Selection**:
- Smaller models (3B-7B) are 10x cheaper than large models (70B)
- API models (GPT-4-mini) often cheaper than local 70B models
**Reduce Test Count**:
- Use difficulty filter (`easy` only) for quick validation
- Test with small dataset first, then scale up
**Parallel Workers**:
- Keep at 1 for sequential execution (cheapest)
- Increase only if time is critical (increases API costs)
**Example Cost Comparison**:
| Model | Hardware | Infrastructure | Duration | HF Jobs Cost | Modal Cost |
|-------|----------|----------------|----------|--------------|------------|
| GPT-4 (API) | CPU | Either | 5 min | Free* | ~$0.00* |
| Llama-3.1-8B | A10G-large | HF Jobs | 25 min | $0.63** | N/A |
| Llama-3.1-8B | L40S | Modal | 20 min | N/A | $0.65** |
| Llama-3.1-70B | A100-80GB | Both | 45 min | $1.74** | $1.56** |
| Llama-3.1-70B | H200 | Modal only | 35 min | N/A | $2.65** |
\* Plus LLM API costs (OpenAI/Anthropic/etc. - not included)
\** Per-second billing, actual runtime only (no minimums)
---
## Monitoring Jobs
### HuggingFace Jobs
**Via HuggingFace Dashboard**:
1. Go to https://huggingface.co/jobs
2. Find your job in the list
3. Click to view details and logs
**Via TraceMind Job Monitoring Tab**:
1. Click **📈 Job Monitoring** in sidebar
2. See all your submitted jobs
3. Real-time status updates
4. Click job to view logs
**Job Statuses**:
- `pending` - Waiting for resources
- `running` - Currently executing
- `completed` - Finished successfully
- `failed` - Error occurred (check logs)
- `cancelled` - Manually stopped
### Modal
**Via Modal Dashboard**:
1. Go to https://modal.com/apps
2. Find your app: `smoltrace-eval-{job_id}`
3. Click to view real-time logs and metrics
**Via TraceMind Job Monitoring Tab**:
1. Click **📈 Job Monitoring** in sidebar
2. See all your submitted jobs
3. Modal jobs show as `submitted` (check Modal dashboard for details)
### Viewing Job Logs
**HuggingFace Jobs**:
```
1. Go to Job Monitoring tab
2. Click on your job
3. Click "View Logs" button
4. See real-time output from SMOLTRACE
```
**Modal**:
```
1. Go to https://modal.com/apps
2. Find your app
3. Click "Logs" tab
4. See streaming output in real-time
```
### Expected Job Duration
**API Models** (litellm provider):
- CPU job: 2-5 minutes for 100 tests
- No model download required
- Depends on API rate limits
**Local Models** (transformers provider):
- Model download: 5-15 minutes (one-time per job)
- 3B model: ~6GB download
- 8B model: ~16GB download
- 70B model: ~140GB download
- Evaluation: 10-30 minutes for 100 tests
- Total: 15-45 minutes typical
**Progress Indicators**:
1. ⏳ Job queued (0-2 minutes)
2. 🔄 Downloading model (5-15 minutes for first run)
3. 🧪 Running evaluation (10-30 minutes)
4. 📤 Uploading results to HuggingFace (1-2 minutes)
5. ✅ Complete
---
## Understanding Job Results
### Where Results Are Stored
**HuggingFace Datasets** (if output_format = "hub"):
SMOLTRACE creates 4 datasets for each evaluation:
1. **Leaderboard Dataset**: `huggingface/smolagents-leaderboard`
- Aggregate statistics for the run
- Appears in TraceMind Leaderboard tab
- Public, shared across all users
2. **Results Dataset**: `{your_username}/agent-results-{model}-{timestamp}`
- Individual test case results
- Success/failure, execution time, tokens, cost
- Links to traces dataset
3. **Traces Dataset**: `{your_username}/agent-traces-{model}-{timestamp}`
- OpenTelemetry traces (if enable_otel = True)
- Detailed execution steps, LLM calls, tool usage
- Viewable in TraceMind Trace Visualization
4. **Metrics Dataset**: `{your_username}/agent-metrics-{model}-{timestamp}`
- GPU metrics (if enable_gpu_metrics = True)
- GPU utilization, memory, temperature, CO2 emissions
- Time-series data for each test
**Local JSON Files** (if output_format = "json"):
- Saved to `output_dir` on the job machine
- Not automatically uploaded to HuggingFace
- Useful for local testing
### Viewing Results in TraceMind
**Step 1: Refresh Leaderboard**
1. Go to **📊 Leaderboard** tab
2. Click **Load Leaderboard** button
3. Your new run appears in the table
**Step 2: View Run Details**
1. Click on your run in the leaderboard
2. See detailed test results:
- Individual test cases
- Success/failure breakdown
- Execution times
- Token usage
- Costs
**Step 3: Visualize Traces** (if enable_otel = True)
1. From run details, click on a test case
2. Click **View Trace** button
3. See OpenTelemetry waterfall diagram
4. Analyze:
- LLM calls and durations
- Tool executions
- Reasoning steps
- GPU metrics overlay (if GPU job)
**Step 4: Ask Questions About Results**
1. Go to **🤖 Agent Chat** tab
2. Ask questions like:
- "Analyze my latest evaluation run"
- "Why did test case 5 fail?"
- "Compare my run with the top model"
- "What was the cost breakdown?"
### Interpreting Results
**Key Metrics**:
| Metric | Description | Good Value |
|--------|-------------|------------|
| **Success Rate** | % of tests passed | >90% excellent, >70% good |
| **Avg Duration** | Time per test case | <5s good, <10s acceptable |
| **Total Cost** | Cost for all tests | Varies by model |
| **Tokens Used** | Total tokens consumed | Lower is better |
| **CO2 Emissions** | Carbon footprint | Lower is better |
| **GPU Utilization** | GPU usage % | >60% efficient |
**Common Patterns**:
**High accuracy, low cost**:
- ✅ Excellent model for production
- Examples: GPT-4-mini, Claude-3-Haiku, Gemini-1.5-Flash
**High accuracy, high cost**:
- ✅ Best for quality-critical tasks
- Examples: GPT-4, Claude-3.5-Sonnet, Gemini-1.5-Pro
**Low accuracy, low cost**:
- ⚠️ May need prompt optimization or better model
- Examples: Small local models (<3B params)
**Low accuracy, high cost**:
- ❌ Poor choice, investigate or switch models
- May indicate configuration issues
---
## Troubleshooting
### Job Submission Failures
**Error: "HuggingFace token not configured"**
- **Cause**: Missing or invalid HF token
- **Fix**:
1. Go to Settings tab
2. Add HF token with "Read + Write + Run Jobs" permissions
3. Click "Save API Keys"
**Error: "HuggingFace Pro subscription required"**
- **Cause**: HF Jobs requires Pro subscription
- **Fix**:
1. Subscribe at https://huggingface.co/pricing ($9/month)
2. Add credit card for GPU charges
3. Try again
**Error: "Modal credentials not configured"**
- **Cause**: Missing Modal API tokens
- **Fix**:
1. Go to https://modal.com/settings/tokens
2. Create new token
3. Copy Token ID and Token Secret
4. Add to Settings tab
5. Try again
**Error: "Modal package not installed"**
- **Cause**: Modal SDK missing (should not happen in hosted Space)
- **Fix**: Contact support or run locally with `pip install modal`
### Job Execution Failures
**Job stuck in "Pending" status**
- **Cause**: High demand for GPU resources
- **Fix**:
- Wait 5-10 minutes
- Try different hardware (e.g., T4 instead of A100)
- Try different infrastructure (Modal vs HF Jobs)
**Job fails with "Out of Memory"**
- **Cause**: Model too large for selected hardware
- **Fix**:
- Use larger GPU (A100-80GB or H200)
- Or use `auto` hardware selection
- Or reduce `parallel_workers` to 1
**Job fails with "Model not found"**
- **Cause**: Invalid model ID or private model
- **Fix**:
- Check model ID format: `organization/model-name`
- For private models, add HF token with access
- Verify model exists on HuggingFace Hub
**Job fails with "API key not set"**
- **Cause**: Missing LLM provider API key
- **Fix**:
1. Go to Settings tab
2. Add API key for your provider (OpenAI, Anthropic, etc.)
3. Submit job again
**Job fails with "Rate limit exceeded"**
- **Cause**: Too many API requests
- **Fix**:
- Reduce `parallel_workers` to 1
- Use different model with higher rate limits
- Wait and retry later
**Modal job fails with "Authentication failed"**
- **Cause**: Invalid Modal tokens
- **Fix**:
1. Go to https://modal.com/settings/tokens
2. Create new token (old one may be expired)
3. Update tokens in Settings tab
### Results Not Appearing
**Results not in leaderboard after job completes**
- **Cause**: Dataset upload failed or not configured
- **Fix**:
- Check job logs for errors
- Verify `output_format` was set to "hub"
- Verify HF token has "Write" permission
- Manually refresh leaderboard (click "Load Leaderboard")
**Traces not appearing**
- **Cause**: OpenTelemetry not enabled
- **Fix**:
- Re-run evaluation with `enable_otel = True`
- Check traces dataset exists on your HF profile
**GPU metrics not showing**
- **Cause**: GPU metrics not enabled or CPU job
- **Fix**:
- Re-run with `enable_gpu_metrics = True`
- Verify job used GPU hardware (not CPU)
- Check metrics dataset exists
---
## Advanced Configuration
### Custom Test Datasets
**Create your own test dataset**:
1. Use **🔬 Synthetic Data Generator** tab:
- Configure domain and tools
- Generate custom tasks
- Push to HuggingFace Hub
2. Use generated dataset in evaluation:
- Set `dataset_name` to your dataset: `{username}/dataset-name`
- Configure agent with matching tools
**Dataset Format Requirements**:
```python
{
"task_id": "task_001",
"prompt": "What's the weather in Tokyo?",
"expected_tool": "get_weather",
"difficulty": "easy",
"category": "tool_usage"
}
```
### Environment Variables
**LLM Provider API Keys** (in Settings):
- `OPENAI_API_KEY` - OpenAI API
- `ANTHROPIC_API_KEY` - Anthropic API
- `GOOGLE_API_KEY` or `GEMINI_API_KEY` - Google Gemini API
- `COHERE_API_KEY` - Cohere API
- `MISTRAL_API_KEY` - Mistral API
- `TOGETHER_API_KEY` - Together AI API
- `GROQ_API_KEY` - Groq API
- `REPLICATE_API_TOKEN` - Replicate API
- `ANYSCALE_API_KEY` - Anyscale API
**Infrastructure Credentials**:
- `HF_TOKEN` - HuggingFace token
- `MODAL_TOKEN_ID` - Modal token ID
- `MODAL_TOKEN_SECRET` - Modal token secret
### Parallel Execution
**Use `parallel_workers` to speed up evaluation**:
- `1` - Sequential execution (default, safest)
- `2-4` - Moderate parallelism (2-4x faster)
- `5-10` - High parallelism (5-10x faster, risky)
**Trade-offs**:
- ✅ **Faster**: Linear speedup with workers
- ⚠️ **Higher cost**: More API calls per minute
- ⚠️ **Rate limits**: May hit provider rate limits
- ⚠️ **Memory**: Increases GPU memory usage
**Recommendations**:
- API models: Keep at 1 (avoid rate limits)
- Local models: Can use 2-4 if GPU has enough VRAM
- Production runs: Use 1 for reliability
### Private Datasets
**Make results private**:
1. Set `private = True` in job configuration
2. Results will be private on your HuggingFace profile
3. Only you can view in leaderboard (if using private leaderboard dataset)
**Use cases**:
- Proprietary models
- Confidential evaluation data
- Internal benchmarking
---
## Quick Reference
### Job Submission Checklist
Before submitting a job, verify:
- [ ] Infrastructure selected (HF Jobs or Modal)
- [ ] Hardware configured (auto or manual)
- [ ] Model ID is correct
- [ ] Provider matches model type
- [ ] API keys configured in Settings
- [ ] Dataset name is valid
- [ ] Output format is "hub" for TraceMind integration
- [ ] OpenTelemetry tracing enabled (if you want traces)
- [ ] GPU metrics enabled (if using GPU)
- [ ] Cost estimate reviewed
- [ ] Timeout is sufficient for your model size
### Common Model Configurations
**OpenAI GPT-4**:
```
Model: openai/gpt-4
Provider: litellm
Hardware: auto → cpu-basic
Infrastructure: Either (HF Jobs or Modal)
Estimated Cost: API costs only
```
**Anthropic Claude-3.5-Sonnet**:
```
Model: anthropic/claude-3.5-sonnet
Provider: litellm
Hardware: auto → cpu-basic
Infrastructure: Either (HF Jobs or Modal)
Estimated Cost: API costs only
```
**Meta Llama-3.1-8B**:
```
Model: meta-llama/Llama-3.1-8B-Instruct
Provider: transformers
Hardware: auto → a10g-large (HF) or gpu_l40s (Modal)
Infrastructure: Modal (cheaper for short jobs)
Estimated Cost: $0.75-1.50
```
**Meta Llama-3.1-70B**:
```
Model: meta-llama/Llama-3.1-70B-Instruct
Provider: transformers
Hardware: auto → a100-large (HF) or gpu_h200 (Modal)
Infrastructure: Modal (if available), else HF Jobs
Estimated Cost: $3.00-8.00
```
**Qwen-2.5-Coder-32B**:
```
Model: Qwen/Qwen2.5-Coder-32B-Instruct
Provider: transformers
Hardware: auto → a100-large (HF) or gpu_a100_80gb (Modal)
Infrastructure: Either
Estimated Cost: $2.00-4.00
```
---
## Next Steps
After submitting your first job:
1. **Monitor progress** in Job Monitoring tab
2. **View results** in Leaderboard when complete
3. **Analyze traces** in Trace Visualization
4. **Ask questions** in Agent Chat about your results
5. **Compare** with other models using Compare feature
6. **Optimize** model selection based on cost/accuracy trade-offs
7. **Generate** custom test datasets for your domain
8. **Share** your results with the community
For more help:
- [USER_GUIDE.md](USER_GUIDE.md) - Complete screen-by-screen walkthrough
- [MCP_INTEGRATION.md](MCP_INTEGRATION.md) - MCP client architecture details
- [ARCHITECTURE.md](ARCHITECTURE.md) - Technical architecture overview
- GitHub Issues: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)