Spaces:

MCP-1st-Birthday
/

TraceMind

Running

App Files Files Community

TraceMind / JOB_SUBMISSION.md

kshitijthakkar

docs: Add comprehensive JOB_SUBMISSION.md guide with accurate pricing

ae24574 10 days ago

preview code

raw

history blame contribute delete

29.3 kB

	# Job Submission Guide

	This guide explains how to submit agent evaluation jobs to run on cloud infrastructure using TraceMind-AI.

	## Table of Contents

	- [Overview](#overview)
	- [Infrastructure Options](#infrastructure-options)
	- [HuggingFace Jobs](#huggingface-jobs)
	- [Modal](#modal)
	- [Prerequisites](#prerequisites)
	- [Hardware Selection Guide](#hardware-selection-guide)
	- [Submitting a Job](#submitting-a-job)
	- [Cost Estimation](#cost-estimation)
	- [Monitoring Jobs](#monitoring-jobs)
	- [Understanding Job Results](#understanding-job-results)
	- [Troubleshooting](#troubleshooting)
	- [Advanced Configuration](#advanced-configuration)

	---

	## Overview

	TraceMind-AI allows you to submit SMOLTRACE evaluation jobs to two cloud platforms:

	1. HuggingFace Jobs - Managed compute with GPU/CPU options
	2. Modal - Serverless compute with pay-per-second billing

	Both platforms:
	- ✅ Run the same SMOLTRACE evaluation engine
	- ✅ Push results automatically to HuggingFace datasets
	- ✅ Appear in the TraceMind leaderboard when complete
	- ✅ Collect OpenTelemetry traces and GPU metrics
	- ✅ Per-second billing with no minimum duration

	Choose based on your needs:
	- HuggingFace Jobs: Best if you already have HF Pro subscription ($9/month)
	- Modal: Best if you need H200/H100 GPUs or want to avoid subscriptions

	Pricing Sources:
	- [HuggingFace Jobs Documentation](https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs)
	- [HuggingFace Spaces GPU Pricing](https://huggingface.co/docs/hub/en/spaces-gpus)
	- [Modal GPU Pricing](https://modal.com/pricing)

	---

	## Infrastructure Options

	### HuggingFace Jobs

	What it is: Managed compute platform from HuggingFace with dedicated GPU/CPU instances.

	Pricing Model: Subscription-based ($9/month HF Pro) + per-second GPU charges

	Hardware Options (pricing from [HF Spaces GPU pricing](https://huggingface.co/docs/hub/en/spaces-gpus)):
	- `cpu-basic` - 2 vCPU, 16GB RAM (Free with Pro)
	- `cpu-upgrade` - 8 vCPU, 32GB RAM (Free with Pro)
	- `t4-small` - NVIDIA T4 16GB, 4 vCPU, 15GB RAM ($0.40/hr = $0.000111/sec)
	- `t4-medium` - NVIDIA T4 16GB, 8 vCPU, 30GB RAM ($0.60/hr = $0.000167/sec)
	- `l4x1` - NVIDIA L4 24GB, 8 vCPU, 30GB RAM ($0.80/hr = $0.000222/sec)
	- `l4x4` - 4x NVIDIA L4 96GB total, 48 vCPU, 186GB RAM ($3.80/hr = $0.001056/sec)
	- `a10g-small` - NVIDIA A10G 24GB ($1.00/hr = $0.000278/sec)
	- `a10g-large` - NVIDIA A10G 24GB (more compute) ($1.50/hr = $0.000417/sec)
	- `a10g-largex2` - 2x NVIDIA A10G 48GB total ($3.00/hr = $0.000833/sec)
	- `a10g-largex4` - 4x NVIDIA A10G 96GB total ($5.00/hr = $0.001389/sec)
	- `a100-large` - NVIDIA A100 80GB, 12 vCPU, 142GB RAM ($2.50/hr = $0.000694/sec)
	- `v5e-1x1` - Google Cloud TPU v5e (pricing TBD)
	- `v5e-2x2` - Google Cloud TPU v5e (pricing TBD)
	- `v5e-2x4` - Google Cloud TPU v5e (pricing TBD)

	Note: Jobs billing is per-second* with no minimum. You only pay for actual compute time used.*

	Pros:
	- Simple authentication (HuggingFace token)
	- Integrated with HF ecosystem
	- Job dashboard at https://huggingface.co/jobs
	- Reliable infrastructure

	Cons:
	- Requires HF Pro subscription ($9/month)
	- Slightly more expensive than Modal for most GPUs
	- Limited hardware options compared to Modal (no H100/H200)

	When to use:
	- ✅ You already have HF Pro subscription
	- ✅ You want simplicity and reliability
	- ✅ You prefer HuggingFace ecosystem integration
	- ✅ You prefer managed infrastructure

	### Modal

	What it is: Serverless compute platform with pay-per-second billing for CPU and GPU workloads.

	Pricing Model: Pay-per-second usage (no subscription required)

	Hardware Options:
	- `cpu` - Physical core (2 vCPU equivalent) ($0.0000131/core/sec, min 0.125 cores)
	- `gpu_t4` - NVIDIA T4 16GB ($0.000164/sec ~= $0.59/hr)
	- `gpu_l4` - NVIDIA L4 24GB ($0.000222/sec ~= $0.80/hr)
	- `gpu_a10` - NVIDIA A10G 24GB ($0.000306/sec ~= $1.10/hr)
	- `gpu_l40s` - NVIDIA L40S 48GB ($0.000542/sec ~= $1.95/hr)
	- `gpu_a100` - NVIDIA A100 40GB ($0.000583/sec ~= $2.10/hr)
	- `gpu_a100_80gb` - NVIDIA A100 80GB ($0.000694/sec ~= $2.50/hr)
	- `gpu_h100` - NVIDIA H100 80GB ($0.001097/sec ~= $3.95/hr)
	- `gpu_h200` - NVIDIA H200 141GB ($0.001261/sec ~= $4.54/hr)
	- `gpu_b200` - NVIDIA B200 192GB ($0.001736/sec ~= $6.25/hr)

	Pros:
	- Pay-per-second (no hourly minimums)
	- Wide range of GPUs (including H200, H100)
	- No subscription required
	- Real-time logs and monitoring
	- Fast cold starts

	Cons:
	- Requires Modal account setup
	- Need to configure API tokens (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET)
	- Network egress charges apply
	- Less integrated with HF ecosystem

	When to use:
	- ✅ You want to minimize costs (generally cheaper than HF Jobs)
	- ✅ You need access to latest GPUs (H200, H100, B200)
	- ✅ You prefer serverless architecture
	- ✅ You don't have HF Pro subscription
	- ✅ You want more GPU options and flexibility

	---

	## Prerequisites

	### For Viewing Leaderboard (Free)

	Required:
	- HuggingFace account (free)
	- HuggingFace token with Read permissions

	How to get:
	1. Go to https://huggingface.co/settings/tokens
	2. Create new token with Read permission
	3. Copy token (starts with `hf_...`)
	4. Add to TraceMind Settings tab

	### For Submitting Jobs to HuggingFace Jobs

	Required:
	1. HuggingFace Pro subscription ($9/month)
	- Sign up at https://huggingface.co/pricing
	- Must add credit card for GPU compute charges
	2. HuggingFace token with Read + Write + Run Jobs permissions
	3. LLM provider API keys (OpenAI, Anthropic, etc.) for API models

	How to setup:
	1. Subscribe to HF Pro: https://huggingface.co/pricing
	2. Add credit card for compute charges
	3. Create token with all permissions:
	- Go to https://huggingface.co/settings/tokens
	- Click "New token"
	- Select: Read, Write, Run Jobs
	- Copy token
	4. Add API keys in TraceMind Settings:
	- HuggingFace Token
	- OpenAI API Key (if testing OpenAI models)
	- Anthropic API Key (if testing Claude models)
	- etc.

	### For Submitting Jobs to Modal

	Required:
	1. Modal account (free to create, pay-per-use)
	2. Modal API token (Token ID + Token Secret)
	3. HuggingFace token with Read + Write permissions
	4. LLM provider API keys (OpenAI, Anthropic, etc.) for API models

	How to setup:
	1. Create Modal account:
	- Go to https://modal.com
	- Sign up (GitHub or email)
	2. Create API token:
	- Go to https://modal.com/settings/tokens
	- Click "Create token"
	- Copy Token ID (starts with `ak-...`)
	- Copy Token Secret (starts with `as-...`)
	3. Add credentials in TraceMind Settings:
	- Modal Token ID
	- Modal Token Secret
	- HuggingFace Token (Read + Write)
	- LLM provider API keys

	---

	## Hardware Selection Guide

	### Auto-Selection (Recommended)

	Set hardware to `auto` to let TraceMind automatically select the optimal hardware based on:
	- Model size (extracted from model name)
	- Provider type (API vs local)
	- Infrastructure (HF Jobs vs Modal)

	Auto-selection logic:

	For API Models (provider = `litellm` or `inference`):
	- Always uses CPU (no GPU needed)
	- HF Jobs: `cpu-basic`
	- Modal: `cpu`

	For Local Models (provider = `transformers`):

	Memory estimation for agentic workloads:
	- Model weights (FP16): ~2GB per 1B params
	- KV cache for long contexts: ~1.5-2x model size
	- Inference overhead: ~20-30% additional
	- Total: ~4-5GB per 1B params for safe execution

	HuggingFace Jobs:
	\| Model Size \| Hardware \| VRAM \| Example Models \|
	\|------------\|----------\|------\|----------------\|
	\| < 1B \| `t4-small` \| 16GB \| Qwen-0.5B, Phi-3-mini \|
	\| 1B - 5B \| `t4-small` \| 16GB \| Llama-3.2-3B, Gemma-2B \|
	\| 6B - 12B \| `a10g-large` \| 24GB \| Llama-3.1-8B, Mistral-7B \|
	\| 13B+ \| `a100-large` \| 80GB \| Llama-3.1-70B, Qwen-14B \|

	Modal:
	\| Model Size \| Hardware \| VRAM \| Example Models \|
	\|------------\|----------\|------\|----------------\|
	\| < 1B \| `gpu_t4` \| 16GB \| Qwen-0.5B, Phi-3-mini \|
	\| 1B - 5B \| `gpu_t4` \| 16GB \| Llama-3.2-3B, Gemma-2B \|
	\| 6B - 12B \| `gpu_l40s` \| 48GB \| Llama-3.1-8B, Mistral-7B \|
	\| 13B - 24B \| `gpu_a100_80gb` \| 80GB \| Llama-2-13B, Qwen-14B \|
	\| 25B - 48B \| `gpu_a100_80gb` \| 80GB \| Gemma-27B, Yi-34B \|
	\| 49B+ \| `gpu_h200` \| 141GB \| Llama-3.1-70B, Qwen-72B \|

	### Manual Selection

	If you know your model's requirements, you can manually select hardware:

	CPU Jobs (API models like GPT-4, Claude):
	- HF Jobs: `cpu-basic` or `cpu-upgrade`
	- Modal: `cpu`

	Small Models (1B-5B params):
	- HF Jobs: `t4-small` (16GB VRAM)
	- Modal: `gpu_t4` (16GB VRAM)
	- Examples: Llama-3.2-3B, Gemma-2B, Qwen-2.5-3B

	Medium Models (6B-12B params):
	- HF Jobs: `a10g-small` or `a10g-large` (24GB VRAM)
	- Modal: `gpu_l40s` (48GB VRAM)
	- Examples: Llama-3.1-8B, Mistral-7B, Qwen-2.5-7B

	Large Models (13B-24B params):
	- HF Jobs: `a100-large` (80GB VRAM)
	- Modal: `gpu_a100_80gb` (80GB VRAM)
	- Examples: Llama-2-13B, Qwen-14B, Mistral-22B

	Very Large Models (25B+ params):
	- HF Jobs: `a100-large` (80GB VRAM) - may need quantization
	- Modal: `gpu_h200` (141GB VRAM) - recommended
	- Examples: Llama-3.1-70B, Qwen-72B, Gemma-27B

	Cost vs Performance Trade-offs:
	- T4: Cheapest GPU, good for small models
	- L4: Newer architecture, better performance than T4
	- A10G: Good balance of cost/performance for medium models
	- L40S: Best for 7B-12B models (Modal only)
	- A100: Industry standard for large models
	- H200: Latest GPU, massive VRAM (141GB), best for 70B+ models

	---

	## Submitting a Job

	### Step 1: Navigate to New Evaluation Screen

	1. Open TraceMind-AI
	2. Click ▶️ New Evaluation in the sidebar
	3. You'll see a comprehensive configuration form

	### Step 2: Configure Infrastructure

	Infrastructure Provider:
	- Choose `HuggingFace Jobs` or `Modal`

	Hardware:
	- Use `auto` (recommended) or select specific hardware
	- See [Hardware Selection Guide](#hardware-selection-guide)

	### Step 3: Configure Model

	Model:
	- Enter model ID (e.g., `openai/gpt-4`, `meta-llama/Llama-3.1-8B-Instruct`)
	- Use HuggingFace format: `organization/model-name`

	Provider:
	- `litellm` - For API models (OpenAI, Anthropic, etc.)
	- `inference` - For HuggingFace Inference API
	- `transformers` - For local models loaded with transformers

	HF Inference Provider (optional):
	- Leave empty unless using HF Inference API
	- Example: `openai-community/gpt2` for HF-hosted models

	HuggingFace Token (optional):
	- Leave empty if already configured in Settings
	- Only needed for private models

	### Step 4: Configure Agent

	Agent Type:
	- `tool` - Function calling agents only
	- `code` - Code execution agents only
	- `both` - Hybrid agents (recommended)

	Search Provider:
	- `duckduckgo` - Free, no API key required (recommended)
	- `serper` - Requires Serper API key
	- `brave` - Requires Brave Search API key

	Enable Optional Tools:
	- Select additional tools for the agent:
	- `google_search` - Google Search (requires API key)
	- `duckduckgo_search` - DuckDuckGo Search
	- `visit_webpage` - Web page scraping
	- `python_interpreter` - Python code execution
	- `wikipedia_search` - Wikipedia queries
	- `user_input` - User interaction (not recommended for batch eval)

	### Step 5: Configure Test Dataset

	Dataset Name:
	- Default: `kshitijthakkar/smoltrace-tasks`
	- Or use your own HuggingFace dataset
	- Format: `username/dataset-name`

	Dataset Split:
	- Default: `train`
	- Other options: `test`, `validation`

	Difficulty Filter:
	- `all` - All difficulty levels (recommended)
	- `easy` - Easy tasks only
	- `medium` - Medium tasks only
	- `hard` - Hard tasks only

	Parallel Workers:
	- Default: `1` (sequential execution)
	- Higher values (2-10) for faster execution
	- ⚠️ Increases memory usage and API rate limits

	### Step 6: Configure Output & Monitoring

	Output Format:
	- `hub` - Push to HuggingFace datasets (recommended)
	- `json` - Save locally (requires output directory)

	Output Directory:
	- Only for `json` format
	- Example: `./evaluation_results`

	Enable OpenTelemetry Tracing:
	- ✅ Recommended - Collects detailed execution traces
	- Traces appear in TraceMind trace visualization

	Enable GPU Metrics:
	- ✅ Recommended for GPU jobs
	- Collects GPU utilization, memory, temperature, CO2 emissions
	- No effect on CPU jobs

	Private Datasets:
	- ☐ Make result datasets private on HuggingFace
	- Default: Public datasets

	Debug Mode:
	- ☐ Enable verbose logging for troubleshooting
	- Default: Off

	Quiet Mode:
	- ☐ Reduce output verbosity
	- Default: Off

	Run ID (optional):
	- Auto-generated UUID if left empty
	- Custom ID for tracking specific runs

	Job Timeout:
	- Default: `1h` (1 hour)
	- Other examples: `30m`, `2h`, `3h`
	- Job will be terminated if it exceeds timeout

	### Step 7: Estimate Cost (Optional but Recommended)

	1. Click 💰 Estimate Cost button
	2. Wait for AI-powered cost analysis
	3. Review:
	- Estimated total cost
	- Estimated duration
	- Hardware selection (if auto)
	- Historical data (if available)

	Cost Estimation Sources:
	- Historical Data: Based on previous runs of the same model in leaderboard
	- MCP AI Analysis: AI-powered estimation using Gemini 2.5 Flash (if no historical data)

	### Step 8: Submit Job

	1. Review all configurations
	2. Click 🚀 Submit Evaluation button
	3. Wait for confirmation message
	4. Copy job ID for tracking

	Confirmation message includes:
	- ✅ Job submission status
	- Job ID and platform-specific ID
	- Hardware selected
	- Estimated duration
	- Monitoring instructions

	### Example: Submit HuggingFace Jobs Evaluation

	```
	Infrastructure: HuggingFace Jobs
	Hardware: auto → a10g-large
	Model: meta-llama/Llama-3.1-8B-Instruct
	Provider: transformers
	Agent Type: both
	Dataset: kshitijthakkar/smoltrace-tasks
	Output Format: hub

	Click "Estimate Cost":
	→ Estimated Cost: $1.25
	→ Duration: 25 minutes
	→ Hardware: a10g-large (auto-selected)

	Click "Submit Evaluation":
	→ ✅ Job submitted successfully!
	→ HF Job ID: username/job_abc123
	→ Monitor at: https://huggingface.co/jobs
	```

	### Example: Submit Modal Evaluation

	```
	Infrastructure: Modal
	Hardware: auto → L40S
	Model: meta-llama/Llama-3.1-8B-Instruct
	Provider: transformers
	Agent Type: both
	Dataset: kshitijthakkar/smoltrace-tasks
	Output Format: hub

	Click "Estimate Cost":
	→ Estimated Cost: $0.95
	→ Duration: 20 minutes
	→ Hardware: gpu_l40s (auto-selected)

	Click "Submit Evaluation":
	→ ✅ Job submitted successfully!
	→ Modal Call ID: modal-job_xyz789
	→ Monitor at: https://modal.com/apps
	```

	---

	## Cost Estimation

	### Understanding Cost Estimates

	TraceMind provides AI-powered cost estimation before you submit jobs:

	Historical Data (most accurate):
	- Based on actual runs of the same model
	- Shows average cost, duration from past evaluations
	- Displays number of historical runs used

	MCP AI Analysis (when no historical data):
	- Powered by Google Gemini 2.5 Flash
	- Analyzes model size, hardware, provider
	- Estimates cost based on typical usage patterns
	- Includes detailed breakdown and recommendations

	### Cost Factors

	For HuggingFace Jobs:
	1. Hardware per-second rate (see [Infrastructure Options](#huggingface-jobs))
	2. Evaluation duration (actual runtime only, billed per-second)
	3. LLM API costs (if using API models like GPT-4)
	4. HF Pro subscription ($9/month required)

	For Modal:
	1. Hardware per-second rate (no minimums)
	2. Evaluation duration (actual runtime only)
	3. Network egress (data transfer out)
	4. LLM API costs (if using API models)

	### Cost Optimization Tips

	Use Auto Hardware Selection:
	- Automatically picks cheapest hardware for your model
	- Avoids over-provisioning (e.g., H200 for 3B model)

	Choose Right Infrastructure:
	- If you have HF Pro: Use HF Jobs (already paying subscription)
	- If you don't have HF Pro: Use Modal (no subscription required)
	- For latest GPUs (H200/H100): Use Modal (HF Jobs doesn't offer these)

	Optimize Model Selection:
	- Smaller models (3B-7B) are 10x cheaper than large models (70B)
	- API models (GPT-4-mini) often cheaper than local 70B models

	Reduce Test Count:
	- Use difficulty filter (`easy` only) for quick validation
	- Test with small dataset first, then scale up

	Parallel Workers:
	- Keep at 1 for sequential execution (cheapest)
	- Increase only if time is critical (increases API costs)

	Example Cost Comparison:
	\| Model \| Hardware \| Infrastructure \| Duration \| HF Jobs Cost \| Modal Cost \|
	\|-------\|----------\|----------------\|----------\|--------------\|------------\|
	\| GPT-4 (API) \| CPU \| Either \| 5 min \| Free* \| ~$0.00* \|
	\| Llama-3.1-8B \| A10G-large \| HF Jobs \| 25 min \| $0.63** \| N/A \|
	\| Llama-3.1-8B \| L40S \| Modal \| 20 min \| N/A \| $0.65** \|
	\| Llama-3.1-70B \| A100-80GB \| Both \| 45 min \| $1.74 \| $1.56 \|
	\| Llama-3.1-70B \| H200 \| Modal only \| 35 min \| N/A \| $2.65** \|

	\* Plus LLM API costs (OpenAI/Anthropic/etc. - not included)
	\** Per-second billing, actual runtime only (no minimums)

	---

	## Monitoring Jobs

	### HuggingFace Jobs

	Via HuggingFace Dashboard:
	1. Go to https://huggingface.co/jobs
	2. Find your job in the list
	3. Click to view details and logs

	Via TraceMind Job Monitoring Tab:
	1. Click 📈 Job Monitoring in sidebar
	2. See all your submitted jobs
	3. Real-time status updates
	4. Click job to view logs

	Job Statuses:
	- `pending` - Waiting for resources
	- `running` - Currently executing
	- `completed` - Finished successfully
	- `failed` - Error occurred (check logs)
	- `cancelled` - Manually stopped

	### Modal

	Via Modal Dashboard:
	1. Go to https://modal.com/apps
	2. Find your app: `smoltrace-eval-{job_id}`
	3. Click to view real-time logs and metrics

	Via TraceMind Job Monitoring Tab:
	1. Click 📈 Job Monitoring in sidebar
	2. See all your submitted jobs
	3. Modal jobs show as `submitted` (check Modal dashboard for details)

	### Viewing Job Logs

	HuggingFace Jobs:
	```
	1. Go to Job Monitoring tab
	2. Click on your job
	3. Click "View Logs" button
	4. See real-time output from SMOLTRACE
	```

	Modal:
	```
	1. Go to https://modal.com/apps
	2. Find your app
	3. Click "Logs" tab
	4. See streaming output in real-time
	```

	### Expected Job Duration

	API Models (litellm provider):
	- CPU job: 2-5 minutes for 100 tests
	- No model download required
	- Depends on API rate limits

	Local Models (transformers provider):
	- Model download: 5-15 minutes (one-time per job)
	- 3B model: ~6GB download
	- 8B model: ~16GB download
	- 70B model: ~140GB download
	- Evaluation: 10-30 minutes for 100 tests
	- Total: 15-45 minutes typical

	Progress Indicators:
	1. ⏳ Job queued (0-2 minutes)
	2. 🔄 Downloading model (5-15 minutes for first run)
	3. 🧪 Running evaluation (10-30 minutes)
	4. 📤 Uploading results to HuggingFace (1-2 minutes)
	5. ✅ Complete

	---

	## Understanding Job Results

	### Where Results Are Stored

	HuggingFace Datasets (if output_format = "hub"):

	SMOLTRACE creates 4 datasets for each evaluation:

	1. Leaderboard Dataset: `huggingface/smolagents-leaderboard`
	- Aggregate statistics for the run
	- Appears in TraceMind Leaderboard tab
	- Public, shared across all users

	2. Results Dataset: `{your_username}/agent-results-{model}-{timestamp}`
	- Individual test case results
	- Success/failure, execution time, tokens, cost
	- Links to traces dataset

	3. Traces Dataset: `{your_username}/agent-traces-{model}-{timestamp}`
	- OpenTelemetry traces (if enable_otel = True)
	- Detailed execution steps, LLM calls, tool usage
	- Viewable in TraceMind Trace Visualization

	4. Metrics Dataset: `{your_username}/agent-metrics-{model}-{timestamp}`
	- GPU metrics (if enable_gpu_metrics = True)
	- GPU utilization, memory, temperature, CO2 emissions
	- Time-series data for each test

	Local JSON Files (if output_format = "json"):
	- Saved to `output_dir` on the job machine
	- Not automatically uploaded to HuggingFace
	- Useful for local testing

	### Viewing Results in TraceMind

	Step 1: Refresh Leaderboard
	1. Go to 📊 Leaderboard tab
	2. Click Load Leaderboard button
	3. Your new run appears in the table

	Step 2: View Run Details
	1. Click on your run in the leaderboard
	2. See detailed test results:
	- Individual test cases
	- Success/failure breakdown
	- Execution times
	- Token usage
	- Costs

	Step 3: Visualize Traces (if enable_otel = True)
	1. From run details, click on a test case
	2. Click View Trace button
	3. See OpenTelemetry waterfall diagram
	4. Analyze:
	- LLM calls and durations
	- Tool executions
	- Reasoning steps
	- GPU metrics overlay (if GPU job)

	Step 4: Ask Questions About Results
	1. Go to 🤖 Agent Chat tab
	2. Ask questions like:
	- "Analyze my latest evaluation run"
	- "Why did test case 5 fail?"
	- "Compare my run with the top model"
	- "What was the cost breakdown?"

	### Interpreting Results

	Key Metrics:

	\| Metric \| Description \| Good Value \|
	\|--------\|-------------\|------------\|
	\| Success Rate \| % of tests passed \| >90% excellent, >70% good \|
	\| Avg Duration \| Time per test case \| <5s good, <10s acceptable \|
	\| Total Cost \| Cost for all tests \| Varies by model \|
	\| Tokens Used \| Total tokens consumed \| Lower is better \|
	\| CO2 Emissions \| Carbon footprint \| Lower is better \|
	\| GPU Utilization \| GPU usage % \| >60% efficient \|

	Common Patterns:

	High accuracy, low cost:
	- ✅ Excellent model for production
	- Examples: GPT-4-mini, Claude-3-Haiku, Gemini-1.5-Flash

	High accuracy, high cost:
	- ✅ Best for quality-critical tasks
	- Examples: GPT-4, Claude-3.5-Sonnet, Gemini-1.5-Pro

	Low accuracy, low cost:
	- ⚠️ May need prompt optimization or better model
	- Examples: Small local models (<3B params)

	Low accuracy, high cost:
	- ❌ Poor choice, investigate or switch models
	- May indicate configuration issues

	---

	## Troubleshooting

	### Job Submission Failures

	Error: "HuggingFace token not configured"
	- Cause: Missing or invalid HF token
	- Fix:
	1. Go to Settings tab
	2. Add HF token with "Read + Write + Run Jobs" permissions
	3. Click "Save API Keys"

	Error: "HuggingFace Pro subscription required"
	- Cause: HF Jobs requires Pro subscription
	- Fix:
	1. Subscribe at https://huggingface.co/pricing ($9/month)
	2. Add credit card for GPU charges
	3. Try again

	Error: "Modal credentials not configured"
	- Cause: Missing Modal API tokens
	- Fix:
	1. Go to https://modal.com/settings/tokens
	2. Create new token
	3. Copy Token ID and Token Secret
	4. Add to Settings tab
	5. Try again

	Error: "Modal package not installed"
	- Cause: Modal SDK missing (should not happen in hosted Space)
	- Fix: Contact support or run locally with `pip install modal`

	### Job Execution Failures

	Job stuck in "Pending" status
	- Cause: High demand for GPU resources
	- Fix:
	- Wait 5-10 minutes
	- Try different hardware (e.g., T4 instead of A100)
	- Try different infrastructure (Modal vs HF Jobs)

	Job fails with "Out of Memory"
	- Cause: Model too large for selected hardware
	- Fix:
	- Use larger GPU (A100-80GB or H200)
	- Or use `auto` hardware selection
	- Or reduce `parallel_workers` to 1

	Job fails with "Model not found"
	- Cause: Invalid model ID or private model
	- Fix:
	- Check model ID format: `organization/model-name`
	- For private models, add HF token with access
	- Verify model exists on HuggingFace Hub

	Job fails with "API key not set"
	- Cause: Missing LLM provider API key
	- Fix:
	1. Go to Settings tab
	2. Add API key for your provider (OpenAI, Anthropic, etc.)
	3. Submit job again

	Job fails with "Rate limit exceeded"
	- Cause: Too many API requests
	- Fix:
	- Reduce `parallel_workers` to 1
	- Use different model with higher rate limits
	- Wait and retry later

	Modal job fails with "Authentication failed"
	- Cause: Invalid Modal tokens
	- Fix:
	1. Go to https://modal.com/settings/tokens
	2. Create new token (old one may be expired)
	3. Update tokens in Settings tab

	### Results Not Appearing

	Results not in leaderboard after job completes
	- Cause: Dataset upload failed or not configured
	- Fix:
	- Check job logs for errors
	- Verify `output_format` was set to "hub"
	- Verify HF token has "Write" permission
	- Manually refresh leaderboard (click "Load Leaderboard")

	Traces not appearing
	- Cause: OpenTelemetry not enabled
	- Fix:
	- Re-run evaluation with `enable_otel = True`
	- Check traces dataset exists on your HF profile

	GPU metrics not showing
	- Cause: GPU metrics not enabled or CPU job
	- Fix:
	- Re-run with `enable_gpu_metrics = True`
	- Verify job used GPU hardware (not CPU)
	- Check metrics dataset exists

	---

	## Advanced Configuration

	### Custom Test Datasets

	Create your own test dataset:

	1. Use 🔬 Synthetic Data Generator tab:
	- Configure domain and tools
	- Generate custom tasks
	- Push to HuggingFace Hub

	2. Use generated dataset in evaluation:
	- Set `dataset_name` to your dataset: `{username}/dataset-name`
	- Configure agent with matching tools

	Dataset Format Requirements:
	```python
	{
	"task_id": "task_001",
	"prompt": "What's the weather in Tokyo?",
	"expected_tool": "get_weather",
	"difficulty": "easy",
	"category": "tool_usage"
	}
	```

	### Environment Variables

	LLM Provider API Keys (in Settings):
	- `OPENAI_API_KEY` - OpenAI API
	- `ANTHROPIC_API_KEY` - Anthropic API
	- `GOOGLE_API_KEY` or `GEMINI_API_KEY` - Google Gemini API
	- `COHERE_API_KEY` - Cohere API
	- `MISTRAL_API_KEY` - Mistral API
	- `TOGETHER_API_KEY` - Together AI API
	- `GROQ_API_KEY` - Groq API
	- `REPLICATE_API_TOKEN` - Replicate API
	- `ANYSCALE_API_KEY` - Anyscale API

	Infrastructure Credentials:
	- `HF_TOKEN` - HuggingFace token
	- `MODAL_TOKEN_ID` - Modal token ID
	- `MODAL_TOKEN_SECRET` - Modal token secret

	### Parallel Execution

	Use `parallel_workers` to speed up evaluation:

	- `1` - Sequential execution (default, safest)
	- `2-4` - Moderate parallelism (2-4x faster)
	- `5-10` - High parallelism (5-10x faster, risky)

	Trade-offs:
	- ✅ Faster: Linear speedup with workers
	- ⚠️ Higher cost: More API calls per minute
	- ⚠️ Rate limits: May hit provider rate limits
	- ⚠️ Memory: Increases GPU memory usage

	Recommendations:
	- API models: Keep at 1 (avoid rate limits)
	- Local models: Can use 2-4 if GPU has enough VRAM
	- Production runs: Use 1 for reliability

	### Private Datasets

	Make results private:

	1. Set `private = True` in job configuration
	2. Results will be private on your HuggingFace profile
	3. Only you can view in leaderboard (if using private leaderboard dataset)

	Use cases:
	- Proprietary models
	- Confidential evaluation data
	- Internal benchmarking

	---

	## Quick Reference

	### Job Submission Checklist

	Before submitting a job, verify:

	- [ ] Infrastructure selected (HF Jobs or Modal)
	- [ ] Hardware configured (auto or manual)
	- [ ] Model ID is correct
	- [ ] Provider matches model type
	- [ ] API keys configured in Settings
	- [ ] Dataset name is valid
	- [ ] Output format is "hub" for TraceMind integration
	- [ ] OpenTelemetry tracing enabled (if you want traces)
	- [ ] GPU metrics enabled (if using GPU)
	- [ ] Cost estimate reviewed
	- [ ] Timeout is sufficient for your model size

	### Common Model Configurations

	OpenAI GPT-4:
	```
	Model: openai/gpt-4
	Provider: litellm
	Hardware: auto → cpu-basic
	Infrastructure: Either (HF Jobs or Modal)
	Estimated Cost: API costs only
	```

	Anthropic Claude-3.5-Sonnet:
	```
	Model: anthropic/claude-3.5-sonnet
	Provider: litellm
	Hardware: auto → cpu-basic
	Infrastructure: Either (HF Jobs or Modal)
	Estimated Cost: API costs only
	```

	Meta Llama-3.1-8B:
	```
	Model: meta-llama/Llama-3.1-8B-Instruct
	Provider: transformers
	Hardware: auto → a10g-large (HF) or gpu_l40s (Modal)
	Infrastructure: Modal (cheaper for short jobs)
	Estimated Cost: $0.75-1.50
	```

	Meta Llama-3.1-70B:
	```
	Model: meta-llama/Llama-3.1-70B-Instruct
	Provider: transformers
	Hardware: auto → a100-large (HF) or gpu_h200 (Modal)
	Infrastructure: Modal (if available), else HF Jobs
	Estimated Cost: $3.00-8.00
	```

	Qwen-2.5-Coder-32B:
	```
	Model: Qwen/Qwen2.5-Coder-32B-Instruct
	Provider: transformers
	Hardware: auto → a100-large (HF) or gpu_a100_80gb (Modal)
	Infrastructure: Either
	Estimated Cost: $2.00-4.00
	```

	---

	## Next Steps

	After submitting your first job:

	1. Monitor progress in Job Monitoring tab
	2. View results in Leaderboard when complete
	3. Analyze traces in Trace Visualization
	4. Ask questions in Agent Chat about your results
	5. Compare with other models using Compare feature
	6. Optimize model selection based on cost/accuracy trade-offs
	7. Generate custom test datasets for your domain
	8. Share your results with the community

	For more help:
	- [USER_GUIDE.md](USER_GUIDE.md) - Complete screen-by-screen walkthrough
	- [MCP_INTEGRATION.md](MCP_INTEGRATION.md) - MCP client architecture details
	- [ARCHITECTURE.md](ARCHITECTURE.md) - Technical architecture overview
	- GitHub Issues: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)