Spaces:
Running
A newer version of the Gradio SDK is available:
6.0.2
Job Submission Guide
This guide explains how to submit agent evaluation jobs to run on cloud infrastructure using TraceMind-AI.
Table of Contents
- Overview
- Infrastructure Options
- Prerequisites
- Hardware Selection Guide
- Submitting a Job
- Cost Estimation
- Monitoring Jobs
- Understanding Job Results
- Troubleshooting
- Advanced Configuration
Overview
TraceMind-AI allows you to submit SMOLTRACE evaluation jobs to two cloud platforms:
- HuggingFace Jobs - Managed compute with GPU/CPU options
- Modal - Serverless compute with pay-per-second billing
Both platforms:
- β Run the same SMOLTRACE evaluation engine
- β Push results automatically to HuggingFace datasets
- β Appear in the TraceMind leaderboard when complete
- β Collect OpenTelemetry traces and GPU metrics
- β Per-second billing with no minimum duration
Choose based on your needs:
- HuggingFace Jobs: Best if you already have HF Pro subscription ($9/month)
- Modal: Best if you need H200/H100 GPUs or want to avoid subscriptions
Pricing Sources:
Infrastructure Options
HuggingFace Jobs
What it is: Managed compute platform from HuggingFace with dedicated GPU/CPU instances.
Pricing Model: Subscription-based ($9/month HF Pro) + per-second GPU charges
Hardware Options (pricing from HF Spaces GPU pricing):
cpu-basic- 2 vCPU, 16GB RAM (Free with Pro)cpu-upgrade- 8 vCPU, 32GB RAM (Free with Pro)t4-small- NVIDIA T4 16GB, 4 vCPU, 15GB RAM ($0.40/hr = $0.000111/sec)t4-medium- NVIDIA T4 16GB, 8 vCPU, 30GB RAM ($0.60/hr = $0.000167/sec)l4x1- NVIDIA L4 24GB, 8 vCPU, 30GB RAM ($0.80/hr = $0.000222/sec)l4x4- 4x NVIDIA L4 96GB total, 48 vCPU, 186GB RAM ($3.80/hr = $0.001056/sec)a10g-small- NVIDIA A10G 24GB ($1.00/hr = $0.000278/sec)a10g-large- NVIDIA A10G 24GB (more compute) ($1.50/hr = $0.000417/sec)a10g-largex2- 2x NVIDIA A10G 48GB total ($3.00/hr = $0.000833/sec)a10g-largex4- 4x NVIDIA A10G 96GB total ($5.00/hr = $0.001389/sec)a100-large- NVIDIA A100 80GB, 12 vCPU, 142GB RAM ($2.50/hr = $0.000694/sec)v5e-1x1- Google Cloud TPU v5e (pricing TBD)v5e-2x2- Google Cloud TPU v5e (pricing TBD)v5e-2x4- Google Cloud TPU v5e (pricing TBD)
Note: Jobs billing is per-second with no minimum. You only pay for actual compute time used.
Pros:
- Simple authentication (HuggingFace token)
- Integrated with HF ecosystem
- Job dashboard at https://huggingface.co/jobs
- Reliable infrastructure
Cons:
- Requires HF Pro subscription ($9/month)
- Slightly more expensive than Modal for most GPUs
- Limited hardware options compared to Modal (no H100/H200)
When to use:
- β You already have HF Pro subscription
- β You want simplicity and reliability
- β You prefer HuggingFace ecosystem integration
- β You prefer managed infrastructure
Modal
What it is: Serverless compute platform with pay-per-second billing for CPU and GPU workloads.
Pricing Model: Pay-per-second usage (no subscription required)
Hardware Options:
cpu- Physical core (2 vCPU equivalent) ($0.0000131/core/sec, min 0.125 cores)gpu_t4- NVIDIA T4 16GB ($0.000164/sec ~= $0.59/hr)gpu_l4- NVIDIA L4 24GB ($0.000222/sec ~= $0.80/hr)gpu_a10- NVIDIA A10G 24GB ($0.000306/sec ~= $1.10/hr)gpu_l40s- NVIDIA L40S 48GB ($0.000542/sec ~= $1.95/hr)gpu_a100- NVIDIA A100 40GB ($0.000583/sec ~= $2.10/hr)gpu_a100_80gb- NVIDIA A100 80GB ($0.000694/sec ~= $2.50/hr)gpu_h100- NVIDIA H100 80GB ($0.001097/sec ~= $3.95/hr)gpu_h200- NVIDIA H200 141GB ($0.001261/sec ~= $4.54/hr)gpu_b200- NVIDIA B200 192GB ($0.001736/sec ~= $6.25/hr)
Pros:
- Pay-per-second (no hourly minimums)
- Wide range of GPUs (including H200, H100)
- No subscription required
- Real-time logs and monitoring
- Fast cold starts
Cons:
- Requires Modal account setup
- Need to configure API tokens (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET)
- Network egress charges apply
- Less integrated with HF ecosystem
When to use:
- β You want to minimize costs (generally cheaper than HF Jobs)
- β You need access to latest GPUs (H200, H100, B200)
- β You prefer serverless architecture
- β You don't have HF Pro subscription
- β You want more GPU options and flexibility
Prerequisites
For Viewing Leaderboard (Free)
Required:
- HuggingFace account (free)
- HuggingFace token with Read permissions
How to get:
- Go to https://huggingface.co/settings/tokens
- Create new token with Read permission
- Copy token (starts with
hf_...) - Add to TraceMind Settings tab
For Submitting Jobs to HuggingFace Jobs
Required:
- HuggingFace Pro subscription ($9/month)
- Sign up at https://huggingface.co/pricing
- Must add credit card for GPU compute charges
- HuggingFace token with Read + Write + Run Jobs permissions
- LLM provider API keys (OpenAI, Anthropic, etc.) for API models
How to setup:
- Subscribe to HF Pro: https://huggingface.co/pricing
- Add credit card for compute charges
- Create token with all permissions:
- Go to https://huggingface.co/settings/tokens
- Click "New token"
- Select: Read, Write, Run Jobs
- Copy token
- Add API keys in TraceMind Settings:
- HuggingFace Token
- OpenAI API Key (if testing OpenAI models)
- Anthropic API Key (if testing Claude models)
- etc.
For Submitting Jobs to Modal
Required:
- Modal account (free to create, pay-per-use)
- Modal API token (Token ID + Token Secret)
- HuggingFace token with Read + Write permissions
- LLM provider API keys (OpenAI, Anthropic, etc.) for API models
How to setup:
- Create Modal account:
- Go to https://modal.com
- Sign up (GitHub or email)
- Create API token:
- Go to https://modal.com/settings/tokens
- Click "Create token"
- Copy Token ID (starts with
ak-...) - Copy Token Secret (starts with
as-...)
- Add credentials in TraceMind Settings:
- Modal Token ID
- Modal Token Secret
- HuggingFace Token (Read + Write)
- LLM provider API keys
Hardware Selection Guide
Auto-Selection (Recommended)
Set hardware to auto to let TraceMind automatically select the optimal hardware based on:
- Model size (extracted from model name)
- Provider type (API vs local)
- Infrastructure (HF Jobs vs Modal)
Auto-selection logic:
For API Models (provider = litellm or inference):
- Always uses CPU (no GPU needed)
- HF Jobs:
cpu-basic - Modal:
cpu
For Local Models (provider = transformers):
Memory estimation for agentic workloads:
- Model weights (FP16): ~2GB per 1B params
- KV cache for long contexts: ~1.5-2x model size
- Inference overhead: ~20-30% additional
- Total: ~4-5GB per 1B params for safe execution
HuggingFace Jobs:
| Model Size | Hardware | VRAM | Example Models |
|---|---|---|---|
| < 1B | t4-small |
16GB | Qwen-0.5B, Phi-3-mini |
| 1B - 5B | t4-small |
16GB | Llama-3.2-3B, Gemma-2B |
| 6B - 12B | a10g-large |
24GB | Llama-3.1-8B, Mistral-7B |
| 13B+ | a100-large |
80GB | Llama-3.1-70B, Qwen-14B |
Modal:
| Model Size | Hardware | VRAM | Example Models |
|---|---|---|---|
| < 1B | gpu_t4 |
16GB | Qwen-0.5B, Phi-3-mini |
| 1B - 5B | gpu_t4 |
16GB | Llama-3.2-3B, Gemma-2B |
| 6B - 12B | gpu_l40s |
48GB | Llama-3.1-8B, Mistral-7B |
| 13B - 24B | gpu_a100_80gb |
80GB | Llama-2-13B, Qwen-14B |
| 25B - 48B | gpu_a100_80gb |
80GB | Gemma-27B, Yi-34B |
| 49B+ | gpu_h200 |
141GB | Llama-3.1-70B, Qwen-72B |
Manual Selection
If you know your model's requirements, you can manually select hardware:
CPU Jobs (API models like GPT-4, Claude):
- HF Jobs:
cpu-basicorcpu-upgrade - Modal:
cpu
Small Models (1B-5B params):
- HF Jobs:
t4-small(16GB VRAM) - Modal:
gpu_t4(16GB VRAM) - Examples: Llama-3.2-3B, Gemma-2B, Qwen-2.5-3B
Medium Models (6B-12B params):
- HF Jobs:
a10g-smallora10g-large(24GB VRAM) - Modal:
gpu_l40s(48GB VRAM) - Examples: Llama-3.1-8B, Mistral-7B, Qwen-2.5-7B
Large Models (13B-24B params):
- HF Jobs:
a100-large(80GB VRAM) - Modal:
gpu_a100_80gb(80GB VRAM) - Examples: Llama-2-13B, Qwen-14B, Mistral-22B
Very Large Models (25B+ params):
- HF Jobs:
a100-large(80GB VRAM) - may need quantization - Modal:
gpu_h200(141GB VRAM) - recommended - Examples: Llama-3.1-70B, Qwen-72B, Gemma-27B
Cost vs Performance Trade-offs:
- T4: Cheapest GPU, good for small models
- L4: Newer architecture, better performance than T4
- A10G: Good balance of cost/performance for medium models
- L40S: Best for 7B-12B models (Modal only)
- A100: Industry standard for large models
- H200: Latest GPU, massive VRAM (141GB), best for 70B+ models
Submitting a Job
Step 1: Navigate to New Evaluation Screen
- Open TraceMind-AI
- Click βΆοΈ New Evaluation in the sidebar
- You'll see a comprehensive configuration form
Step 2: Configure Infrastructure
Infrastructure Provider:
- Choose
HuggingFace JobsorModal
Hardware:
- Use
auto(recommended) or select specific hardware - See Hardware Selection Guide
Step 3: Configure Model
Model:
- Enter model ID (e.g.,
openai/gpt-4,meta-llama/Llama-3.1-8B-Instruct) - Use HuggingFace format:
organization/model-name
Provider:
litellm- For API models (OpenAI, Anthropic, etc.)inference- For HuggingFace Inference APItransformers- For local models loaded with transformers
HF Inference Provider (optional):
- Leave empty unless using HF Inference API
- Example:
openai-community/gpt2for HF-hosted models
HuggingFace Token (optional):
- Leave empty if already configured in Settings
- Only needed for private models
Step 4: Configure Agent
Agent Type:
tool- Function calling agents onlycode- Code execution agents onlyboth- Hybrid agents (recommended)
Search Provider:
duckduckgo- Free, no API key required (recommended)serper- Requires Serper API keybrave- Requires Brave Search API key
Enable Optional Tools:
- Select additional tools for the agent:
google_search- Google Search (requires API key)duckduckgo_search- DuckDuckGo Searchvisit_webpage- Web page scrapingpython_interpreter- Python code executionwikipedia_search- Wikipedia queriesuser_input- User interaction (not recommended for batch eval)
Step 5: Configure Test Dataset
Dataset Name:
- Default:
kshitijthakkar/smoltrace-tasks - Or use your own HuggingFace dataset
- Format:
username/dataset-name
Dataset Split:
- Default:
train - Other options:
test,validation
Difficulty Filter:
all- All difficulty levels (recommended)easy- Easy tasks onlymedium- Medium tasks onlyhard- Hard tasks only
Parallel Workers:
- Default:
1(sequential execution) - Higher values (2-10) for faster execution
- β οΈ Increases memory usage and API rate limits
Step 6: Configure Output & Monitoring
Output Format:
hub- Push to HuggingFace datasets (recommended)json- Save locally (requires output directory)
Output Directory:
- Only for
jsonformat - Example:
./evaluation_results
Enable OpenTelemetry Tracing:
- β Recommended - Collects detailed execution traces
- Traces appear in TraceMind trace visualization
Enable GPU Metrics:
- β Recommended for GPU jobs
- Collects GPU utilization, memory, temperature, CO2 emissions
- No effect on CPU jobs
Private Datasets:
- β Make result datasets private on HuggingFace
- Default: Public datasets
Debug Mode:
- β Enable verbose logging for troubleshooting
- Default: Off
Quiet Mode:
- β Reduce output verbosity
- Default: Off
Run ID (optional):
- Auto-generated UUID if left empty
- Custom ID for tracking specific runs
Job Timeout:
- Default:
1h(1 hour) - Other examples:
30m,2h,3h - Job will be terminated if it exceeds timeout
Step 7: Estimate Cost (Optional but Recommended)
- Click π° Estimate Cost button
- Wait for AI-powered cost analysis
- Review:
- Estimated total cost
- Estimated duration
- Hardware selection (if auto)
- Historical data (if available)
Cost Estimation Sources:
- Historical Data: Based on previous runs of the same model in leaderboard
- MCP AI Analysis: AI-powered estimation using Gemini 2.5 Flash (if no historical data)
Step 8: Submit Job
- Review all configurations
- Click π Submit Evaluation button
- Wait for confirmation message
- Copy job ID for tracking
Confirmation message includes:
- β Job submission status
- Job ID and platform-specific ID
- Hardware selected
- Estimated duration
- Monitoring instructions
Example: Submit HuggingFace Jobs Evaluation
Infrastructure: HuggingFace Jobs
Hardware: auto β a10g-large
Model: meta-llama/Llama-3.1-8B-Instruct
Provider: transformers
Agent Type: both
Dataset: kshitijthakkar/smoltrace-tasks
Output Format: hub
Click "Estimate Cost":
β Estimated Cost: $1.25
β Duration: 25 minutes
β Hardware: a10g-large (auto-selected)
Click "Submit Evaluation":
β β
Job submitted successfully!
β HF Job ID: username/job_abc123
β Monitor at: https://huggingface.co/jobs
Example: Submit Modal Evaluation
Infrastructure: Modal
Hardware: auto β L40S
Model: meta-llama/Llama-3.1-8B-Instruct
Provider: transformers
Agent Type: both
Dataset: kshitijthakkar/smoltrace-tasks
Output Format: hub
Click "Estimate Cost":
β Estimated Cost: $0.95
β Duration: 20 minutes
β Hardware: gpu_l40s (auto-selected)
Click "Submit Evaluation":
β β
Job submitted successfully!
β Modal Call ID: modal-job_xyz789
β Monitor at: https://modal.com/apps
Cost Estimation
Understanding Cost Estimates
TraceMind provides AI-powered cost estimation before you submit jobs:
Historical Data (most accurate):
- Based on actual runs of the same model
- Shows average cost, duration from past evaluations
- Displays number of historical runs used
MCP AI Analysis (when no historical data):
- Powered by Google Gemini 2.5 Flash
- Analyzes model size, hardware, provider
- Estimates cost based on typical usage patterns
- Includes detailed breakdown and recommendations
Cost Factors
For HuggingFace Jobs:
- Hardware per-second rate (see Infrastructure Options)
- Evaluation duration (actual runtime only, billed per-second)
- LLM API costs (if using API models like GPT-4)
- HF Pro subscription ($9/month required)
For Modal:
- Hardware per-second rate (no minimums)
- Evaluation duration (actual runtime only)
- Network egress (data transfer out)
- LLM API costs (if using API models)
Cost Optimization Tips
Use Auto Hardware Selection:
- Automatically picks cheapest hardware for your model
- Avoids over-provisioning (e.g., H200 for 3B model)
Choose Right Infrastructure:
- If you have HF Pro: Use HF Jobs (already paying subscription)
- If you don't have HF Pro: Use Modal (no subscription required)
- For latest GPUs (H200/H100): Use Modal (HF Jobs doesn't offer these)
Optimize Model Selection:
- Smaller models (3B-7B) are 10x cheaper than large models (70B)
- API models (GPT-4-mini) often cheaper than local 70B models
Reduce Test Count:
- Use difficulty filter (
easyonly) for quick validation - Test with small dataset first, then scale up
Parallel Workers:
- Keep at 1 for sequential execution (cheapest)
- Increase only if time is critical (increases API costs)
Example Cost Comparison:
| Model | Hardware | Infrastructure | Duration | HF Jobs Cost | Modal Cost |
|---|---|---|---|---|---|
| GPT-4 (API) | CPU | Either | 5 min | Free* | ~$0.00* |
| Llama-3.1-8B | A10G-large | HF Jobs | 25 min | $0.63** | N/A |
| Llama-3.1-8B | L40S | Modal | 20 min | N/A | $0.65** |
| Llama-3.1-70B | A100-80GB | Both | 45 min | $1.74** | $1.56** |
| Llama-3.1-70B | H200 | Modal only | 35 min | N/A | $2.65** |
* Plus LLM API costs (OpenAI/Anthropic/etc. - not included) ** Per-second billing, actual runtime only (no minimums)
Monitoring Jobs
HuggingFace Jobs
Via HuggingFace Dashboard:
- Go to https://huggingface.co/jobs
- Find your job in the list
- Click to view details and logs
Via TraceMind Job Monitoring Tab:
- Click π Job Monitoring in sidebar
- See all your submitted jobs
- Real-time status updates
- Click job to view logs
Job Statuses:
pending- Waiting for resourcesrunning- Currently executingcompleted- Finished successfullyfailed- Error occurred (check logs)cancelled- Manually stopped
Modal
Via Modal Dashboard:
- Go to https://modal.com/apps
- Find your app:
smoltrace-eval-{job_id} - Click to view real-time logs and metrics
Via TraceMind Job Monitoring Tab:
- Click π Job Monitoring in sidebar
- See all your submitted jobs
- Modal jobs show as
submitted(check Modal dashboard for details)
Viewing Job Logs
HuggingFace Jobs:
1. Go to Job Monitoring tab
2. Click on your job
3. Click "View Logs" button
4. See real-time output from SMOLTRACE
Modal:
1. Go to https://modal.com/apps
2. Find your app
3. Click "Logs" tab
4. See streaming output in real-time
Expected Job Duration
API Models (litellm provider):
- CPU job: 2-5 minutes for 100 tests
- No model download required
- Depends on API rate limits
Local Models (transformers provider):
- Model download: 5-15 minutes (one-time per job)
- 3B model: ~6GB download
- 8B model: ~16GB download
- 70B model: ~140GB download
- Evaluation: 10-30 minutes for 100 tests
- Total: 15-45 minutes typical
Progress Indicators:
- β³ Job queued (0-2 minutes)
- π Downloading model (5-15 minutes for first run)
- π§ͺ Running evaluation (10-30 minutes)
- π€ Uploading results to HuggingFace (1-2 minutes)
- β Complete
Understanding Job Results
Where Results Are Stored
HuggingFace Datasets (if output_format = "hub"):
SMOLTRACE creates 4 datasets for each evaluation:
Leaderboard Dataset:
huggingface/smolagents-leaderboard- Aggregate statistics for the run
- Appears in TraceMind Leaderboard tab
- Public, shared across all users
Results Dataset:
{your_username}/agent-results-{model}-{timestamp}- Individual test case results
- Success/failure, execution time, tokens, cost
- Links to traces dataset
Traces Dataset:
{your_username}/agent-traces-{model}-{timestamp}- OpenTelemetry traces (if enable_otel = True)
- Detailed execution steps, LLM calls, tool usage
- Viewable in TraceMind Trace Visualization
Metrics Dataset:
{your_username}/agent-metrics-{model}-{timestamp}- GPU metrics (if enable_gpu_metrics = True)
- GPU utilization, memory, temperature, CO2 emissions
- Time-series data for each test
Local JSON Files (if output_format = "json"):
- Saved to
output_diron the job machine - Not automatically uploaded to HuggingFace
- Useful for local testing
Viewing Results in TraceMind
Step 1: Refresh Leaderboard
- Go to π Leaderboard tab
- Click Load Leaderboard button
- Your new run appears in the table
Step 2: View Run Details
- Click on your run in the leaderboard
- See detailed test results:
- Individual test cases
- Success/failure breakdown
- Execution times
- Token usage
- Costs
Step 3: Visualize Traces (if enable_otel = True)
- From run details, click on a test case
- Click View Trace button
- See OpenTelemetry waterfall diagram
- Analyze:
- LLM calls and durations
- Tool executions
- Reasoning steps
- GPU metrics overlay (if GPU job)
Step 4: Ask Questions About Results
- Go to π€ Agent Chat tab
- Ask questions like:
- "Analyze my latest evaluation run"
- "Why did test case 5 fail?"
- "Compare my run with the top model"
- "What was the cost breakdown?"
Interpreting Results
Key Metrics:
| Metric | Description | Good Value |
|---|---|---|
| Success Rate | % of tests passed | >90% excellent, >70% good |
| Avg Duration | Time per test case | <5s good, <10s acceptable |
| Total Cost | Cost for all tests | Varies by model |
| Tokens Used | Total tokens consumed | Lower is better |
| CO2 Emissions | Carbon footprint | Lower is better |
| GPU Utilization | GPU usage % | >60% efficient |
Common Patterns:
High accuracy, low cost:
- β Excellent model for production
- Examples: GPT-4-mini, Claude-3-Haiku, Gemini-1.5-Flash
High accuracy, high cost:
- β Best for quality-critical tasks
- Examples: GPT-4, Claude-3.5-Sonnet, Gemini-1.5-Pro
Low accuracy, low cost:
- β οΈ May need prompt optimization or better model
- Examples: Small local models (<3B params)
Low accuracy, high cost:
- β Poor choice, investigate or switch models
- May indicate configuration issues
Troubleshooting
Job Submission Failures
Error: "HuggingFace token not configured"
- Cause: Missing or invalid HF token
- Fix:
- Go to Settings tab
- Add HF token with "Read + Write + Run Jobs" permissions
- Click "Save API Keys"
Error: "HuggingFace Pro subscription required"
- Cause: HF Jobs requires Pro subscription
- Fix:
- Subscribe at https://huggingface.co/pricing ($9/month)
- Add credit card for GPU charges
- Try again
Error: "Modal credentials not configured"
- Cause: Missing Modal API tokens
- Fix:
- Go to https://modal.com/settings/tokens
- Create new token
- Copy Token ID and Token Secret
- Add to Settings tab
- Try again
Error: "Modal package not installed"
- Cause: Modal SDK missing (should not happen in hosted Space)
- Fix: Contact support or run locally with
pip install modal
Job Execution Failures
Job stuck in "Pending" status
- Cause: High demand for GPU resources
- Fix:
- Wait 5-10 minutes
- Try different hardware (e.g., T4 instead of A100)
- Try different infrastructure (Modal vs HF Jobs)
Job fails with "Out of Memory"
- Cause: Model too large for selected hardware
- Fix:
- Use larger GPU (A100-80GB or H200)
- Or use
autohardware selection - Or reduce
parallel_workersto 1
Job fails with "Model not found"
- Cause: Invalid model ID or private model
- Fix:
- Check model ID format:
organization/model-name - For private models, add HF token with access
- Verify model exists on HuggingFace Hub
- Check model ID format:
Job fails with "API key not set"
- Cause: Missing LLM provider API key
- Fix:
- Go to Settings tab
- Add API key for your provider (OpenAI, Anthropic, etc.)
- Submit job again
Job fails with "Rate limit exceeded"
- Cause: Too many API requests
- Fix:
- Reduce
parallel_workersto 1 - Use different model with higher rate limits
- Wait and retry later
- Reduce
Modal job fails with "Authentication failed"
- Cause: Invalid Modal tokens
- Fix:
- Go to https://modal.com/settings/tokens
- Create new token (old one may be expired)
- Update tokens in Settings tab
Results Not Appearing
Results not in leaderboard after job completes
- Cause: Dataset upload failed or not configured
- Fix:
- Check job logs for errors
- Verify
output_formatwas set to "hub" - Verify HF token has "Write" permission
- Manually refresh leaderboard (click "Load Leaderboard")
Traces not appearing
- Cause: OpenTelemetry not enabled
- Fix:
- Re-run evaluation with
enable_otel = True - Check traces dataset exists on your HF profile
- Re-run evaluation with
GPU metrics not showing
- Cause: GPU metrics not enabled or CPU job
- Fix:
- Re-run with
enable_gpu_metrics = True - Verify job used GPU hardware (not CPU)
- Check metrics dataset exists
- Re-run with
Advanced Configuration
Custom Test Datasets
Create your own test dataset:
Use π¬ Synthetic Data Generator tab:
- Configure domain and tools
- Generate custom tasks
- Push to HuggingFace Hub
Use generated dataset in evaluation:
- Set
dataset_nameto your dataset:{username}/dataset-name - Configure agent with matching tools
- Set
Dataset Format Requirements:
{
"task_id": "task_001",
"prompt": "What's the weather in Tokyo?",
"expected_tool": "get_weather",
"difficulty": "easy",
"category": "tool_usage"
}
Environment Variables
LLM Provider API Keys (in Settings):
OPENAI_API_KEY- OpenAI APIANTHROPIC_API_KEY- Anthropic APIGOOGLE_API_KEYorGEMINI_API_KEY- Google Gemini APICOHERE_API_KEY- Cohere APIMISTRAL_API_KEY- Mistral APITOGETHER_API_KEY- Together AI APIGROQ_API_KEY- Groq APIREPLICATE_API_TOKEN- Replicate APIANYSCALE_API_KEY- Anyscale API
Infrastructure Credentials:
HF_TOKEN- HuggingFace tokenMODAL_TOKEN_ID- Modal token IDMODAL_TOKEN_SECRET- Modal token secret
Parallel Execution
Use parallel_workers to speed up evaluation:
1- Sequential execution (default, safest)2-4- Moderate parallelism (2-4x faster)5-10- High parallelism (5-10x faster, risky)
Trade-offs:
- β Faster: Linear speedup with workers
- β οΈ Higher cost: More API calls per minute
- β οΈ Rate limits: May hit provider rate limits
- β οΈ Memory: Increases GPU memory usage
Recommendations:
- API models: Keep at 1 (avoid rate limits)
- Local models: Can use 2-4 if GPU has enough VRAM
- Production runs: Use 1 for reliability
Private Datasets
Make results private:
- Set
private = Truein job configuration - Results will be private on your HuggingFace profile
- Only you can view in leaderboard (if using private leaderboard dataset)
Use cases:
- Proprietary models
- Confidential evaluation data
- Internal benchmarking
Quick Reference
Job Submission Checklist
Before submitting a job, verify:
- Infrastructure selected (HF Jobs or Modal)
- Hardware configured (auto or manual)
- Model ID is correct
- Provider matches model type
- API keys configured in Settings
- Dataset name is valid
- Output format is "hub" for TraceMind integration
- OpenTelemetry tracing enabled (if you want traces)
- GPU metrics enabled (if using GPU)
- Cost estimate reviewed
- Timeout is sufficient for your model size
Common Model Configurations
OpenAI GPT-4:
Model: openai/gpt-4
Provider: litellm
Hardware: auto β cpu-basic
Infrastructure: Either (HF Jobs or Modal)
Estimated Cost: API costs only
Anthropic Claude-3.5-Sonnet:
Model: anthropic/claude-3.5-sonnet
Provider: litellm
Hardware: auto β cpu-basic
Infrastructure: Either (HF Jobs or Modal)
Estimated Cost: API costs only
Meta Llama-3.1-8B:
Model: meta-llama/Llama-3.1-8B-Instruct
Provider: transformers
Hardware: auto β a10g-large (HF) or gpu_l40s (Modal)
Infrastructure: Modal (cheaper for short jobs)
Estimated Cost: $0.75-1.50
Meta Llama-3.1-70B:
Model: meta-llama/Llama-3.1-70B-Instruct
Provider: transformers
Hardware: auto β a100-large (HF) or gpu_h200 (Modal)
Infrastructure: Modal (if available), else HF Jobs
Estimated Cost: $3.00-8.00
Qwen-2.5-Coder-32B:
Model: Qwen/Qwen2.5-Coder-32B-Instruct
Provider: transformers
Hardware: auto β a100-large (HF) or gpu_a100_80gb (Modal)
Infrastructure: Either
Estimated Cost: $2.00-4.00
Next Steps
After submitting your first job:
- Monitor progress in Job Monitoring tab
- View results in Leaderboard when complete
- Analyze traces in Trace Visualization
- Ask questions in Agent Chat about your results
- Compare with other models using Compare feature
- Optimize model selection based on cost/accuracy trade-offs
- Generate custom test datasets for your domain
- Share your results with the community
For more help:
- USER_GUIDE.md - Complete screen-by-screen walkthrough
- MCP_INTEGRATION.md - MCP client architecture details
- ARCHITECTURE.md - Technical architecture overview
- GitHub Issues: TraceMind-AI/issues