Spaces:

MCP-1st-Birthday
/

TraceMind

Running

App Files Files Community

TraceMind / JOB_SUBMISSION.md

kshitijthakkar

docs: Add comprehensive JOB_SUBMISSION.md guide with accurate pricing

ae24574 10 days ago

preview code

raw

history blame contribute delete

29.3 kB

A newer version of the Gradio SDK is available: 6.0.2

Upgrade

Job Submission Guide

This guide explains how to submit agent evaluation jobs to run on cloud infrastructure using TraceMind-AI.

Overview
Infrastructure Options
- HuggingFace Jobs
- Modal
Prerequisites
Hardware Selection Guide
Submitting a Job
Cost Estimation
Monitoring Jobs
Understanding Job Results
Troubleshooting
Advanced Configuration

Overview

TraceMind-AI allows you to submit SMOLTRACE evaluation jobs to two cloud platforms:

HuggingFace Jobs - Managed compute with GPU/CPU options
Modal - Serverless compute with pay-per-second billing

Both platforms:

✅ Run the same SMOLTRACE evaluation engine
✅ Push results automatically to HuggingFace datasets
✅ Appear in the TraceMind leaderboard when complete
✅ Collect OpenTelemetry traces and GPU metrics
✅ Per-second billing with no minimum duration

Choose based on your needs:

HuggingFace Jobs: Best if you already have HF Pro subscription ($9/month)
Modal: Best if you need H200/H100 GPUs or want to avoid subscriptions

Pricing Sources:

Infrastructure Options

HuggingFace Jobs

What it is: Managed compute platform from HuggingFace with dedicated GPU/CPU instances.

Pricing Model: Subscription-based ($9/month HF Pro) + per-second GPU charges

Hardware Options (pricing from HF Spaces GPU pricing):

cpu-basic - 2 vCPU, 16GB RAM (Free with Pro)
cpu-upgrade - 8 vCPU, 32GB RAM (Free with Pro)
t4-small - NVIDIA T4 16GB, 4 vCPU, 15GB RAM ($0.40/hr = $0.000111/sec)
t4-medium - NVIDIA T4 16GB, 8 vCPU, 30GB RAM ($0.60/hr = $0.000167/sec)
l4x1 - NVIDIA L4 24GB, 8 vCPU, 30GB RAM ($0.80/hr = $0.000222/sec)
l4x4 - 4x NVIDIA L4 96GB total, 48 vCPU, 186GB RAM ($3.80/hr = $0.001056/sec)
a10g-small - NVIDIA A10G 24GB ($1.00/hr = $0.000278/sec)
a10g-large - NVIDIA A10G 24GB (more compute) ($1.50/hr = $0.000417/sec)
a10g-largex2 - 2x NVIDIA A10G 48GB total ($3.00/hr = $0.000833/sec)
a10g-largex4 - 4x NVIDIA A10G 96GB total ($5.00/hr = $0.001389/sec)
a100-large - NVIDIA A100 80GB, 12 vCPU, 142GB RAM ($2.50/hr = $0.000694/sec)
v5e-1x1 - Google Cloud TPU v5e (pricing TBD)
v5e-2x2 - Google Cloud TPU v5e (pricing TBD)
v5e-2x4 - Google Cloud TPU v5e (pricing TBD)

Note: Jobs billing is per-second with no minimum. You only pay for actual compute time used.

Pros:

Simple authentication (HuggingFace token)
Integrated with HF ecosystem
Job dashboard at https://huggingface.co/jobs
Reliable infrastructure

Cons:

Requires HF Pro subscription ($9/month)
Slightly more expensive than Modal for most GPUs
Limited hardware options compared to Modal (no H100/H200)

When to use:

✅ You already have HF Pro subscription
✅ You want simplicity and reliability
✅ You prefer HuggingFace ecosystem integration
✅ You prefer managed infrastructure

Modal

What it is: Serverless compute platform with pay-per-second billing for CPU and GPU workloads.

Pricing Model: Pay-per-second usage (no subscription required)

Hardware Options:

cpu - Physical core (2 vCPU equivalent) ($0.0000131/core/sec, min 0.125 cores)
gpu_t4 - NVIDIA T4 16GB ($0.000164/sec ~= $0.59/hr)
gpu_l4 - NVIDIA L4 24GB ($0.000222/sec ~= $0.80/hr)
gpu_a10 - NVIDIA A10G 24GB ($0.000306/sec ~= $1.10/hr)
gpu_l40s - NVIDIA L40S 48GB ($0.000542/sec ~= $1.95/hr)
gpu_a100 - NVIDIA A100 40GB ($0.000583/sec ~= $2.10/hr)
gpu_a100_80gb - NVIDIA A100 80GB ($0.000694/sec ~= $2.50/hr)
gpu_h100 - NVIDIA H100 80GB ($0.001097/sec ~= $3.95/hr)
gpu_h200 - NVIDIA H200 141GB ($0.001261/sec ~= $4.54/hr)
gpu_b200 - NVIDIA B200 192GB ($0.001736/sec ~= $6.25/hr)

Pros:

Pay-per-second (no hourly minimums)
Wide range of GPUs (including H200, H100)
No subscription required
Real-time logs and monitoring
Fast cold starts

Cons:

Requires Modal account setup
Need to configure API tokens (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET)
Network egress charges apply
Less integrated with HF ecosystem

When to use:

✅ You want to minimize costs (generally cheaper than HF Jobs)
✅ You need access to latest GPUs (H200, H100, B200)
✅ You prefer serverless architecture
✅ You don't have HF Pro subscription
✅ You want more GPU options and flexibility

Prerequisites

For Viewing Leaderboard (Free)

Required:

HuggingFace account (free)
HuggingFace token with Read permissions

How to get:

Go to https://huggingface.co/settings/tokens
Create new token with Read permission
Copy token (starts with hf_...)
Add to TraceMind Settings tab

For Submitting Jobs to HuggingFace Jobs

Required:

HuggingFace Pro subscription ($9/month)
- Sign up at https://huggingface.co/pricing
- Must add credit card for GPU compute charges
HuggingFace token with Read + Write + Run Jobs permissions
LLM provider API keys (OpenAI, Anthropic, etc.) for API models

How to setup:

Subscribe to HF Pro: https://huggingface.co/pricing
Add credit card for compute charges
Create token with all permissions:
- Go to https://huggingface.co/settings/tokens
- Click "New token"
- Select: Read, Write, Run Jobs
- Copy token
Add API keys in TraceMind Settings:
- HuggingFace Token
- OpenAI API Key (if testing OpenAI models)
- Anthropic API Key (if testing Claude models)
- etc.

For Submitting Jobs to Modal

Required:

Modal account (free to create, pay-per-use)
Modal API token (Token ID + Token Secret)
HuggingFace token with Read + Write permissions
LLM provider API keys (OpenAI, Anthropic, etc.) for API models

How to setup:

Create Modal account:
- Go to https://modal.com
- Sign up (GitHub or email)
Create API token:
- Go to https://modal.com/settings/tokens
- Click "Create token"
- Copy Token ID (starts with ak-...)
- Copy Token Secret (starts with as-...)
Add credentials in TraceMind Settings:
- Modal Token ID
- Modal Token Secret
- HuggingFace Token (Read + Write)
- LLM provider API keys

Hardware Selection Guide

Auto-Selection (Recommended)

Set hardware to auto to let TraceMind automatically select the optimal hardware based on:

Model size (extracted from model name)
Provider type (API vs local)
Infrastructure (HF Jobs vs Modal)

Auto-selection logic:

For API Models (provider = litellm or inference):

Always uses CPU (no GPU needed)
HF Jobs: cpu-basic
Modal: cpu

For Local Models (provider = transformers):

Memory estimation for agentic workloads:

Model weights (FP16): ~2GB per 1B params
KV cache for long contexts: ~1.5-2x model size
Inference overhead: ~20-30% additional
Total: ~4-5GB per 1B params for safe execution

HuggingFace Jobs:

Model Size	Hardware	VRAM	Example Models
< 1B	`t4-small`	16GB	Qwen-0.5B, Phi-3-mini
1B - 5B	`t4-small`	16GB	Llama-3.2-3B, Gemma-2B
6B - 12B	`a10g-large`	24GB	Llama-3.1-8B, Mistral-7B
13B+	`a100-large`	80GB	Llama-3.1-70B, Qwen-14B

Modal:

Model Size	Hardware	VRAM	Example Models
< 1B	`gpu_t4`	16GB	Qwen-0.5B, Phi-3-mini
1B - 5B	`gpu_t4`	16GB	Llama-3.2-3B, Gemma-2B
6B - 12B	`gpu_l40s`	48GB	Llama-3.1-8B, Mistral-7B
13B - 24B	`gpu_a100_80gb`	80GB	Llama-2-13B, Qwen-14B
25B - 48B	`gpu_a100_80gb`	80GB	Gemma-27B, Yi-34B
49B+	`gpu_h200`	141GB	Llama-3.1-70B, Qwen-72B

Manual Selection

If you know your model's requirements, you can manually select hardware:

CPU Jobs (API models like GPT-4, Claude):

HF Jobs: cpu-basic or cpu-upgrade
Modal: cpu

Small Models (1B-5B params):

HF Jobs: t4-small (16GB VRAM)
Modal: gpu_t4 (16GB VRAM)
Examples: Llama-3.2-3B, Gemma-2B, Qwen-2.5-3B

Medium Models (6B-12B params):

HF Jobs: a10g-small or a10g-large (24GB VRAM)
Modal: gpu_l40s (48GB VRAM)
Examples: Llama-3.1-8B, Mistral-7B, Qwen-2.5-7B

Large Models (13B-24B params):

HF Jobs: a100-large (80GB VRAM)
Modal: gpu_a100_80gb (80GB VRAM)
Examples: Llama-2-13B, Qwen-14B, Mistral-22B

Very Large Models (25B+ params):

HF Jobs: a100-large (80GB VRAM) - may need quantization
Modal: gpu_h200 (141GB VRAM) - recommended
Examples: Llama-3.1-70B, Qwen-72B, Gemma-27B

Cost vs Performance Trade-offs:

T4: Cheapest GPU, good for small models
L4: Newer architecture, better performance than T4
A10G: Good balance of cost/performance for medium models
L40S: Best for 7B-12B models (Modal only)
A100: Industry standard for large models
H200: Latest GPU, massive VRAM (141GB), best for 70B+ models

Submitting a Job

Step 1: Navigate to New Evaluation Screen

Open TraceMind-AI
Click ▶️ New Evaluation in the sidebar
You'll see a comprehensive configuration form

Step 2: Configure Infrastructure

Infrastructure Provider:

Choose HuggingFace Jobs or Modal

Hardware:

Use auto (recommended) or select specific hardware
See Hardware Selection Guide

Step 3: Configure Model

Model:

Enter model ID (e.g., openai/gpt-4, meta-llama/Llama-3.1-8B-Instruct)
Use HuggingFace format: organization/model-name

Provider:

litellm - For API models (OpenAI, Anthropic, etc.)
inference - For HuggingFace Inference API
transformers - For local models loaded with transformers

HF Inference Provider (optional):

Leave empty unless using HF Inference API
Example: openai-community/gpt2 for HF-hosted models

HuggingFace Token (optional):

Leave empty if already configured in Settings
Only needed for private models

Step 4: Configure Agent

Agent Type:

tool - Function calling agents only
code - Code execution agents only
both - Hybrid agents (recommended)

Search Provider:

duckduckgo - Free, no API key required (recommended)
serper - Requires Serper API key
brave - Requires Brave Search API key

Enable Optional Tools:

Select additional tools for the agent:
- google_search - Google Search (requires API key)
- duckduckgo_search - DuckDuckGo Search
- visit_webpage - Web page scraping
- python_interpreter - Python code execution
- wikipedia_search - Wikipedia queries
- user_input - User interaction (not recommended for batch eval)

Step 5: Configure Test Dataset

Dataset Name:

Default: kshitijthakkar/smoltrace-tasks
Or use your own HuggingFace dataset
Format: username/dataset-name

Dataset Split:

Default: train
Other options: test, validation

Difficulty Filter:

all - All difficulty levels (recommended)
easy - Easy tasks only
medium - Medium tasks only
hard - Hard tasks only

Parallel Workers:

Default: 1 (sequential execution)
Higher values (2-10) for faster execution
⚠️ Increases memory usage and API rate limits

Step 6: Configure Output & Monitoring

Output Format:

hub - Push to HuggingFace datasets (recommended)
json - Save locally (requires output directory)

Output Directory:

Only for json format
Example: ./evaluation_results

Enable OpenTelemetry Tracing:

✅ Recommended - Collects detailed execution traces
Traces appear in TraceMind trace visualization

Enable GPU Metrics:

✅ Recommended for GPU jobs
Collects GPU utilization, memory, temperature, CO2 emissions
No effect on CPU jobs

Private Datasets:

☐ Make result datasets private on HuggingFace
Default: Public datasets

Debug Mode:

☐ Enable verbose logging for troubleshooting
Default: Off

Quiet Mode:

☐ Reduce output verbosity
Default: Off

Run ID (optional):

Auto-generated UUID if left empty
Custom ID for tracking specific runs

Job Timeout:

Default: 1h (1 hour)
Other examples: 30m, 2h, 3h
Job will be terminated if it exceeds timeout

Step 7: Estimate Cost (Optional but Recommended)

Click 💰 Estimate Cost button
Wait for AI-powered cost analysis
Review:
- Estimated total cost
- Estimated duration
- Hardware selection (if auto)
- Historical data (if available)

Cost Estimation Sources:

Historical Data: Based on previous runs of the same model in leaderboard
MCP AI Analysis: AI-powered estimation using Gemini 2.5 Flash (if no historical data)

Step 8: Submit Job

Review all configurations
Click 🚀 Submit Evaluation button
Wait for confirmation message
Copy job ID for tracking

Confirmation message includes:

✅ Job submission status
Job ID and platform-specific ID
Hardware selected
Estimated duration
Monitoring instructions

Example: Submit HuggingFace Jobs Evaluation

Infrastructure: HuggingFace Jobs
Hardware: auto → a10g-large
Model: meta-llama/Llama-3.1-8B-Instruct
Provider: transformers
Agent Type: both
Dataset: kshitijthakkar/smoltrace-tasks
Output Format: hub

Click "Estimate Cost":
→ Estimated Cost: $1.25
→ Duration: 25 minutes
→ Hardware: a10g-large (auto-selected)

Click "Submit Evaluation":
→ ✅ Job submitted successfully!
→ HF Job ID: username/job_abc123
→ Monitor at: https://huggingface.co/jobs

Example: Submit Modal Evaluation

Infrastructure: Modal
Hardware: auto → L40S
Model: meta-llama/Llama-3.1-8B-Instruct
Provider: transformers
Agent Type: both
Dataset: kshitijthakkar/smoltrace-tasks
Output Format: hub

Click "Estimate Cost":
→ Estimated Cost: $0.95
→ Duration: 20 minutes
→ Hardware: gpu_l40s (auto-selected)

Click "Submit Evaluation":
→ ✅ Job submitted successfully!
→ Modal Call ID: modal-job_xyz789
→ Monitor at: https://modal.com/apps

Cost Estimation

Understanding Cost Estimates

TraceMind provides AI-powered cost estimation before you submit jobs:

Historical Data (most accurate):

Based on actual runs of the same model
Shows average cost, duration from past evaluations
Displays number of historical runs used

MCP AI Analysis (when no historical data):

Powered by Google Gemini 2.5 Flash
Analyzes model size, hardware, provider
Estimates cost based on typical usage patterns
Includes detailed breakdown and recommendations

Cost Factors

For HuggingFace Jobs:

Hardware per-second rate (see Infrastructure Options)
Evaluation duration (actual runtime only, billed per-second)
LLM API costs (if using API models like GPT-4)
HF Pro subscription ($9/month required)

For Modal:

Hardware per-second rate (no minimums)
Evaluation duration (actual runtime only)
Network egress (data transfer out)
LLM API costs (if using API models)

Cost Optimization Tips

Use Auto Hardware Selection:

Automatically picks cheapest hardware for your model
Avoids over-provisioning (e.g., H200 for 3B model)

Choose Right Infrastructure:

If you have HF Pro: Use HF Jobs (already paying subscription)
If you don't have HF Pro: Use Modal (no subscription required)
For latest GPUs (H200/H100): Use Modal (HF Jobs doesn't offer these)

Optimize Model Selection:

Smaller models (3B-7B) are 10x cheaper than large models (70B)
API models (GPT-4-mini) often cheaper than local 70B models

Reduce Test Count:

Use difficulty filter (easy only) for quick validation
Test with small dataset first, then scale up

Parallel Workers:

Keep at 1 for sequential execution (cheapest)
Increase only if time is critical (increases API costs)

Example Cost Comparison:

Model	Hardware	Infrastructure	Duration	HF Jobs Cost	Modal Cost
GPT-4 (API)	CPU	Either	5 min	Free*	~$0.00*
Llama-3.1-8B	A10G-large	HF Jobs	25 min	$0.63**	N/A
Llama-3.1-8B	L40S	Modal	20 min	N/A	$0.65**
Llama-3.1-70B	A100-80GB	Both	45 min	$1.74**	$1.56**
Llama-3.1-70B	H200	Modal only	35 min	N/A	$2.65**

* Plus LLM API costs (OpenAI/Anthropic/etc. - not included) ** Per-second billing, actual runtime only (no minimums)

Monitoring Jobs

HuggingFace Jobs

Via HuggingFace Dashboard:

Go to https://huggingface.co/jobs
Find your job in the list
Click to view details and logs

Via TraceMind Job Monitoring Tab:

Click 📈 Job Monitoring in sidebar
See all your submitted jobs
Real-time status updates
Click job to view logs

Job Statuses:

pending - Waiting for resources
running - Currently executing
completed - Finished successfully
failed - Error occurred (check logs)
cancelled - Manually stopped

Modal

Via Modal Dashboard:

Go to https://modal.com/apps
Find your app: smoltrace-eval-{job_id}
Click to view real-time logs and metrics

Via TraceMind Job Monitoring Tab:

Click 📈 Job Monitoring in sidebar
See all your submitted jobs
Modal jobs show as submitted (check Modal dashboard for details)

Viewing Job Logs

HuggingFace Jobs:

1. Go to Job Monitoring tab
2. Click on your job
3. Click "View Logs" button
4. See real-time output from SMOLTRACE

Modal:

1. Go to https://modal.com/apps
2. Find your app
3. Click "Logs" tab
4. See streaming output in real-time

Expected Job Duration

API Models (litellm provider):

CPU job: 2-5 minutes for 100 tests
No model download required
Depends on API rate limits

Local Models (transformers provider):

Model download: 5-15 minutes (one-time per job)
- 3B model: ~6GB download
- 8B model: ~16GB download
- 70B model: ~140GB download
Evaluation: 10-30 minutes for 100 tests
Total: 15-45 minutes typical

Progress Indicators:

⏳ Job queued (0-2 minutes)
🔄 Downloading model (5-15 minutes for first run)
🧪 Running evaluation (10-30 minutes)
📤 Uploading results to HuggingFace (1-2 minutes)
✅ Complete

Understanding Job Results

Where Results Are Stored

HuggingFace Datasets (if output_format = "hub"):

SMOLTRACE creates 4 datasets for each evaluation:

Leaderboard Dataset: huggingface/smolagents-leaderboard
- Aggregate statistics for the run
- Appears in TraceMind Leaderboard tab
- Public, shared across all users
Results Dataset: {your_username}/agent-results-{model}-{timestamp}
- Individual test case results
- Success/failure, execution time, tokens, cost
- Links to traces dataset
Traces Dataset: {your_username}/agent-traces-{model}-{timestamp}
- OpenTelemetry traces (if enable_otel = True)
- Detailed execution steps, LLM calls, tool usage
- Viewable in TraceMind Trace Visualization
Metrics Dataset: {your_username}/agent-metrics-{model}-{timestamp}
- GPU metrics (if enable_gpu_metrics = True)
- GPU utilization, memory, temperature, CO2 emissions
- Time-series data for each test

Local JSON Files (if output_format = "json"):

Saved to output_dir on the job machine
Not automatically uploaded to HuggingFace
Useful for local testing

Viewing Results in TraceMind

Step 1: Refresh Leaderboard

Go to 📊 Leaderboard tab
Click Load Leaderboard button
Your new run appears in the table

Step 2: View Run Details

Click on your run in the leaderboard
See detailed test results:
- Individual test cases
- Success/failure breakdown
- Execution times
- Token usage
- Costs

Step 3: Visualize Traces (if enable_otel = True)

From run details, click on a test case
Click View Trace button
See OpenTelemetry waterfall diagram
Analyze:
- LLM calls and durations
- Tool executions
- Reasoning steps
- GPU metrics overlay (if GPU job)

Step 4: Ask Questions About Results

Go to 🤖 Agent Chat tab
Ask questions like:
- "Analyze my latest evaluation run"
- "Why did test case 5 fail?"
- "Compare my run with the top model"
- "What was the cost breakdown?"

Interpreting Results

Key Metrics:

Metric	Description	Good Value
Success Rate	% of tests passed	>90% excellent, >70% good
Avg Duration	Time per test case	<5s good, <10s acceptable
Total Cost	Cost for all tests	Varies by model
Tokens Used	Total tokens consumed	Lower is better
CO2 Emissions	Carbon footprint	Lower is better
GPU Utilization	GPU usage %	>60% efficient

Common Patterns:

High accuracy, low cost:

✅ Excellent model for production
Examples: GPT-4-mini, Claude-3-Haiku, Gemini-1.5-Flash

High accuracy, high cost:

✅ Best for quality-critical tasks
Examples: GPT-4, Claude-3.5-Sonnet, Gemini-1.5-Pro

Low accuracy, low cost:

⚠️ May need prompt optimization or better model
Examples: Small local models (<3B params)

Low accuracy, high cost:

❌ Poor choice, investigate or switch models
May indicate configuration issues

Troubleshooting

Job Submission Failures

Error: "HuggingFace token not configured"

Cause: Missing or invalid HF token
Fix:
1. Go to Settings tab
2. Add HF token with "Read + Write + Run Jobs" permissions
3. Click "Save API Keys"

Error: "HuggingFace Pro subscription required"

Cause: HF Jobs requires Pro subscription
Fix:
1. Subscribe at https://huggingface.co/pricing ($9/month)
2. Add credit card for GPU charges
3. Try again

Error: "Modal credentials not configured"

Cause: Missing Modal API tokens
Fix:
1. Go to https://modal.com/settings/tokens
2. Create new token
3. Copy Token ID and Token Secret
4. Add to Settings tab
5. Try again

Error: "Modal package not installed"

Cause: Modal SDK missing (should not happen in hosted Space)
Fix: Contact support or run locally with pip install modal

Job Execution Failures

Job stuck in "Pending" status

Cause: High demand for GPU resources
Fix:
- Wait 5-10 minutes
- Try different hardware (e.g., T4 instead of A100)
- Try different infrastructure (Modal vs HF Jobs)

Job fails with "Out of Memory"

Cause: Model too large for selected hardware
Fix:
- Use larger GPU (A100-80GB or H200)
- Or use auto hardware selection
- Or reduce parallel_workers to 1

Job fails with "Model not found"

Cause: Invalid model ID or private model
Fix:
- Check model ID format: organization/model-name
- For private models, add HF token with access
- Verify model exists on HuggingFace Hub

Job fails with "API key not set"

Cause: Missing LLM provider API key
Fix:
1. Go to Settings tab
2. Add API key for your provider (OpenAI, Anthropic, etc.)
3. Submit job again

Job fails with "Rate limit exceeded"

Cause: Too many API requests
Fix:
- Reduce parallel_workers to 1
- Use different model with higher rate limits
- Wait and retry later

Modal job fails with "Authentication failed"

Cause: Invalid Modal tokens
Fix:
1. Go to https://modal.com/settings/tokens
2. Create new token (old one may be expired)
3. Update tokens in Settings tab

Results Not Appearing

Results not in leaderboard after job completes

Cause: Dataset upload failed or not configured
Fix:
- Check job logs for errors
- Verify output_format was set to "hub"
- Verify HF token has "Write" permission
- Manually refresh leaderboard (click "Load Leaderboard")

Traces not appearing

Cause: OpenTelemetry not enabled
Fix:
- Re-run evaluation with enable_otel = True
- Check traces dataset exists on your HF profile

GPU metrics not showing

Cause: GPU metrics not enabled or CPU job
Fix:
- Re-run with enable_gpu_metrics = True
- Verify job used GPU hardware (not CPU)
- Check metrics dataset exists

Advanced Configuration

Custom Test Datasets

Create your own test dataset:

Use 🔬 Synthetic Data Generator tab:
- Configure domain and tools
- Generate custom tasks
- Push to HuggingFace Hub
Use generated dataset in evaluation:
- Set dataset_name to your dataset: {username}/dataset-name
- Configure agent with matching tools

Dataset Format Requirements:

{
    "task_id": "task_001",
    "prompt": "What's the weather in Tokyo?",
    "expected_tool": "get_weather",
    "difficulty": "easy",
    "category": "tool_usage"
}

Environment Variables

LLM Provider API Keys (in Settings):

OPENAI_API_KEY - OpenAI API
ANTHROPIC_API_KEY - Anthropic API
GOOGLE_API_KEY or GEMINI_API_KEY - Google Gemini API
COHERE_API_KEY - Cohere API
MISTRAL_API_KEY - Mistral API
TOGETHER_API_KEY - Together AI API
GROQ_API_KEY - Groq API
REPLICATE_API_TOKEN - Replicate API
ANYSCALE_API_KEY - Anyscale API

Infrastructure Credentials:

HF_TOKEN - HuggingFace token
MODAL_TOKEN_ID - Modal token ID
MODAL_TOKEN_SECRET - Modal token secret

Parallel Execution

Use parallel_workers to speed up evaluation:

1 - Sequential execution (default, safest)
2-4 - Moderate parallelism (2-4x faster)
5-10 - High parallelism (5-10x faster, risky)

Trade-offs:

✅ Faster: Linear speedup with workers
⚠️ Higher cost: More API calls per minute
⚠️ Rate limits: May hit provider rate limits
⚠️ Memory: Increases GPU memory usage

Recommendations:

API models: Keep at 1 (avoid rate limits)
Local models: Can use 2-4 if GPU has enough VRAM
Production runs: Use 1 for reliability

Private Datasets

Make results private:

Set private = True in job configuration
Results will be private on your HuggingFace profile
Only you can view in leaderboard (if using private leaderboard dataset)

Use cases:

Proprietary models
Confidential evaluation data
Internal benchmarking

Quick Reference

Job Submission Checklist

Before submitting a job, verify:

Infrastructure selected (HF Jobs or Modal)
Hardware configured (auto or manual)
Model ID is correct
Provider matches model type
API keys configured in Settings
Dataset name is valid
Output format is "hub" for TraceMind integration
OpenTelemetry tracing enabled (if you want traces)
GPU metrics enabled (if using GPU)
Cost estimate reviewed
Timeout is sufficient for your model size

Common Model Configurations

OpenAI GPT-4:

Model: openai/gpt-4
Provider: litellm
Hardware: auto → cpu-basic
Infrastructure: Either (HF Jobs or Modal)
Estimated Cost: API costs only

Anthropic Claude-3.5-Sonnet:

Model: anthropic/claude-3.5-sonnet
Provider: litellm
Hardware: auto → cpu-basic
Infrastructure: Either (HF Jobs or Modal)
Estimated Cost: API costs only

Meta Llama-3.1-8B:

Model: meta-llama/Llama-3.1-8B-Instruct
Provider: transformers
Hardware: auto → a10g-large (HF) or gpu_l40s (Modal)
Infrastructure: Modal (cheaper for short jobs)
Estimated Cost: $0.75-1.50

Meta Llama-3.1-70B:

Model: meta-llama/Llama-3.1-70B-Instruct
Provider: transformers
Hardware: auto → a100-large (HF) or gpu_h200 (Modal)
Infrastructure: Modal (if available), else HF Jobs
Estimated Cost: $3.00-8.00

Qwen-2.5-Coder-32B:

Model: Qwen/Qwen2.5-Coder-32B-Instruct
Provider: transformers
Hardware: auto → a100-large (HF) or gpu_a100_80gb (Modal)
Infrastructure: Either
Estimated Cost: $2.00-4.00

Next Steps

After submitting your first job:

Monitor progress in Job Monitoring tab
View results in Leaderboard when complete
Analyze traces in Trace Visualization
Ask questions in Agent Chat about your results
Compare with other models using Compare feature
Optimize model selection based on cost/accuracy trade-offs
Generate custom test datasets for your domain
Share your results with the community

For more help:

USER_GUIDE.md - Complete screen-by-screen walkthrough
MCP_INTEGRATION.md - MCP client architecture details
ARCHITECTURE.md - Technical architecture overview
GitHub Issues: TraceMind-AI/issues

Job Submission Guide

Table of Contents

Overview

Infrastructure Options

HuggingFace Jobs

Modal

Prerequisites

For Viewing Leaderboard (Free)

For Submitting Jobs to HuggingFace Jobs

For Submitting Jobs to Modal

Hardware Selection Guide

Auto-Selection (Recommended)

Manual Selection

Submitting a Job

Step 1: Navigate to New Evaluation Screen

Step 2: Configure Infrastructure

Step 3: Configure Model

Step 4: Configure Agent

Step 5: Configure Test Dataset

Step 6: Configure Output & Monitoring

Step 7: Estimate Cost (Optional but Recommended)

Step 8: Submit Job

Example: Submit HuggingFace Jobs Evaluation

Example: Submit Modal Evaluation

Cost Estimation

Understanding Cost Estimates

Cost Factors

Cost Optimization Tips

Monitoring Jobs

HuggingFace Jobs

Modal

Viewing Job Logs

Expected Job Duration

Understanding Job Results

Where Results Are Stored

Viewing Results in TraceMind

Interpreting Results

Troubleshooting

Job Submission Failures

Job Execution Failures

Results Not Appearing

Advanced Configuration

Custom Test Datasets

Environment Variables

Parallel Execution

Private Datasets

Quick Reference

Job Submission Checklist

Common Model Configurations

Next Steps