Spaces:
Running
Running
| # Job Submission Guide | |
| This guide explains how to submit agent evaluation jobs to run on cloud infrastructure using TraceMind-AI. | |
| ## Table of Contents | |
| - [Overview](#overview) | |
| - [Infrastructure Options](#infrastructure-options) | |
| - [HuggingFace Jobs](#huggingface-jobs) | |
| - [Modal](#modal) | |
| - [Prerequisites](#prerequisites) | |
| - [Hardware Selection Guide](#hardware-selection-guide) | |
| - [Submitting a Job](#submitting-a-job) | |
| - [Cost Estimation](#cost-estimation) | |
| - [Monitoring Jobs](#monitoring-jobs) | |
| - [Understanding Job Results](#understanding-job-results) | |
| - [Troubleshooting](#troubleshooting) | |
| - [Advanced Configuration](#advanced-configuration) | |
| --- | |
| ## Overview | |
| TraceMind-AI allows you to submit SMOLTRACE evaluation jobs to two cloud platforms: | |
| 1. **HuggingFace Jobs** - Managed compute with GPU/CPU options | |
| 2. **Modal** - Serverless compute with pay-per-second billing | |
| Both platforms: | |
| - ✅ Run the same SMOLTRACE evaluation engine | |
| - ✅ Push results automatically to HuggingFace datasets | |
| - ✅ Appear in the TraceMind leaderboard when complete | |
| - ✅ Collect OpenTelemetry traces and GPU metrics | |
| - ✅ **Per-second billing** with no minimum duration | |
| **Choose based on your needs**: | |
| - **HuggingFace Jobs**: Best if you already have HF Pro subscription ($9/month) | |
| - **Modal**: Best if you need H200/H100 GPUs or want to avoid subscriptions | |
| **Pricing Sources**: | |
| - [HuggingFace Jobs Documentation](https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs) | |
| - [HuggingFace Spaces GPU Pricing](https://huggingface.co/docs/hub/en/spaces-gpus) | |
| - [Modal GPU Pricing](https://modal.com/pricing) | |
| --- | |
| ## Infrastructure Options | |
| ### HuggingFace Jobs | |
| **What it is**: Managed compute platform from HuggingFace with dedicated GPU/CPU instances. | |
| **Pricing Model**: Subscription-based ($9/month HF Pro) + **per-second** GPU charges | |
| **Hardware Options** (pricing from [HF Spaces GPU pricing](https://huggingface.co/docs/hub/en/spaces-gpus)): | |
| - `cpu-basic` - 2 vCPU, 16GB RAM (Free with Pro) | |
| - `cpu-upgrade` - 8 vCPU, 32GB RAM (Free with Pro) | |
| - `t4-small` - NVIDIA T4 16GB, 4 vCPU, 15GB RAM ($0.40/hr = $0.000111/sec) | |
| - `t4-medium` - NVIDIA T4 16GB, 8 vCPU, 30GB RAM ($0.60/hr = $0.000167/sec) | |
| - `l4x1` - NVIDIA L4 24GB, 8 vCPU, 30GB RAM ($0.80/hr = $0.000222/sec) | |
| - `l4x4` - 4x NVIDIA L4 96GB total, 48 vCPU, 186GB RAM ($3.80/hr = $0.001056/sec) | |
| - `a10g-small` - NVIDIA A10G 24GB ($1.00/hr = $0.000278/sec) | |
| - `a10g-large` - NVIDIA A10G 24GB (more compute) ($1.50/hr = $0.000417/sec) | |
| - `a10g-largex2` - 2x NVIDIA A10G 48GB total ($3.00/hr = $0.000833/sec) | |
| - `a10g-largex4` - 4x NVIDIA A10G 96GB total ($5.00/hr = $0.001389/sec) | |
| - `a100-large` - NVIDIA A100 80GB, 12 vCPU, 142GB RAM ($2.50/hr = $0.000694/sec) | |
| - `v5e-1x1` - Google Cloud TPU v5e (pricing TBD) | |
| - `v5e-2x2` - Google Cloud TPU v5e (pricing TBD) | |
| - `v5e-2x4` - Google Cloud TPU v5e (pricing TBD) | |
| *Note: Jobs billing is **per-second** with no minimum. You only pay for actual compute time used.* | |
| **Pros**: | |
| - Simple authentication (HuggingFace token) | |
| - Integrated with HF ecosystem | |
| - Job dashboard at https://huggingface.co/jobs | |
| - Reliable infrastructure | |
| **Cons**: | |
| - Requires HF Pro subscription ($9/month) | |
| - Slightly more expensive than Modal for most GPUs | |
| - Limited hardware options compared to Modal (no H100/H200) | |
| **When to use**: | |
| - ✅ You already have HF Pro subscription | |
| - ✅ You want simplicity and reliability | |
| - ✅ You prefer HuggingFace ecosystem integration | |
| - ✅ You prefer managed infrastructure | |
| ### Modal | |
| **What it is**: Serverless compute platform with pay-per-second billing for CPU and GPU workloads. | |
| **Pricing Model**: Pay-per-second usage (no subscription required) | |
| **Hardware Options**: | |
| - `cpu` - Physical core (2 vCPU equivalent) ($0.0000131/core/sec, min 0.125 cores) | |
| - `gpu_t4` - NVIDIA T4 16GB ($0.000164/sec ~= $0.59/hr) | |
| - `gpu_l4` - NVIDIA L4 24GB ($0.000222/sec ~= $0.80/hr) | |
| - `gpu_a10` - NVIDIA A10G 24GB ($0.000306/sec ~= $1.10/hr) | |
| - `gpu_l40s` - NVIDIA L40S 48GB ($0.000542/sec ~= $1.95/hr) | |
| - `gpu_a100` - NVIDIA A100 40GB ($0.000583/sec ~= $2.10/hr) | |
| - `gpu_a100_80gb` - NVIDIA A100 80GB ($0.000694/sec ~= $2.50/hr) | |
| - `gpu_h100` - NVIDIA H100 80GB ($0.001097/sec ~= $3.95/hr) | |
| - `gpu_h200` - NVIDIA H200 141GB ($0.001261/sec ~= $4.54/hr) | |
| - `gpu_b200` - NVIDIA B200 192GB ($0.001736/sec ~= $6.25/hr) | |
| **Pros**: | |
| - Pay-per-second (no hourly minimums) | |
| - Wide range of GPUs (including H200, H100) | |
| - No subscription required | |
| - Real-time logs and monitoring | |
| - Fast cold starts | |
| **Cons**: | |
| - Requires Modal account setup | |
| - Need to configure API tokens (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET) | |
| - Network egress charges apply | |
| - Less integrated with HF ecosystem | |
| **When to use**: | |
| - ✅ You want to minimize costs (generally cheaper than HF Jobs) | |
| - ✅ You need access to latest GPUs (H200, H100, B200) | |
| - ✅ You prefer serverless architecture | |
| - ✅ You don't have HF Pro subscription | |
| - ✅ You want more GPU options and flexibility | |
| --- | |
| ## Prerequisites | |
| ### For Viewing Leaderboard (Free) | |
| **Required**: | |
| - HuggingFace account (free) | |
| - HuggingFace token with **Read** permissions | |
| **How to get**: | |
| 1. Go to https://huggingface.co/settings/tokens | |
| 2. Create new token with **Read** permission | |
| 3. Copy token (starts with `hf_...`) | |
| 4. Add to TraceMind Settings tab | |
| ### For Submitting Jobs to HuggingFace Jobs | |
| **Required**: | |
| 1. **HuggingFace Pro** subscription ($9/month) | |
| - Sign up at https://huggingface.co/pricing | |
| - **Must add credit card** for GPU compute charges | |
| 2. HuggingFace token with **Read + Write + Run Jobs** permissions | |
| 3. LLM provider API keys (OpenAI, Anthropic, etc.) for API models | |
| **How to setup**: | |
| 1. Subscribe to HF Pro: https://huggingface.co/pricing | |
| 2. Add credit card for compute charges | |
| 3. Create token with all permissions: | |
| - Go to https://huggingface.co/settings/tokens | |
| - Click "New token" | |
| - Select: **Read**, **Write**, **Run Jobs** | |
| - Copy token | |
| 4. Add API keys in TraceMind Settings: | |
| - HuggingFace Token | |
| - OpenAI API Key (if testing OpenAI models) | |
| - Anthropic API Key (if testing Claude models) | |
| - etc. | |
| ### For Submitting Jobs to Modal | |
| **Required**: | |
| 1. Modal account (free to create, pay-per-use) | |
| 2. Modal API token (Token ID + Token Secret) | |
| 3. HuggingFace token with **Read + Write** permissions | |
| 4. LLM provider API keys (OpenAI, Anthropic, etc.) for API models | |
| **How to setup**: | |
| 1. Create Modal account: | |
| - Go to https://modal.com | |
| - Sign up (GitHub or email) | |
| 2. Create API token: | |
| - Go to https://modal.com/settings/tokens | |
| - Click "Create token" | |
| - Copy **Token ID** (starts with `ak-...`) | |
| - Copy **Token Secret** (starts with `as-...`) | |
| 3. Add credentials in TraceMind Settings: | |
| - Modal Token ID | |
| - Modal Token Secret | |
| - HuggingFace Token (Read + Write) | |
| - LLM provider API keys | |
| --- | |
| ## Hardware Selection Guide | |
| ### Auto-Selection (Recommended) | |
| Set hardware to **`auto`** to let TraceMind automatically select the optimal hardware based on: | |
| - Model size (extracted from model name) | |
| - Provider type (API vs local) | |
| - Infrastructure (HF Jobs vs Modal) | |
| **Auto-selection logic**: | |
| **For API Models** (provider = `litellm` or `inference`): | |
| - Always uses **CPU** (no GPU needed) | |
| - HF Jobs: `cpu-basic` | |
| - Modal: `cpu` | |
| **For Local Models** (provider = `transformers`): | |
| *Memory estimation for agentic workloads*: | |
| - Model weights (FP16): ~2GB per 1B params | |
| - KV cache for long contexts: ~1.5-2x model size | |
| - Inference overhead: ~20-30% additional | |
| - **Total: ~4-5GB per 1B params for safe execution** | |
| **HuggingFace Jobs**: | |
| | Model Size | Hardware | VRAM | Example Models | | |
| |------------|----------|------|----------------| | |
| | < 1B | `t4-small` | 16GB | Qwen-0.5B, Phi-3-mini | | |
| | 1B - 5B | `t4-small` | 16GB | Llama-3.2-3B, Gemma-2B | | |
| | 6B - 12B | `a10g-large` | 24GB | Llama-3.1-8B, Mistral-7B | | |
| | 13B+ | `a100-large` | 80GB | Llama-3.1-70B, Qwen-14B | | |
| **Modal**: | |
| | Model Size | Hardware | VRAM | Example Models | | |
| |------------|----------|------|----------------| | |
| | < 1B | `gpu_t4` | 16GB | Qwen-0.5B, Phi-3-mini | | |
| | 1B - 5B | `gpu_t4` | 16GB | Llama-3.2-3B, Gemma-2B | | |
| | 6B - 12B | `gpu_l40s` | 48GB | Llama-3.1-8B, Mistral-7B | | |
| | 13B - 24B | `gpu_a100_80gb` | 80GB | Llama-2-13B, Qwen-14B | | |
| | 25B - 48B | `gpu_a100_80gb` | 80GB | Gemma-27B, Yi-34B | | |
| | 49B+ | `gpu_h200` | 141GB | Llama-3.1-70B, Qwen-72B | | |
| ### Manual Selection | |
| If you know your model's requirements, you can manually select hardware: | |
| **CPU Jobs** (API models like GPT-4, Claude): | |
| - HF Jobs: `cpu-basic` or `cpu-upgrade` | |
| - Modal: `cpu` | |
| **Small Models** (1B-5B params): | |
| - HF Jobs: `t4-small` (16GB VRAM) | |
| - Modal: `gpu_t4` (16GB VRAM) | |
| - Examples: Llama-3.2-3B, Gemma-2B, Qwen-2.5-3B | |
| **Medium Models** (6B-12B params): | |
| - HF Jobs: `a10g-small` or `a10g-large` (24GB VRAM) | |
| - Modal: `gpu_l40s` (48GB VRAM) | |
| - Examples: Llama-3.1-8B, Mistral-7B, Qwen-2.5-7B | |
| **Large Models** (13B-24B params): | |
| - HF Jobs: `a100-large` (80GB VRAM) | |
| - Modal: `gpu_a100_80gb` (80GB VRAM) | |
| - Examples: Llama-2-13B, Qwen-14B, Mistral-22B | |
| **Very Large Models** (25B+ params): | |
| - HF Jobs: `a100-large` (80GB VRAM) - may need quantization | |
| - Modal: `gpu_h200` (141GB VRAM) - recommended | |
| - Examples: Llama-3.1-70B, Qwen-72B, Gemma-27B | |
| **Cost vs Performance Trade-offs**: | |
| - T4: Cheapest GPU, good for small models | |
| - L4: Newer architecture, better performance than T4 | |
| - A10G: Good balance of cost/performance for medium models | |
| - L40S: Best for 7B-12B models (Modal only) | |
| - A100: Industry standard for large models | |
| - H200: Latest GPU, massive VRAM (141GB), best for 70B+ models | |
| --- | |
| ## Submitting a Job | |
| ### Step 1: Navigate to New Evaluation Screen | |
| 1. Open TraceMind-AI | |
| 2. Click **▶️ New Evaluation** in the sidebar | |
| 3. You'll see a comprehensive configuration form | |
| ### Step 2: Configure Infrastructure | |
| **Infrastructure Provider**: | |
| - Choose `HuggingFace Jobs` or `Modal` | |
| **Hardware**: | |
| - Use `auto` (recommended) or select specific hardware | |
| - See [Hardware Selection Guide](#hardware-selection-guide) | |
| ### Step 3: Configure Model | |
| **Model**: | |
| - Enter model ID (e.g., `openai/gpt-4`, `meta-llama/Llama-3.1-8B-Instruct`) | |
| - Use HuggingFace format: `organization/model-name` | |
| **Provider**: | |
| - `litellm` - For API models (OpenAI, Anthropic, etc.) | |
| - `inference` - For HuggingFace Inference API | |
| - `transformers` - For local models loaded with transformers | |
| **HF Inference Provider** (optional): | |
| - Leave empty unless using HF Inference API | |
| - Example: `openai-community/gpt2` for HF-hosted models | |
| **HuggingFace Token** (optional): | |
| - Leave empty if already configured in Settings | |
| - Only needed for private models | |
| ### Step 4: Configure Agent | |
| **Agent Type**: | |
| - `tool` - Function calling agents only | |
| - `code` - Code execution agents only | |
| - `both` - Hybrid agents (recommended) | |
| **Search Provider**: | |
| - `duckduckgo` - Free, no API key required (recommended) | |
| - `serper` - Requires Serper API key | |
| - `brave` - Requires Brave Search API key | |
| **Enable Optional Tools**: | |
| - Select additional tools for the agent: | |
| - `google_search` - Google Search (requires API key) | |
| - `duckduckgo_search` - DuckDuckGo Search | |
| - `visit_webpage` - Web page scraping | |
| - `python_interpreter` - Python code execution | |
| - `wikipedia_search` - Wikipedia queries | |
| - `user_input` - User interaction (not recommended for batch eval) | |
| ### Step 5: Configure Test Dataset | |
| **Dataset Name**: | |
| - Default: `kshitijthakkar/smoltrace-tasks` | |
| - Or use your own HuggingFace dataset | |
| - Format: `username/dataset-name` | |
| **Dataset Split**: | |
| - Default: `train` | |
| - Other options: `test`, `validation` | |
| **Difficulty Filter**: | |
| - `all` - All difficulty levels (recommended) | |
| - `easy` - Easy tasks only | |
| - `medium` - Medium tasks only | |
| - `hard` - Hard tasks only | |
| **Parallel Workers**: | |
| - Default: `1` (sequential execution) | |
| - Higher values (2-10) for faster execution | |
| - ⚠️ Increases memory usage and API rate limits | |
| ### Step 6: Configure Output & Monitoring | |
| **Output Format**: | |
| - `hub` - Push to HuggingFace datasets (recommended) | |
| - `json` - Save locally (requires output directory) | |
| **Output Directory**: | |
| - Only for `json` format | |
| - Example: `./evaluation_results` | |
| **Enable OpenTelemetry Tracing**: | |
| - ✅ Recommended - Collects detailed execution traces | |
| - Traces appear in TraceMind trace visualization | |
| **Enable GPU Metrics**: | |
| - ✅ Recommended for GPU jobs | |
| - Collects GPU utilization, memory, temperature, CO2 emissions | |
| - No effect on CPU jobs | |
| **Private Datasets**: | |
| - ☐ Make result datasets private on HuggingFace | |
| - Default: Public datasets | |
| **Debug Mode**: | |
| - ☐ Enable verbose logging for troubleshooting | |
| - Default: Off | |
| **Quiet Mode**: | |
| - ☐ Reduce output verbosity | |
| - Default: Off | |
| **Run ID** (optional): | |
| - Auto-generated UUID if left empty | |
| - Custom ID for tracking specific runs | |
| **Job Timeout**: | |
| - Default: `1h` (1 hour) | |
| - Other examples: `30m`, `2h`, `3h` | |
| - Job will be terminated if it exceeds timeout | |
| ### Step 7: Estimate Cost (Optional but Recommended) | |
| 1. Click **💰 Estimate Cost** button | |
| 2. Wait for AI-powered cost analysis | |
| 3. Review: | |
| - Estimated total cost | |
| - Estimated duration | |
| - Hardware selection (if auto) | |
| - Historical data (if available) | |
| **Cost Estimation Sources**: | |
| - **Historical Data**: Based on previous runs of the same model in leaderboard | |
| - **MCP AI Analysis**: AI-powered estimation using Gemini 2.5 Flash (if no historical data) | |
| ### Step 8: Submit Job | |
| 1. Review all configurations | |
| 2. Click **🚀 Submit Evaluation** button | |
| 3. Wait for confirmation message | |
| 4. Copy job ID for tracking | |
| **Confirmation message includes**: | |
| - ✅ Job submission status | |
| - Job ID and platform-specific ID | |
| - Hardware selected | |
| - Estimated duration | |
| - Monitoring instructions | |
| ### Example: Submit HuggingFace Jobs Evaluation | |
| ``` | |
| Infrastructure: HuggingFace Jobs | |
| Hardware: auto → a10g-large | |
| Model: meta-llama/Llama-3.1-8B-Instruct | |
| Provider: transformers | |
| Agent Type: both | |
| Dataset: kshitijthakkar/smoltrace-tasks | |
| Output Format: hub | |
| Click "Estimate Cost": | |
| → Estimated Cost: $1.25 | |
| → Duration: 25 minutes | |
| → Hardware: a10g-large (auto-selected) | |
| Click "Submit Evaluation": | |
| → ✅ Job submitted successfully! | |
| → HF Job ID: username/job_abc123 | |
| → Monitor at: https://huggingface.co/jobs | |
| ``` | |
| ### Example: Submit Modal Evaluation | |
| ``` | |
| Infrastructure: Modal | |
| Hardware: auto → L40S | |
| Model: meta-llama/Llama-3.1-8B-Instruct | |
| Provider: transformers | |
| Agent Type: both | |
| Dataset: kshitijthakkar/smoltrace-tasks | |
| Output Format: hub | |
| Click "Estimate Cost": | |
| → Estimated Cost: $0.95 | |
| → Duration: 20 minutes | |
| → Hardware: gpu_l40s (auto-selected) | |
| Click "Submit Evaluation": | |
| → ✅ Job submitted successfully! | |
| → Modal Call ID: modal-job_xyz789 | |
| → Monitor at: https://modal.com/apps | |
| ``` | |
| --- | |
| ## Cost Estimation | |
| ### Understanding Cost Estimates | |
| TraceMind provides AI-powered cost estimation before you submit jobs: | |
| **Historical Data** (most accurate): | |
| - Based on actual runs of the same model | |
| - Shows average cost, duration from past evaluations | |
| - Displays number of historical runs used | |
| **MCP AI Analysis** (when no historical data): | |
| - Powered by Google Gemini 2.5 Flash | |
| - Analyzes model size, hardware, provider | |
| - Estimates cost based on typical usage patterns | |
| - Includes detailed breakdown and recommendations | |
| ### Cost Factors | |
| **For HuggingFace Jobs**: | |
| 1. **Hardware per-second rate** (see [Infrastructure Options](#huggingface-jobs)) | |
| 2. **Evaluation duration** (actual runtime only, billed per-second) | |
| 3. **LLM API costs** (if using API models like GPT-4) | |
| 4. **HF Pro subscription** ($9/month required) | |
| **For Modal**: | |
| 1. **Hardware per-second rate** (no minimums) | |
| 2. **Evaluation duration** (actual runtime only) | |
| 3. **Network egress** (data transfer out) | |
| 4. **LLM API costs** (if using API models) | |
| ### Cost Optimization Tips | |
| **Use Auto Hardware Selection**: | |
| - Automatically picks cheapest hardware for your model | |
| - Avoids over-provisioning (e.g., H200 for 3B model) | |
| **Choose Right Infrastructure**: | |
| - **If you have HF Pro**: Use HF Jobs (already paying subscription) | |
| - **If you don't have HF Pro**: Use Modal (no subscription required) | |
| - **For latest GPUs (H200/H100)**: Use Modal (HF Jobs doesn't offer these) | |
| **Optimize Model Selection**: | |
| - Smaller models (3B-7B) are 10x cheaper than large models (70B) | |
| - API models (GPT-4-mini) often cheaper than local 70B models | |
| **Reduce Test Count**: | |
| - Use difficulty filter (`easy` only) for quick validation | |
| - Test with small dataset first, then scale up | |
| **Parallel Workers**: | |
| - Keep at 1 for sequential execution (cheapest) | |
| - Increase only if time is critical (increases API costs) | |
| **Example Cost Comparison**: | |
| | Model | Hardware | Infrastructure | Duration | HF Jobs Cost | Modal Cost | | |
| |-------|----------|----------------|----------|--------------|------------| | |
| | GPT-4 (API) | CPU | Either | 5 min | Free* | ~$0.00* | | |
| | Llama-3.1-8B | A10G-large | HF Jobs | 25 min | $0.63** | N/A | | |
| | Llama-3.1-8B | L40S | Modal | 20 min | N/A | $0.65** | | |
| | Llama-3.1-70B | A100-80GB | Both | 45 min | $1.74** | $1.56** | | |
| | Llama-3.1-70B | H200 | Modal only | 35 min | N/A | $2.65** | | |
| \* Plus LLM API costs (OpenAI/Anthropic/etc. - not included) | |
| \** Per-second billing, actual runtime only (no minimums) | |
| --- | |
| ## Monitoring Jobs | |
| ### HuggingFace Jobs | |
| **Via HuggingFace Dashboard**: | |
| 1. Go to https://huggingface.co/jobs | |
| 2. Find your job in the list | |
| 3. Click to view details and logs | |
| **Via TraceMind Job Monitoring Tab**: | |
| 1. Click **📈 Job Monitoring** in sidebar | |
| 2. See all your submitted jobs | |
| 3. Real-time status updates | |
| 4. Click job to view logs | |
| **Job Statuses**: | |
| - `pending` - Waiting for resources | |
| - `running` - Currently executing | |
| - `completed` - Finished successfully | |
| - `failed` - Error occurred (check logs) | |
| - `cancelled` - Manually stopped | |
| ### Modal | |
| **Via Modal Dashboard**: | |
| 1. Go to https://modal.com/apps | |
| 2. Find your app: `smoltrace-eval-{job_id}` | |
| 3. Click to view real-time logs and metrics | |
| **Via TraceMind Job Monitoring Tab**: | |
| 1. Click **📈 Job Monitoring** in sidebar | |
| 2. See all your submitted jobs | |
| 3. Modal jobs show as `submitted` (check Modal dashboard for details) | |
| ### Viewing Job Logs | |
| **HuggingFace Jobs**: | |
| ``` | |
| 1. Go to Job Monitoring tab | |
| 2. Click on your job | |
| 3. Click "View Logs" button | |
| 4. See real-time output from SMOLTRACE | |
| ``` | |
| **Modal**: | |
| ``` | |
| 1. Go to https://modal.com/apps | |
| 2. Find your app | |
| 3. Click "Logs" tab | |
| 4. See streaming output in real-time | |
| ``` | |
| ### Expected Job Duration | |
| **API Models** (litellm provider): | |
| - CPU job: 2-5 minutes for 100 tests | |
| - No model download required | |
| - Depends on API rate limits | |
| **Local Models** (transformers provider): | |
| - Model download: 5-15 minutes (one-time per job) | |
| - 3B model: ~6GB download | |
| - 8B model: ~16GB download | |
| - 70B model: ~140GB download | |
| - Evaluation: 10-30 minutes for 100 tests | |
| - Total: 15-45 minutes typical | |
| **Progress Indicators**: | |
| 1. ⏳ Job queued (0-2 minutes) | |
| 2. 🔄 Downloading model (5-15 minutes for first run) | |
| 3. 🧪 Running evaluation (10-30 minutes) | |
| 4. 📤 Uploading results to HuggingFace (1-2 minutes) | |
| 5. ✅ Complete | |
| --- | |
| ## Understanding Job Results | |
| ### Where Results Are Stored | |
| **HuggingFace Datasets** (if output_format = "hub"): | |
| SMOLTRACE creates 4 datasets for each evaluation: | |
| 1. **Leaderboard Dataset**: `huggingface/smolagents-leaderboard` | |
| - Aggregate statistics for the run | |
| - Appears in TraceMind Leaderboard tab | |
| - Public, shared across all users | |
| 2. **Results Dataset**: `{your_username}/agent-results-{model}-{timestamp}` | |
| - Individual test case results | |
| - Success/failure, execution time, tokens, cost | |
| - Links to traces dataset | |
| 3. **Traces Dataset**: `{your_username}/agent-traces-{model}-{timestamp}` | |
| - OpenTelemetry traces (if enable_otel = True) | |
| - Detailed execution steps, LLM calls, tool usage | |
| - Viewable in TraceMind Trace Visualization | |
| 4. **Metrics Dataset**: `{your_username}/agent-metrics-{model}-{timestamp}` | |
| - GPU metrics (if enable_gpu_metrics = True) | |
| - GPU utilization, memory, temperature, CO2 emissions | |
| - Time-series data for each test | |
| **Local JSON Files** (if output_format = "json"): | |
| - Saved to `output_dir` on the job machine | |
| - Not automatically uploaded to HuggingFace | |
| - Useful for local testing | |
| ### Viewing Results in TraceMind | |
| **Step 1: Refresh Leaderboard** | |
| 1. Go to **📊 Leaderboard** tab | |
| 2. Click **Load Leaderboard** button | |
| 3. Your new run appears in the table | |
| **Step 2: View Run Details** | |
| 1. Click on your run in the leaderboard | |
| 2. See detailed test results: | |
| - Individual test cases | |
| - Success/failure breakdown | |
| - Execution times | |
| - Token usage | |
| - Costs | |
| **Step 3: Visualize Traces** (if enable_otel = True) | |
| 1. From run details, click on a test case | |
| 2. Click **View Trace** button | |
| 3. See OpenTelemetry waterfall diagram | |
| 4. Analyze: | |
| - LLM calls and durations | |
| - Tool executions | |
| - Reasoning steps | |
| - GPU metrics overlay (if GPU job) | |
| **Step 4: Ask Questions About Results** | |
| 1. Go to **🤖 Agent Chat** tab | |
| 2. Ask questions like: | |
| - "Analyze my latest evaluation run" | |
| - "Why did test case 5 fail?" | |
| - "Compare my run with the top model" | |
| - "What was the cost breakdown?" | |
| ### Interpreting Results | |
| **Key Metrics**: | |
| | Metric | Description | Good Value | | |
| |--------|-------------|------------| | |
| | **Success Rate** | % of tests passed | >90% excellent, >70% good | | |
| | **Avg Duration** | Time per test case | <5s good, <10s acceptable | | |
| | **Total Cost** | Cost for all tests | Varies by model | | |
| | **Tokens Used** | Total tokens consumed | Lower is better | | |
| | **CO2 Emissions** | Carbon footprint | Lower is better | | |
| | **GPU Utilization** | GPU usage % | >60% efficient | | |
| **Common Patterns**: | |
| **High accuracy, low cost**: | |
| - ✅ Excellent model for production | |
| - Examples: GPT-4-mini, Claude-3-Haiku, Gemini-1.5-Flash | |
| **High accuracy, high cost**: | |
| - ✅ Best for quality-critical tasks | |
| - Examples: GPT-4, Claude-3.5-Sonnet, Gemini-1.5-Pro | |
| **Low accuracy, low cost**: | |
| - ⚠️ May need prompt optimization or better model | |
| - Examples: Small local models (<3B params) | |
| **Low accuracy, high cost**: | |
| - ❌ Poor choice, investigate or switch models | |
| - May indicate configuration issues | |
| --- | |
| ## Troubleshooting | |
| ### Job Submission Failures | |
| **Error: "HuggingFace token not configured"** | |
| - **Cause**: Missing or invalid HF token | |
| - **Fix**: | |
| 1. Go to Settings tab | |
| 2. Add HF token with "Read + Write + Run Jobs" permissions | |
| 3. Click "Save API Keys" | |
| **Error: "HuggingFace Pro subscription required"** | |
| - **Cause**: HF Jobs requires Pro subscription | |
| - **Fix**: | |
| 1. Subscribe at https://huggingface.co/pricing ($9/month) | |
| 2. Add credit card for GPU charges | |
| 3. Try again | |
| **Error: "Modal credentials not configured"** | |
| - **Cause**: Missing Modal API tokens | |
| - **Fix**: | |
| 1. Go to https://modal.com/settings/tokens | |
| 2. Create new token | |
| 3. Copy Token ID and Token Secret | |
| 4. Add to Settings tab | |
| 5. Try again | |
| **Error: "Modal package not installed"** | |
| - **Cause**: Modal SDK missing (should not happen in hosted Space) | |
| - **Fix**: Contact support or run locally with `pip install modal` | |
| ### Job Execution Failures | |
| **Job stuck in "Pending" status** | |
| - **Cause**: High demand for GPU resources | |
| - **Fix**: | |
| - Wait 5-10 minutes | |
| - Try different hardware (e.g., T4 instead of A100) | |
| - Try different infrastructure (Modal vs HF Jobs) | |
| **Job fails with "Out of Memory"** | |
| - **Cause**: Model too large for selected hardware | |
| - **Fix**: | |
| - Use larger GPU (A100-80GB or H200) | |
| - Or use `auto` hardware selection | |
| - Or reduce `parallel_workers` to 1 | |
| **Job fails with "Model not found"** | |
| - **Cause**: Invalid model ID or private model | |
| - **Fix**: | |
| - Check model ID format: `organization/model-name` | |
| - For private models, add HF token with access | |
| - Verify model exists on HuggingFace Hub | |
| **Job fails with "API key not set"** | |
| - **Cause**: Missing LLM provider API key | |
| - **Fix**: | |
| 1. Go to Settings tab | |
| 2. Add API key for your provider (OpenAI, Anthropic, etc.) | |
| 3. Submit job again | |
| **Job fails with "Rate limit exceeded"** | |
| - **Cause**: Too many API requests | |
| - **Fix**: | |
| - Reduce `parallel_workers` to 1 | |
| - Use different model with higher rate limits | |
| - Wait and retry later | |
| **Modal job fails with "Authentication failed"** | |
| - **Cause**: Invalid Modal tokens | |
| - **Fix**: | |
| 1. Go to https://modal.com/settings/tokens | |
| 2. Create new token (old one may be expired) | |
| 3. Update tokens in Settings tab | |
| ### Results Not Appearing | |
| **Results not in leaderboard after job completes** | |
| - **Cause**: Dataset upload failed or not configured | |
| - **Fix**: | |
| - Check job logs for errors | |
| - Verify `output_format` was set to "hub" | |
| - Verify HF token has "Write" permission | |
| - Manually refresh leaderboard (click "Load Leaderboard") | |
| **Traces not appearing** | |
| - **Cause**: OpenTelemetry not enabled | |
| - **Fix**: | |
| - Re-run evaluation with `enable_otel = True` | |
| - Check traces dataset exists on your HF profile | |
| **GPU metrics not showing** | |
| - **Cause**: GPU metrics not enabled or CPU job | |
| - **Fix**: | |
| - Re-run with `enable_gpu_metrics = True` | |
| - Verify job used GPU hardware (not CPU) | |
| - Check metrics dataset exists | |
| --- | |
| ## Advanced Configuration | |
| ### Custom Test Datasets | |
| **Create your own test dataset**: | |
| 1. Use **🔬 Synthetic Data Generator** tab: | |
| - Configure domain and tools | |
| - Generate custom tasks | |
| - Push to HuggingFace Hub | |
| 2. Use generated dataset in evaluation: | |
| - Set `dataset_name` to your dataset: `{username}/dataset-name` | |
| - Configure agent with matching tools | |
| **Dataset Format Requirements**: | |
| ```python | |
| { | |
| "task_id": "task_001", | |
| "prompt": "What's the weather in Tokyo?", | |
| "expected_tool": "get_weather", | |
| "difficulty": "easy", | |
| "category": "tool_usage" | |
| } | |
| ``` | |
| ### Environment Variables | |
| **LLM Provider API Keys** (in Settings): | |
| - `OPENAI_API_KEY` - OpenAI API | |
| - `ANTHROPIC_API_KEY` - Anthropic API | |
| - `GOOGLE_API_KEY` or `GEMINI_API_KEY` - Google Gemini API | |
| - `COHERE_API_KEY` - Cohere API | |
| - `MISTRAL_API_KEY` - Mistral API | |
| - `TOGETHER_API_KEY` - Together AI API | |
| - `GROQ_API_KEY` - Groq API | |
| - `REPLICATE_API_TOKEN` - Replicate API | |
| - `ANYSCALE_API_KEY` - Anyscale API | |
| **Infrastructure Credentials**: | |
| - `HF_TOKEN` - HuggingFace token | |
| - `MODAL_TOKEN_ID` - Modal token ID | |
| - `MODAL_TOKEN_SECRET` - Modal token secret | |
| ### Parallel Execution | |
| **Use `parallel_workers` to speed up evaluation**: | |
| - `1` - Sequential execution (default, safest) | |
| - `2-4` - Moderate parallelism (2-4x faster) | |
| - `5-10` - High parallelism (5-10x faster, risky) | |
| **Trade-offs**: | |
| - ✅ **Faster**: Linear speedup with workers | |
| - ⚠️ **Higher cost**: More API calls per minute | |
| - ⚠️ **Rate limits**: May hit provider rate limits | |
| - ⚠️ **Memory**: Increases GPU memory usage | |
| **Recommendations**: | |
| - API models: Keep at 1 (avoid rate limits) | |
| - Local models: Can use 2-4 if GPU has enough VRAM | |
| - Production runs: Use 1 for reliability | |
| ### Private Datasets | |
| **Make results private**: | |
| 1. Set `private = True` in job configuration | |
| 2. Results will be private on your HuggingFace profile | |
| 3. Only you can view in leaderboard (if using private leaderboard dataset) | |
| **Use cases**: | |
| - Proprietary models | |
| - Confidential evaluation data | |
| - Internal benchmarking | |
| --- | |
| ## Quick Reference | |
| ### Job Submission Checklist | |
| Before submitting a job, verify: | |
| - [ ] Infrastructure selected (HF Jobs or Modal) | |
| - [ ] Hardware configured (auto or manual) | |
| - [ ] Model ID is correct | |
| - [ ] Provider matches model type | |
| - [ ] API keys configured in Settings | |
| - [ ] Dataset name is valid | |
| - [ ] Output format is "hub" for TraceMind integration | |
| - [ ] OpenTelemetry tracing enabled (if you want traces) | |
| - [ ] GPU metrics enabled (if using GPU) | |
| - [ ] Cost estimate reviewed | |
| - [ ] Timeout is sufficient for your model size | |
| ### Common Model Configurations | |
| **OpenAI GPT-4**: | |
| ``` | |
| Model: openai/gpt-4 | |
| Provider: litellm | |
| Hardware: auto → cpu-basic | |
| Infrastructure: Either (HF Jobs or Modal) | |
| Estimated Cost: API costs only | |
| ``` | |
| **Anthropic Claude-3.5-Sonnet**: | |
| ``` | |
| Model: anthropic/claude-3.5-sonnet | |
| Provider: litellm | |
| Hardware: auto → cpu-basic | |
| Infrastructure: Either (HF Jobs or Modal) | |
| Estimated Cost: API costs only | |
| ``` | |
| **Meta Llama-3.1-8B**: | |
| ``` | |
| Model: meta-llama/Llama-3.1-8B-Instruct | |
| Provider: transformers | |
| Hardware: auto → a10g-large (HF) or gpu_l40s (Modal) | |
| Infrastructure: Modal (cheaper for short jobs) | |
| Estimated Cost: $0.75-1.50 | |
| ``` | |
| **Meta Llama-3.1-70B**: | |
| ``` | |
| Model: meta-llama/Llama-3.1-70B-Instruct | |
| Provider: transformers | |
| Hardware: auto → a100-large (HF) or gpu_h200 (Modal) | |
| Infrastructure: Modal (if available), else HF Jobs | |
| Estimated Cost: $3.00-8.00 | |
| ``` | |
| **Qwen-2.5-Coder-32B**: | |
| ``` | |
| Model: Qwen/Qwen2.5-Coder-32B-Instruct | |
| Provider: transformers | |
| Hardware: auto → a100-large (HF) or gpu_a100_80gb (Modal) | |
| Infrastructure: Either | |
| Estimated Cost: $2.00-4.00 | |
| ``` | |
| --- | |
| ## Next Steps | |
| After submitting your first job: | |
| 1. **Monitor progress** in Job Monitoring tab | |
| 2. **View results** in Leaderboard when complete | |
| 3. **Analyze traces** in Trace Visualization | |
| 4. **Ask questions** in Agent Chat about your results | |
| 5. **Compare** with other models using Compare feature | |
| 6. **Optimize** model selection based on cost/accuracy trade-offs | |
| 7. **Generate** custom test datasets for your domain | |
| 8. **Share** your results with the community | |
| For more help: | |
| - [USER_GUIDE.md](USER_GUIDE.md) - Complete screen-by-screen walkthrough | |
| - [MCP_INTEGRATION.md](MCP_INTEGRATION.md) - MCP client architecture details | |
| - [ARCHITECTURE.md](ARCHITECTURE.md) - Technical architecture overview | |
| - GitHub Issues: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues) | |