# TraceMind-AI - Complete User Guide

This guide provides a comprehensive walkthrough of all features and screens in TraceMind-AI.

## Table of Contents

- [Getting Started](#getting-started)
- [Screen-by-Screen Guide](#screen-by-screen-guide)
  - [📊 Leaderboard](#-leaderboard)
  - [🤖 Agent Chat](#-agent-chat)
  - [🚀 New Evaluation](#-new-evaluation)
  - [📈 Job Monitoring](#-job-monitoring)
  - [🔍 Trace Visualization](#-trace-visualization)
  - [🔬 Synthetic Data Generator](#-synthetic-data-generator)
  - [⚙️ Settings](#️-settings)
- [Common Workflows](#common-workflows)
- [Troubleshooting](#troubleshooting)

---

## Getting Started

### First-Time Setup

1. **Visit** https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
2. **Sign in** with your HuggingFace account (required for viewing)
3. **Configure API keys** (optional but recommended):
   - Go to **⚙️ Settings** tab
   - Enter Gemini API Key and HuggingFace Token
   - Click **"Save API Keys"**

### Navigation

TraceMind-AI is organized into tabs:
- **📊 Leaderboard**: View evaluation results with AI insights
- **🤖 Agent Chat**: Interactive autonomous agent powered by MCP tools
- **🚀 New Evaluation**: Submit evaluation jobs to HF Jobs or Modal
- **📈 Job Monitoring**: Track status of submitted jobs
- **🔍 Trace Visualization**: Deep-dive into agent execution traces
- **🔬 Synthetic Data Generator**: Create custom test datasets with AI
- **⚙️ Settings**: Configure API keys and preferences

---

## Screen-by-Screen Guide

### 📊 Leaderboard

**Purpose**: Browse all evaluation runs with AI-powered insights and detailed analysis.

#### Features

**Main Table**:
- View all evaluation runs from the SMOLTRACE leaderboard
- Sortable columns: Model, Success Rate, Cost, Duration, CO2 emissions
- Click any row to see detailed test results

**AI Insights Panel** (Top of screen):
- Automatically generated insights from MCP server
- Powered by Google Gemini 2.5 Flash
- Updates when you click "Load Leaderboard"
- Shows top performers, trends, and recommendations

**Filter & Sort Options**:
- Filter by agent type (tool, code, both)
- Filter by provider (litellm, transformers)
- Sort by any metric (success rate, cost, duration)

#### How to Use

1. **Load Data**:
   ```
   Click "Load Leaderboard" button
   → Fetches latest evaluation runs from HuggingFace
   → AI generates insights automatically
   ```

2. **Read AI Insights**:
   - Located at top of screen
   - Summary of evaluation trends
   - Top performing models
   - Cost/accuracy trade-offs
   - Actionable recommendations

3. **Explore Runs**:
   - Scroll through table
   - Sort by clicking column headers
   - Click on any run to see details

4. **View Details**:
   ```
   Click a row in the table
   → Opens detail view with:
      - All test cases (success/failure)
      - Execution times
      - Cost breakdown
      - Link to trace visualization
   ```

#### Example Workflow

```
Scenario: Find the most cost-effective model for production

1. Click "Load Leaderboard"
2. Read AI insights: "Llama-3.1-8B offers best cost/performance at $0.002/run"
3. Sort table by "Cost" (ascending)
4. Compare top 3 cheapest models
5. Click on Llama-3.1-8B run to see detailed results
6. Review success rate (93.4%) and test case breakdowns
7. Decision: Use Llama-3.1-8B for cost-sensitive workloads
```

#### Tips

- **Refresh regularly**: Click "Load Leaderboard" to see new evaluation results
- **Compare models**: Use the sort function to compare across different metrics
- **Trust the AI**: The insights panel provides strategic recommendations based on all data

---

### 🤖 Agent Chat

**Purpose**: Interactive autonomous agent that can answer questions about evaluations using MCP tools.

**🎯 Track 2 Feature**: This demonstrates MCP client usage with smolagents framework.

#### Features

**Autonomous Agent**:
- Built with `smolagents` framework
- Has access to all TraceMind MCP Server tools
- Plans and executes multi-step actions
- Provides detailed, data-driven answers

**MCP Tools Available to Agent**:
- `analyze_leaderboard` - Get AI insights about top performers
- `estimate_cost` - Calculate evaluation costs before running
- `debug_trace` - Analyze execution traces
- `compare_runs` - Compare two evaluation runs
- `get_top_performers` - Fetch top N models efficiently
- `get_leaderboard_summary` - Get high-level statistics
- `get_dataset` - Load SMOLTRACE datasets
- `analyze_results` - Analyze detailed test results

**Agent Reasoning Visibility**:
- Toggle **"Show Agent Reasoning"** to see:
  - Planning steps
  - Tool execution logs
  - Intermediate results
  - Final synthesis

**Quick Action Buttons**:
- **"Quick: Top Models"**: Get top 5 models with costs
- **"Quick: Cost Estimate"**: Estimate cost for a model
- **"Quick: Load Leaderboard"**: Fetch leaderboard summary

#### How to Use

1. **Start a Conversation**:
   ```
   Type your question in the chat box
   Example: "What are the top 3 performing models and how much do they cost?"

   Click "Send"
   → Agent plans approach
   → Executes MCP tools
   → Returns synthesized answer
   ```

2. **Watch Agent Work** (optional):
   ```
   Enable "Show Agent Reasoning" checkbox
   → See step-by-step execution:
      Step 1: Calling get_top_performers with metric=success_rate, top_n=3
      Step 2: Analyzing costs from results
      Step 3: Formatting response with recommendations
   ```

3. **Ask Follow-Up Questions**:
   ```
   You: "Which one is most cost-effective?"
   Agent: Uses previous context + MCP tools to answer
   ```

4. **Use Quick Actions**:
   ```
   Click "Quick: Top Models" button
   → Agent automatically executes query
   → Returns top 5 models with metrics
   ```

#### Example Conversations

**Example 1: Cost Comparison**
```
User: "Compare the costs of the top 3 models. Which one offers the best value?"

Agent:
Let me analyze the top 3 models by success rate:

[Calls: get_top_performers(metric="success_rate", top_n=3)]

Top 3 Models:
1. GPT-4: 95.8% success, $0.05/run
2. Claude-3: 94.1% success, $0.04/run
3. Llama-3.1-8B: 93.4% success, $0.002/run

[Calls: analyze_leaderboard(metric_focus="cost")]

Cost Analysis:
- GPT-4: Highest accuracy but 25x more expensive than Llama
- Claude-3: Good middle ground at $0.04/run
- Llama-3.1: Best value at $0.002/run with only 2.4% accuracy drop

Recommendation for "best value":
→ Llama-3.1-8B offers 93.4% accuracy at $0.002/run
→ For 1M runs: Saves $48,000/month vs GPT-4
→ Only 2.4% accuracy trade-off
```

**Example 2: Evaluation Planning**
```
User: "I want to evaluate DeepSeek-V3 with 100 tests. How much will it cost and which hardware should I use?"

Agent:
Let me estimate costs for DeepSeek-V3:

[Calls: estimate_cost(model="deepseek-ai/DeepSeek-V3", num_tests=100, hardware="auto")]

Cost Estimate:
- Model: deepseek-ai/DeepSeek-V3
- Tests: 100
- Recommended hardware: GPU A100 (70B+ parameters)

Breakdown:
- LLM inference cost: ~$0.01/test (self-hosted, no API fees)
- HF Jobs GPU cost: ~$3.00/hr for A100
- Estimated duration: 25 minutes
- Total compute cost: $1.25
- Total cost: $1.25 (100 tests)

Recommendation:
Use HuggingFace Jobs with A100 GPU. Total cost: $1.25 for 100 tests.
Alternative: Modal H200 for 2x faster inference at $2.50.
```

#### Tips

- **Be specific**: Ask clear, focused questions for better answers
- **Use context**: Agent remembers conversation history
- **Watch reasoning**: Enable to understand how agent uses MCP tools
- **Try quick actions**: Fast way to get common information

---

### 🚀 New Evaluation

**Purpose**: Submit evaluation jobs to HuggingFace Jobs or Modal for running SMOLTRACE evaluations.

**⚠️ Requires**: HuggingFace Pro account ($9/month) with credit card, or Modal account.

#### Features

**Model Selection**:
- Enter any model name (format: `provider/model-name`)
- Examples: `openai/gpt-4`, `meta-llama/Llama-3.1-8B`, `deepseek-ai/DeepSeek-V3`
- Auto-detects if API model or local model

**Infrastructure Choice**:
- **HuggingFace Jobs**: Managed compute (H200, A100, A10, T4, CPU)
- **Modal**: Serverless GPU compute (pay-per-second)

**Hardware Selection**:
- **Auto** (recommended): Automatically selects optimal hardware based on model size
- **Manual**: Choose specific GPU tier (A10, A100, H200) or CPU

**Cost Estimation**:
- Click **"💰 Estimate Cost"** before submitting
- Shows predicted:
  - LLM API costs (for API models)
  - Compute costs (for local models)
  - Duration estimate
  - CO2 emissions

**Agent Type**:
- **tool**: Test tool-calling capabilities
- **code**: Test code generation capabilities
- **both**: Test both (recommended)

#### How to Use

**Step 1: Configure Prerequisites** (One-time setup)

For **HuggingFace Jobs**:
```
1. Sign up for HF Pro: https://huggingface.co/pricing ($9/month)
2. Add credit card for compute charges
3. Create HF token with "Read + Write + Run Jobs" permissions
4. Go to Settings tab → Enter HF token → Save
```

For **Modal** (Alternative):
```
1. Sign up: https://modal.com (free tier available)
2. Generate API token: https://modal.com/settings/tokens
3. Go to Settings tab → Enter MODAL_TOKEN_ID + MODAL_TOKEN_SECRET → Save
```

For **API Models** (OpenAI, Anthropic, etc.):
```
1. Get API key from provider (e.g., https://platform.openai.com/api-keys)
2. Go to Settings tab → Enter provider API key → Save
```

**Step 2: Create Evaluation**

```
1. Enter model name:
   Example: "meta-llama/Llama-3.1-8B"

2. Select infrastructure:
   - HuggingFace Jobs (default)
   - Modal (alternative)

3. Choose agent type:
   - "both" (recommended)

4. Select hardware:
   - "auto" (recommended - smart selection)
   - Or choose manually: cpu-basic, t4-small, a10g-small, a100-large, h200

5. Set timeout (optional):
   - Default: 3600s (1 hour)
   - Range: 300s - 7200s

6. Click "💰 Estimate Cost":
   → Shows predicted cost and duration
   → Example: "$2.00, 20 minutes, 0.5g CO2"

7. Review estimate, then click "Submit Evaluation"
```

**Step 3: Monitor Job**

```
After submission:
→ Job ID displayed
→ Go to "📈 Job Monitoring" tab to track progress
→ Or visit HuggingFace Jobs dashboard: https://huggingface.co/jobs
```

**Step 4: View Results**

```
When job completes:
→ Results automatically uploaded to HuggingFace datasets
→ Appears in Leaderboard within 1-2 minutes
→ Click on your run to see detailed results
```

#### Hardware Selection Guide

**For API Models** (OpenAI, Anthropic, Google):
- Use: `cpu-basic` (HF Jobs) or CPU (Modal)
- Cost: ~$0.05/hr (HF), ~$0.0001/sec (Modal)
- Why: No GPU needed for API calls

**For Small Models** (4B-8B parameters):
- Use: `t4-small` (HF) or A10G (Modal)
- Cost: ~$0.60/hr (HF), ~$0.0006/sec (Modal)
- Examples: Llama-3.1-8B, Mistral-7B

**For Medium Models** (7B-13B parameters):
- Use: `a10g-small` (HF) or A10G (Modal)
- Cost: ~$1.10/hr (HF), ~$0.0006/sec (Modal)
- Examples: Qwen2.5-14B, Mixtral-8x7B

**For Large Models** (70B+ parameters):
- Use: `a100-large` (HF) or A100-80GB (Modal)
- Cost: ~$3.00/hr (HF), ~$0.0030/sec (Modal)
- Examples: Llama-3.1-70B, DeepSeek-V3

**For Fastest Inference**:
- Use: `h200` (HF or Modal)
- Cost: ~$5.00/hr (HF), ~$0.0050/sec (Modal)
- Best for: Time-sensitive evaluations, large batches

#### Example Workflows

**Workflow 1: Evaluate API Model (OpenAI GPT-4)**
```
1. Model: "openai/gpt-4"
2. Infrastructure: HuggingFace Jobs
3. Agent type: both
4. Hardware: auto (selects cpu-basic)
5. Estimate: $50.00 (mostly API costs), 45 min
6. Submit → Monitor → View in leaderboard
```

**Workflow 2: Evaluate Local Model (Llama-3.1-8B)**
```
1. Model: "meta-llama/Llama-3.1-8B"
2. Infrastructure: Modal (for pay-per-second billing)
3. Agent type: both
4. Hardware: auto (selects A10G)
5. Estimate: $0.20, 15 min
6. Submit → Monitor → View in leaderboard
```

#### Tips

- **Always estimate first**: Prevents surprise costs
- **Use "auto" hardware**: Smart selection based on model size
- **Start small**: Test with 10-20 tests before scaling to 100+
- **Monitor jobs**: Check Job Monitoring tab for status
- **Modal for experimentation**: Pay-per-second is cost-effective for testing

---

### 📈 Job Monitoring

**Purpose**: Track status of submitted evaluation jobs.

#### Features

**Job Status Display**:
- Job ID
- Current status (pending, running, completed, failed)
- Start time
- Duration
- Infrastructure (HF Jobs or Modal)

**Real-time Updates**:
- Auto-refreshes every 30 seconds
- Manual refresh button

**Job Actions**:
- View logs
- Cancel job (if still running)
- View results (if completed)

#### How to Use

```
1. Go to "📈 Job Monitoring" tab
2. See list of your submitted jobs
3. Click "Refresh" for latest status
4. When status = "completed":
   → Click "View Results"
   → Opens leaderboard filtered to your run
```

#### Job Statuses

- **Pending**: Job queued, waiting for resources
- **Running**: Evaluation in progress
- **Completed**: Evaluation finished successfully
- **Failed**: Evaluation encountered an error

#### Tips

- **Check logs** if job fails: Helps diagnose issues
- **Expected duration**:
  - API models: 2-5 minutes
  - Local models: 15-30 minutes (includes model download)

---

### 🔍 Trace Visualization

**Purpose**: Deep-dive into OpenTelemetry traces to understand agent execution.

**Access**: Click on any test case in a run's detail view

#### Features

**Waterfall Diagram**:
- Visual timeline of execution
- Spans show: LLM calls, tool executions, reasoning steps
- Duration bars (wider = slower)
- Parent-child relationships

**Span Details**:
- Span name (e.g., "LLM Call - Reasoning", "Tool Call - get_weather")
- Start/end times
- Duration
- Attributes (model, tokens, cost, tool inputs/outputs)
- Status (OK, ERROR)

**GPU Metrics Overlay** (for GPU jobs only):
- GPU utilization %
- Memory usage
- Temperature
- CO2 emissions

**MCP-Powered Q&A**:
- Ask questions about the trace
- Example: "Why was tool X called twice?"
- Agent uses `debug_trace` MCP tool to analyze

#### How to Use

```
1. From leaderboard → Click a run → Click a test case
2. View waterfall diagram:
   → Spans arranged chronologically
   → Parent spans (e.g., "Agent Execution")
   → Child spans (e.g., "LLM Call", "Tool Call")

3. Click any span:
   → See detailed attributes
   → Token counts, costs, inputs/outputs

4. Ask questions (MCP-powered):
   User: "Why did this test fail?"
   → Agent analyzes trace with debug_trace tool
   → Returns explanation with span references

5. Check GPU metrics (if available):
   → Graph shows utilization over time
   → Overlayed on execution timeline
```

#### Example Analysis

**Scenario: Understanding a slow execution**

```
1. Open trace for test_045 (duration: 8.5s)
2. Waterfall shows:
   - Span 1: LLM Call - Reasoning (1.2s) ✓
   - Span 2: Tool Call - search_web (6.5s) ⚠️ SLOW
   - Span 3: LLM Call - Final Response (0.8s) ✓

3. Click Span 2 (search_web):
   - Input: {"query": "weather in Tokyo"}
   - Output: 5 results
   - Duration: 6.5s (6x slower than typical)

4. Ask agent: "Why was the search_web call so slow?"
   → Agent analysis:
      "The search_web call took 6.5s due to network latency.
       Span attributes show API response time: 6.2s.
       This is an external dependency issue, not agent code.
       Recommendation: Implement timeout (5s) and fallback strategy."
```

#### Tips

- **Look for patterns**: Similar failures often have common spans
- **Use MCP Q&A**: Faster than manual trace analysis
- **Check GPU metrics**: Identify resource bottlenecks
- **Compare successful vs failed traces**: Spot differences

---

### 🔬 Synthetic Data Generator

**Purpose**: Generate custom synthetic test datasets for agent evaluation using AI, complete with domain-specific tasks and prompt templates.

#### Features

**AI-Powered Dataset Generation**:
- Generate 5-100 synthetic tasks using Google Gemini 2.5 Flash
- Customizable domain, tools, difficulty, and agent type
- Automatic batching for large datasets (parallel generation)
- SMOLTRACE-format output ready for evaluation

**Prompt Template Generation**:
- Customized YAML templates based on smolagents format
- Optimized for your specific domain and tools
- Included automatically in dataset card

**Push to HuggingFace Hub**:
- One-click upload to HuggingFace Hub
- Public or private repositories
- Auto-generated README with usage instructions
- Ready to use with SMOLTRACE evaluations

#### How to Use

**Step 1: Configure & Generate Dataset**

1. Navigate to **🔬 Synthetic Data Generator** tab

2. Configure generation parameters:
   - **Domain**: Topic/industry (e.g., "travel", "finance", "healthcare", "customer_support")
   - **Tools**: Comma-separated list of tool names (e.g., "get_weather,search_flights,book_hotel")
   - **Number of Tasks**: 5-100 tasks (slider)
   - **Difficulty Level**:
     - `balanced` (40% easy, 40% medium, 20% hard)
     - `easy_only` (100% easy tasks)
     - `medium_only` (100% medium tasks)
     - `hard_only` (100% hard tasks)
     - `progressive` (50% easy, 30% medium, 20% hard)
   - **Agent Type**:
     - `tool` (ToolCallingAgent only)
     - `code` (CodeAgent only)
     - `both` (50/50 mix)

3. Click **"🎲 Generate Synthetic Dataset"**

4. Wait for generation (30-120s depending on size):
   - Shows progress message
   - Automatic batching for >20 tasks
   - Parallel API calls for faster generation

**Step 2: Review Generated Content**

1. **Dataset Preview Tab**:
   - View all generated tasks in JSON format
   - Check task IDs, prompts, expected tools, difficulty
   - See dataset statistics:
     - Total tasks
     - Difficulty distribution
     - Agent type distribution
     - Tools coverage

2. **Prompt Template Tab**:
   - View customized YAML prompt template
   - Based on smolagents templates
   - Adapted for your domain and tools
   - Ready to use with ToolCallingAgent or CodeAgent

**Step 3: Push to HuggingFace Hub** (Optional)

1. Enter **Repository Name**:
   - Format: `username/smoltrace-{domain}-tasks`
   - Example: `alice/smoltrace-finance-tasks`
   - Auto-filled with your HF username after generation

2. Set **Visibility**:
   - ☐ Private Repository (unchecked = public)
   - ☑ Private Repository (checked = private)

3. Provide **HuggingFace Token** (optional):
   - Leave empty to use environment token (HF_TOKEN from Settings)
   - Or paste token from https://huggingface.co/settings/tokens
   - Requires write permissions

4. Click **"📤 Push to HuggingFace Hub"**

5. Wait for upload (5-30s):
   - Creates dataset repository
   - Uploads tasks
   - Generates README with:
     - Usage instructions
     - Prompt template
     - SMOLTRACE integration code
   - Returns dataset URL

#### Example Workflow

```
Scenario: Create finance evaluation dataset with 20 tasks

1. Configure:
   Domain: "finance"
   Tools: "get_stock_price,calculate_roi,get_market_news,send_alert"
   Number of Tasks: 20
   Difficulty: "balanced"
   Agent Type: "both"

2. Click "Generate"
   → AI generates 20 tasks:
      - 8 easy (single tool, straightforward)
      - 8 medium (multiple tools or complex logic)
      - 4 hard (complex reasoning, edge cases)
      - 10 for ToolCallingAgent
      - 10 for CodeAgent
   → Also generates customized prompt template

3. Review Dataset Preview:
   Task 1:
   {
     "id": "finance_stock_price_1",
     "prompt": "What is the current price of AAPL stock?",
     "expected_tool": "get_stock_price",
     "difficulty": "easy",
     "agent_type": "tool",
     "expected_keywords": ["AAPL", "price", "$"]
   }

   Task 15:
   {
     "id": "finance_complex_analysis_15",
     "prompt": "Calculate the ROI for investing $10,000 in AAPL last year and send an alert if ROI > 15%",
     "expected_tool": "calculate_roi",
     "expected_tool_calls": 2,
     "difficulty": "hard",
     "agent_type": "code",
     "expected_keywords": ["ROI", "15%", "alert"]
   }

4. Review Prompt Template:
   See customized YAML with:
   - Finance-specific system prompt
   - Tool descriptions for get_stock_price, calculate_roi, etc.
   - Response format guidelines

5. Push to Hub:
   Repository: "yourname/smoltrace-finance-tasks"
   Private: No (public)
   Token: (empty, using environment token)

   → Uploads to https://huggingface.co/datasets/yourname/smoltrace-finance-tasks
   → README includes usage instructions and prompt template

6. Use in evaluation:
   # Load your custom dataset
   dataset = load_dataset("yourname/smoltrace-finance-tasks")

   # Run SMOLTRACE evaluation
   smoltrace-eval --model openai/gpt-4 \
                  --dataset-name yourname/smoltrace-finance-tasks \
                  --agent-type both
```

#### Configuration Reference

**Difficulty Levels Explained**:

| Level | Characteristics | Example |
|-------|----------------|---------|
| **Easy** | Single tool call, straightforward input, clear expected output | "What's the weather in Tokyo?" → get_weather("Tokyo") |
| **Medium** | Multiple tool calls OR complex input parsing OR conditional logic | "Compare weather in Tokyo and London" → get_weather("Tokyo"), get_weather("London"), compare |
| **Hard** | Multiple tools, complex reasoning, edge cases, error handling | "Plan a trip with best weather, book flights if under $500, alert if unavailable" |

**Agent Types Explained**:

| Type | Description | Use Case |
|------|-------------|----------|
| **tool** | ToolCallingAgent - Declarative tool calling with structured outputs | API-based models that support function calling (GPT-4, Claude) |
| **code** | CodeAgent - Writes Python code to use tools programmatically | Models that excel at code generation (Qwen-Coder, DeepSeek-Coder) |
| **both** | 50/50 mix of tool and code agent tasks | Comprehensive evaluation across agent types |

#### Best Practices

**Domain Selection**:
- Be specific: "customer_support_saas" > "support"
- Match your use case: Use actual business domain
- Consider tools available: Domain should align with tools

**Tool Names**:
- Use descriptive names: "get_stock_price" > "fetch"
- Match actual tool implementations
- 3-8 tools is ideal (enough variety, not overwhelming)
- Include mix of data retrieval and action tools

**Number of Tasks**:
- 5-10 tasks: Quick testing, proof of concept
- 20-30 tasks: Solid evaluation dataset
- 50-100 tasks: Comprehensive benchmark

**Difficulty Distribution**:
- `balanced`: Best for general evaluation
- `progressive`: Good for learning/debugging
- `easy_only`: Quick sanity checks
- `hard_only`: Stress testing advanced capabilities

**Quality Assurance**:
- Always review generated tasks before pushing
- Check for domain relevance and variety
- Verify expected tools match your actual tools
- Ensure prompts are clear and executable

#### Troubleshooting

**Generation fails with "Invalid API key"**:
- Go to **⚙️ Settings**
- Configure Gemini API Key
- Get key from https://aistudio.google.com/apikey

**Generated tasks don't match domain**:
- Be more specific in domain description
- Try regenerating with adjusted parameters
- Review prompt template for domain alignment

**Push to Hub fails with "Authentication error"**:
- Verify HuggingFace token has write permissions
- Get token from https://huggingface.co/settings/tokens
- Check token in **⚙️ Settings** or provide directly

**Dataset generation is slow (>60s)**:
- Large requests (>20 tasks) are automatically batched
- Each batch takes 30-120s
- Example: 100 tasks = 5 batches × 60s = ~5 minutes
- This is normal for large datasets

**Tasks are too easy/hard**:
- Adjust difficulty distribution
- Regenerate with different settings
- Mix difficulty levels with `balanced` or `progressive`

#### Advanced Tips

**Iterative Refinement**:
1. Generate 10 tasks with `balanced` difficulty
2. Review quality and variety
3. If satisfied, generate 50-100 tasks with same settings
4. If not, adjust domain/tools and regenerate

**Dataset Versioning**:
- Use version suffixes: `username/smoltrace-finance-tasks-v2`
- Iterate on datasets as tools evolve
- Keep track of which version was used for evaluations

**Combining Datasets**:
- Generate multiple small datasets for different domains
- Use SMOLTRACE CLI to merge datasets
- Create comprehensive multi-domain benchmarks

**Custom Prompt Templates**:
- Generate prompt template separately
- Customize further based on your needs
- Use in agent initialization before evaluation
- Include in dataset card for reproducibility

---

### ⚙️ Settings

**Purpose**: Configure API keys, preferences, and authentication.

#### Features

**API Key Configuration**:
- Gemini API Key (for MCP server AI analysis)
- HuggingFace Token (for dataset access + job submission)
- Modal Token ID + Secret (for Modal job submission)
- LLM Provider Keys (OpenAI, Anthropic, etc.)

**Preferences**:
- Default infrastructure (HF Jobs vs Modal)
- Default hardware tier
- Auto-refresh intervals

**Security**:
- Keys stored in browser session only (not server)
- HTTPS encryption for all API calls
- Keys never logged or exposed

#### How to Use

**Configure Essential Keys**:
```
1. Go to "⚙️ Settings" tab

2. Enter Gemini API Key:
   - Get from: https://ai.google.dev/
   - Click "Get API Key" → Create project → Generate
   - Paste into field
   - Free tier: 1,500 requests/day

3. Enter HuggingFace Token:
   - Get from: https://huggingface.co/settings/tokens
   - Click "New token" → Name: "TraceMind"
   - Permissions:
     - Read (for viewing datasets)
     - Write (for uploading results)
     - Run Jobs (for evaluation submission)
   - Paste into field

4. Click "Save API Keys"
   → Keys stored in browser session
   → MCP server will use your keys
```

**Configure for Job Submission** (Optional):

For **HuggingFace Jobs**:
```
Already configured if you entered HF token above with "Run Jobs" permission.
```

For **Modal** (Alternative):
```
1. Sign up: https://modal.com
2. Get token: https://modal.com/settings/tokens
3. Copy MODAL_TOKEN_ID (starts with 'ak-')
4. Copy MODAL_TOKEN_SECRET (starts with 'as-')
5. Paste both into Settings → Save
```

For **API Model Providers**:
```
1. Get API key from provider:
   - OpenAI: https://platform.openai.com/api-keys
   - Anthropic: https://console.anthropic.com/settings/keys
   - Google: https://ai.google.dev/

2. Paste into corresponding field in Settings
3. Click "Save LLM Provider Keys"
```

#### Security Best Practices

- **Use environment variables**: For production, set keys via HF Spaces secrets
- **Rotate keys regularly**: Generate new tokens every 3-6 months
- **Minimal permissions**: Only grant "Run Jobs" if you need to submit evaluations
- **Monitor usage**: Check API provider dashboards for unexpected charges

---

## Common Workflows

### Workflow 1: Quick Model Comparison

```
Goal: Compare GPT-4 vs Llama-3.1-8B for production use

Steps:
1. Go to Leaderboard → Load Leaderboard
2. Read AI insights: "GPT-4 leads accuracy, Llama-3.1 best cost"
3. Sort by Success Rate → Note: GPT-4 (95.8%), Llama (93.4%)
4. Sort by Cost → Note: GPT-4 ($0.05), Llama ($0.002)
5. Go to Agent Chat → Ask: "Compare GPT-4 and Llama-3.1. Which should I use for 1M runs/month?"
   → Agent analyzes with MCP tools
   → Returns: "Llama saves $48K/month, only 2.4% accuracy drop"
6. Decision: Use Llama-3.1-8B for production
```

### Workflow 2: Evaluate Custom Model

```
Goal: Evaluate your fine-tuned model on SMOLTRACE benchmark

Steps:
1. Ensure model is on HuggingFace: username/my-finetuned-model
2. Go to Settings → Configure HF token (with Run Jobs permission)
3. Go to New Evaluation:
   - Model: "username/my-finetuned-model"
   - Infrastructure: HuggingFace Jobs
   - Agent type: both
   - Hardware: auto
4. Click "Estimate Cost" → Review: $1.50, 20 min
5. Click "Submit Evaluation"
6. Go to Job Monitoring → Wait for "Completed" (15-25 min)
7. Go to Leaderboard → Refresh → See your model in table
8. Click your run → Review detailed results
9. Compare vs other models using Agent Chat
```

### Workflow 3: Debug Failed Test

```
Goal: Understand why test_045 failed in your evaluation

Steps:
1. Go to Leaderboard → Find your run → Click to open details
2. Filter to failed tests only
3. Click test_045 → Opens trace visualization
4. Examine waterfall:
   - Span 1: LLM Call (OK)
   - Span 2: Tool Call - "unknown_tool" (ERROR)
   - No Span 3 (execution stopped)
5. Ask Agent: "Why did test_045 fail?"
   → Agent uses debug_trace MCP tool
   → Returns: "Tool 'unknown_tool' not found. Add to agent's tool list."
6. Fix: Update agent config to include missing tool
7. Re-run evaluation with fixed config
```

---

## Troubleshooting

### Leaderboard Issues

**Problem**: "Load Leaderboard" button doesn't work
- **Solution**: Check HuggingFace token in Settings (needs Read permission)
- **Solution**: Verify leaderboard dataset exists: https://huggingface.co/datasets/kshitijthakkar/smoltrace-leaderboard

**Problem**: AI insights not showing
- **Solution**: Check Gemini API key in Settings
- **Solution**: Wait 5-10 seconds for AI generation to complete

### Agent Chat Issues

**Problem**: Agent responds with "MCP server connection failed"
- **Solution**: Check MCP server status: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
- **Solution**: Configure Gemini API key in both TraceMind-AI and MCP server Settings

**Problem**: Agent gives incorrect information
- **Solution**: Agent may be using stale data. Ask: "Load the latest leaderboard data"
- **Solution**: Verify question is clear and specific

### Evaluation Submission Issues

**Problem**: "Submit Evaluation" fails with auth error
- **Solution**: HF token needs "Run Jobs" permission
- **Solution**: Ensure HF Pro account is active ($9/month)
- **Solution**: Verify credit card is on file for compute charges

**Problem**: Job stuck in "Pending" status
- **Solution**: HuggingFace Jobs may have queue. Wait 5-10 minutes.
- **Solution**: Try Modal as alternative infrastructure

**Problem**: Job fails with "Out of Memory"
- **Solution**: Model too large for selected hardware
- **Solution**: Increase hardware tier (e.g., t4-small → a10g-small)
- **Solution**: Use auto hardware selection

### Trace Visualization Issues

**Problem**: Traces not loading
- **Solution**: Ensure evaluation completed successfully
- **Solution**: Check traces dataset exists on HuggingFace
- **Solution**: Verify HF token has Read permission

**Problem**: GPU metrics missing
- **Solution**: Only available for GPU jobs (not API models)
- **Solution**: Ensure evaluation was run with SMOLTRACE's GPU metrics enabled

---

## Getting Help

- **📧 GitHub Issues**: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)
- **💬 HF Discord**: `#agents-mcp-hackathon-winter25`
- **📖 Documentation**: See [MCP_INTEGRATION.md](MCP_INTEGRATION.md) and [ARCHITECTURE.md](ARCHITECTURE.md)

---

**Last Updated**: November 21, 2025