| # TraceMind MCP Server - Complete API Documentation | |
| This document provides comprehensive API reference for all MCP components provided by TraceMind MCP Server. | |
| ## Table of Contents | |
| - [MCP Tools (11)](#mcp-tools) | |
| - [AI-Powered Analysis Tools](#ai-powered-analysis-tools) | |
| - [Token-Optimized Tools](#token-optimized-tools) | |
| - [Data Management Tools](#data-management-tools) | |
| - [MCP Resources (3)](#mcp-resources) | |
| - [MCP Prompts (3)](#mcp-prompts) | |
| - [Error Handling](#error-handling) | |
| - [Best Practices](#best-practices) | |
| --- | |
| ## MCP Tools | |
| ### AI-Powered Analysis Tools | |
| These tools use Google Gemini 2.5 Flash to provide intelligent, context-aware analysis of agent evaluation data. | |
| #### 1. analyze_leaderboard | |
| Analyzes evaluation leaderboard data from HuggingFace datasets and generates AI-powered insights. | |
| **Parameters:** | |
| - `leaderboard_repo` (str): HuggingFace dataset repository | |
| - Default: `"kshitijthakkar/smoltrace-leaderboard"` | |
| - Format: `"username/dataset-name"` | |
| - `metric_focus` (str): Primary metric to analyze | |
| - Options: `"overall"`, `"accuracy"`, `"cost"`, `"latency"`, `"co2"` | |
| - Default: `"overall"` | |
| - `time_range` (str): Time period to analyze | |
| - Options: `"last_week"`, `"last_month"`, `"all_time"` | |
| - Default: `"last_week"` | |
| - `top_n` (int): Number of top models to highlight | |
| - Range: 1-20 | |
| - Default: 5 | |
| **Returns:** String containing AI-generated analysis with: | |
| - Top performers by selected metric | |
| - Trade-off analysis (e.g., accuracy vs cost) | |
| - Trend identification | |
| - Actionable recommendations | |
| **Example Use Case:** | |
| Before choosing a model for production, get AI-powered insights on which configuration offers the best cost/performance for your requirements. | |
| **Example Call:** | |
| ```python | |
| result = await analyze_leaderboard( | |
| leaderboard_repo="kshitijthakkar/smoltrace-leaderboard", | |
| metric_focus="cost", | |
| time_range="last_week", | |
| top_n=5 | |
| ) | |
| ``` | |
| **Example Response:** | |
| ``` | |
| Based on 247 evaluations in the past week: | |
| Top Performers (Cost Focus): | |
| 1. meta-llama/Llama-3.1-8B: $0.002 per run, 93.4% accuracy | |
| 2. mistralai/Mistral-7B: $0.003 per run, 91.2% accuracy | |
| 3. openai/gpt-3.5-turbo: $0.008 per run, 94.1% accuracy | |
| Trade-off Analysis: | |
| - Llama-3.1 offers best cost/performance ratio at 25x cheaper than GPT-4 | |
| - GPT-4 leads in accuracy (95.8%) but costs $0.05 per run | |
| - For production with 1M runs/month: Llama-3.1 saves $48,000 vs GPT-4 | |
| Recommendations: | |
| - Cost-sensitive: Use Llama-3.1-8B (93% accuracy, minimal cost) | |
| - Accuracy-critical: Use GPT-4 (96% accuracy, premium cost) | |
| - Balanced: Use GPT-3.5-Turbo (94% accuracy, moderate cost) | |
| ``` | |
| --- | |
| #### 2. debug_trace | |
| Analyzes OpenTelemetry trace data and answers specific questions about agent execution. | |
| **Parameters:** | |
| - `trace_dataset` (str): HuggingFace dataset containing traces | |
| - Format: `"username/smoltrace-traces-model"` | |
| - Must contain "smoltrace-" prefix | |
| - `trace_id` (str): Specific trace ID to analyze | |
| - Format: `"trace_abc123"` | |
| - `question` (str): Question about the trace | |
| - Examples: "Why was tool X called twice?", "Which step took the most time?" | |
| - `include_metrics` (bool): Include GPU metrics in analysis | |
| - Default: `true` | |
| **Returns:** String containing AI analysis of the trace with: | |
| - Answer to the specific question | |
| - Relevant span details | |
| - Performance insights | |
| - GPU metrics (if available and requested) | |
| **Example Use Case:** | |
| When an agent test fails, understand exactly what happened without manually parsing trace spans. | |
| **Example Call:** | |
| ```python | |
| result = await debug_trace( | |
| trace_dataset="kshitij/smoltrace-traces-gpt4", | |
| trace_id="trace_abc123", | |
| question="Why was the search tool called twice?", | |
| include_metrics=True | |
| ) | |
| ``` | |
| **Example Response:** | |
| ``` | |
| Based on trace analysis: | |
| Answer: | |
| The agent called the search_web tool twice due to an iterative reasoning pattern: | |
| 1. First call (span_003 at 14:23:19.000): | |
| - Query: "weather in Tokyo" | |
| - Duration: 890ms | |
| - Result: 5 results, oldest was 2 days old | |
| 2. Second call (span_005 at 14:23:21.200): | |
| - Query: "latest weather in Tokyo" | |
| - Duration: 1200ms | |
| - Modified reasoning: LLM determined first results were stale | |
| Performance Impact: | |
| - Added 2.09s to total execution time | |
| - Cost increase: +$0.0003 (tokens for second reasoning step) | |
| - This is normal behavior for tool-calling agents with iterative reasoning | |
| GPU Metrics: | |
| - N/A (API model, no GPU used) | |
| ``` | |
| --- | |
| #### 3. estimate_cost | |
| Predicts costs, duration, and environmental impact before running evaluations. | |
| **Parameters:** | |
| - `model` (str, required): Model name to evaluate | |
| - Format: `"provider/model-name"` (e.g., `"openai/gpt-4"`, `"meta-llama/Llama-3.1-8B"`) | |
| - `agent_type` (str): Type of agent evaluation | |
| - Options: `"tool"`, `"code"`, `"both"` | |
| - Default: `"both"` | |
| - `num_tests` (int): Number of test cases | |
| - Range: 1-10000 | |
| - Default: 100 | |
| - `hardware` (str): Hardware type | |
| - Options: `"auto"`, `"cpu"`, `"gpu_a10"`, `"gpu_h200"` | |
| - Default: `"auto"` (auto-selects based on model) | |
| **Returns:** String containing cost estimate with: | |
| - LLM API costs (for API models) | |
| - HuggingFace Jobs compute costs (for local models) | |
| - Estimated duration | |
| - CO2 emissions estimate | |
| - Hardware recommendations | |
| **Example Use Case:** | |
| Compare the cost of evaluating GPT-4 vs Llama-3.1 across 1000 tests before committing resources. | |
| **Example Call:** | |
| ```python | |
| result = await estimate_cost( | |
| model="openai/gpt-4", | |
| agent_type="both", | |
| num_tests=1000, | |
| hardware="auto" | |
| ) | |
| ``` | |
| **Example Response:** | |
| ``` | |
| Cost Estimate for openai/gpt-4: | |
| LLM API Costs: | |
| - Estimated tokens per test: 1,500 | |
| - Token cost: $0.03/1K input, $0.06/1K output | |
| - Total LLM cost: $50.00 (1000 tests) | |
| Compute Costs: | |
| - Recommended hardware: cpu-basic (API model) | |
| - HF Jobs cost: ~$0.05/hr | |
| - Estimated duration: 45 minutes | |
| - Total compute cost: $0.04 | |
| Total Cost: $50.04 | |
| Cost per test: $0.05 | |
| CO2 emissions: ~0.5g (API calls, minimal compute) | |
| Recommendations: | |
| - This is an API model, CPU hardware is sufficient | |
| - For cost optimization, consider Llama-3.1-8B (25x cheaper) | |
| - Estimated runtime: 45 minutes for 1000 tests | |
| ``` | |
| --- | |
| #### 4. compare_runs | |
| Compares two evaluation runs with AI-powered analysis across multiple dimensions. | |
| **Parameters:** | |
| - `run_id_1` (str, required): First run ID from leaderboard | |
| - `run_id_2` (str, required): Second run ID from leaderboard | |
| - `leaderboard_repo` (str): Leaderboard dataset repository | |
| - Default: `"kshitijthakkar/smoltrace-leaderboard"` | |
| - `focus` (str): Comparison focus area | |
| - Options: | |
| - `"comprehensive"`: All dimensions | |
| - `"cost"`: Cost efficiency and ROI | |
| - `"performance"`: Speed and accuracy trade-offs | |
| - `"eco_friendly"`: Environmental impact | |
| - Default: `"comprehensive"` | |
| **Returns:** String containing AI comparison with: | |
| - Success rate comparison with statistical significance | |
| - Cost efficiency analysis | |
| - Speed comparison | |
| - Environmental impact (CO2 emissions) | |
| - GPU efficiency (for GPU jobs) | |
| **Example Use Case:** | |
| After running evaluations with two different models, compare them head-to-head to determine which is better for production deployment. | |
| **Example Call:** | |
| ```python | |
| result = await compare_runs( | |
| run_id_1="run_abc123", | |
| run_id_2="run_def456", | |
| leaderboard_repo="kshitijthakkar/smoltrace-leaderboard", | |
| focus="cost" | |
| ) | |
| ``` | |
| **Example Response:** | |
| ``` | |
| Comparison: GPT-4 vs Llama-3.1-8B (Cost Focus) | |
| Success Rates: | |
| - GPT-4: 95.8% (96/100 tests) | |
| - Llama-3.1: 93.4% (93/100 tests) | |
| - Difference: +2.4% for GPT-4 (statistically significant, p<0.05) | |
| Cost Efficiency: | |
| - GPT-4: $0.05 per test, $0.052 per successful test | |
| - Llama-3.1: $0.002 per test, $0.0021 per successful test | |
| - Cost ratio: GPT-4 is 25x more expensive | |
| ROI Analysis: | |
| - For 1M evaluations/month: | |
| - GPT-4: $50,000/month, 958K successes | |
| - Llama-3.1: $2,000/month, 934K successes | |
| - GPT-4 provides 24K more successes for $48K more cost | |
| - Cost per additional success: $2.00 | |
| Recommendation (Cost Focus): | |
| Use Llama-3.1-8B for cost-sensitive workloads where 93% accuracy is acceptable. | |
| Switch to GPT-4 only for accuracy-critical tasks where the 2.4% improvement justifies 25x cost. | |
| ``` | |
| --- | |
| #### 5. analyze_results | |
| Analyzes detailed test results and provides optimization recommendations. | |
| **Parameters:** | |
| - `results_repo` (str, required): HuggingFace dataset containing results | |
| - Format: `"username/smoltrace-results-model-timestamp"` | |
| - Must contain "smoltrace-results-" prefix | |
| - `analysis_focus` (str): Focus area for analysis | |
| - Options: `"failures"`, `"performance"`, `"cost"`, `"comprehensive"` | |
| - Default: `"comprehensive"` | |
| - `max_rows` (int): Maximum test cases to analyze | |
| - Range: 10-500 | |
| - Default: 100 | |
| **Returns:** String containing AI analysis with: | |
| - Failure patterns and root causes | |
| - Performance bottlenecks in specific test cases | |
| - Cost optimization opportunities | |
| - Tool usage patterns | |
| - Task-specific insights (which types work well vs poorly) | |
| - Actionable optimization recommendations | |
| **Example Use Case:** | |
| After running an evaluation, analyze the detailed test results to understand why certain tests are failing and get specific recommendations for improving success rate. | |
| **Example Call:** | |
| ```python | |
| result = await analyze_results( | |
| results_repo="kshitij/smoltrace-results-gpt4-20251120", | |
| analysis_focus="failures", | |
| max_rows=100 | |
| ) | |
| ``` | |
| **Example Response:** | |
| ``` | |
| Analysis of Test Results (100 tests analyzed) | |
| Overall Statistics: | |
| - Success Rate: 89% (89/100 tests passed) | |
| - Average Duration: 3.2s per test | |
| - Total Cost: $4.50 ($0.045 per test) | |
| Failure Analysis (11 failures): | |
| 1. Tool Not Found (6 failures): | |
| - Test IDs: task_012, task_045, task_067, task_089, task_091, task_093 | |
| - Pattern: All failed tests required the 'get_weather' tool | |
| - Root Cause: Tool definition missing or incorrect name | |
| - Fix: Ensure 'get_weather' tool is available in agent's tool list | |
| 2. Timeout (3 failures): | |
| - Test IDs: task_034, task_071, task_088 | |
| - Pattern: Complex multi-step tasks with >5 tool calls | |
| - Root Cause: Exceeding 30s timeout limit | |
| - Fix: Increase timeout to 60s or simplify complex tasks | |
| 3. Incorrect Response (2 failures): | |
| - Test IDs: task_056, task_072 | |
| - Pattern: Math calculation tasks | |
| - Root Cause: Model hallucinating numbers instead of using calculator tool | |
| - Fix: Update prompt to emphasize tool usage for calculations | |
| Performance Insights: | |
| - Fast tasks (<2s): 45 tests - Simple single-tool calls | |
| - Slow tasks (>5s): 12 tests - Multi-step reasoning with 3+ tools | |
| - Optimal duration: 2-3s for most tasks | |
| Cost Optimization: | |
| - High-cost tests: task_023 ($0.12) - Used 4K tokens | |
| - Low-cost tests: task_087 ($0.008) - Used 180 tokens | |
| - Recommendation: Optimize prompt to reduce token usage by 20% | |
| Recommendations: | |
| 1. Add missing 'get_weather' tool → Fixes 6 failures | |
| 2. Increase timeout from 30s to 60s → Fixes 3 failures | |
| 3. Strengthen calculator tool instruction → Fixes 2 failures | |
| 4. Expected improvement: 89% → 100% success rate | |
| ``` | |
| --- | |
| ### Token-Optimized Tools | |
| These tools are specifically designed to minimize token usage when querying leaderboard data. | |
| #### 6. get_top_performers | |
| Get top N performing models from leaderboard with 90% token reduction. | |
| **Performance Optimization:** Returns only top N models instead of loading the full leaderboard dataset (51 runs), resulting in **90% token reduction**. | |
| **When to Use:** Perfect for queries like "Which model is leading?", "Show me the top 5 models". | |
| **Parameters:** | |
| - `leaderboard_repo` (str): HuggingFace dataset repository | |
| - Default: `"kshitijthakkar/smoltrace-leaderboard"` | |
| - `metric` (str): Metric to rank by | |
| - Options: `"success_rate"`, `"total_cost_usd"`, `"avg_duration_ms"`, `"co2_emissions_g"` | |
| - Default: `"success_rate"` | |
| - `top_n` (int): Number of top models to return | |
| - Range: 1-20 | |
| - Default: 5 | |
| **Returns:** JSON string with: | |
| - Metric used for ranking | |
| - Ranking order (ascending/descending) | |
| - Total runs in leaderboard | |
| - Array of top performers with 10 essential fields | |
| **Benefits:** | |
| - ✅ Token Reduction: 90% fewer tokens vs full dataset | |
| - ✅ Ready to Use: Properly formatted JSON | |
| - ✅ Pre-Sorted: Already ranked by chosen metric | |
| - ✅ Essential Data Only: 10 fields vs 20+ in full dataset | |
| **Example Call:** | |
| ```python | |
| result = await get_top_performers( | |
| leaderboard_repo="kshitijthakkar/smoltrace-leaderboard", | |
| metric="total_cost_usd", | |
| top_n=3 | |
| ) | |
| ``` | |
| **Example Response:** | |
| ```json | |
| { | |
| "metric": "total_cost_usd", | |
| "order": "ascending", | |
| "total_runs": 51, | |
| "top_performers": [ | |
| { | |
| "run_id": "run_001", | |
| "model": "meta-llama/Llama-3.1-8B", | |
| "success_rate": 93.4, | |
| "total_cost_usd": 0.002, | |
| "avg_duration_ms": 2100, | |
| "agent_type": "both", | |
| "provider": "transformers", | |
| "submitted_by": "kshitij", | |
| "timestamp": "2025-11-20T10:30:00Z", | |
| "total_tests": 100 | |
| }, | |
| ... | |
| ] | |
| } | |
| ``` | |
| --- | |
| #### 7. get_leaderboard_summary | |
| Get high-level leaderboard statistics with 99% token reduction. | |
| **Performance Optimization:** Returns only aggregated statistics instead of raw data, resulting in **99% token reduction**. | |
| **When to Use:** Perfect for overview queries like "How many runs are in the leaderboard?", "What's the average success rate?". | |
| **Parameters:** | |
| - `leaderboard_repo` (str): HuggingFace dataset repository | |
| - Default: `"kshitijthakkar/smoltrace-leaderboard"` | |
| **Returns:** JSON string with: | |
| - Total runs count | |
| - Unique models and submitters | |
| - Overall statistics (avg/best/worst success rates, avg cost, avg duration, total CO2) | |
| - Breakdown by agent type | |
| - Breakdown by provider | |
| - Top 3 models by success rate | |
| **Benefits:** | |
| - ✅ Extreme Token Reduction: 99% fewer tokens | |
| - ✅ Ready to Use: Properly formatted JSON | |
| - ✅ Comprehensive Stats: Averages, distributions, breakdowns | |
| - ✅ Quick Insights: Perfect for overview questions | |
| **Example Call:** | |
| ```python | |
| result = await get_leaderboard_summary( | |
| leaderboard_repo="kshitijthakkar/smoltrace-leaderboard" | |
| ) | |
| ``` | |
| **Example Response:** | |
| ```json | |
| { | |
| "total_runs": 51, | |
| "unique_models": 12, | |
| "unique_submitters": 3, | |
| "overall_stats": { | |
| "avg_success_rate": 89.2, | |
| "best_success_rate": 95.8, | |
| "worst_success_rate": 78.3, | |
| "avg_cost_usd": 0.012, | |
| "avg_duration_ms": 3200, | |
| "total_co2_g": 45.6 | |
| }, | |
| "by_agent_type": { | |
| "tool": {"count": 20, "avg_success_rate": 88.5}, | |
| "code": {"count": 18, "avg_success_rate": 87.2}, | |
| "both": {"count": 13, "avg_success_rate": 92.1} | |
| }, | |
| "by_provider": { | |
| "litellm": {"count": 30, "avg_success_rate": 91.3}, | |
| "transformers": {"count": 21, "avg_success_rate": 86.4} | |
| }, | |
| "top_3_models": [ | |
| {"model": "openai/gpt-4", "success_rate": 95.8}, | |
| {"model": "anthropic/claude-3", "success_rate": 94.1}, | |
| {"model": "meta-llama/Llama-3.1-8B", "success_rate": 93.4} | |
| ] | |
| } | |
| ``` | |
| --- | |
| ### Data Management Tools | |
| #### 8. get_dataset | |
| Loads SMOLTRACE datasets from HuggingFace and returns raw data as JSON. | |
| **⚠️ Important:** For leaderboard queries, prefer using `get_top_performers()` or `get_leaderboard_summary()` to avoid token bloat! | |
| **Security Restriction:** Only datasets with "smoltrace-" in the repository name are allowed. | |
| **Parameters:** | |
| - `dataset_repo` (str, required): HuggingFace dataset repository | |
| - Must contain "smoltrace-" prefix | |
| - Format: `"username/smoltrace-type-model"` | |
| - `split` (str): Dataset split to load | |
| - Default: `"train"` | |
| - `limit` (int): Maximum rows to return | |
| - Range: 1-200 | |
| - Default: 100 | |
| **Returns:** JSON string with: | |
| - Total rows in dataset | |
| - List of column names | |
| - Array of data rows (up to `limit`) | |
| **Primary Use Cases:** | |
| - Load `smoltrace-results-*` datasets for test case details | |
| - Load `smoltrace-traces-*` datasets for OpenTelemetry data | |
| - Load `smoltrace-metrics-*` datasets for GPU metrics | |
| - **NOT recommended** for leaderboard queries (use optimized tools) | |
| **Example Call:** | |
| ```python | |
| result = await get_dataset( | |
| dataset_repo="kshitij/smoltrace-results-gpt4", | |
| split="train", | |
| limit=50 | |
| ) | |
| ``` | |
| --- | |
| #### 9. generate_synthetic_dataset | |
| Creates domain-specific test datasets for SMOLTRACE evaluations using AI. | |
| **Parameters:** | |
| - `domain` (str, required): Domain for tasks | |
| - Examples: "e-commerce", "customer service", "finance", "healthcare" | |
| - `tools` (list[str], required): Available tools | |
| - Example: `["search_web", "get_weather", "calculator"]` | |
| - `num_tasks` (int): Number of tasks to generate | |
| - Range: 1-100 | |
| - Default: 20 | |
| - `difficulty_distribution` (str): Task difficulty mix | |
| - Options: `"balanced"`, `"easy_only"`, `"medium_only"`, `"hard_only"`, `"progressive"` | |
| - Default: `"balanced"` | |
| - `agent_type` (str): Target agent type | |
| - Options: `"tool"`, `"code"`, `"both"` | |
| - Default: `"both"` | |
| **Returns:** JSON string with: | |
| - `dataset_info`: Metadata (domain, tools, counts, timestamp) | |
| - `tasks`: Array of SMOLTRACE-formatted tasks | |
| - `usage_instructions`: Guide for HuggingFace upload and SMOLTRACE usage | |
| **SMOLTRACE Task Format:** | |
| ```json | |
| { | |
| "id": "unique_identifier", | |
| "prompt": "Clear, specific task for the agent", | |
| "expected_tool": "tool_name", | |
| "expected_tool_calls": 1, | |
| "difficulty": "easy|medium|hard", | |
| "agent_type": "tool|code", | |
| "expected_keywords": ["keyword1", "keyword2"] | |
| } | |
| ``` | |
| **Difficulty Calibration:** | |
| - **Easy** (40%): Single tool call, straightforward input | |
| - **Medium** (40%): Multiple tool calls OR complex input parsing | |
| - **Hard** (20%): Multiple tools, complex reasoning, edge cases | |
| **Enterprise Use Cases:** | |
| - Custom Tools: Benchmark proprietary APIs | |
| - Industry-Specific: Generate tasks for finance, healthcare, legal | |
| - Internal Workflows: Test company-specific processes | |
| **Example Call:** | |
| ```python | |
| result = await generate_synthetic_dataset( | |
| domain="customer service", | |
| tools=["search_knowledge_base", "create_ticket", "send_email"], | |
| num_tasks=50, | |
| difficulty_distribution="balanced", | |
| agent_type="tool" | |
| ) | |
| ``` | |
| --- | |
| #### 10. push_dataset_to_hub | |
| Upload generated datasets to HuggingFace Hub with proper formatting. | |
| **Parameters:** | |
| - `dataset_name` (str, required): Repository name on HuggingFace | |
| - Format: `"username/my-dataset"` | |
| - `data` (str or list, required): Dataset content | |
| - Can be JSON string or list of dictionaries | |
| - `description` (str): Dataset description for card | |
| - Default: Auto-generated | |
| - `private` (bool): Make dataset private | |
| - Default: `False` | |
| **Returns:** Success message with dataset URL | |
| **Example Workflow:** | |
| 1. Generate synthetic dataset with `generate_synthetic_dataset` | |
| 2. Review and modify tasks if needed | |
| 3. Upload to HuggingFace with `push_dataset_to_hub` | |
| 4. Use in SMOLTRACE evaluations or share with team | |
| **Example Call:** | |
| ```python | |
| result = await push_dataset_to_hub( | |
| dataset_name="kshitij/my-custom-evaluation", | |
| data=generated_tasks, | |
| description="Custom evaluation dataset for e-commerce agents", | |
| private=False | |
| ) | |
| ``` | |
| --- | |
| #### 11. generate_prompt_template | |
| Generate customized smolagents prompt template for a specific domain and tool set. | |
| **Parameters:** | |
| - `domain` (str, required): Domain for the prompt template | |
| - Examples: `"finance"`, `"healthcare"`, `"customer_support"`, `"e-commerce"` | |
| - `tool_names` (str, required): Comma-separated list of tool names | |
| - Format: `"tool1,tool2,tool3"` | |
| - Example: `"get_stock_price,calculate_roi,fetch_company_info"` | |
| - `agent_type` (str): Agent type | |
| - Options: `"tool"` (ToolCallingAgent), `"code"` (CodeAgent) | |
| - Default: `"tool"` | |
| **Returns:** JSON response containing: | |
| - Customized YAML prompt template | |
| - Metadata (domain, tools, agent_type, timestamp) | |
| - Usage instructions | |
| **Use Case:** | |
| When you generate synthetic datasets with `generate_synthetic_dataset`, use this tool to create a matching prompt template that agents can use during evaluation. This ensures your evaluation setup is complete and ready to run. | |
| **Integration:** | |
| The generated prompt template can be included in your HuggingFace dataset card, making it easy for anyone to run evaluations with your dataset. | |
| **Example Call:** | |
| ```python | |
| result = await generate_prompt_template( | |
| domain="customer_support", | |
| tool_names="search_knowledge_base,create_ticket,send_email,escalate_to_human", | |
| agent_type="tool" | |
| ) | |
| ``` | |
| **Example Response:** | |
| ```json | |
| { | |
| "prompt_template": "---\nname: customer_support_agent\ndescription: An AI agent for customer support tasks...\n\ninstructions: |-\n You are a helpful customer support agent...\n \n Available tools:\n - search_knowledge_base: Search the knowledge base...\n - create_ticket: Create a support ticket...\n ...", | |
| "metadata": { | |
| "domain": "customer_support", | |
| "tools": ["search_knowledge_base", "create_ticket", "send_email", "escalate_to_human"], | |
| "agent_type": "tool", | |
| "base_template": "ToolCallingAgent", | |
| "timestamp": "2025-11-21T10:30:00Z" | |
| }, | |
| "usage_instructions": "1. Save the prompt_template to a file (e.g., customer_support_prompt.yaml)\n2. Use with SMOLTRACE: smoltrace-eval --model your-model --prompt-file customer_support_prompt.yaml\n3. Or include in your dataset card for easy evaluation" | |
| } | |
| ``` | |
| --- | |
| ## MCP Resources | |
| Resources provide direct data access without AI analysis. Access via URI scheme. | |
| ### 1. leaderboard://{repo} | |
| Direct access to raw leaderboard data in JSON format. | |
| **URI Format:** | |
| ``` | |
| leaderboard://username/dataset-name | |
| ``` | |
| **Example:** | |
| ``` | |
| GET leaderboard://kshitijthakkar/smoltrace-leaderboard | |
| ``` | |
| **Returns:** JSON array with all evaluation runs, including: | |
| - run_id, model, agent_type, provider | |
| - success_rate, total_tests, successful_tests, failed_tests | |
| - avg_duration_ms, total_tokens, total_cost_usd, co2_emissions_g | |
| - results_dataset, traces_dataset, metrics_dataset (references) | |
| - timestamp, submitted_by, hf_job_id | |
| --- | |
| ### 2. trace://{trace_id}/{repo} | |
| Direct access to trace data with OpenTelemetry spans. | |
| **URI Format:** | |
| ``` | |
| trace://trace_id/username/dataset-name | |
| ``` | |
| **Example:** | |
| ``` | |
| GET trace://trace_abc123/kshitij/agent-traces-gpt4 | |
| ``` | |
| **Returns:** JSON with: | |
| - traceId | |
| - spans array (spanId, parentSpanId, name, kind, startTime, endTime, attributes, status) | |
| --- | |
| ### 3. cost://model/{model_name} | |
| Model pricing and hardware cost information. | |
| **URI Format:** | |
| ``` | |
| cost://model/provider/model-name | |
| ``` | |
| **Example:** | |
| ``` | |
| GET cost://model/openai/gpt-4 | |
| ``` | |
| **Returns:** JSON with: | |
| - Model pricing (input/output token costs) | |
| - Recommended hardware tier | |
| - Estimated compute costs | |
| - CO2 emissions per 1K tokens | |
| --- | |
| ## MCP Prompts | |
| Prompts provide reusable templates for standardized interactions. | |
| ### 1. analysis_prompt | |
| Templates for different analysis types. | |
| **Parameters:** | |
| - `analysis_type` (str): Type of analysis | |
| - Options: `"leaderboard"`, `"cost"`, `"performance"`, `"trace"` | |
| - `focus_area` (str): Specific focus | |
| - Options: `"overall"`, `"cost"`, `"accuracy"`, `"speed"`, `"eco"` | |
| - `detail_level` (str): Level of detail | |
| - Options: `"summary"`, `"detailed"`, `"comprehensive"` | |
| **Returns:** Formatted prompt string for use with AI tools | |
| **Example:** | |
| ```python | |
| prompt = analysis_prompt( | |
| analysis_type="leaderboard", | |
| focus_area="cost", | |
| detail_level="detailed" | |
| ) | |
| # Returns: "Provide a detailed analysis of cost efficiency in the leaderboard..." | |
| ``` | |
| --- | |
| ### 2. debug_prompt | |
| Templates for debugging scenarios. | |
| **Parameters:** | |
| - `debug_type` (str): Type of debugging | |
| - Options: `"failure"`, `"performance"`, `"tool_calling"`, `"reasoning"` | |
| - `context` (str): Additional context | |
| - Options: `"test_failure"`, `"timeout"`, `"unexpected_tool"`, `"reasoning_loop"` | |
| **Returns:** Formatted prompt string | |
| **Example:** | |
| ```python | |
| prompt = debug_prompt( | |
| debug_type="performance", | |
| context="tool_calling" | |
| ) | |
| # Returns: "Analyze tool calling performance. Identify which tools are slow..." | |
| ``` | |
| --- | |
| ### 3. optimization_prompt | |
| Templates for optimization goals. | |
| **Parameters:** | |
| - `optimization_goal` (str): Optimization target | |
| - Options: `"cost"`, `"speed"`, `"accuracy"`, `"co2"` | |
| - `constraints` (str): Constraints to respect | |
| - Options: `"maintain_quality"`, `"no_accuracy_loss"`, `"budget_limit"`, `"time_limit"` | |
| **Returns:** Formatted prompt string | |
| **Example:** | |
| ```python | |
| prompt = optimization_prompt( | |
| optimization_goal="cost", | |
| constraints="maintain_quality" | |
| ) | |
| # Returns: "Analyze this evaluation setup and recommend cost optimizations..." | |
| ``` | |
| --- | |
| ## Error Handling | |
| ### Common Error Responses | |
| **Invalid Dataset Repository:** | |
| ```json | |
| { | |
| "error": "Dataset must contain 'smoltrace-' prefix for security", | |
| "provided": "username/invalid-dataset" | |
| } | |
| ``` | |
| **Dataset Not Found:** | |
| ```json | |
| { | |
| "error": "Dataset not found on HuggingFace", | |
| "repository": "username/smoltrace-nonexistent" | |
| } | |
| ``` | |
| **API Rate Limit:** | |
| ```json | |
| { | |
| "error": "Gemini API rate limit exceeded", | |
| "retry_after": 60 | |
| } | |
| ``` | |
| **Invalid Parameters:** | |
| ```json | |
| { | |
| "error": "Invalid parameter value", | |
| "parameter": "top_n", | |
| "value": 50, | |
| "allowed_range": "1-20" | |
| } | |
| ``` | |
| --- | |
| ## Best Practices | |
| ### 1. Token Optimization | |
| **DO:** | |
| - Use `get_top_performers()` for "top N" queries (90% token reduction) | |
| - Use `get_leaderboard_summary()` for overview queries (99% token reduction) | |
| - Set appropriate `limit` when using `get_dataset()` | |
| **DON'T:** | |
| - Use `get_dataset()` for leaderboard queries (loads all 51 runs) | |
| - Request more data than needed | |
| - Ignore token optimization tools | |
| ### 2. AI Tool Usage | |
| **DO:** | |
| - Use AI tools (`analyze_leaderboard`, `debug_trace`) for complex analysis | |
| - Provide specific questions to `debug_trace` for focused answers | |
| - Use `focus` parameter in `compare_runs` for targeted comparisons | |
| **DON'T:** | |
| - Use AI tools for simple data retrieval (use resources instead) | |
| - Make vague requests (be specific for better results) | |
| ### 3. Dataset Security | |
| **DO:** | |
| - Only use datasets with "smoltrace-" prefix | |
| - Verify dataset exists before requesting | |
| - Use public datasets or authenticate for private ones | |
| **DON'T:** | |
| - Try to access arbitrary HuggingFace datasets | |
| - Share private dataset URLs without authentication | |
| ### 4. Cost Management | |
| **DO:** | |
| - Use `estimate_cost` before running large evaluations | |
| - Compare cost estimates across different models | |
| - Consider token-optimized tools to reduce API costs | |
| **DON'T:** | |
| - Skip cost estimation for expensive operations | |
| - Ignore hardware recommendations | |
| - Overlook CO2 emissions in decision-making | |
| --- | |
| ## Support | |
| For issues or questions: | |
| - 📧 GitHub Issues: [TraceMind-mcp-server/issues](https://github.com/Mandark-droid/TraceMind-mcp-server/issues) | |
| - 💬 HF Discord: `#agents-mcp-hackathon-winter25` | |
| - 🏷️ Tag: `building-mcp-track-enterprise` | |