TraceMind-mcp-server / DOCUMENTATION.md
kshitijthakkar's picture
docs: Deploy final documentation package
6982f0b
|
raw
history blame
26.5 kB

TraceMind MCP Server - Complete API Documentation

This document provides comprehensive API reference for all MCP components provided by TraceMind MCP Server.

Table of Contents


MCP Tools

AI-Powered Analysis Tools

These tools use Google Gemini 2.5 Flash to provide intelligent, context-aware analysis of agent evaluation data.

1. analyze_leaderboard

Analyzes evaluation leaderboard data from HuggingFace datasets and generates AI-powered insights.

Parameters:

  • leaderboard_repo (str): HuggingFace dataset repository
    • Default: "kshitijthakkar/smoltrace-leaderboard"
    • Format: "username/dataset-name"
  • metric_focus (str): Primary metric to analyze
    • Options: "overall", "accuracy", "cost", "latency", "co2"
    • Default: "overall"
  • time_range (str): Time period to analyze
    • Options: "last_week", "last_month", "all_time"
    • Default: "last_week"
  • top_n (int): Number of top models to highlight
    • Range: 1-20
    • Default: 5

Returns: String containing AI-generated analysis with:

  • Top performers by selected metric
  • Trade-off analysis (e.g., accuracy vs cost)
  • Trend identification
  • Actionable recommendations

Example Use Case: Before choosing a model for production, get AI-powered insights on which configuration offers the best cost/performance for your requirements.

Example Call:

result = await analyze_leaderboard(
    leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
    metric_focus="cost",
    time_range="last_week",
    top_n=5
)

Example Response:

Based on 247 evaluations in the past week:

Top Performers (Cost Focus):
1. meta-llama/Llama-3.1-8B: $0.002 per run, 93.4% accuracy
2. mistralai/Mistral-7B: $0.003 per run, 91.2% accuracy
3. openai/gpt-3.5-turbo: $0.008 per run, 94.1% accuracy

Trade-off Analysis:
- Llama-3.1 offers best cost/performance ratio at 25x cheaper than GPT-4
- GPT-4 leads in accuracy (95.8%) but costs $0.05 per run
- For production with 1M runs/month: Llama-3.1 saves $48,000 vs GPT-4

Recommendations:
- Cost-sensitive: Use Llama-3.1-8B (93% accuracy, minimal cost)
- Accuracy-critical: Use GPT-4 (96% accuracy, premium cost)
- Balanced: Use GPT-3.5-Turbo (94% accuracy, moderate cost)

2. debug_trace

Analyzes OpenTelemetry trace data and answers specific questions about agent execution.

Parameters:

  • trace_dataset (str): HuggingFace dataset containing traces
    • Format: "username/smoltrace-traces-model"
    • Must contain "smoltrace-" prefix
  • trace_id (str): Specific trace ID to analyze
    • Format: "trace_abc123"
  • question (str): Question about the trace
    • Examples: "Why was tool X called twice?", "Which step took the most time?"
  • include_metrics (bool): Include GPU metrics in analysis
    • Default: true

Returns: String containing AI analysis of the trace with:

  • Answer to the specific question
  • Relevant span details
  • Performance insights
  • GPU metrics (if available and requested)

Example Use Case: When an agent test fails, understand exactly what happened without manually parsing trace spans.

Example Call:

result = await debug_trace(
    trace_dataset="kshitij/smoltrace-traces-gpt4",
    trace_id="trace_abc123",
    question="Why was the search tool called twice?",
    include_metrics=True
)

Example Response:

Based on trace analysis:

Answer:
The agent called the search_web tool twice due to an iterative reasoning pattern:

1. First call (span_003 at 14:23:19.000):
   - Query: "weather in Tokyo"
   - Duration: 890ms
   - Result: 5 results, oldest was 2 days old

2. Second call (span_005 at 14:23:21.200):
   - Query: "latest weather in Tokyo"
   - Duration: 1200ms
   - Modified reasoning: LLM determined first results were stale

Performance Impact:
- Added 2.09s to total execution time
- Cost increase: +$0.0003 (tokens for second reasoning step)
- This is normal behavior for tool-calling agents with iterative reasoning

GPU Metrics:
- N/A (API model, no GPU used)

3. estimate_cost

Predicts costs, duration, and environmental impact before running evaluations.

Parameters:

  • model (str, required): Model name to evaluate
    • Format: "provider/model-name" (e.g., "openai/gpt-4", "meta-llama/Llama-3.1-8B")
  • agent_type (str): Type of agent evaluation
    • Options: "tool", "code", "both"
    • Default: "both"
  • num_tests (int): Number of test cases
    • Range: 1-10000
    • Default: 100
  • hardware (str): Hardware type
    • Options: "auto", "cpu", "gpu_a10", "gpu_h200"
    • Default: "auto" (auto-selects based on model)

Returns: String containing cost estimate with:

  • LLM API costs (for API models)
  • HuggingFace Jobs compute costs (for local models)
  • Estimated duration
  • CO2 emissions estimate
  • Hardware recommendations

Example Use Case: Compare the cost of evaluating GPT-4 vs Llama-3.1 across 1000 tests before committing resources.

Example Call:

result = await estimate_cost(
    model="openai/gpt-4",
    agent_type="both",
    num_tests=1000,
    hardware="auto"
)

Example Response:

Cost Estimate for openai/gpt-4:

LLM API Costs:
- Estimated tokens per test: 1,500
- Token cost: $0.03/1K input, $0.06/1K output
- Total LLM cost: $50.00 (1000 tests)

Compute Costs:
- Recommended hardware: cpu-basic (API model)
- HF Jobs cost: ~$0.05/hr
- Estimated duration: 45 minutes
- Total compute cost: $0.04

Total Cost: $50.04
Cost per test: $0.05
CO2 emissions: ~0.5g (API calls, minimal compute)

Recommendations:
- This is an API model, CPU hardware is sufficient
- For cost optimization, consider Llama-3.1-8B (25x cheaper)
- Estimated runtime: 45 minutes for 1000 tests

4. compare_runs

Compares two evaluation runs with AI-powered analysis across multiple dimensions.

Parameters:

  • run_id_1 (str, required): First run ID from leaderboard
  • run_id_2 (str, required): Second run ID from leaderboard
  • leaderboard_repo (str): Leaderboard dataset repository
    • Default: "kshitijthakkar/smoltrace-leaderboard"
  • focus (str): Comparison focus area
    • Options:
      • "comprehensive": All dimensions
      • "cost": Cost efficiency and ROI
      • "performance": Speed and accuracy trade-offs
      • "eco_friendly": Environmental impact
    • Default: "comprehensive"

Returns: String containing AI comparison with:

  • Success rate comparison with statistical significance
  • Cost efficiency analysis
  • Speed comparison
  • Environmental impact (CO2 emissions)
  • GPU efficiency (for GPU jobs)

Example Use Case: After running evaluations with two different models, compare them head-to-head to determine which is better for production deployment.

Example Call:

result = await compare_runs(
    run_id_1="run_abc123",
    run_id_2="run_def456",
    leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
    focus="cost"
)

Example Response:

Comparison: GPT-4 vs Llama-3.1-8B (Cost Focus)

Success Rates:
- GPT-4: 95.8% (96/100 tests)
- Llama-3.1: 93.4% (93/100 tests)
- Difference: +2.4% for GPT-4 (statistically significant, p<0.05)

Cost Efficiency:
- GPT-4: $0.05 per test, $0.052 per successful test
- Llama-3.1: $0.002 per test, $0.0021 per successful test
- Cost ratio: GPT-4 is 25x more expensive

ROI Analysis:
- For 1M evaluations/month:
  - GPT-4: $50,000/month, 958K successes
  - Llama-3.1: $2,000/month, 934K successes
- GPT-4 provides 24K more successes for $48K more cost
- Cost per additional success: $2.00

Recommendation (Cost Focus):
Use Llama-3.1-8B for cost-sensitive workloads where 93% accuracy is acceptable.
Switch to GPT-4 only for accuracy-critical tasks where the 2.4% improvement justifies 25x cost.

5. analyze_results

Analyzes detailed test results and provides optimization recommendations.

Parameters:

  • results_repo (str, required): HuggingFace dataset containing results
    • Format: "username/smoltrace-results-model-timestamp"
    • Must contain "smoltrace-results-" prefix
  • analysis_focus (str): Focus area for analysis
    • Options: "failures", "performance", "cost", "comprehensive"
    • Default: "comprehensive"
  • max_rows (int): Maximum test cases to analyze
    • Range: 10-500
    • Default: 100

Returns: String containing AI analysis with:

  • Failure patterns and root causes
  • Performance bottlenecks in specific test cases
  • Cost optimization opportunities
  • Tool usage patterns
  • Task-specific insights (which types work well vs poorly)
  • Actionable optimization recommendations

Example Use Case: After running an evaluation, analyze the detailed test results to understand why certain tests are failing and get specific recommendations for improving success rate.

Example Call:

result = await analyze_results(
    results_repo="kshitij/smoltrace-results-gpt4-20251120",
    analysis_focus="failures",
    max_rows=100
)

Example Response:

Analysis of Test Results (100 tests analyzed)

Overall Statistics:
- Success Rate: 89% (89/100 tests passed)
- Average Duration: 3.2s per test
- Total Cost: $4.50 ($0.045 per test)

Failure Analysis (11 failures):
1. Tool Not Found (6 failures):
   - Test IDs: task_012, task_045, task_067, task_089, task_091, task_093
   - Pattern: All failed tests required the 'get_weather' tool
   - Root Cause: Tool definition missing or incorrect name
   - Fix: Ensure 'get_weather' tool is available in agent's tool list

2. Timeout (3 failures):
   - Test IDs: task_034, task_071, task_088
   - Pattern: Complex multi-step tasks with >5 tool calls
   - Root Cause: Exceeding 30s timeout limit
   - Fix: Increase timeout to 60s or simplify complex tasks

3. Incorrect Response (2 failures):
   - Test IDs: task_056, task_072
   - Pattern: Math calculation tasks
   - Root Cause: Model hallucinating numbers instead of using calculator tool
   - Fix: Update prompt to emphasize tool usage for calculations

Performance Insights:
- Fast tasks (<2s): 45 tests - Simple single-tool calls
- Slow tasks (>5s): 12 tests - Multi-step reasoning with 3+ tools
- Optimal duration: 2-3s for most tasks

Cost Optimization:
- High-cost tests: task_023 ($0.12) - Used 4K tokens
- Low-cost tests: task_087 ($0.008) - Used 180 tokens
- Recommendation: Optimize prompt to reduce token usage by 20%

Recommendations:
1. Add missing 'get_weather' tool β†’ Fixes 6 failures
2. Increase timeout from 30s to 60s β†’ Fixes 3 failures
3. Strengthen calculator tool instruction β†’ Fixes 2 failures
4. Expected improvement: 89% β†’ 100% success rate

Token-Optimized Tools

These tools are specifically designed to minimize token usage when querying leaderboard data.

6. get_top_performers

Get top N performing models from leaderboard with 90% token reduction.

Performance Optimization: Returns only top N models instead of loading the full leaderboard dataset (51 runs), resulting in 90% token reduction.

When to Use: Perfect for queries like "Which model is leading?", "Show me the top 5 models".

Parameters:

  • leaderboard_repo (str): HuggingFace dataset repository
    • Default: "kshitijthakkar/smoltrace-leaderboard"
  • metric (str): Metric to rank by
    • Options: "success_rate", "total_cost_usd", "avg_duration_ms", "co2_emissions_g"
    • Default: "success_rate"
  • top_n (int): Number of top models to return
    • Range: 1-20
    • Default: 5

Returns: JSON string with:

  • Metric used for ranking
  • Ranking order (ascending/descending)
  • Total runs in leaderboard
  • Array of top performers with 10 essential fields

Benefits:

  • βœ… Token Reduction: 90% fewer tokens vs full dataset
  • βœ… Ready to Use: Properly formatted JSON
  • βœ… Pre-Sorted: Already ranked by chosen metric
  • βœ… Essential Data Only: 10 fields vs 20+ in full dataset

Example Call:

result = await get_top_performers(
    leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
    metric="total_cost_usd",
    top_n=3
)

Example Response:

{
  "metric": "total_cost_usd",
  "order": "ascending",
  "total_runs": 51,
  "top_performers": [
    {
      "run_id": "run_001",
      "model": "meta-llama/Llama-3.1-8B",
      "success_rate": 93.4,
      "total_cost_usd": 0.002,
      "avg_duration_ms": 2100,
      "agent_type": "both",
      "provider": "transformers",
      "submitted_by": "kshitij",
      "timestamp": "2025-11-20T10:30:00Z",
      "total_tests": 100
    },
    ...
  ]
}

7. get_leaderboard_summary

Get high-level leaderboard statistics with 99% token reduction.

Performance Optimization: Returns only aggregated statistics instead of raw data, resulting in 99% token reduction.

When to Use: Perfect for overview queries like "How many runs are in the leaderboard?", "What's the average success rate?".

Parameters:

  • leaderboard_repo (str): HuggingFace dataset repository
    • Default: "kshitijthakkar/smoltrace-leaderboard"

Returns: JSON string with:

  • Total runs count
  • Unique models and submitters
  • Overall statistics (avg/best/worst success rates, avg cost, avg duration, total CO2)
  • Breakdown by agent type
  • Breakdown by provider
  • Top 3 models by success rate

Benefits:

  • βœ… Extreme Token Reduction: 99% fewer tokens
  • βœ… Ready to Use: Properly formatted JSON
  • βœ… Comprehensive Stats: Averages, distributions, breakdowns
  • βœ… Quick Insights: Perfect for overview questions

Example Call:

result = await get_leaderboard_summary(
    leaderboard_repo="kshitijthakkar/smoltrace-leaderboard"
)

Example Response:

{
  "total_runs": 51,
  "unique_models": 12,
  "unique_submitters": 3,
  "overall_stats": {
    "avg_success_rate": 89.2,
    "best_success_rate": 95.8,
    "worst_success_rate": 78.3,
    "avg_cost_usd": 0.012,
    "avg_duration_ms": 3200,
    "total_co2_g": 45.6
  },
  "by_agent_type": {
    "tool": {"count": 20, "avg_success_rate": 88.5},
    "code": {"count": 18, "avg_success_rate": 87.2},
    "both": {"count": 13, "avg_success_rate": 92.1}
  },
  "by_provider": {
    "litellm": {"count": 30, "avg_success_rate": 91.3},
    "transformers": {"count": 21, "avg_success_rate": 86.4}
  },
  "top_3_models": [
    {"model": "openai/gpt-4", "success_rate": 95.8},
    {"model": "anthropic/claude-3", "success_rate": 94.1},
    {"model": "meta-llama/Llama-3.1-8B", "success_rate": 93.4}
  ]
}

Data Management Tools

8. get_dataset

Loads SMOLTRACE datasets from HuggingFace and returns raw data as JSON.

⚠️ Important: For leaderboard queries, prefer using get_top_performers() or get_leaderboard_summary() to avoid token bloat!

Security Restriction: Only datasets with "smoltrace-" in the repository name are allowed.

Parameters:

  • dataset_repo (str, required): HuggingFace dataset repository
    • Must contain "smoltrace-" prefix
    • Format: "username/smoltrace-type-model"
  • split (str): Dataset split to load
    • Default: "train"
  • limit (int): Maximum rows to return
    • Range: 1-200
    • Default: 100

Returns: JSON string with:

  • Total rows in dataset
  • List of column names
  • Array of data rows (up to limit)

Primary Use Cases:

  • Load smoltrace-results-* datasets for test case details
  • Load smoltrace-traces-* datasets for OpenTelemetry data
  • Load smoltrace-metrics-* datasets for GPU metrics
  • NOT recommended for leaderboard queries (use optimized tools)

Example Call:

result = await get_dataset(
    dataset_repo="kshitij/smoltrace-results-gpt4",
    split="train",
    limit=50
)

9. generate_synthetic_dataset

Creates domain-specific test datasets for SMOLTRACE evaluations using AI.

Parameters:

  • domain (str, required): Domain for tasks
    • Examples: "e-commerce", "customer service", "finance", "healthcare"
  • tools (list[str], required): Available tools
    • Example: ["search_web", "get_weather", "calculator"]
  • num_tasks (int): Number of tasks to generate
    • Range: 1-100
    • Default: 20
  • difficulty_distribution (str): Task difficulty mix
    • Options: "balanced", "easy_only", "medium_only", "hard_only", "progressive"
    • Default: "balanced"
  • agent_type (str): Target agent type
    • Options: "tool", "code", "both"
    • Default: "both"

Returns: JSON string with:

  • dataset_info: Metadata (domain, tools, counts, timestamp)
  • tasks: Array of SMOLTRACE-formatted tasks
  • usage_instructions: Guide for HuggingFace upload and SMOLTRACE usage

SMOLTRACE Task Format:

{
  "id": "unique_identifier",
  "prompt": "Clear, specific task for the agent",
  "expected_tool": "tool_name",
  "expected_tool_calls": 1,
  "difficulty": "easy|medium|hard",
  "agent_type": "tool|code",
  "expected_keywords": ["keyword1", "keyword2"]
}

Difficulty Calibration:

  • Easy (40%): Single tool call, straightforward input
  • Medium (40%): Multiple tool calls OR complex input parsing
  • Hard (20%): Multiple tools, complex reasoning, edge cases

Enterprise Use Cases:

  • Custom Tools: Benchmark proprietary APIs
  • Industry-Specific: Generate tasks for finance, healthcare, legal
  • Internal Workflows: Test company-specific processes

Example Call:

result = await generate_synthetic_dataset(
    domain="customer service",
    tools=["search_knowledge_base", "create_ticket", "send_email"],
    num_tasks=50,
    difficulty_distribution="balanced",
    agent_type="tool"
)

10. push_dataset_to_hub

Upload generated datasets to HuggingFace Hub with proper formatting.

Parameters:

  • dataset_name (str, required): Repository name on HuggingFace
    • Format: "username/my-dataset"
  • data (str or list, required): Dataset content
    • Can be JSON string or list of dictionaries
  • description (str): Dataset description for card
    • Default: Auto-generated
  • private (bool): Make dataset private
    • Default: False

Returns: Success message with dataset URL

Example Workflow:

  1. Generate synthetic dataset with generate_synthetic_dataset
  2. Review and modify tasks if needed
  3. Upload to HuggingFace with push_dataset_to_hub
  4. Use in SMOLTRACE evaluations or share with team

Example Call:

result = await push_dataset_to_hub(
    dataset_name="kshitij/my-custom-evaluation",
    data=generated_tasks,
    description="Custom evaluation dataset for e-commerce agents",
    private=False
)

11. generate_prompt_template

Generate customized smolagents prompt template for a specific domain and tool set.

Parameters:

  • domain (str, required): Domain for the prompt template
    • Examples: "finance", "healthcare", "customer_support", "e-commerce"
  • tool_names (str, required): Comma-separated list of tool names
    • Format: "tool1,tool2,tool3"
    • Example: "get_stock_price,calculate_roi,fetch_company_info"
  • agent_type (str): Agent type
    • Options: "tool" (ToolCallingAgent), "code" (CodeAgent)
    • Default: "tool"

Returns: JSON response containing:

  • Customized YAML prompt template
  • Metadata (domain, tools, agent_type, timestamp)
  • Usage instructions

Use Case: When you generate synthetic datasets with generate_synthetic_dataset, use this tool to create a matching prompt template that agents can use during evaluation. This ensures your evaluation setup is complete and ready to run.

Integration: The generated prompt template can be included in your HuggingFace dataset card, making it easy for anyone to run evaluations with your dataset.

Example Call:

result = await generate_prompt_template(
    domain="customer_support",
    tool_names="search_knowledge_base,create_ticket,send_email,escalate_to_human",
    agent_type="tool"
)

Example Response:

{
  "prompt_template": "---\nname: customer_support_agent\ndescription: An AI agent for customer support tasks...\n\ninstructions: |-\n  You are a helpful customer support agent...\n  \n  Available tools:\n  - search_knowledge_base: Search the knowledge base...\n  - create_ticket: Create a support ticket...\n  ...",
  "metadata": {
    "domain": "customer_support",
    "tools": ["search_knowledge_base", "create_ticket", "send_email", "escalate_to_human"],
    "agent_type": "tool",
    "base_template": "ToolCallingAgent",
    "timestamp": "2025-11-21T10:30:00Z"
  },
  "usage_instructions": "1. Save the prompt_template to a file (e.g., customer_support_prompt.yaml)\n2. Use with SMOLTRACE: smoltrace-eval --model your-model --prompt-file customer_support_prompt.yaml\n3. Or include in your dataset card for easy evaluation"
}

MCP Resources

Resources provide direct data access without AI analysis. Access via URI scheme.

1. leaderboard://{repo}

Direct access to raw leaderboard data in JSON format.

URI Format:

leaderboard://username/dataset-name

Example:

GET leaderboard://kshitijthakkar/smoltrace-leaderboard

Returns: JSON array with all evaluation runs, including:

  • run_id, model, agent_type, provider
  • success_rate, total_tests, successful_tests, failed_tests
  • avg_duration_ms, total_tokens, total_cost_usd, co2_emissions_g
  • results_dataset, traces_dataset, metrics_dataset (references)
  • timestamp, submitted_by, hf_job_id

2. trace://{trace_id}/{repo}

Direct access to trace data with OpenTelemetry spans.

URI Format:

trace://trace_id/username/dataset-name

Example:

GET trace://trace_abc123/kshitij/agent-traces-gpt4

Returns: JSON with:

  • traceId
  • spans array (spanId, parentSpanId, name, kind, startTime, endTime, attributes, status)

3. cost://model/{model_name}

Model pricing and hardware cost information.

URI Format:

cost://model/provider/model-name

Example:

GET cost://model/openai/gpt-4

Returns: JSON with:

  • Model pricing (input/output token costs)
  • Recommended hardware tier
  • Estimated compute costs
  • CO2 emissions per 1K tokens

MCP Prompts

Prompts provide reusable templates for standardized interactions.

1. analysis_prompt

Templates for different analysis types.

Parameters:

  • analysis_type (str): Type of analysis
    • Options: "leaderboard", "cost", "performance", "trace"
  • focus_area (str): Specific focus
    • Options: "overall", "cost", "accuracy", "speed", "eco"
  • detail_level (str): Level of detail
    • Options: "summary", "detailed", "comprehensive"

Returns: Formatted prompt string for use with AI tools

Example:

prompt = analysis_prompt(
    analysis_type="leaderboard",
    focus_area="cost",
    detail_level="detailed"
)
# Returns: "Provide a detailed analysis of cost efficiency in the leaderboard..."

2. debug_prompt

Templates for debugging scenarios.

Parameters:

  • debug_type (str): Type of debugging
    • Options: "failure", "performance", "tool_calling", "reasoning"
  • context (str): Additional context
    • Options: "test_failure", "timeout", "unexpected_tool", "reasoning_loop"

Returns: Formatted prompt string

Example:

prompt = debug_prompt(
    debug_type="performance",
    context="tool_calling"
)
# Returns: "Analyze tool calling performance. Identify which tools are slow..."

3. optimization_prompt

Templates for optimization goals.

Parameters:

  • optimization_goal (str): Optimization target
    • Options: "cost", "speed", "accuracy", "co2"
  • constraints (str): Constraints to respect
    • Options: "maintain_quality", "no_accuracy_loss", "budget_limit", "time_limit"

Returns: Formatted prompt string

Example:

prompt = optimization_prompt(
    optimization_goal="cost",
    constraints="maintain_quality"
)
# Returns: "Analyze this evaluation setup and recommend cost optimizations..."

Error Handling

Common Error Responses

Invalid Dataset Repository:

{
  "error": "Dataset must contain 'smoltrace-' prefix for security",
  "provided": "username/invalid-dataset"
}

Dataset Not Found:

{
  "error": "Dataset not found on HuggingFace",
  "repository": "username/smoltrace-nonexistent"
}

API Rate Limit:

{
  "error": "Gemini API rate limit exceeded",
  "retry_after": 60
}

Invalid Parameters:

{
  "error": "Invalid parameter value",
  "parameter": "top_n",
  "value": 50,
  "allowed_range": "1-20"
}

Best Practices

1. Token Optimization

DO:

  • Use get_top_performers() for "top N" queries (90% token reduction)
  • Use get_leaderboard_summary() for overview queries (99% token reduction)
  • Set appropriate limit when using get_dataset()

DON'T:

  • Use get_dataset() for leaderboard queries (loads all 51 runs)
  • Request more data than needed
  • Ignore token optimization tools

2. AI Tool Usage

DO:

  • Use AI tools (analyze_leaderboard, debug_trace) for complex analysis
  • Provide specific questions to debug_trace for focused answers
  • Use focus parameter in compare_runs for targeted comparisons

DON'T:

  • Use AI tools for simple data retrieval (use resources instead)
  • Make vague requests (be specific for better results)

3. Dataset Security

DO:

  • Only use datasets with "smoltrace-" prefix
  • Verify dataset exists before requesting
  • Use public datasets or authenticate for private ones

DON'T:

  • Try to access arbitrary HuggingFace datasets
  • Share private dataset URLs without authentication

4. Cost Management

DO:

  • Use estimate_cost before running large evaluations
  • Compare cost estimates across different models
  • Consider token-optimized tools to reduce API costs

DON'T:

  • Skip cost estimation for expensive operations
  • Ignore hardware recommendations
  • Overlook CO2 emissions in decision-making

Support

For issues or questions:

  • πŸ“§ GitHub Issues: TraceMind-mcp-server/issues
  • πŸ’¬ HF Discord: #agents-mcp-hackathon-winter25
  • 🏷️ Tag: building-mcp-track-enterprise