# TraceMind MCP Server - Complete API Documentation This document provides comprehensive API reference for all MCP components provided by TraceMind MCP Server. ## Table of Contents - [MCP Tools (11)](#mcp-tools) - [AI-Powered Analysis Tools](#ai-powered-analysis-tools) - [Token-Optimized Tools](#token-optimized-tools) - [Data Management Tools](#data-management-tools) - [MCP Resources (3)](#mcp-resources) - [MCP Prompts (3)](#mcp-prompts) - [Error Handling](#error-handling) - [Best Practices](#best-practices) --- ## MCP Tools ### AI-Powered Analysis Tools These tools use Google Gemini 2.5 Flash to provide intelligent, context-aware analysis of agent evaluation data. #### 1. analyze_leaderboard Analyzes evaluation leaderboard data from HuggingFace datasets and generates AI-powered insights. **Parameters:** - `leaderboard_repo` (str): HuggingFace dataset repository - Default: `"kshitijthakkar/smoltrace-leaderboard"` - Format: `"username/dataset-name"` - `metric_focus` (str): Primary metric to analyze - Options: `"overall"`, `"accuracy"`, `"cost"`, `"latency"`, `"co2"` - Default: `"overall"` - `time_range` (str): Time period to analyze - Options: `"last_week"`, `"last_month"`, `"all_time"` - Default: `"last_week"` - `top_n` (int): Number of top models to highlight - Range: 1-20 - Default: 5 **Returns:** String containing AI-generated analysis with: - Top performers by selected metric - Trade-off analysis (e.g., accuracy vs cost) - Trend identification - Actionable recommendations **Example Use Case:** Before choosing a model for production, get AI-powered insights on which configuration offers the best cost/performance for your requirements. **Example Call:** ```python result = await analyze_leaderboard( leaderboard_repo="kshitijthakkar/smoltrace-leaderboard", metric_focus="cost", time_range="last_week", top_n=5 ) ``` **Example Response:** ``` Based on 247 evaluations in the past week: Top Performers (Cost Focus): 1. meta-llama/Llama-3.1-8B: $0.002 per run, 93.4% accuracy 2. mistralai/Mistral-7B: $0.003 per run, 91.2% accuracy 3. openai/gpt-3.5-turbo: $0.008 per run, 94.1% accuracy Trade-off Analysis: - Llama-3.1 offers best cost/performance ratio at 25x cheaper than GPT-4 - GPT-4 leads in accuracy (95.8%) but costs $0.05 per run - For production with 1M runs/month: Llama-3.1 saves $48,000 vs GPT-4 Recommendations: - Cost-sensitive: Use Llama-3.1-8B (93% accuracy, minimal cost) - Accuracy-critical: Use GPT-4 (96% accuracy, premium cost) - Balanced: Use GPT-3.5-Turbo (94% accuracy, moderate cost) ``` --- #### 2. debug_trace Analyzes OpenTelemetry trace data and answers specific questions about agent execution. **Parameters:** - `trace_dataset` (str): HuggingFace dataset containing traces - Format: `"username/smoltrace-traces-model"` - Must contain "smoltrace-" prefix - `trace_id` (str): Specific trace ID to analyze - Format: `"trace_abc123"` - `question` (str): Question about the trace - Examples: "Why was tool X called twice?", "Which step took the most time?" - `include_metrics` (bool): Include GPU metrics in analysis - Default: `true` **Returns:** String containing AI analysis of the trace with: - Answer to the specific question - Relevant span details - Performance insights - GPU metrics (if available and requested) **Example Use Case:** When an agent test fails, understand exactly what happened without manually parsing trace spans. **Example Call:** ```python result = await debug_trace( trace_dataset="kshitij/smoltrace-traces-gpt4", trace_id="trace_abc123", question="Why was the search tool called twice?", include_metrics=True ) ``` **Example Response:** ``` Based on trace analysis: Answer: The agent called the search_web tool twice due to an iterative reasoning pattern: 1. First call (span_003 at 14:23:19.000): - Query: "weather in Tokyo" - Duration: 890ms - Result: 5 results, oldest was 2 days old 2. Second call (span_005 at 14:23:21.200): - Query: "latest weather in Tokyo" - Duration: 1200ms - Modified reasoning: LLM determined first results were stale Performance Impact: - Added 2.09s to total execution time - Cost increase: +$0.0003 (tokens for second reasoning step) - This is normal behavior for tool-calling agents with iterative reasoning GPU Metrics: - N/A (API model, no GPU used) ``` --- #### 3. estimate_cost Predicts costs, duration, and environmental impact before running evaluations. **Parameters:** - `model` (str, required): Model name to evaluate - Format: `"provider/model-name"` (e.g., `"openai/gpt-4"`, `"meta-llama/Llama-3.1-8B"`) - `agent_type` (str): Type of agent evaluation - Options: `"tool"`, `"code"`, `"both"` - Default: `"both"` - `num_tests` (int): Number of test cases - Range: 1-10000 - Default: 100 - `hardware` (str): Hardware type - Options: `"auto"`, `"cpu"`, `"gpu_a10"`, `"gpu_h200"` - Default: `"auto"` (auto-selects based on model) **Returns:** String containing cost estimate with: - LLM API costs (for API models) - HuggingFace Jobs compute costs (for local models) - Estimated duration - CO2 emissions estimate - Hardware recommendations **Example Use Case:** Compare the cost of evaluating GPT-4 vs Llama-3.1 across 1000 tests before committing resources. **Example Call:** ```python result = await estimate_cost( model="openai/gpt-4", agent_type="both", num_tests=1000, hardware="auto" ) ``` **Example Response:** ``` Cost Estimate for openai/gpt-4: LLM API Costs: - Estimated tokens per test: 1,500 - Token cost: $0.03/1K input, $0.06/1K output - Total LLM cost: $50.00 (1000 tests) Compute Costs: - Recommended hardware: cpu-basic (API model) - HF Jobs cost: ~$0.05/hr - Estimated duration: 45 minutes - Total compute cost: $0.04 Total Cost: $50.04 Cost per test: $0.05 CO2 emissions: ~0.5g (API calls, minimal compute) Recommendations: - This is an API model, CPU hardware is sufficient - For cost optimization, consider Llama-3.1-8B (25x cheaper) - Estimated runtime: 45 minutes for 1000 tests ``` --- #### 4. compare_runs Compares two evaluation runs with AI-powered analysis across multiple dimensions. **Parameters:** - `run_id_1` (str, required): First run ID from leaderboard - `run_id_2` (str, required): Second run ID from leaderboard - `leaderboard_repo` (str): Leaderboard dataset repository - Default: `"kshitijthakkar/smoltrace-leaderboard"` - `focus` (str): Comparison focus area - Options: - `"comprehensive"`: All dimensions - `"cost"`: Cost efficiency and ROI - `"performance"`: Speed and accuracy trade-offs - `"eco_friendly"`: Environmental impact - Default: `"comprehensive"` **Returns:** String containing AI comparison with: - Success rate comparison with statistical significance - Cost efficiency analysis - Speed comparison - Environmental impact (CO2 emissions) - GPU efficiency (for GPU jobs) **Example Use Case:** After running evaluations with two different models, compare them head-to-head to determine which is better for production deployment. **Example Call:** ```python result = await compare_runs( run_id_1="run_abc123", run_id_2="run_def456", leaderboard_repo="kshitijthakkar/smoltrace-leaderboard", focus="cost" ) ``` **Example Response:** ``` Comparison: GPT-4 vs Llama-3.1-8B (Cost Focus) Success Rates: - GPT-4: 95.8% (96/100 tests) - Llama-3.1: 93.4% (93/100 tests) - Difference: +2.4% for GPT-4 (statistically significant, p<0.05) Cost Efficiency: - GPT-4: $0.05 per test, $0.052 per successful test - Llama-3.1: $0.002 per test, $0.0021 per successful test - Cost ratio: GPT-4 is 25x more expensive ROI Analysis: - For 1M evaluations/month: - GPT-4: $50,000/month, 958K successes - Llama-3.1: $2,000/month, 934K successes - GPT-4 provides 24K more successes for $48K more cost - Cost per additional success: $2.00 Recommendation (Cost Focus): Use Llama-3.1-8B for cost-sensitive workloads where 93% accuracy is acceptable. Switch to GPT-4 only for accuracy-critical tasks where the 2.4% improvement justifies 25x cost. ``` --- #### 5. analyze_results Analyzes detailed test results and provides optimization recommendations. **Parameters:** - `results_repo` (str, required): HuggingFace dataset containing results - Format: `"username/smoltrace-results-model-timestamp"` - Must contain "smoltrace-results-" prefix - `analysis_focus` (str): Focus area for analysis - Options: `"failures"`, `"performance"`, `"cost"`, `"comprehensive"` - Default: `"comprehensive"` - `max_rows` (int): Maximum test cases to analyze - Range: 10-500 - Default: 100 **Returns:** String containing AI analysis with: - Failure patterns and root causes - Performance bottlenecks in specific test cases - Cost optimization opportunities - Tool usage patterns - Task-specific insights (which types work well vs poorly) - Actionable optimization recommendations **Example Use Case:** After running an evaluation, analyze the detailed test results to understand why certain tests are failing and get specific recommendations for improving success rate. **Example Call:** ```python result = await analyze_results( results_repo="kshitij/smoltrace-results-gpt4-20251120", analysis_focus="failures", max_rows=100 ) ``` **Example Response:** ``` Analysis of Test Results (100 tests analyzed) Overall Statistics: - Success Rate: 89% (89/100 tests passed) - Average Duration: 3.2s per test - Total Cost: $4.50 ($0.045 per test) Failure Analysis (11 failures): 1. Tool Not Found (6 failures): - Test IDs: task_012, task_045, task_067, task_089, task_091, task_093 - Pattern: All failed tests required the 'get_weather' tool - Root Cause: Tool definition missing or incorrect name - Fix: Ensure 'get_weather' tool is available in agent's tool list 2. Timeout (3 failures): - Test IDs: task_034, task_071, task_088 - Pattern: Complex multi-step tasks with >5 tool calls - Root Cause: Exceeding 30s timeout limit - Fix: Increase timeout to 60s or simplify complex tasks 3. Incorrect Response (2 failures): - Test IDs: task_056, task_072 - Pattern: Math calculation tasks - Root Cause: Model hallucinating numbers instead of using calculator tool - Fix: Update prompt to emphasize tool usage for calculations Performance Insights: - Fast tasks (<2s): 45 tests - Simple single-tool calls - Slow tasks (>5s): 12 tests - Multi-step reasoning with 3+ tools - Optimal duration: 2-3s for most tasks Cost Optimization: - High-cost tests: task_023 ($0.12) - Used 4K tokens - Low-cost tests: task_087 ($0.008) - Used 180 tokens - Recommendation: Optimize prompt to reduce token usage by 20% Recommendations: 1. Add missing 'get_weather' tool → Fixes 6 failures 2. Increase timeout from 30s to 60s → Fixes 3 failures 3. Strengthen calculator tool instruction → Fixes 2 failures 4. Expected improvement: 89% → 100% success rate ``` --- ### Token-Optimized Tools These tools are specifically designed to minimize token usage when querying leaderboard data. #### 6. get_top_performers Get top N performing models from leaderboard with 90% token reduction. **Performance Optimization:** Returns only top N models instead of loading the full leaderboard dataset (51 runs), resulting in **90% token reduction**. **When to Use:** Perfect for queries like "Which model is leading?", "Show me the top 5 models". **Parameters:** - `leaderboard_repo` (str): HuggingFace dataset repository - Default: `"kshitijthakkar/smoltrace-leaderboard"` - `metric` (str): Metric to rank by - Options: `"success_rate"`, `"total_cost_usd"`, `"avg_duration_ms"`, `"co2_emissions_g"` - Default: `"success_rate"` - `top_n` (int): Number of top models to return - Range: 1-20 - Default: 5 **Returns:** JSON string with: - Metric used for ranking - Ranking order (ascending/descending) - Total runs in leaderboard - Array of top performers with 10 essential fields **Benefits:** - ✅ Token Reduction: 90% fewer tokens vs full dataset - ✅ Ready to Use: Properly formatted JSON - ✅ Pre-Sorted: Already ranked by chosen metric - ✅ Essential Data Only: 10 fields vs 20+ in full dataset **Example Call:** ```python result = await get_top_performers( leaderboard_repo="kshitijthakkar/smoltrace-leaderboard", metric="total_cost_usd", top_n=3 ) ``` **Example Response:** ```json { "metric": "total_cost_usd", "order": "ascending", "total_runs": 51, "top_performers": [ { "run_id": "run_001", "model": "meta-llama/Llama-3.1-8B", "success_rate": 93.4, "total_cost_usd": 0.002, "avg_duration_ms": 2100, "agent_type": "both", "provider": "transformers", "submitted_by": "kshitij", "timestamp": "2025-11-20T10:30:00Z", "total_tests": 100 }, ... ] } ``` --- #### 7. get_leaderboard_summary Get high-level leaderboard statistics with 99% token reduction. **Performance Optimization:** Returns only aggregated statistics instead of raw data, resulting in **99% token reduction**. **When to Use:** Perfect for overview queries like "How many runs are in the leaderboard?", "What's the average success rate?". **Parameters:** - `leaderboard_repo` (str): HuggingFace dataset repository - Default: `"kshitijthakkar/smoltrace-leaderboard"` **Returns:** JSON string with: - Total runs count - Unique models and submitters - Overall statistics (avg/best/worst success rates, avg cost, avg duration, total CO2) - Breakdown by agent type - Breakdown by provider - Top 3 models by success rate **Benefits:** - ✅ Extreme Token Reduction: 99% fewer tokens - ✅ Ready to Use: Properly formatted JSON - ✅ Comprehensive Stats: Averages, distributions, breakdowns - ✅ Quick Insights: Perfect for overview questions **Example Call:** ```python result = await get_leaderboard_summary( leaderboard_repo="kshitijthakkar/smoltrace-leaderboard" ) ``` **Example Response:** ```json { "total_runs": 51, "unique_models": 12, "unique_submitters": 3, "overall_stats": { "avg_success_rate": 89.2, "best_success_rate": 95.8, "worst_success_rate": 78.3, "avg_cost_usd": 0.012, "avg_duration_ms": 3200, "total_co2_g": 45.6 }, "by_agent_type": { "tool": {"count": 20, "avg_success_rate": 88.5}, "code": {"count": 18, "avg_success_rate": 87.2}, "both": {"count": 13, "avg_success_rate": 92.1} }, "by_provider": { "litellm": {"count": 30, "avg_success_rate": 91.3}, "transformers": {"count": 21, "avg_success_rate": 86.4} }, "top_3_models": [ {"model": "openai/gpt-4", "success_rate": 95.8}, {"model": "anthropic/claude-3", "success_rate": 94.1}, {"model": "meta-llama/Llama-3.1-8B", "success_rate": 93.4} ] } ``` --- ### Data Management Tools #### 8. get_dataset Loads SMOLTRACE datasets from HuggingFace and returns raw data as JSON. **⚠️ Important:** For leaderboard queries, prefer using `get_top_performers()` or `get_leaderboard_summary()` to avoid token bloat! **Security Restriction:** Only datasets with "smoltrace-" in the repository name are allowed. **Parameters:** - `dataset_repo` (str, required): HuggingFace dataset repository - Must contain "smoltrace-" prefix - Format: `"username/smoltrace-type-model"` - `split` (str): Dataset split to load - Default: `"train"` - `limit` (int): Maximum rows to return - Range: 1-200 - Default: 100 **Returns:** JSON string with: - Total rows in dataset - List of column names - Array of data rows (up to `limit`) **Primary Use Cases:** - Load `smoltrace-results-*` datasets for test case details - Load `smoltrace-traces-*` datasets for OpenTelemetry data - Load `smoltrace-metrics-*` datasets for GPU metrics - **NOT recommended** for leaderboard queries (use optimized tools) **Example Call:** ```python result = await get_dataset( dataset_repo="kshitij/smoltrace-results-gpt4", split="train", limit=50 ) ``` --- #### 9. generate_synthetic_dataset Creates domain-specific test datasets for SMOLTRACE evaluations using AI. **Parameters:** - `domain` (str, required): Domain for tasks - Examples: "e-commerce", "customer service", "finance", "healthcare" - `tools` (list[str], required): Available tools - Example: `["search_web", "get_weather", "calculator"]` - `num_tasks` (int): Number of tasks to generate - Range: 1-100 - Default: 20 - `difficulty_distribution` (str): Task difficulty mix - Options: `"balanced"`, `"easy_only"`, `"medium_only"`, `"hard_only"`, `"progressive"` - Default: `"balanced"` - `agent_type` (str): Target agent type - Options: `"tool"`, `"code"`, `"both"` - Default: `"both"` **Returns:** JSON string with: - `dataset_info`: Metadata (domain, tools, counts, timestamp) - `tasks`: Array of SMOLTRACE-formatted tasks - `usage_instructions`: Guide for HuggingFace upload and SMOLTRACE usage **SMOLTRACE Task Format:** ```json { "id": "unique_identifier", "prompt": "Clear, specific task for the agent", "expected_tool": "tool_name", "expected_tool_calls": 1, "difficulty": "easy|medium|hard", "agent_type": "tool|code", "expected_keywords": ["keyword1", "keyword2"] } ``` **Difficulty Calibration:** - **Easy** (40%): Single tool call, straightforward input - **Medium** (40%): Multiple tool calls OR complex input parsing - **Hard** (20%): Multiple tools, complex reasoning, edge cases **Enterprise Use Cases:** - Custom Tools: Benchmark proprietary APIs - Industry-Specific: Generate tasks for finance, healthcare, legal - Internal Workflows: Test company-specific processes **Example Call:** ```python result = await generate_synthetic_dataset( domain="customer service", tools=["search_knowledge_base", "create_ticket", "send_email"], num_tasks=50, difficulty_distribution="balanced", agent_type="tool" ) ``` --- #### 10. push_dataset_to_hub Upload generated datasets to HuggingFace Hub with proper formatting. **Parameters:** - `dataset_name` (str, required): Repository name on HuggingFace - Format: `"username/my-dataset"` - `data` (str or list, required): Dataset content - Can be JSON string or list of dictionaries - `description` (str): Dataset description for card - Default: Auto-generated - `private` (bool): Make dataset private - Default: `False` **Returns:** Success message with dataset URL **Example Workflow:** 1. Generate synthetic dataset with `generate_synthetic_dataset` 2. Review and modify tasks if needed 3. Upload to HuggingFace with `push_dataset_to_hub` 4. Use in SMOLTRACE evaluations or share with team **Example Call:** ```python result = await push_dataset_to_hub( dataset_name="kshitij/my-custom-evaluation", data=generated_tasks, description="Custom evaluation dataset for e-commerce agents", private=False ) ``` --- #### 11. generate_prompt_template Generate customized smolagents prompt template for a specific domain and tool set. **Parameters:** - `domain` (str, required): Domain for the prompt template - Examples: `"finance"`, `"healthcare"`, `"customer_support"`, `"e-commerce"` - `tool_names` (str, required): Comma-separated list of tool names - Format: `"tool1,tool2,tool3"` - Example: `"get_stock_price,calculate_roi,fetch_company_info"` - `agent_type` (str): Agent type - Options: `"tool"` (ToolCallingAgent), `"code"` (CodeAgent) - Default: `"tool"` **Returns:** JSON response containing: - Customized YAML prompt template - Metadata (domain, tools, agent_type, timestamp) - Usage instructions **Use Case:** When you generate synthetic datasets with `generate_synthetic_dataset`, use this tool to create a matching prompt template that agents can use during evaluation. This ensures your evaluation setup is complete and ready to run. **Integration:** The generated prompt template can be included in your HuggingFace dataset card, making it easy for anyone to run evaluations with your dataset. **Example Call:** ```python result = await generate_prompt_template( domain="customer_support", tool_names="search_knowledge_base,create_ticket,send_email,escalate_to_human", agent_type="tool" ) ``` **Example Response:** ```json { "prompt_template": "---\nname: customer_support_agent\ndescription: An AI agent for customer support tasks...\n\ninstructions: |-\n You are a helpful customer support agent...\n \n Available tools:\n - search_knowledge_base: Search the knowledge base...\n - create_ticket: Create a support ticket...\n ...", "metadata": { "domain": "customer_support", "tools": ["search_knowledge_base", "create_ticket", "send_email", "escalate_to_human"], "agent_type": "tool", "base_template": "ToolCallingAgent", "timestamp": "2025-11-21T10:30:00Z" }, "usage_instructions": "1. Save the prompt_template to a file (e.g., customer_support_prompt.yaml)\n2. Use with SMOLTRACE: smoltrace-eval --model your-model --prompt-file customer_support_prompt.yaml\n3. Or include in your dataset card for easy evaluation" } ``` --- ## MCP Resources Resources provide direct data access without AI analysis. Access via URI scheme. ### 1. leaderboard://{repo} Direct access to raw leaderboard data in JSON format. **URI Format:** ``` leaderboard://username/dataset-name ``` **Example:** ``` GET leaderboard://kshitijthakkar/smoltrace-leaderboard ``` **Returns:** JSON array with all evaluation runs, including: - run_id, model, agent_type, provider - success_rate, total_tests, successful_tests, failed_tests - avg_duration_ms, total_tokens, total_cost_usd, co2_emissions_g - results_dataset, traces_dataset, metrics_dataset (references) - timestamp, submitted_by, hf_job_id --- ### 2. trace://{trace_id}/{repo} Direct access to trace data with OpenTelemetry spans. **URI Format:** ``` trace://trace_id/username/dataset-name ``` **Example:** ``` GET trace://trace_abc123/kshitij/agent-traces-gpt4 ``` **Returns:** JSON with: - traceId - spans array (spanId, parentSpanId, name, kind, startTime, endTime, attributes, status) --- ### 3. cost://model/{model_name} Model pricing and hardware cost information. **URI Format:** ``` cost://model/provider/model-name ``` **Example:** ``` GET cost://model/openai/gpt-4 ``` **Returns:** JSON with: - Model pricing (input/output token costs) - Recommended hardware tier - Estimated compute costs - CO2 emissions per 1K tokens --- ## MCP Prompts Prompts provide reusable templates for standardized interactions. ### 1. analysis_prompt Templates for different analysis types. **Parameters:** - `analysis_type` (str): Type of analysis - Options: `"leaderboard"`, `"cost"`, `"performance"`, `"trace"` - `focus_area` (str): Specific focus - Options: `"overall"`, `"cost"`, `"accuracy"`, `"speed"`, `"eco"` - `detail_level` (str): Level of detail - Options: `"summary"`, `"detailed"`, `"comprehensive"` **Returns:** Formatted prompt string for use with AI tools **Example:** ```python prompt = analysis_prompt( analysis_type="leaderboard", focus_area="cost", detail_level="detailed" ) # Returns: "Provide a detailed analysis of cost efficiency in the leaderboard..." ``` --- ### 2. debug_prompt Templates for debugging scenarios. **Parameters:** - `debug_type` (str): Type of debugging - Options: `"failure"`, `"performance"`, `"tool_calling"`, `"reasoning"` - `context` (str): Additional context - Options: `"test_failure"`, `"timeout"`, `"unexpected_tool"`, `"reasoning_loop"` **Returns:** Formatted prompt string **Example:** ```python prompt = debug_prompt( debug_type="performance", context="tool_calling" ) # Returns: "Analyze tool calling performance. Identify which tools are slow..." ``` --- ### 3. optimization_prompt Templates for optimization goals. **Parameters:** - `optimization_goal` (str): Optimization target - Options: `"cost"`, `"speed"`, `"accuracy"`, `"co2"` - `constraints` (str): Constraints to respect - Options: `"maintain_quality"`, `"no_accuracy_loss"`, `"budget_limit"`, `"time_limit"` **Returns:** Formatted prompt string **Example:** ```python prompt = optimization_prompt( optimization_goal="cost", constraints="maintain_quality" ) # Returns: "Analyze this evaluation setup and recommend cost optimizations..." ``` --- ## Error Handling ### Common Error Responses **Invalid Dataset Repository:** ```json { "error": "Dataset must contain 'smoltrace-' prefix for security", "provided": "username/invalid-dataset" } ``` **Dataset Not Found:** ```json { "error": "Dataset not found on HuggingFace", "repository": "username/smoltrace-nonexistent" } ``` **API Rate Limit:** ```json { "error": "Gemini API rate limit exceeded", "retry_after": 60 } ``` **Invalid Parameters:** ```json { "error": "Invalid parameter value", "parameter": "top_n", "value": 50, "allowed_range": "1-20" } ``` --- ## Best Practices ### 1. Token Optimization **DO:** - Use `get_top_performers()` for "top N" queries (90% token reduction) - Use `get_leaderboard_summary()` for overview queries (99% token reduction) - Set appropriate `limit` when using `get_dataset()` **DON'T:** - Use `get_dataset()` for leaderboard queries (loads all 51 runs) - Request more data than needed - Ignore token optimization tools ### 2. AI Tool Usage **DO:** - Use AI tools (`analyze_leaderboard`, `debug_trace`) for complex analysis - Provide specific questions to `debug_trace` for focused answers - Use `focus` parameter in `compare_runs` for targeted comparisons **DON'T:** - Use AI tools for simple data retrieval (use resources instead) - Make vague requests (be specific for better results) ### 3. Dataset Security **DO:** - Only use datasets with "smoltrace-" prefix - Verify dataset exists before requesting - Use public datasets or authenticate for private ones **DON'T:** - Try to access arbitrary HuggingFace datasets - Share private dataset URLs without authentication ### 4. Cost Management **DO:** - Use `estimate_cost` before running large evaluations - Compare cost estimates across different models - Consider token-optimized tools to reduce API costs **DON'T:** - Skip cost estimation for expensive operations - Ignore hardware recommendations - Overlook CO2 emissions in decision-making --- ## Support For issues or questions: - 📧 GitHub Issues: [TraceMind-mcp-server/issues](https://github.com/Mandark-droid/TraceMind-mcp-server/issues) - 💬 HF Discord: `#agents-mcp-hackathon-winter25` - 🏷️ Tag: `building-mcp-track-enterprise`