Spaces:

MCP-1st-Birthday
/

TraceMind-mcp-server

Running

App Files Files Community

TraceMind-mcp-server / DOCUMENTATION.md

kshitijthakkar

docs: Deploy final documentation package

6982f0b 15 days ago

preview code

raw

history blame

26.5 kB

	# TraceMind MCP Server - Complete API Documentation

	This document provides comprehensive API reference for all MCP components provided by TraceMind MCP Server.

	## Table of Contents

	- [MCP Tools (11)](#mcp-tools)
	- [AI-Powered Analysis Tools](#ai-powered-analysis-tools)
	- [Token-Optimized Tools](#token-optimized-tools)
	- [Data Management Tools](#data-management-tools)
	- [MCP Resources (3)](#mcp-resources)
	- [MCP Prompts (3)](#mcp-prompts)
	- [Error Handling](#error-handling)
	- [Best Practices](#best-practices)

	---

	## MCP Tools

	### AI-Powered Analysis Tools

	These tools use Google Gemini 2.5 Flash to provide intelligent, context-aware analysis of agent evaluation data.

	#### 1. analyze_leaderboard

	Analyzes evaluation leaderboard data from HuggingFace datasets and generates AI-powered insights.

	Parameters:
	- `leaderboard_repo` (str): HuggingFace dataset repository
	- Default: `"kshitijthakkar/smoltrace-leaderboard"`
	- Format: `"username/dataset-name"`
	- `metric_focus` (str): Primary metric to analyze
	- Options: `"overall"`, `"accuracy"`, `"cost"`, `"latency"`, `"co2"`
	- Default: `"overall"`
	- `time_range` (str): Time period to analyze
	- Options: `"last_week"`, `"last_month"`, `"all_time"`
	- Default: `"last_week"`
	- `top_n` (int): Number of top models to highlight
	- Range: 1-20
	- Default: 5

	Returns: String containing AI-generated analysis with:
	- Top performers by selected metric
	- Trade-off analysis (e.g., accuracy vs cost)
	- Trend identification
	- Actionable recommendations

	Example Use Case:
	Before choosing a model for production, get AI-powered insights on which configuration offers the best cost/performance for your requirements.

	Example Call:
	```python
	result = await analyze_leaderboard(
	leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
	metric_focus="cost",
	time_range="last_week",
	top_n=5
	)
	```

	Example Response:
	```
	Based on 247 evaluations in the past week:

	Top Performers (Cost Focus):
	1. meta-llama/Llama-3.1-8B: $0.002 per run, 93.4% accuracy
	2. mistralai/Mistral-7B: $0.003 per run, 91.2% accuracy
	3. openai/gpt-3.5-turbo: $0.008 per run, 94.1% accuracy

	Trade-off Analysis:
	- Llama-3.1 offers best cost/performance ratio at 25x cheaper than GPT-4
	- GPT-4 leads in accuracy (95.8%) but costs $0.05 per run
	- For production with 1M runs/month: Llama-3.1 saves $48,000 vs GPT-4

	Recommendations:
	- Cost-sensitive: Use Llama-3.1-8B (93% accuracy, minimal cost)
	- Accuracy-critical: Use GPT-4 (96% accuracy, premium cost)
	- Balanced: Use GPT-3.5-Turbo (94% accuracy, moderate cost)
	```

	---

	#### 2. debug_trace

	Analyzes OpenTelemetry trace data and answers specific questions about agent execution.

	Parameters:
	- `trace_dataset` (str): HuggingFace dataset containing traces
	- Format: `"username/smoltrace-traces-model"`
	- Must contain "smoltrace-" prefix
	- `trace_id` (str): Specific trace ID to analyze
	- Format: `"trace_abc123"`
	- `question` (str): Question about the trace
	- Examples: "Why was tool X called twice?", "Which step took the most time?"
	- `include_metrics` (bool): Include GPU metrics in analysis
	- Default: `true`

	Returns: String containing AI analysis of the trace with:
	- Answer to the specific question
	- Relevant span details
	- Performance insights
	- GPU metrics (if available and requested)

	Example Use Case:
	When an agent test fails, understand exactly what happened without manually parsing trace spans.

	Example Call:
	```python
	result = await debug_trace(
	trace_dataset="kshitij/smoltrace-traces-gpt4",
	trace_id="trace_abc123",
	question="Why was the search tool called twice?",
	include_metrics=True
	)
	```

	Example Response:
	```
	Based on trace analysis:

	Answer:
	The agent called the search_web tool twice due to an iterative reasoning pattern:

	1. First call (span_003 at 14:23:19.000):
	- Query: "weather in Tokyo"
	- Duration: 890ms
	- Result: 5 results, oldest was 2 days old

	2. Second call (span_005 at 14:23:21.200):
	- Query: "latest weather in Tokyo"
	- Duration: 1200ms
	- Modified reasoning: LLM determined first results were stale

	Performance Impact:
	- Added 2.09s to total execution time
	- Cost increase: +$0.0003 (tokens for second reasoning step)
	- This is normal behavior for tool-calling agents with iterative reasoning

	GPU Metrics:
	- N/A (API model, no GPU used)
	```

	---

	#### 3. estimate_cost

	Predicts costs, duration, and environmental impact before running evaluations.

	Parameters:
	- `model` (str, required): Model name to evaluate
	- Format: `"provider/model-name"` (e.g., `"openai/gpt-4"`, `"meta-llama/Llama-3.1-8B"`)
	- `agent_type` (str): Type of agent evaluation
	- Options: `"tool"`, `"code"`, `"both"`
	- Default: `"both"`
	- `num_tests` (int): Number of test cases
	- Range: 1-10000
	- Default: 100
	- `hardware` (str): Hardware type
	- Options: `"auto"`, `"cpu"`, `"gpu_a10"`, `"gpu_h200"`
	- Default: `"auto"` (auto-selects based on model)

	Returns: String containing cost estimate with:
	- LLM API costs (for API models)
	- HuggingFace Jobs compute costs (for local models)
	- Estimated duration
	- CO2 emissions estimate
	- Hardware recommendations

	Example Use Case:
	Compare the cost of evaluating GPT-4 vs Llama-3.1 across 1000 tests before committing resources.

	Example Call:
	```python
	result = await estimate_cost(
	model="openai/gpt-4",
	agent_type="both",
	num_tests=1000,
	hardware="auto"
	)
	```

	Example Response:
	```
	Cost Estimate for openai/gpt-4:

	LLM API Costs:
	- Estimated tokens per test: 1,500
	- Token cost: $0.03/1K input, $0.06/1K output
	- Total LLM cost: $50.00 (1000 tests)

	Compute Costs:
	- Recommended hardware: cpu-basic (API model)
	- HF Jobs cost: ~$0.05/hr
	- Estimated duration: 45 minutes
	- Total compute cost: $0.04

	Total Cost: $50.04
	Cost per test: $0.05
	CO2 emissions: ~0.5g (API calls, minimal compute)

	Recommendations:
	- This is an API model, CPU hardware is sufficient
	- For cost optimization, consider Llama-3.1-8B (25x cheaper)
	- Estimated runtime: 45 minutes for 1000 tests
	```

	---

	#### 4. compare_runs

	Compares two evaluation runs with AI-powered analysis across multiple dimensions.

	Parameters:
	- `run_id_1` (str, required): First run ID from leaderboard
	- `run_id_2` (str, required): Second run ID from leaderboard
	- `leaderboard_repo` (str): Leaderboard dataset repository
	- Default: `"kshitijthakkar/smoltrace-leaderboard"`
	- `focus` (str): Comparison focus area
	- Options:
	- `"comprehensive"`: All dimensions
	- `"cost"`: Cost efficiency and ROI
	- `"performance"`: Speed and accuracy trade-offs
	- `"eco_friendly"`: Environmental impact
	- Default: `"comprehensive"`

	Returns: String containing AI comparison with:
	- Success rate comparison with statistical significance
	- Cost efficiency analysis
	- Speed comparison
	- Environmental impact (CO2 emissions)
	- GPU efficiency (for GPU jobs)

	Example Use Case:
	After running evaluations with two different models, compare them head-to-head to determine which is better for production deployment.

	Example Call:
	```python
	result = await compare_runs(
	run_id_1="run_abc123",
	run_id_2="run_def456",
	leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
	focus="cost"
	)
	```

	Example Response:
	```
	Comparison: GPT-4 vs Llama-3.1-8B (Cost Focus)

	Success Rates:
	- GPT-4: 95.8% (96/100 tests)
	- Llama-3.1: 93.4% (93/100 tests)
	- Difference: +2.4% for GPT-4 (statistically significant, p<0.05)

	Cost Efficiency:
	- GPT-4: $0.05 per test, $0.052 per successful test
	- Llama-3.1: $0.002 per test, $0.0021 per successful test
	- Cost ratio: GPT-4 is 25x more expensive

	ROI Analysis:
	- For 1M evaluations/month:
	- GPT-4: $50,000/month, 958K successes
	- Llama-3.1: $2,000/month, 934K successes
	- GPT-4 provides 24K more successes for $48K more cost
	- Cost per additional success: $2.00

	Recommendation (Cost Focus):
	Use Llama-3.1-8B for cost-sensitive workloads where 93% accuracy is acceptable.
	Switch to GPT-4 only for accuracy-critical tasks where the 2.4% improvement justifies 25x cost.
	```

	---

	#### 5. analyze_results

	Analyzes detailed test results and provides optimization recommendations.

	Parameters:
	- `results_repo` (str, required): HuggingFace dataset containing results
	- Format: `"username/smoltrace-results-model-timestamp"`
	- Must contain "smoltrace-results-" prefix
	- `analysis_focus` (str): Focus area for analysis
	- Options: `"failures"`, `"performance"`, `"cost"`, `"comprehensive"`
	- Default: `"comprehensive"`
	- `max_rows` (int): Maximum test cases to analyze
	- Range: 10-500
	- Default: 100

	Returns: String containing AI analysis with:
	- Failure patterns and root causes
	- Performance bottlenecks in specific test cases
	- Cost optimization opportunities
	- Tool usage patterns
	- Task-specific insights (which types work well vs poorly)
	- Actionable optimization recommendations

	Example Use Case:
	After running an evaluation, analyze the detailed test results to understand why certain tests are failing and get specific recommendations for improving success rate.

	Example Call:
	```python
	result = await analyze_results(
	results_repo="kshitij/smoltrace-results-gpt4-20251120",
	analysis_focus="failures",
	max_rows=100
	)
	```

	Example Response:
	```
	Analysis of Test Results (100 tests analyzed)

	Overall Statistics:
	- Success Rate: 89% (89/100 tests passed)
	- Average Duration: 3.2s per test
	- Total Cost: $4.50 ($0.045 per test)

	Failure Analysis (11 failures):
	1. Tool Not Found (6 failures):
	- Test IDs: task_012, task_045, task_067, task_089, task_091, task_093
	- Pattern: All failed tests required the 'get_weather' tool
	- Root Cause: Tool definition missing or incorrect name
	- Fix: Ensure 'get_weather' tool is available in agent's tool list

	2. Timeout (3 failures):
	- Test IDs: task_034, task_071, task_088
	- Pattern: Complex multi-step tasks with >5 tool calls
	- Root Cause: Exceeding 30s timeout limit
	- Fix: Increase timeout to 60s or simplify complex tasks

	3. Incorrect Response (2 failures):
	- Test IDs: task_056, task_072
	- Pattern: Math calculation tasks
	- Root Cause: Model hallucinating numbers instead of using calculator tool
	- Fix: Update prompt to emphasize tool usage for calculations

	Performance Insights:
	- Fast tasks (<2s): 45 tests - Simple single-tool calls
	- Slow tasks (>5s): 12 tests - Multi-step reasoning with 3+ tools
	- Optimal duration: 2-3s for most tasks

	Cost Optimization:
	- High-cost tests: task_023 ($0.12) - Used 4K tokens
	- Low-cost tests: task_087 ($0.008) - Used 180 tokens
	- Recommendation: Optimize prompt to reduce token usage by 20%

	Recommendations:
	1. Add missing 'get_weather' tool → Fixes 6 failures
	2. Increase timeout from 30s to 60s → Fixes 3 failures
	3. Strengthen calculator tool instruction → Fixes 2 failures
	4. Expected improvement: 89% → 100% success rate
	```

	---

	### Token-Optimized Tools

	These tools are specifically designed to minimize token usage when querying leaderboard data.

	#### 6. get_top_performers

	Get top N performing models from leaderboard with 90% token reduction.

	Performance Optimization: Returns only top N models instead of loading the full leaderboard dataset (51 runs), resulting in 90% token reduction.

	When to Use: Perfect for queries like "Which model is leading?", "Show me the top 5 models".

	Parameters:
	- `leaderboard_repo` (str): HuggingFace dataset repository
	- Default: `"kshitijthakkar/smoltrace-leaderboard"`
	- `metric` (str): Metric to rank by
	- Options: `"success_rate"`, `"total_cost_usd"`, `"avg_duration_ms"`, `"co2_emissions_g"`
	- Default: `"success_rate"`
	- `top_n` (int): Number of top models to return
	- Range: 1-20
	- Default: 5

	Returns: JSON string with:
	- Metric used for ranking
	- Ranking order (ascending/descending)
	- Total runs in leaderboard
	- Array of top performers with 10 essential fields

	Benefits:
	- ✅ Token Reduction: 90% fewer tokens vs full dataset
	- ✅ Ready to Use: Properly formatted JSON
	- ✅ Pre-Sorted: Already ranked by chosen metric
	- ✅ Essential Data Only: 10 fields vs 20+ in full dataset

	Example Call:
	```python
	result = await get_top_performers(
	leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
	metric="total_cost_usd",
	top_n=3
	)
	```

	Example Response:
	```json
	{
	"metric": "total_cost_usd",
	"order": "ascending",
	"total_runs": 51,
	"top_performers": [
	{
	"run_id": "run_001",
	"model": "meta-llama/Llama-3.1-8B",
	"success_rate": 93.4,
	"total_cost_usd": 0.002,
	"avg_duration_ms": 2100,
	"agent_type": "both",
	"provider": "transformers",
	"submitted_by": "kshitij",
	"timestamp": "2025-11-20T10:30:00Z",
	"total_tests": 100
	},
	...
	]
	}
	```

	---

	#### 7. get_leaderboard_summary

	Get high-level leaderboard statistics with 99% token reduction.

	Performance Optimization: Returns only aggregated statistics instead of raw data, resulting in 99% token reduction.

	When to Use: Perfect for overview queries like "How many runs are in the leaderboard?", "What's the average success rate?".

	Parameters:
	- `leaderboard_repo` (str): HuggingFace dataset repository
	- Default: `"kshitijthakkar/smoltrace-leaderboard"`

	Returns: JSON string with:
	- Total runs count
	- Unique models and submitters
	- Overall statistics (avg/best/worst success rates, avg cost, avg duration, total CO2)
	- Breakdown by agent type
	- Breakdown by provider
	- Top 3 models by success rate

	Benefits:
	- ✅ Extreme Token Reduction: 99% fewer tokens
	- ✅ Ready to Use: Properly formatted JSON
	- ✅ Comprehensive Stats: Averages, distributions, breakdowns
	- ✅ Quick Insights: Perfect for overview questions

	Example Call:
	```python
	result = await get_leaderboard_summary(
	leaderboard_repo="kshitijthakkar/smoltrace-leaderboard"
	)
	```

	Example Response:
	```json
	{
	"total_runs": 51,
	"unique_models": 12,
	"unique_submitters": 3,
	"overall_stats": {
	"avg_success_rate": 89.2,
	"best_success_rate": 95.8,
	"worst_success_rate": 78.3,
	"avg_cost_usd": 0.012,
	"avg_duration_ms": 3200,
	"total_co2_g": 45.6
	},
	"by_agent_type": {
	"tool": {"count": 20, "avg_success_rate": 88.5},
	"code": {"count": 18, "avg_success_rate": 87.2},
	"both": {"count": 13, "avg_success_rate": 92.1}
	},
	"by_provider": {
	"litellm": {"count": 30, "avg_success_rate": 91.3},
	"transformers": {"count": 21, "avg_success_rate": 86.4}
	},
	"top_3_models": [
	{"model": "openai/gpt-4", "success_rate": 95.8},
	{"model": "anthropic/claude-3", "success_rate": 94.1},
	{"model": "meta-llama/Llama-3.1-8B", "success_rate": 93.4}
	]
	}
	```

	---

	### Data Management Tools

	#### 8. get_dataset

	Loads SMOLTRACE datasets from HuggingFace and returns raw data as JSON.

	⚠️ Important: For leaderboard queries, prefer using `get_top_performers()` or `get_leaderboard_summary()` to avoid token bloat!

	Security Restriction: Only datasets with "smoltrace-" in the repository name are allowed.

	Parameters:
	- `dataset_repo` (str, required): HuggingFace dataset repository
	- Must contain "smoltrace-" prefix
	- Format: `"username/smoltrace-type-model"`
	- `split` (str): Dataset split to load
	- Default: `"train"`
	- `limit` (int): Maximum rows to return
	- Range: 1-200
	- Default: 100

	Returns: JSON string with:
	- Total rows in dataset
	- List of column names
	- Array of data rows (up to `limit`)

	Primary Use Cases:
	- Load `smoltrace-results-*` datasets for test case details
	- Load `smoltrace-traces-*` datasets for OpenTelemetry data
	- Load `smoltrace-metrics-*` datasets for GPU metrics
	- NOT recommended for leaderboard queries (use optimized tools)

	Example Call:
	```python
	result = await get_dataset(
	dataset_repo="kshitij/smoltrace-results-gpt4",
	split="train",
	limit=50
	)
	```

	---

	#### 9. generate_synthetic_dataset

	Creates domain-specific test datasets for SMOLTRACE evaluations using AI.

	Parameters:
	- `domain` (str, required): Domain for tasks
	- Examples: "e-commerce", "customer service", "finance", "healthcare"
	- `tools` (list[str], required): Available tools
	- Example: `["search_web", "get_weather", "calculator"]`
	- `num_tasks` (int): Number of tasks to generate
	- Range: 1-100
	- Default: 20
	- `difficulty_distribution` (str): Task difficulty mix
	- Options: `"balanced"`, `"easy_only"`, `"medium_only"`, `"hard_only"`, `"progressive"`
	- Default: `"balanced"`
	- `agent_type` (str): Target agent type
	- Options: `"tool"`, `"code"`, `"both"`
	- Default: `"both"`

	Returns: JSON string with:
	- `dataset_info`: Metadata (domain, tools, counts, timestamp)
	- `tasks`: Array of SMOLTRACE-formatted tasks
	- `usage_instructions`: Guide for HuggingFace upload and SMOLTRACE usage

	SMOLTRACE Task Format:
	```json
	{
	"id": "unique_identifier",
	"prompt": "Clear, specific task for the agent",
	"expected_tool": "tool_name",
	"expected_tool_calls": 1,
	"difficulty": "easy\|medium\|hard",
	"agent_type": "tool\|code",
	"expected_keywords": ["keyword1", "keyword2"]
	}
	```

	Difficulty Calibration:
	- Easy (40%): Single tool call, straightforward input
	- Medium (40%): Multiple tool calls OR complex input parsing
	- Hard (20%): Multiple tools, complex reasoning, edge cases

	Enterprise Use Cases:
	- Custom Tools: Benchmark proprietary APIs
	- Industry-Specific: Generate tasks for finance, healthcare, legal
	- Internal Workflows: Test company-specific processes

	Example Call:
	```python
	result = await generate_synthetic_dataset(
	domain="customer service",
	tools=["search_knowledge_base", "create_ticket", "send_email"],
	num_tasks=50,
	difficulty_distribution="balanced",
	agent_type="tool"
	)
	```

	---

	#### 10. push_dataset_to_hub

	Upload generated datasets to HuggingFace Hub with proper formatting.

	Parameters:
	- `dataset_name` (str, required): Repository name on HuggingFace
	- Format: `"username/my-dataset"`
	- `data` (str or list, required): Dataset content
	- Can be JSON string or list of dictionaries
	- `description` (str): Dataset description for card
	- Default: Auto-generated
	- `private` (bool): Make dataset private
	- Default: `False`

	Returns: Success message with dataset URL

	Example Workflow:
	1. Generate synthetic dataset with `generate_synthetic_dataset`
	2. Review and modify tasks if needed
	3. Upload to HuggingFace with `push_dataset_to_hub`
	4. Use in SMOLTRACE evaluations or share with team

	Example Call:
	```python
	result = await push_dataset_to_hub(
	dataset_name="kshitij/my-custom-evaluation",
	data=generated_tasks,
	description="Custom evaluation dataset for e-commerce agents",
	private=False
	)
	```

	---

	#### 11. generate_prompt_template

	Generate customized smolagents prompt template for a specific domain and tool set.

	Parameters:
	- `domain` (str, required): Domain for the prompt template
	- Examples: `"finance"`, `"healthcare"`, `"customer_support"`, `"e-commerce"`
	- `tool_names` (str, required): Comma-separated list of tool names
	- Format: `"tool1,tool2,tool3"`
	- Example: `"get_stock_price,calculate_roi,fetch_company_info"`
	- `agent_type` (str): Agent type
	- Options: `"tool"` (ToolCallingAgent), `"code"` (CodeAgent)
	- Default: `"tool"`

	Returns: JSON response containing:
	- Customized YAML prompt template
	- Metadata (domain, tools, agent_type, timestamp)
	- Usage instructions

	Use Case:
	When you generate synthetic datasets with `generate_synthetic_dataset`, use this tool to create a matching prompt template that agents can use during evaluation. This ensures your evaluation setup is complete and ready to run.

	Integration:
	The generated prompt template can be included in your HuggingFace dataset card, making it easy for anyone to run evaluations with your dataset.

	Example Call:
	```python
	result = await generate_prompt_template(
	domain="customer_support",
	tool_names="search_knowledge_base,create_ticket,send_email,escalate_to_human",
	agent_type="tool"
	)
	```

	Example Response:
	```json
	{
	"prompt_template": "---\nname: customer_support_agent\ndescription: An AI agent for customer support tasks...\n\ninstructions: \|-\n You are a helpful customer support agent...\n \n Available tools:\n - search_knowledge_base: Search the knowledge base...\n - create_ticket: Create a support ticket...\n ...",
	"metadata": {
	"domain": "customer_support",
	"tools": ["search_knowledge_base", "create_ticket", "send_email", "escalate_to_human"],
	"agent_type": "tool",
	"base_template": "ToolCallingAgent",
	"timestamp": "2025-11-21T10:30:00Z"
	},
	"usage_instructions": "1. Save the prompt_template to a file (e.g., customer_support_prompt.yaml)\n2. Use with SMOLTRACE: smoltrace-eval --model your-model --prompt-file customer_support_prompt.yaml\n3. Or include in your dataset card for easy evaluation"
	}
	```

	---

	## MCP Resources

	Resources provide direct data access without AI analysis. Access via URI scheme.

	### 1. leaderboard://{repo}

	Direct access to raw leaderboard data in JSON format.

	URI Format:
	```
	leaderboard://username/dataset-name
	```

	Example:
	```
	GET leaderboard://kshitijthakkar/smoltrace-leaderboard
	```

	Returns: JSON array with all evaluation runs, including:
	- run_id, model, agent_type, provider
	- success_rate, total_tests, successful_tests, failed_tests
	- avg_duration_ms, total_tokens, total_cost_usd, co2_emissions_g
	- results_dataset, traces_dataset, metrics_dataset (references)
	- timestamp, submitted_by, hf_job_id

	---

	### 2. trace://{trace_id}/{repo}

	Direct access to trace data with OpenTelemetry spans.

	URI Format:
	```
	trace://trace_id/username/dataset-name
	```

	Example:
	```
	GET trace://trace_abc123/kshitij/agent-traces-gpt4
	```

	Returns: JSON with:
	- traceId
	- spans array (spanId, parentSpanId, name, kind, startTime, endTime, attributes, status)

	---

	### 3. cost://model/{model_name}

	Model pricing and hardware cost information.

	URI Format:
	```
	cost://model/provider/model-name
	```

	Example:
	```
	GET cost://model/openai/gpt-4
	```

	Returns: JSON with:
	- Model pricing (input/output token costs)
	- Recommended hardware tier
	- Estimated compute costs
	- CO2 emissions per 1K tokens

	---

	## MCP Prompts

	Prompts provide reusable templates for standardized interactions.

	### 1. analysis_prompt

	Templates for different analysis types.

	Parameters:
	- `analysis_type` (str): Type of analysis
	- Options: `"leaderboard"`, `"cost"`, `"performance"`, `"trace"`
	- `focus_area` (str): Specific focus
	- Options: `"overall"`, `"cost"`, `"accuracy"`, `"speed"`, `"eco"`
	- `detail_level` (str): Level of detail
	- Options: `"summary"`, `"detailed"`, `"comprehensive"`

	Returns: Formatted prompt string for use with AI tools

	Example:
	```python
	prompt = analysis_prompt(
	analysis_type="leaderboard",
	focus_area="cost",
	detail_level="detailed"
	)
	# Returns: "Provide a detailed analysis of cost efficiency in the leaderboard..."
	```

	---

	### 2. debug_prompt

	Templates for debugging scenarios.

	Parameters:
	- `debug_type` (str): Type of debugging
	- Options: `"failure"`, `"performance"`, `"tool_calling"`, `"reasoning"`
	- `context` (str): Additional context
	- Options: `"test_failure"`, `"timeout"`, `"unexpected_tool"`, `"reasoning_loop"`

	Returns: Formatted prompt string

	Example:
	```python
	prompt = debug_prompt(
	debug_type="performance",
	context="tool_calling"
	)
	# Returns: "Analyze tool calling performance. Identify which tools are slow..."
	```

	---

	### 3. optimization_prompt

	Templates for optimization goals.

	Parameters:
	- `optimization_goal` (str): Optimization target
	- Options: `"cost"`, `"speed"`, `"accuracy"`, `"co2"`
	- `constraints` (str): Constraints to respect
	- Options: `"maintain_quality"`, `"no_accuracy_loss"`, `"budget_limit"`, `"time_limit"`

	Returns: Formatted prompt string

	Example:
	```python
	prompt = optimization_prompt(
	optimization_goal="cost",
	constraints="maintain_quality"
	)
	# Returns: "Analyze this evaluation setup and recommend cost optimizations..."
	```

	---

	## Error Handling

	### Common Error Responses

	Invalid Dataset Repository:
	```json
	{
	"error": "Dataset must contain 'smoltrace-' prefix for security",
	"provided": "username/invalid-dataset"
	}
	```

	Dataset Not Found:
	```json
	{
	"error": "Dataset not found on HuggingFace",
	"repository": "username/smoltrace-nonexistent"
	}
	```

	API Rate Limit:
	```json
	{
	"error": "Gemini API rate limit exceeded",
	"retry_after": 60
	}
	```

	Invalid Parameters:
	```json
	{
	"error": "Invalid parameter value",
	"parameter": "top_n",
	"value": 50,
	"allowed_range": "1-20"
	}
	```

	---

	## Best Practices

	### 1. Token Optimization

	DO:
	- Use `get_top_performers()` for "top N" queries (90% token reduction)
	- Use `get_leaderboard_summary()` for overview queries (99% token reduction)
	- Set appropriate `limit` when using `get_dataset()`

	DON'T:
	- Use `get_dataset()` for leaderboard queries (loads all 51 runs)
	- Request more data than needed
	- Ignore token optimization tools

	### 2. AI Tool Usage

	DO:
	- Use AI tools (`analyze_leaderboard`, `debug_trace`) for complex analysis
	- Provide specific questions to `debug_trace` for focused answers
	- Use `focus` parameter in `compare_runs` for targeted comparisons

	DON'T:
	- Use AI tools for simple data retrieval (use resources instead)
	- Make vague requests (be specific for better results)

	### 3. Dataset Security

	DO:
	- Only use datasets with "smoltrace-" prefix
	- Verify dataset exists before requesting
	- Use public datasets or authenticate for private ones

	DON'T:
	- Try to access arbitrary HuggingFace datasets
	- Share private dataset URLs without authentication

	### 4. Cost Management

	DO:
	- Use `estimate_cost` before running large evaluations
	- Compare cost estimates across different models
	- Consider token-optimized tools to reduce API costs

	DON'T:
	- Skip cost estimation for expensive operations
	- Ignore hardware recommendations
	- Overlook CO2 emissions in decision-making

	---

	## Support

	For issues or questions:
	- 📧 GitHub Issues: [TraceMind-mcp-server/issues](https://github.com/Mandark-droid/TraceMind-mcp-server/issues)
	- 💬 HF Discord: `#agents-mcp-hackathon-winter25`
	- 🏷️ Tag: `building-mcp-track-enterprise`