TraceMind-mcp-server / DOCUMENTATION.md
kshitijthakkar's picture
docs: Deploy final documentation package
6982f0b
|
raw
history blame
26.5 kB
# TraceMind MCP Server - Complete API Documentation
This document provides comprehensive API reference for all MCP components provided by TraceMind MCP Server.
## Table of Contents
- [MCP Tools (11)](#mcp-tools)
- [AI-Powered Analysis Tools](#ai-powered-analysis-tools)
- [Token-Optimized Tools](#token-optimized-tools)
- [Data Management Tools](#data-management-tools)
- [MCP Resources (3)](#mcp-resources)
- [MCP Prompts (3)](#mcp-prompts)
- [Error Handling](#error-handling)
- [Best Practices](#best-practices)
---
## MCP Tools
### AI-Powered Analysis Tools
These tools use Google Gemini 2.5 Flash to provide intelligent, context-aware analysis of agent evaluation data.
#### 1. analyze_leaderboard
Analyzes evaluation leaderboard data from HuggingFace datasets and generates AI-powered insights.
**Parameters:**
- `leaderboard_repo` (str): HuggingFace dataset repository
- Default: `"kshitijthakkar/smoltrace-leaderboard"`
- Format: `"username/dataset-name"`
- `metric_focus` (str): Primary metric to analyze
- Options: `"overall"`, `"accuracy"`, `"cost"`, `"latency"`, `"co2"`
- Default: `"overall"`
- `time_range` (str): Time period to analyze
- Options: `"last_week"`, `"last_month"`, `"all_time"`
- Default: `"last_week"`
- `top_n` (int): Number of top models to highlight
- Range: 1-20
- Default: 5
**Returns:** String containing AI-generated analysis with:
- Top performers by selected metric
- Trade-off analysis (e.g., accuracy vs cost)
- Trend identification
- Actionable recommendations
**Example Use Case:**
Before choosing a model for production, get AI-powered insights on which configuration offers the best cost/performance for your requirements.
**Example Call:**
```python
result = await analyze_leaderboard(
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
metric_focus="cost",
time_range="last_week",
top_n=5
)
```
**Example Response:**
```
Based on 247 evaluations in the past week:
Top Performers (Cost Focus):
1. meta-llama/Llama-3.1-8B: $0.002 per run, 93.4% accuracy
2. mistralai/Mistral-7B: $0.003 per run, 91.2% accuracy
3. openai/gpt-3.5-turbo: $0.008 per run, 94.1% accuracy
Trade-off Analysis:
- Llama-3.1 offers best cost/performance ratio at 25x cheaper than GPT-4
- GPT-4 leads in accuracy (95.8%) but costs $0.05 per run
- For production with 1M runs/month: Llama-3.1 saves $48,000 vs GPT-4
Recommendations:
- Cost-sensitive: Use Llama-3.1-8B (93% accuracy, minimal cost)
- Accuracy-critical: Use GPT-4 (96% accuracy, premium cost)
- Balanced: Use GPT-3.5-Turbo (94% accuracy, moderate cost)
```
---
#### 2. debug_trace
Analyzes OpenTelemetry trace data and answers specific questions about agent execution.
**Parameters:**
- `trace_dataset` (str): HuggingFace dataset containing traces
- Format: `"username/smoltrace-traces-model"`
- Must contain "smoltrace-" prefix
- `trace_id` (str): Specific trace ID to analyze
- Format: `"trace_abc123"`
- `question` (str): Question about the trace
- Examples: "Why was tool X called twice?", "Which step took the most time?"
- `include_metrics` (bool): Include GPU metrics in analysis
- Default: `true`
**Returns:** String containing AI analysis of the trace with:
- Answer to the specific question
- Relevant span details
- Performance insights
- GPU metrics (if available and requested)
**Example Use Case:**
When an agent test fails, understand exactly what happened without manually parsing trace spans.
**Example Call:**
```python
result = await debug_trace(
trace_dataset="kshitij/smoltrace-traces-gpt4",
trace_id="trace_abc123",
question="Why was the search tool called twice?",
include_metrics=True
)
```
**Example Response:**
```
Based on trace analysis:
Answer:
The agent called the search_web tool twice due to an iterative reasoning pattern:
1. First call (span_003 at 14:23:19.000):
- Query: "weather in Tokyo"
- Duration: 890ms
- Result: 5 results, oldest was 2 days old
2. Second call (span_005 at 14:23:21.200):
- Query: "latest weather in Tokyo"
- Duration: 1200ms
- Modified reasoning: LLM determined first results were stale
Performance Impact:
- Added 2.09s to total execution time
- Cost increase: +$0.0003 (tokens for second reasoning step)
- This is normal behavior for tool-calling agents with iterative reasoning
GPU Metrics:
- N/A (API model, no GPU used)
```
---
#### 3. estimate_cost
Predicts costs, duration, and environmental impact before running evaluations.
**Parameters:**
- `model` (str, required): Model name to evaluate
- Format: `"provider/model-name"` (e.g., `"openai/gpt-4"`, `"meta-llama/Llama-3.1-8B"`)
- `agent_type` (str): Type of agent evaluation
- Options: `"tool"`, `"code"`, `"both"`
- Default: `"both"`
- `num_tests` (int): Number of test cases
- Range: 1-10000
- Default: 100
- `hardware` (str): Hardware type
- Options: `"auto"`, `"cpu"`, `"gpu_a10"`, `"gpu_h200"`
- Default: `"auto"` (auto-selects based on model)
**Returns:** String containing cost estimate with:
- LLM API costs (for API models)
- HuggingFace Jobs compute costs (for local models)
- Estimated duration
- CO2 emissions estimate
- Hardware recommendations
**Example Use Case:**
Compare the cost of evaluating GPT-4 vs Llama-3.1 across 1000 tests before committing resources.
**Example Call:**
```python
result = await estimate_cost(
model="openai/gpt-4",
agent_type="both",
num_tests=1000,
hardware="auto"
)
```
**Example Response:**
```
Cost Estimate for openai/gpt-4:
LLM API Costs:
- Estimated tokens per test: 1,500
- Token cost: $0.03/1K input, $0.06/1K output
- Total LLM cost: $50.00 (1000 tests)
Compute Costs:
- Recommended hardware: cpu-basic (API model)
- HF Jobs cost: ~$0.05/hr
- Estimated duration: 45 minutes
- Total compute cost: $0.04
Total Cost: $50.04
Cost per test: $0.05
CO2 emissions: ~0.5g (API calls, minimal compute)
Recommendations:
- This is an API model, CPU hardware is sufficient
- For cost optimization, consider Llama-3.1-8B (25x cheaper)
- Estimated runtime: 45 minutes for 1000 tests
```
---
#### 4. compare_runs
Compares two evaluation runs with AI-powered analysis across multiple dimensions.
**Parameters:**
- `run_id_1` (str, required): First run ID from leaderboard
- `run_id_2` (str, required): Second run ID from leaderboard
- `leaderboard_repo` (str): Leaderboard dataset repository
- Default: `"kshitijthakkar/smoltrace-leaderboard"`
- `focus` (str): Comparison focus area
- Options:
- `"comprehensive"`: All dimensions
- `"cost"`: Cost efficiency and ROI
- `"performance"`: Speed and accuracy trade-offs
- `"eco_friendly"`: Environmental impact
- Default: `"comprehensive"`
**Returns:** String containing AI comparison with:
- Success rate comparison with statistical significance
- Cost efficiency analysis
- Speed comparison
- Environmental impact (CO2 emissions)
- GPU efficiency (for GPU jobs)
**Example Use Case:**
After running evaluations with two different models, compare them head-to-head to determine which is better for production deployment.
**Example Call:**
```python
result = await compare_runs(
run_id_1="run_abc123",
run_id_2="run_def456",
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
focus="cost"
)
```
**Example Response:**
```
Comparison: GPT-4 vs Llama-3.1-8B (Cost Focus)
Success Rates:
- GPT-4: 95.8% (96/100 tests)
- Llama-3.1: 93.4% (93/100 tests)
- Difference: +2.4% for GPT-4 (statistically significant, p<0.05)
Cost Efficiency:
- GPT-4: $0.05 per test, $0.052 per successful test
- Llama-3.1: $0.002 per test, $0.0021 per successful test
- Cost ratio: GPT-4 is 25x more expensive
ROI Analysis:
- For 1M evaluations/month:
- GPT-4: $50,000/month, 958K successes
- Llama-3.1: $2,000/month, 934K successes
- GPT-4 provides 24K more successes for $48K more cost
- Cost per additional success: $2.00
Recommendation (Cost Focus):
Use Llama-3.1-8B for cost-sensitive workloads where 93% accuracy is acceptable.
Switch to GPT-4 only for accuracy-critical tasks where the 2.4% improvement justifies 25x cost.
```
---
#### 5. analyze_results
Analyzes detailed test results and provides optimization recommendations.
**Parameters:**
- `results_repo` (str, required): HuggingFace dataset containing results
- Format: `"username/smoltrace-results-model-timestamp"`
- Must contain "smoltrace-results-" prefix
- `analysis_focus` (str): Focus area for analysis
- Options: `"failures"`, `"performance"`, `"cost"`, `"comprehensive"`
- Default: `"comprehensive"`
- `max_rows` (int): Maximum test cases to analyze
- Range: 10-500
- Default: 100
**Returns:** String containing AI analysis with:
- Failure patterns and root causes
- Performance bottlenecks in specific test cases
- Cost optimization opportunities
- Tool usage patterns
- Task-specific insights (which types work well vs poorly)
- Actionable optimization recommendations
**Example Use Case:**
After running an evaluation, analyze the detailed test results to understand why certain tests are failing and get specific recommendations for improving success rate.
**Example Call:**
```python
result = await analyze_results(
results_repo="kshitij/smoltrace-results-gpt4-20251120",
analysis_focus="failures",
max_rows=100
)
```
**Example Response:**
```
Analysis of Test Results (100 tests analyzed)
Overall Statistics:
- Success Rate: 89% (89/100 tests passed)
- Average Duration: 3.2s per test
- Total Cost: $4.50 ($0.045 per test)
Failure Analysis (11 failures):
1. Tool Not Found (6 failures):
- Test IDs: task_012, task_045, task_067, task_089, task_091, task_093
- Pattern: All failed tests required the 'get_weather' tool
- Root Cause: Tool definition missing or incorrect name
- Fix: Ensure 'get_weather' tool is available in agent's tool list
2. Timeout (3 failures):
- Test IDs: task_034, task_071, task_088
- Pattern: Complex multi-step tasks with >5 tool calls
- Root Cause: Exceeding 30s timeout limit
- Fix: Increase timeout to 60s or simplify complex tasks
3. Incorrect Response (2 failures):
- Test IDs: task_056, task_072
- Pattern: Math calculation tasks
- Root Cause: Model hallucinating numbers instead of using calculator tool
- Fix: Update prompt to emphasize tool usage for calculations
Performance Insights:
- Fast tasks (<2s): 45 tests - Simple single-tool calls
- Slow tasks (>5s): 12 tests - Multi-step reasoning with 3+ tools
- Optimal duration: 2-3s for most tasks
Cost Optimization:
- High-cost tests: task_023 ($0.12) - Used 4K tokens
- Low-cost tests: task_087 ($0.008) - Used 180 tokens
- Recommendation: Optimize prompt to reduce token usage by 20%
Recommendations:
1. Add missing 'get_weather' tool → Fixes 6 failures
2. Increase timeout from 30s to 60s → Fixes 3 failures
3. Strengthen calculator tool instruction → Fixes 2 failures
4. Expected improvement: 89% → 100% success rate
```
---
### Token-Optimized Tools
These tools are specifically designed to minimize token usage when querying leaderboard data.
#### 6. get_top_performers
Get top N performing models from leaderboard with 90% token reduction.
**Performance Optimization:** Returns only top N models instead of loading the full leaderboard dataset (51 runs), resulting in **90% token reduction**.
**When to Use:** Perfect for queries like "Which model is leading?", "Show me the top 5 models".
**Parameters:**
- `leaderboard_repo` (str): HuggingFace dataset repository
- Default: `"kshitijthakkar/smoltrace-leaderboard"`
- `metric` (str): Metric to rank by
- Options: `"success_rate"`, `"total_cost_usd"`, `"avg_duration_ms"`, `"co2_emissions_g"`
- Default: `"success_rate"`
- `top_n` (int): Number of top models to return
- Range: 1-20
- Default: 5
**Returns:** JSON string with:
- Metric used for ranking
- Ranking order (ascending/descending)
- Total runs in leaderboard
- Array of top performers with 10 essential fields
**Benefits:**
- ✅ Token Reduction: 90% fewer tokens vs full dataset
- ✅ Ready to Use: Properly formatted JSON
- ✅ Pre-Sorted: Already ranked by chosen metric
- ✅ Essential Data Only: 10 fields vs 20+ in full dataset
**Example Call:**
```python
result = await get_top_performers(
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
metric="total_cost_usd",
top_n=3
)
```
**Example Response:**
```json
{
"metric": "total_cost_usd",
"order": "ascending",
"total_runs": 51,
"top_performers": [
{
"run_id": "run_001",
"model": "meta-llama/Llama-3.1-8B",
"success_rate": 93.4,
"total_cost_usd": 0.002,
"avg_duration_ms": 2100,
"agent_type": "both",
"provider": "transformers",
"submitted_by": "kshitij",
"timestamp": "2025-11-20T10:30:00Z",
"total_tests": 100
},
...
]
}
```
---
#### 7. get_leaderboard_summary
Get high-level leaderboard statistics with 99% token reduction.
**Performance Optimization:** Returns only aggregated statistics instead of raw data, resulting in **99% token reduction**.
**When to Use:** Perfect for overview queries like "How many runs are in the leaderboard?", "What's the average success rate?".
**Parameters:**
- `leaderboard_repo` (str): HuggingFace dataset repository
- Default: `"kshitijthakkar/smoltrace-leaderboard"`
**Returns:** JSON string with:
- Total runs count
- Unique models and submitters
- Overall statistics (avg/best/worst success rates, avg cost, avg duration, total CO2)
- Breakdown by agent type
- Breakdown by provider
- Top 3 models by success rate
**Benefits:**
- ✅ Extreme Token Reduction: 99% fewer tokens
- ✅ Ready to Use: Properly formatted JSON
- ✅ Comprehensive Stats: Averages, distributions, breakdowns
- ✅ Quick Insights: Perfect for overview questions
**Example Call:**
```python
result = await get_leaderboard_summary(
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard"
)
```
**Example Response:**
```json
{
"total_runs": 51,
"unique_models": 12,
"unique_submitters": 3,
"overall_stats": {
"avg_success_rate": 89.2,
"best_success_rate": 95.8,
"worst_success_rate": 78.3,
"avg_cost_usd": 0.012,
"avg_duration_ms": 3200,
"total_co2_g": 45.6
},
"by_agent_type": {
"tool": {"count": 20, "avg_success_rate": 88.5},
"code": {"count": 18, "avg_success_rate": 87.2},
"both": {"count": 13, "avg_success_rate": 92.1}
},
"by_provider": {
"litellm": {"count": 30, "avg_success_rate": 91.3},
"transformers": {"count": 21, "avg_success_rate": 86.4}
},
"top_3_models": [
{"model": "openai/gpt-4", "success_rate": 95.8},
{"model": "anthropic/claude-3", "success_rate": 94.1},
{"model": "meta-llama/Llama-3.1-8B", "success_rate": 93.4}
]
}
```
---
### Data Management Tools
#### 8. get_dataset
Loads SMOLTRACE datasets from HuggingFace and returns raw data as JSON.
**⚠️ Important:** For leaderboard queries, prefer using `get_top_performers()` or `get_leaderboard_summary()` to avoid token bloat!
**Security Restriction:** Only datasets with "smoltrace-" in the repository name are allowed.
**Parameters:**
- `dataset_repo` (str, required): HuggingFace dataset repository
- Must contain "smoltrace-" prefix
- Format: `"username/smoltrace-type-model"`
- `split` (str): Dataset split to load
- Default: `"train"`
- `limit` (int): Maximum rows to return
- Range: 1-200
- Default: 100
**Returns:** JSON string with:
- Total rows in dataset
- List of column names
- Array of data rows (up to `limit`)
**Primary Use Cases:**
- Load `smoltrace-results-*` datasets for test case details
- Load `smoltrace-traces-*` datasets for OpenTelemetry data
- Load `smoltrace-metrics-*` datasets for GPU metrics
- **NOT recommended** for leaderboard queries (use optimized tools)
**Example Call:**
```python
result = await get_dataset(
dataset_repo="kshitij/smoltrace-results-gpt4",
split="train",
limit=50
)
```
---
#### 9. generate_synthetic_dataset
Creates domain-specific test datasets for SMOLTRACE evaluations using AI.
**Parameters:**
- `domain` (str, required): Domain for tasks
- Examples: "e-commerce", "customer service", "finance", "healthcare"
- `tools` (list[str], required): Available tools
- Example: `["search_web", "get_weather", "calculator"]`
- `num_tasks` (int): Number of tasks to generate
- Range: 1-100
- Default: 20
- `difficulty_distribution` (str): Task difficulty mix
- Options: `"balanced"`, `"easy_only"`, `"medium_only"`, `"hard_only"`, `"progressive"`
- Default: `"balanced"`
- `agent_type` (str): Target agent type
- Options: `"tool"`, `"code"`, `"both"`
- Default: `"both"`
**Returns:** JSON string with:
- `dataset_info`: Metadata (domain, tools, counts, timestamp)
- `tasks`: Array of SMOLTRACE-formatted tasks
- `usage_instructions`: Guide for HuggingFace upload and SMOLTRACE usage
**SMOLTRACE Task Format:**
```json
{
"id": "unique_identifier",
"prompt": "Clear, specific task for the agent",
"expected_tool": "tool_name",
"expected_tool_calls": 1,
"difficulty": "easy|medium|hard",
"agent_type": "tool|code",
"expected_keywords": ["keyword1", "keyword2"]
}
```
**Difficulty Calibration:**
- **Easy** (40%): Single tool call, straightforward input
- **Medium** (40%): Multiple tool calls OR complex input parsing
- **Hard** (20%): Multiple tools, complex reasoning, edge cases
**Enterprise Use Cases:**
- Custom Tools: Benchmark proprietary APIs
- Industry-Specific: Generate tasks for finance, healthcare, legal
- Internal Workflows: Test company-specific processes
**Example Call:**
```python
result = await generate_synthetic_dataset(
domain="customer service",
tools=["search_knowledge_base", "create_ticket", "send_email"],
num_tasks=50,
difficulty_distribution="balanced",
agent_type="tool"
)
```
---
#### 10. push_dataset_to_hub
Upload generated datasets to HuggingFace Hub with proper formatting.
**Parameters:**
- `dataset_name` (str, required): Repository name on HuggingFace
- Format: `"username/my-dataset"`
- `data` (str or list, required): Dataset content
- Can be JSON string or list of dictionaries
- `description` (str): Dataset description for card
- Default: Auto-generated
- `private` (bool): Make dataset private
- Default: `False`
**Returns:** Success message with dataset URL
**Example Workflow:**
1. Generate synthetic dataset with `generate_synthetic_dataset`
2. Review and modify tasks if needed
3. Upload to HuggingFace with `push_dataset_to_hub`
4. Use in SMOLTRACE evaluations or share with team
**Example Call:**
```python
result = await push_dataset_to_hub(
dataset_name="kshitij/my-custom-evaluation",
data=generated_tasks,
description="Custom evaluation dataset for e-commerce agents",
private=False
)
```
---
#### 11. generate_prompt_template
Generate customized smolagents prompt template for a specific domain and tool set.
**Parameters:**
- `domain` (str, required): Domain for the prompt template
- Examples: `"finance"`, `"healthcare"`, `"customer_support"`, `"e-commerce"`
- `tool_names` (str, required): Comma-separated list of tool names
- Format: `"tool1,tool2,tool3"`
- Example: `"get_stock_price,calculate_roi,fetch_company_info"`
- `agent_type` (str): Agent type
- Options: `"tool"` (ToolCallingAgent), `"code"` (CodeAgent)
- Default: `"tool"`
**Returns:** JSON response containing:
- Customized YAML prompt template
- Metadata (domain, tools, agent_type, timestamp)
- Usage instructions
**Use Case:**
When you generate synthetic datasets with `generate_synthetic_dataset`, use this tool to create a matching prompt template that agents can use during evaluation. This ensures your evaluation setup is complete and ready to run.
**Integration:**
The generated prompt template can be included in your HuggingFace dataset card, making it easy for anyone to run evaluations with your dataset.
**Example Call:**
```python
result = await generate_prompt_template(
domain="customer_support",
tool_names="search_knowledge_base,create_ticket,send_email,escalate_to_human",
agent_type="tool"
)
```
**Example Response:**
```json
{
"prompt_template": "---\nname: customer_support_agent\ndescription: An AI agent for customer support tasks...\n\ninstructions: |-\n You are a helpful customer support agent...\n \n Available tools:\n - search_knowledge_base: Search the knowledge base...\n - create_ticket: Create a support ticket...\n ...",
"metadata": {
"domain": "customer_support",
"tools": ["search_knowledge_base", "create_ticket", "send_email", "escalate_to_human"],
"agent_type": "tool",
"base_template": "ToolCallingAgent",
"timestamp": "2025-11-21T10:30:00Z"
},
"usage_instructions": "1. Save the prompt_template to a file (e.g., customer_support_prompt.yaml)\n2. Use with SMOLTRACE: smoltrace-eval --model your-model --prompt-file customer_support_prompt.yaml\n3. Or include in your dataset card for easy evaluation"
}
```
---
## MCP Resources
Resources provide direct data access without AI analysis. Access via URI scheme.
### 1. leaderboard://{repo}
Direct access to raw leaderboard data in JSON format.
**URI Format:**
```
leaderboard://username/dataset-name
```
**Example:**
```
GET leaderboard://kshitijthakkar/smoltrace-leaderboard
```
**Returns:** JSON array with all evaluation runs, including:
- run_id, model, agent_type, provider
- success_rate, total_tests, successful_tests, failed_tests
- avg_duration_ms, total_tokens, total_cost_usd, co2_emissions_g
- results_dataset, traces_dataset, metrics_dataset (references)
- timestamp, submitted_by, hf_job_id
---
### 2. trace://{trace_id}/{repo}
Direct access to trace data with OpenTelemetry spans.
**URI Format:**
```
trace://trace_id/username/dataset-name
```
**Example:**
```
GET trace://trace_abc123/kshitij/agent-traces-gpt4
```
**Returns:** JSON with:
- traceId
- spans array (spanId, parentSpanId, name, kind, startTime, endTime, attributes, status)
---
### 3. cost://model/{model_name}
Model pricing and hardware cost information.
**URI Format:**
```
cost://model/provider/model-name
```
**Example:**
```
GET cost://model/openai/gpt-4
```
**Returns:** JSON with:
- Model pricing (input/output token costs)
- Recommended hardware tier
- Estimated compute costs
- CO2 emissions per 1K tokens
---
## MCP Prompts
Prompts provide reusable templates for standardized interactions.
### 1. analysis_prompt
Templates for different analysis types.
**Parameters:**
- `analysis_type` (str): Type of analysis
- Options: `"leaderboard"`, `"cost"`, `"performance"`, `"trace"`
- `focus_area` (str): Specific focus
- Options: `"overall"`, `"cost"`, `"accuracy"`, `"speed"`, `"eco"`
- `detail_level` (str): Level of detail
- Options: `"summary"`, `"detailed"`, `"comprehensive"`
**Returns:** Formatted prompt string for use with AI tools
**Example:**
```python
prompt = analysis_prompt(
analysis_type="leaderboard",
focus_area="cost",
detail_level="detailed"
)
# Returns: "Provide a detailed analysis of cost efficiency in the leaderboard..."
```
---
### 2. debug_prompt
Templates for debugging scenarios.
**Parameters:**
- `debug_type` (str): Type of debugging
- Options: `"failure"`, `"performance"`, `"tool_calling"`, `"reasoning"`
- `context` (str): Additional context
- Options: `"test_failure"`, `"timeout"`, `"unexpected_tool"`, `"reasoning_loop"`
**Returns:** Formatted prompt string
**Example:**
```python
prompt = debug_prompt(
debug_type="performance",
context="tool_calling"
)
# Returns: "Analyze tool calling performance. Identify which tools are slow..."
```
---
### 3. optimization_prompt
Templates for optimization goals.
**Parameters:**
- `optimization_goal` (str): Optimization target
- Options: `"cost"`, `"speed"`, `"accuracy"`, `"co2"`
- `constraints` (str): Constraints to respect
- Options: `"maintain_quality"`, `"no_accuracy_loss"`, `"budget_limit"`, `"time_limit"`
**Returns:** Formatted prompt string
**Example:**
```python
prompt = optimization_prompt(
optimization_goal="cost",
constraints="maintain_quality"
)
# Returns: "Analyze this evaluation setup and recommend cost optimizations..."
```
---
## Error Handling
### Common Error Responses
**Invalid Dataset Repository:**
```json
{
"error": "Dataset must contain 'smoltrace-' prefix for security",
"provided": "username/invalid-dataset"
}
```
**Dataset Not Found:**
```json
{
"error": "Dataset not found on HuggingFace",
"repository": "username/smoltrace-nonexistent"
}
```
**API Rate Limit:**
```json
{
"error": "Gemini API rate limit exceeded",
"retry_after": 60
}
```
**Invalid Parameters:**
```json
{
"error": "Invalid parameter value",
"parameter": "top_n",
"value": 50,
"allowed_range": "1-20"
}
```
---
## Best Practices
### 1. Token Optimization
**DO:**
- Use `get_top_performers()` for "top N" queries (90% token reduction)
- Use `get_leaderboard_summary()` for overview queries (99% token reduction)
- Set appropriate `limit` when using `get_dataset()`
**DON'T:**
- Use `get_dataset()` for leaderboard queries (loads all 51 runs)
- Request more data than needed
- Ignore token optimization tools
### 2. AI Tool Usage
**DO:**
- Use AI tools (`analyze_leaderboard`, `debug_trace`) for complex analysis
- Provide specific questions to `debug_trace` for focused answers
- Use `focus` parameter in `compare_runs` for targeted comparisons
**DON'T:**
- Use AI tools for simple data retrieval (use resources instead)
- Make vague requests (be specific for better results)
### 3. Dataset Security
**DO:**
- Only use datasets with "smoltrace-" prefix
- Verify dataset exists before requesting
- Use public datasets or authenticate for private ones
**DON'T:**
- Try to access arbitrary HuggingFace datasets
- Share private dataset URLs without authentication
### 4. Cost Management
**DO:**
- Use `estimate_cost` before running large evaluations
- Compare cost estimates across different models
- Consider token-optimized tools to reduce API costs
**DON'T:**
- Skip cost estimation for expensive operations
- Ignore hardware recommendations
- Overlook CO2 emissions in decision-making
---
## Support
For issues or questions:
- 📧 GitHub Issues: [TraceMind-mcp-server/issues](https://github.com/Mandark-droid/TraceMind-mcp-server/issues)
- 💬 HF Discord: `#agents-mcp-hackathon-winter25`
- 🏷️ Tag: `building-mcp-track-enterprise`