feat: Add analyze_results tool for optimization recommendations
Browse filesAdded new MCP tool 'analyze_results' that:
- Analyzes individual test case results (not just aggregate data)
- Identifies failure patterns and common error types
- Finds performance bottlenecks and slowest tests
- Analyzes cost patterns and expensive tests
- Provides actionable optimization recommendations
- Supports 4 focus modes: comprehensive, failures, performance, cost
Tool capabilities:
- Loads results datasets (smoltrace-results-*)
- Analyzes up to 500 test cases per request
- Groups analysis by category and difficulty
- Uses Gemini 2.5 Pro for intelligent insights
- Returns markdown with detailed recommendations
UI updates:
- Added new 'Analyze Results' tab in Gradio UI
- Updated header to show 6 tools instead of 5
- Added import for analyze_results
Documentation updates:
- README now reflects 6 tools and 12 total MCP components
- Updated all tool counts throughout documentation
- Updated FAQ to mention 6 tools
- README.md +11 -8
- app.py +77 -2
- mcp_tools.py +148 -0
|
@@ -33,12 +33,13 @@ tags:
|
|
| 33 |
|
| 34 |
TraceMind MCP Server is a Gradio-based MCP (Model Context Protocol) server that provides a complete MCP implementation with:
|
| 35 |
|
| 36 |
-
### π οΈ **
|
| 37 |
1. **π analyze_leaderboard**: Generate insights from evaluation leaderboard data
|
| 38 |
2. **π debug_trace**: Debug specific agent execution traces using OpenTelemetry data
|
| 39 |
3. **π° estimate_cost**: Predict evaluation costs before running
|
| 40 |
4. **βοΈ compare_runs**: Compare two evaluation runs with AI-powered analysis
|
| 41 |
-
5.
|
|
|
|
| 42 |
|
| 43 |
### π¦ **3 Data Resources**
|
| 44 |
1. **leaderboard data**: Direct JSON access to evaluation results
|
|
@@ -386,19 +387,20 @@ A: Use the SSE endpoint (`/gradio_api/mcp/sse`) for now, but note that it's depr
|
|
| 386 |
A: Streamable HTTP is the newer, more efficient protocol with better error handling and performance. SSE is the legacy protocol being phased out.
|
| 387 |
|
| 388 |
**Q: How do I test if my connection works?**
|
| 389 |
-
A: After configuring your client, restart it and look for "tracemind" in your available MCP tools/servers. You should see
|
| 390 |
|
| 391 |
**Q: Can I use this MCP server without authentication?**
|
| 392 |
A: The MCP endpoint is publicly accessible. However, the tools may require HuggingFace datasets to be public or accessible with your HF token (configured server-side).
|
| 393 |
|
| 394 |
### Available MCP Components
|
| 395 |
|
| 396 |
-
**Tools** (
|
| 397 |
1. **analyze_leaderboard**: AI-powered leaderboard analysis with Gemini 2.5 Pro
|
| 398 |
2. **debug_trace**: Trace debugging with AI insights
|
| 399 |
3. **estimate_cost**: Cost estimation with optimization recommendations
|
| 400 |
4. **compare_runs**: Compare two evaluation runs with AI-powered analysis
|
| 401 |
-
5. **
|
|
|
|
| 402 |
|
| 403 |
**Resources** (3):
|
| 404 |
1. **leaderboard://{repo}**: Direct access to raw leaderboard data in JSON
|
|
@@ -555,7 +557,7 @@ TraceMind UI is a Gradio-based agent evaluation platform that uses these MCP too
|
|
| 555 |
|
| 556 |
### app.py
|
| 557 |
Main Gradio application with:
|
| 558 |
-
- Testing UI for all
|
| 559 |
- MCP server enabled via `mcp_server=True`
|
| 560 |
- API documentation
|
| 561 |
|
|
@@ -567,13 +569,14 @@ Google Gemini 2.5 Pro client that:
|
|
| 567 |
- Uses `gemini-2.5-pro-latest` model (can switch to `gemini-2.5-flash-latest`)
|
| 568 |
|
| 569 |
### mcp_tools.py
|
| 570 |
-
Complete MCP implementation with
|
| 571 |
|
| 572 |
-
**Tools** (
|
| 573 |
- `analyze_leaderboard()`: AI-powered leaderboard analysis
|
| 574 |
- `debug_trace()`: AI-powered trace debugging
|
| 575 |
- `estimate_cost()`: AI-powered cost estimation
|
| 576 |
- `compare_runs()`: AI-powered run comparison
|
|
|
|
| 577 |
- `get_dataset()`: Load SMOLTRACE datasets as JSON
|
| 578 |
|
| 579 |
**Resources** (3 decorated functions with `@gr.mcp.resource()`):
|
|
|
|
| 33 |
|
| 34 |
TraceMind MCP Server is a Gradio-based MCP (Model Context Protocol) server that provides a complete MCP implementation with:
|
| 35 |
|
| 36 |
+
### π οΈ **6 AI-Powered Tools**
|
| 37 |
1. **π analyze_leaderboard**: Generate insights from evaluation leaderboard data
|
| 38 |
2. **π debug_trace**: Debug specific agent execution traces using OpenTelemetry data
|
| 39 |
3. **π° estimate_cost**: Predict evaluation costs before running
|
| 40 |
4. **βοΈ compare_runs**: Compare two evaluation runs with AI-powered analysis
|
| 41 |
+
5. **π analyze_results**: Deep dive into test results with optimization recommendations
|
| 42 |
+
6. **π¦ get_dataset**: Load SMOLTRACE datasets (smoltrace-* prefix only) as JSON for flexible analysis
|
| 43 |
|
| 44 |
### π¦ **3 Data Resources**
|
| 45 |
1. **leaderboard data**: Direct JSON access to evaluation results
|
|
|
|
| 387 |
A: Streamable HTTP is the newer, more efficient protocol with better error handling and performance. SSE is the legacy protocol being phased out.
|
| 388 |
|
| 389 |
**Q: How do I test if my connection works?**
|
| 390 |
+
A: After configuring your client, restart it and look for "tracemind" in your available MCP tools/servers. You should see 6 tools, 3 resources, and 3 prompts.
|
| 391 |
|
| 392 |
**Q: Can I use this MCP server without authentication?**
|
| 393 |
A: The MCP endpoint is publicly accessible. However, the tools may require HuggingFace datasets to be public or accessible with your HF token (configured server-side).
|
| 394 |
|
| 395 |
### Available MCP Components
|
| 396 |
|
| 397 |
+
**Tools** (6):
|
| 398 |
1. **analyze_leaderboard**: AI-powered leaderboard analysis with Gemini 2.5 Pro
|
| 399 |
2. **debug_trace**: Trace debugging with AI insights
|
| 400 |
3. **estimate_cost**: Cost estimation with optimization recommendations
|
| 401 |
4. **compare_runs**: Compare two evaluation runs with AI-powered analysis
|
| 402 |
+
5. **analyze_results**: Deep dive into test results with optimization recommendations
|
| 403 |
+
6. **get_dataset**: Load SMOLTRACE datasets (smoltrace-* only) as JSON
|
| 404 |
|
| 405 |
**Resources** (3):
|
| 406 |
1. **leaderboard://{repo}**: Direct access to raw leaderboard data in JSON
|
|
|
|
| 557 |
|
| 558 |
### app.py
|
| 559 |
Main Gradio application with:
|
| 560 |
+
- Testing UI for all 6 tools
|
| 561 |
- MCP server enabled via `mcp_server=True`
|
| 562 |
- API documentation
|
| 563 |
|
|
|
|
| 569 |
- Uses `gemini-2.5-pro-latest` model (can switch to `gemini-2.5-flash-latest`)
|
| 570 |
|
| 571 |
### mcp_tools.py
|
| 572 |
+
Complete MCP implementation with 12 components:
|
| 573 |
|
| 574 |
+
**Tools** (6 async functions):
|
| 575 |
- `analyze_leaderboard()`: AI-powered leaderboard analysis
|
| 576 |
- `debug_trace()`: AI-powered trace debugging
|
| 577 |
- `estimate_cost()`: AI-powered cost estimation
|
| 578 |
- `compare_runs()`: AI-powered run comparison
|
| 579 |
+
- `analyze_results()`: AI-powered results analysis with optimization recommendations
|
| 580 |
- `get_dataset()`: Load SMOLTRACE datasets as JSON
|
| 581 |
|
| 582 |
**Resources** (3 decorated functions with `@gr.mcp.resource()`):
|
|
@@ -21,6 +21,7 @@ from mcp_tools import (
|
|
| 21 |
debug_trace,
|
| 22 |
estimate_cost,
|
| 23 |
compare_runs,
|
|
|
|
| 24 |
get_dataset
|
| 25 |
)
|
| 26 |
|
|
@@ -40,13 +41,14 @@ def create_gradio_ui():
|
|
| 40 |
|
| 41 |
**AI-Powered Analysis for Agent Evaluation Data**
|
| 42 |
|
| 43 |
-
This server provides **
|
| 44 |
|
| 45 |
### MCP Tools (AI-Powered)
|
| 46 |
- π **Analyze Leaderboard**: Get insights from evaluation results
|
| 47 |
- π **Debug Trace**: Understand what happened in a specific test
|
| 48 |
- π° **Estimate Cost**: Predict evaluation costs before running
|
| 49 |
- βοΈ **Compare Runs**: Compare two evaluation runs with AI-powered analysis
|
|
|
|
| 50 |
- π¦ **Get Dataset**: Load any HuggingFace dataset as JSON for flexible analysis
|
| 51 |
|
| 52 |
### MCP Resources (Data Access)
|
|
@@ -520,7 +522,80 @@ def create_gradio_ui():
|
|
| 520 |
outputs=[compare_output]
|
| 521 |
)
|
| 522 |
|
| 523 |
-
# Tab 5:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 524 |
with gr.Tab("π¦ Get Dataset"):
|
| 525 |
gr.Markdown("""
|
| 526 |
## Load SMOLTRACE Datasets as JSON
|
|
|
|
| 21 |
debug_trace,
|
| 22 |
estimate_cost,
|
| 23 |
compare_runs,
|
| 24 |
+
analyze_results,
|
| 25 |
get_dataset
|
| 26 |
)
|
| 27 |
|
|
|
|
| 41 |
|
| 42 |
**AI-Powered Analysis for Agent Evaluation Data**
|
| 43 |
|
| 44 |
+
This server provides **6 MCP Tools + 3 MCP Resources + 3 MCP Prompts**:
|
| 45 |
|
| 46 |
### MCP Tools (AI-Powered)
|
| 47 |
- π **Analyze Leaderboard**: Get insights from evaluation results
|
| 48 |
- π **Debug Trace**: Understand what happened in a specific test
|
| 49 |
- π° **Estimate Cost**: Predict evaluation costs before running
|
| 50 |
- βοΈ **Compare Runs**: Compare two evaluation runs with AI-powered analysis
|
| 51 |
+
- π **Analyze Results**: Deep dive into test results with optimization recommendations
|
| 52 |
- π¦ **Get Dataset**: Load any HuggingFace dataset as JSON for flexible analysis
|
| 53 |
|
| 54 |
### MCP Resources (Data Access)
|
|
|
|
| 522 |
outputs=[compare_output]
|
| 523 |
)
|
| 524 |
|
| 525 |
+
# Tab 5: Analyze Results
|
| 526 |
+
with gr.Tab("π Analyze Results"):
|
| 527 |
+
gr.Markdown("""
|
| 528 |
+
## Analyze Test Results & Get Optimization Recommendations
|
| 529 |
+
|
| 530 |
+
Deep dive into individual test case results to identify failure patterns,
|
| 531 |
+
performance bottlenecks, and cost optimization opportunities.
|
| 532 |
+
""")
|
| 533 |
+
|
| 534 |
+
with gr.Row():
|
| 535 |
+
results_repo_input = gr.Textbox(
|
| 536 |
+
label="Results Repository",
|
| 537 |
+
placeholder="e.g., username/smoltrace-results-gpt4-20251114",
|
| 538 |
+
info="HuggingFace dataset containing results data"
|
| 539 |
+
)
|
| 540 |
+
results_focus = gr.Dropdown(
|
| 541 |
+
choices=["comprehensive", "failures", "performance", "cost"],
|
| 542 |
+
value="comprehensive",
|
| 543 |
+
label="Analysis Focus",
|
| 544 |
+
info="What aspect to focus the analysis on"
|
| 545 |
+
)
|
| 546 |
+
|
| 547 |
+
with gr.Row():
|
| 548 |
+
results_max_rows = gr.Slider(
|
| 549 |
+
minimum=10,
|
| 550 |
+
maximum=500,
|
| 551 |
+
value=100,
|
| 552 |
+
step=10,
|
| 553 |
+
label="Max Test Cases to Analyze",
|
| 554 |
+
info="Limit number of test cases for analysis"
|
| 555 |
+
)
|
| 556 |
+
|
| 557 |
+
results_button = gr.Button("π Analyze Results", variant="primary")
|
| 558 |
+
results_output = gr.Markdown()
|
| 559 |
+
|
| 560 |
+
async def run_analyze_results(repo, focus, max_rows, gemini_key, hf_token):
|
| 561 |
+
"""
|
| 562 |
+
Analyze detailed test results and provide optimization recommendations.
|
| 563 |
+
|
| 564 |
+
Args:
|
| 565 |
+
repo (str): HuggingFace dataset repository containing results
|
| 566 |
+
focus (str): Analysis focus area
|
| 567 |
+
max_rows (int): Maximum test cases to analyze
|
| 568 |
+
gemini_key (str): Gemini API key from session state
|
| 569 |
+
hf_token (str): HuggingFace token from session state
|
| 570 |
+
|
| 571 |
+
Returns:
|
| 572 |
+
str: Markdown-formatted analysis with recommendations
|
| 573 |
+
"""
|
| 574 |
+
try:
|
| 575 |
+
if not repo:
|
| 576 |
+
return "β **Error**: Please provide a results repository"
|
| 577 |
+
|
| 578 |
+
# Use user-provided key or fall back to environment variable
|
| 579 |
+
api_key = gemini_key if gemini_key and gemini_key.strip() else None
|
| 580 |
+
|
| 581 |
+
result = await analyze_results(
|
| 582 |
+
results_repo=repo,
|
| 583 |
+
analysis_focus=focus,
|
| 584 |
+
max_rows=int(max_rows),
|
| 585 |
+
hf_token=hf_token if hf_token and hf_token.strip() else None,
|
| 586 |
+
gemini_api_key=api_key
|
| 587 |
+
)
|
| 588 |
+
return result
|
| 589 |
+
except Exception as e:
|
| 590 |
+
return f"β **Error**: {str(e)}"
|
| 591 |
+
|
| 592 |
+
results_button.click(
|
| 593 |
+
fn=run_analyze_results,
|
| 594 |
+
inputs=[results_repo_input, results_focus, results_max_rows, gemini_key_state, hf_token_state],
|
| 595 |
+
outputs=[results_output]
|
| 596 |
+
)
|
| 597 |
+
|
| 598 |
+
# Tab 6: Get Dataset
|
| 599 |
with gr.Tab("π¦ Get Dataset"):
|
| 600 |
gr.Markdown("""
|
| 601 |
## Load SMOLTRACE Datasets as JSON
|
|
@@ -576,6 +576,154 @@ Provide eco-conscious recommendations for sustainable AI deployment.
|
|
| 576 |
return f"β **Error comparing runs**: {str(e)}"
|
| 577 |
|
| 578 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 579 |
@gr.mcp.tool()
|
| 580 |
async def get_dataset(
|
| 581 |
dataset_repo: str,
|
|
|
|
| 576 |
return f"β **Error comparing runs**: {str(e)}"
|
| 577 |
|
| 578 |
|
| 579 |
+
@gr.mcp.tool()
|
| 580 |
+
async def analyze_results(
|
| 581 |
+
results_repo: str,
|
| 582 |
+
analysis_focus: str = "comprehensive",
|
| 583 |
+
max_rows: int = 100,
|
| 584 |
+
hf_token: Optional[str] = None,
|
| 585 |
+
gemini_api_key: Optional[str] = None
|
| 586 |
+
) -> str:
|
| 587 |
+
"""
|
| 588 |
+
Analyze detailed test results and provide optimization recommendations.
|
| 589 |
+
|
| 590 |
+
USE THIS TOOL when you need to:
|
| 591 |
+
- Understand why tests are failing and get recommendations
|
| 592 |
+
- Identify performance bottlenecks in specific test cases
|
| 593 |
+
- Find cost optimization opportunities
|
| 594 |
+
- Get insights about tool usage patterns
|
| 595 |
+
- Analyze which types of tasks work well vs poorly
|
| 596 |
+
|
| 597 |
+
This tool analyzes individual test case results (not aggregate leaderboard data)
|
| 598 |
+
and uses Google Gemini 2.5 Pro to provide actionable optimization recommendations.
|
| 599 |
+
|
| 600 |
+
Args:
|
| 601 |
+
results_repo (str): HuggingFace dataset repository containing results (e.g., "username/smoltrace-results-gpt4-20251114")
|
| 602 |
+
analysis_focus (str): Focus area. Options: "failures", "performance", "cost", "comprehensive". Default: "comprehensive"
|
| 603 |
+
max_rows (int): Maximum test cases to analyze. Default: 100. Range: 10-500
|
| 604 |
+
hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
|
| 605 |
+
gemini_api_key (Optional[str]): Google Gemini API key. If None, uses GEMINI_API_KEY environment variable.
|
| 606 |
+
|
| 607 |
+
Returns:
|
| 608 |
+
str: Markdown-formatted analysis with failure patterns, performance insights, cost analysis, and optimization recommendations
|
| 609 |
+
"""
|
| 610 |
+
try:
|
| 611 |
+
# Initialize Gemini client
|
| 612 |
+
gemini_client = GeminiClient(api_key=gemini_api_key) if gemini_api_key else GeminiClient()
|
| 613 |
+
|
| 614 |
+
# Load results dataset
|
| 615 |
+
print(f"Loading results from {results_repo}...")
|
| 616 |
+
token = hf_token if hf_token else os.getenv("HF_TOKEN")
|
| 617 |
+
ds = load_dataset(results_repo, split="train", token=token)
|
| 618 |
+
df = pd.DataFrame(ds)
|
| 619 |
+
|
| 620 |
+
if df.empty:
|
| 621 |
+
return "β **Error**: Results dataset is empty"
|
| 622 |
+
|
| 623 |
+
# Limit rows
|
| 624 |
+
max_rows = max(10, min(500, max_rows))
|
| 625 |
+
df_sample = df.head(max_rows)
|
| 626 |
+
|
| 627 |
+
# Calculate statistics
|
| 628 |
+
total_tests = len(df_sample)
|
| 629 |
+
successful = df_sample[df_sample['success'] == True]
|
| 630 |
+
failed = df_sample[df_sample['success'] == False]
|
| 631 |
+
|
| 632 |
+
success_rate = (len(successful) / total_tests * 100) if total_tests > 0 else 0
|
| 633 |
+
|
| 634 |
+
# Analyze by category/difficulty
|
| 635 |
+
category_stats = {}
|
| 636 |
+
if 'category' in df_sample.columns:
|
| 637 |
+
category_stats = df_sample.groupby('category').agg({
|
| 638 |
+
'success': ['count', 'sum', 'mean'],
|
| 639 |
+
'execution_time_ms': 'mean',
|
| 640 |
+
'cost_usd': 'sum'
|
| 641 |
+
}).to_dict()
|
| 642 |
+
|
| 643 |
+
difficulty_stats = {}
|
| 644 |
+
if 'difficulty' in df_sample.columns:
|
| 645 |
+
difficulty_stats = df_sample.groupby('difficulty').agg({
|
| 646 |
+
'success': ['count', 'sum', 'mean'],
|
| 647 |
+
'execution_time_ms': 'mean'
|
| 648 |
+
}).to_dict()
|
| 649 |
+
|
| 650 |
+
# Find slowest tests
|
| 651 |
+
slowest_tests = df_sample.nlargest(5, 'execution_time_ms')[
|
| 652 |
+
['task_id', 'prompt', 'execution_time_ms', 'success', 'cost_usd']
|
| 653 |
+
].to_dict('records')
|
| 654 |
+
|
| 655 |
+
# Find most expensive tests
|
| 656 |
+
if 'cost_usd' in df_sample.columns:
|
| 657 |
+
most_expensive = df_sample.nlargest(5, 'cost_usd')[
|
| 658 |
+
['task_id', 'prompt', 'cost_usd', 'total_tokens', 'success']
|
| 659 |
+
].to_dict('records')
|
| 660 |
+
else:
|
| 661 |
+
most_expensive = []
|
| 662 |
+
|
| 663 |
+
# Analyze failures
|
| 664 |
+
failure_analysis = []
|
| 665 |
+
if len(failed) > 0:
|
| 666 |
+
# Get sample of failures
|
| 667 |
+
failure_sample = failed.head(10)[
|
| 668 |
+
['task_id', 'prompt', 'error', 'error_type', 'tool_called', 'expected_tool']
|
| 669 |
+
].to_dict('records')
|
| 670 |
+
|
| 671 |
+
# Count error types
|
| 672 |
+
if 'error_type' in failed.columns:
|
| 673 |
+
error_type_counts = failed['error_type'].value_counts().to_dict()
|
| 674 |
+
else:
|
| 675 |
+
error_type_counts = {}
|
| 676 |
+
|
| 677 |
+
failure_analysis = {
|
| 678 |
+
"total_failures": len(failed),
|
| 679 |
+
"failure_rate": (len(failed) / total_tests * 100),
|
| 680 |
+
"error_type_counts": error_type_counts,
|
| 681 |
+
"sample_failures": failure_sample
|
| 682 |
+
}
|
| 683 |
+
|
| 684 |
+
# Prepare data for Gemini analysis
|
| 685 |
+
analysis_data = {
|
| 686 |
+
"results_repo": results_repo,
|
| 687 |
+
"total_tests_analyzed": total_tests,
|
| 688 |
+
"overall_stats": {
|
| 689 |
+
"success_rate": round(success_rate, 2),
|
| 690 |
+
"successful_tests": len(successful),
|
| 691 |
+
"failed_tests": len(failed),
|
| 692 |
+
"avg_execution_time_ms": float(df_sample['execution_time_ms'].mean()),
|
| 693 |
+
"total_cost_usd": float(df_sample['cost_usd'].sum()) if 'cost_usd' in df_sample.columns else 0,
|
| 694 |
+
"avg_tokens_per_test": float(df_sample['total_tokens'].mean()) if 'total_tokens' in df_sample.columns else 0
|
| 695 |
+
},
|
| 696 |
+
"category_performance": category_stats,
|
| 697 |
+
"difficulty_performance": difficulty_stats,
|
| 698 |
+
"slowest_tests": slowest_tests,
|
| 699 |
+
"most_expensive_tests": most_expensive,
|
| 700 |
+
"failure_analysis": failure_analysis,
|
| 701 |
+
"analysis_focus": analysis_focus
|
| 702 |
+
}
|
| 703 |
+
|
| 704 |
+
# Create focus-specific prompt
|
| 705 |
+
focus_prompts = {
|
| 706 |
+
"failures": "Focus specifically on failure patterns. Analyze why tests are failing, identify common error types, and provide actionable recommendations to improve success rate.",
|
| 707 |
+
"performance": "Focus on performance optimization. Analyze execution times, identify bottlenecks, and recommend ways to speed up test execution.",
|
| 708 |
+
"cost": "Focus on cost optimization. Analyze token usage and costs, identify expensive tests, and recommend ways to reduce evaluation costs.",
|
| 709 |
+
"comprehensive": "Provide comprehensive analysis covering failures, performance, cost, and overall optimization opportunities."
|
| 710 |
+
}
|
| 711 |
+
|
| 712 |
+
specific_question = focus_prompts.get(analysis_focus, focus_prompts["comprehensive"])
|
| 713 |
+
|
| 714 |
+
# Get AI analysis
|
| 715 |
+
result = await gemini_client.analyze_with_context(
|
| 716 |
+
data=analysis_data,
|
| 717 |
+
analysis_type="results",
|
| 718 |
+
specific_question=specific_question
|
| 719 |
+
)
|
| 720 |
+
|
| 721 |
+
return result
|
| 722 |
+
|
| 723 |
+
except Exception as e:
|
| 724 |
+
return f"β **Error analyzing results**: {str(e)}\n\nPlease check:\n- Repository name is correct (should be smoltrace-results-*)\n- You have access to the dataset\n- HF_TOKEN is set correctly"
|
| 725 |
+
|
| 726 |
+
|
| 727 |
@gr.mcp.tool()
|
| 728 |
async def get_dataset(
|
| 729 |
dataset_repo: str,
|