Mandark-droid commited on
Commit
eb3c2b5
Β·
1 Parent(s): 266ceb7

feat: Add analyze_results tool for optimization recommendations

Browse files

Added new MCP tool 'analyze_results' that:
- Analyzes individual test case results (not just aggregate data)
- Identifies failure patterns and common error types
- Finds performance bottlenecks and slowest tests
- Analyzes cost patterns and expensive tests
- Provides actionable optimization recommendations
- Supports 4 focus modes: comprehensive, failures, performance, cost

Tool capabilities:
- Loads results datasets (smoltrace-results-*)
- Analyzes up to 500 test cases per request
- Groups analysis by category and difficulty
- Uses Gemini 2.5 Pro for intelligent insights
- Returns markdown with detailed recommendations

UI updates:
- Added new 'Analyze Results' tab in Gradio UI
- Updated header to show 6 tools instead of 5
- Added import for analyze_results

Documentation updates:
- README now reflects 6 tools and 12 total MCP components
- Updated all tool counts throughout documentation
- Updated FAQ to mention 6 tools

Files changed (3) hide show
  1. README.md +11 -8
  2. app.py +77 -2
  3. mcp_tools.py +148 -0
README.md CHANGED
@@ -33,12 +33,13 @@ tags:
33
 
34
  TraceMind MCP Server is a Gradio-based MCP (Model Context Protocol) server that provides a complete MCP implementation with:
35
 
36
- ### πŸ› οΈ **5 AI-Powered Tools**
37
  1. **πŸ“Š analyze_leaderboard**: Generate insights from evaluation leaderboard data
38
  2. **πŸ› debug_trace**: Debug specific agent execution traces using OpenTelemetry data
39
  3. **πŸ’° estimate_cost**: Predict evaluation costs before running
40
  4. **βš–οΈ compare_runs**: Compare two evaluation runs with AI-powered analysis
41
- 5. **πŸ“¦ get_dataset**: Load SMOLTRACE datasets (smoltrace-* prefix only) as JSON for flexible analysis
 
42
 
43
  ### πŸ“¦ **3 Data Resources**
44
  1. **leaderboard data**: Direct JSON access to evaluation results
@@ -386,19 +387,20 @@ A: Use the SSE endpoint (`/gradio_api/mcp/sse`) for now, but note that it's depr
386
  A: Streamable HTTP is the newer, more efficient protocol with better error handling and performance. SSE is the legacy protocol being phased out.
387
 
388
  **Q: How do I test if my connection works?**
389
- A: After configuring your client, restart it and look for "tracemind" in your available MCP tools/servers. You should see 5 tools, 3 resources, and 3 prompts.
390
 
391
  **Q: Can I use this MCP server without authentication?**
392
  A: The MCP endpoint is publicly accessible. However, the tools may require HuggingFace datasets to be public or accessible with your HF token (configured server-side).
393
 
394
  ### Available MCP Components
395
 
396
- **Tools** (5):
397
  1. **analyze_leaderboard**: AI-powered leaderboard analysis with Gemini 2.5 Pro
398
  2. **debug_trace**: Trace debugging with AI insights
399
  3. **estimate_cost**: Cost estimation with optimization recommendations
400
  4. **compare_runs**: Compare two evaluation runs with AI-powered analysis
401
- 5. **get_dataset**: Load SMOLTRACE datasets (smoltrace-* only) as JSON
 
402
 
403
  **Resources** (3):
404
  1. **leaderboard://{repo}**: Direct access to raw leaderboard data in JSON
@@ -555,7 +557,7 @@ TraceMind UI is a Gradio-based agent evaluation platform that uses these MCP too
555
 
556
  ### app.py
557
  Main Gradio application with:
558
- - Testing UI for all 5 tools
559
  - MCP server enabled via `mcp_server=True`
560
  - API documentation
561
 
@@ -567,13 +569,14 @@ Google Gemini 2.5 Pro client that:
567
  - Uses `gemini-2.5-pro-latest` model (can switch to `gemini-2.5-flash-latest`)
568
 
569
  ### mcp_tools.py
570
- Complete MCP implementation with 11 components:
571
 
572
- **Tools** (5 async functions):
573
  - `analyze_leaderboard()`: AI-powered leaderboard analysis
574
  - `debug_trace()`: AI-powered trace debugging
575
  - `estimate_cost()`: AI-powered cost estimation
576
  - `compare_runs()`: AI-powered run comparison
 
577
  - `get_dataset()`: Load SMOLTRACE datasets as JSON
578
 
579
  **Resources** (3 decorated functions with `@gr.mcp.resource()`):
 
33
 
34
  TraceMind MCP Server is a Gradio-based MCP (Model Context Protocol) server that provides a complete MCP implementation with:
35
 
36
+ ### πŸ› οΈ **6 AI-Powered Tools**
37
  1. **πŸ“Š analyze_leaderboard**: Generate insights from evaluation leaderboard data
38
  2. **πŸ› debug_trace**: Debug specific agent execution traces using OpenTelemetry data
39
  3. **πŸ’° estimate_cost**: Predict evaluation costs before running
40
  4. **βš–οΈ compare_runs**: Compare two evaluation runs with AI-powered analysis
41
+ 5. **πŸ” analyze_results**: Deep dive into test results with optimization recommendations
42
+ 6. **πŸ“¦ get_dataset**: Load SMOLTRACE datasets (smoltrace-* prefix only) as JSON for flexible analysis
43
 
44
  ### πŸ“¦ **3 Data Resources**
45
  1. **leaderboard data**: Direct JSON access to evaluation results
 
387
  A: Streamable HTTP is the newer, more efficient protocol with better error handling and performance. SSE is the legacy protocol being phased out.
388
 
389
  **Q: How do I test if my connection works?**
390
+ A: After configuring your client, restart it and look for "tracemind" in your available MCP tools/servers. You should see 6 tools, 3 resources, and 3 prompts.
391
 
392
  **Q: Can I use this MCP server without authentication?**
393
  A: The MCP endpoint is publicly accessible. However, the tools may require HuggingFace datasets to be public or accessible with your HF token (configured server-side).
394
 
395
  ### Available MCP Components
396
 
397
+ **Tools** (6):
398
  1. **analyze_leaderboard**: AI-powered leaderboard analysis with Gemini 2.5 Pro
399
  2. **debug_trace**: Trace debugging with AI insights
400
  3. **estimate_cost**: Cost estimation with optimization recommendations
401
  4. **compare_runs**: Compare two evaluation runs with AI-powered analysis
402
+ 5. **analyze_results**: Deep dive into test results with optimization recommendations
403
+ 6. **get_dataset**: Load SMOLTRACE datasets (smoltrace-* only) as JSON
404
 
405
  **Resources** (3):
406
  1. **leaderboard://{repo}**: Direct access to raw leaderboard data in JSON
 
557
 
558
  ### app.py
559
  Main Gradio application with:
560
+ - Testing UI for all 6 tools
561
  - MCP server enabled via `mcp_server=True`
562
  - API documentation
563
 
 
569
  - Uses `gemini-2.5-pro-latest` model (can switch to `gemini-2.5-flash-latest`)
570
 
571
  ### mcp_tools.py
572
+ Complete MCP implementation with 12 components:
573
 
574
+ **Tools** (6 async functions):
575
  - `analyze_leaderboard()`: AI-powered leaderboard analysis
576
  - `debug_trace()`: AI-powered trace debugging
577
  - `estimate_cost()`: AI-powered cost estimation
578
  - `compare_runs()`: AI-powered run comparison
579
+ - `analyze_results()`: AI-powered results analysis with optimization recommendations
580
  - `get_dataset()`: Load SMOLTRACE datasets as JSON
581
 
582
  **Resources** (3 decorated functions with `@gr.mcp.resource()`):
app.py CHANGED
@@ -21,6 +21,7 @@ from mcp_tools import (
21
  debug_trace,
22
  estimate_cost,
23
  compare_runs,
 
24
  get_dataset
25
  )
26
 
@@ -40,13 +41,14 @@ def create_gradio_ui():
40
 
41
  **AI-Powered Analysis for Agent Evaluation Data**
42
 
43
- This server provides **5 MCP Tools + 3 MCP Resources + 3 MCP Prompts**:
44
 
45
  ### MCP Tools (AI-Powered)
46
  - πŸ“Š **Analyze Leaderboard**: Get insights from evaluation results
47
  - πŸ› **Debug Trace**: Understand what happened in a specific test
48
  - πŸ’° **Estimate Cost**: Predict evaluation costs before running
49
  - βš–οΈ **Compare Runs**: Compare two evaluation runs with AI-powered analysis
 
50
  - πŸ“¦ **Get Dataset**: Load any HuggingFace dataset as JSON for flexible analysis
51
 
52
  ### MCP Resources (Data Access)
@@ -520,7 +522,80 @@ def create_gradio_ui():
520
  outputs=[compare_output]
521
  )
522
 
523
- # Tab 5: Get Dataset
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
524
  with gr.Tab("πŸ“¦ Get Dataset"):
525
  gr.Markdown("""
526
  ## Load SMOLTRACE Datasets as JSON
 
21
  debug_trace,
22
  estimate_cost,
23
  compare_runs,
24
+ analyze_results,
25
  get_dataset
26
  )
27
 
 
41
 
42
  **AI-Powered Analysis for Agent Evaluation Data**
43
 
44
+ This server provides **6 MCP Tools + 3 MCP Resources + 3 MCP Prompts**:
45
 
46
  ### MCP Tools (AI-Powered)
47
  - πŸ“Š **Analyze Leaderboard**: Get insights from evaluation results
48
  - πŸ› **Debug Trace**: Understand what happened in a specific test
49
  - πŸ’° **Estimate Cost**: Predict evaluation costs before running
50
  - βš–οΈ **Compare Runs**: Compare two evaluation runs with AI-powered analysis
51
+ - πŸ” **Analyze Results**: Deep dive into test results with optimization recommendations
52
  - πŸ“¦ **Get Dataset**: Load any HuggingFace dataset as JSON for flexible analysis
53
 
54
  ### MCP Resources (Data Access)
 
522
  outputs=[compare_output]
523
  )
524
 
525
+ # Tab 5: Analyze Results
526
+ with gr.Tab("πŸ” Analyze Results"):
527
+ gr.Markdown("""
528
+ ## Analyze Test Results & Get Optimization Recommendations
529
+
530
+ Deep dive into individual test case results to identify failure patterns,
531
+ performance bottlenecks, and cost optimization opportunities.
532
+ """)
533
+
534
+ with gr.Row():
535
+ results_repo_input = gr.Textbox(
536
+ label="Results Repository",
537
+ placeholder="e.g., username/smoltrace-results-gpt4-20251114",
538
+ info="HuggingFace dataset containing results data"
539
+ )
540
+ results_focus = gr.Dropdown(
541
+ choices=["comprehensive", "failures", "performance", "cost"],
542
+ value="comprehensive",
543
+ label="Analysis Focus",
544
+ info="What aspect to focus the analysis on"
545
+ )
546
+
547
+ with gr.Row():
548
+ results_max_rows = gr.Slider(
549
+ minimum=10,
550
+ maximum=500,
551
+ value=100,
552
+ step=10,
553
+ label="Max Test Cases to Analyze",
554
+ info="Limit number of test cases for analysis"
555
+ )
556
+
557
+ results_button = gr.Button("πŸ” Analyze Results", variant="primary")
558
+ results_output = gr.Markdown()
559
+
560
+ async def run_analyze_results(repo, focus, max_rows, gemini_key, hf_token):
561
+ """
562
+ Analyze detailed test results and provide optimization recommendations.
563
+
564
+ Args:
565
+ repo (str): HuggingFace dataset repository containing results
566
+ focus (str): Analysis focus area
567
+ max_rows (int): Maximum test cases to analyze
568
+ gemini_key (str): Gemini API key from session state
569
+ hf_token (str): HuggingFace token from session state
570
+
571
+ Returns:
572
+ str: Markdown-formatted analysis with recommendations
573
+ """
574
+ try:
575
+ if not repo:
576
+ return "❌ **Error**: Please provide a results repository"
577
+
578
+ # Use user-provided key or fall back to environment variable
579
+ api_key = gemini_key if gemini_key and gemini_key.strip() else None
580
+
581
+ result = await analyze_results(
582
+ results_repo=repo,
583
+ analysis_focus=focus,
584
+ max_rows=int(max_rows),
585
+ hf_token=hf_token if hf_token and hf_token.strip() else None,
586
+ gemini_api_key=api_key
587
+ )
588
+ return result
589
+ except Exception as e:
590
+ return f"❌ **Error**: {str(e)}"
591
+
592
+ results_button.click(
593
+ fn=run_analyze_results,
594
+ inputs=[results_repo_input, results_focus, results_max_rows, gemini_key_state, hf_token_state],
595
+ outputs=[results_output]
596
+ )
597
+
598
+ # Tab 6: Get Dataset
599
  with gr.Tab("πŸ“¦ Get Dataset"):
600
  gr.Markdown("""
601
  ## Load SMOLTRACE Datasets as JSON
mcp_tools.py CHANGED
@@ -576,6 +576,154 @@ Provide eco-conscious recommendations for sustainable AI deployment.
576
  return f"❌ **Error comparing runs**: {str(e)}"
577
 
578
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
579
  @gr.mcp.tool()
580
  async def get_dataset(
581
  dataset_repo: str,
 
576
  return f"❌ **Error comparing runs**: {str(e)}"
577
 
578
 
579
+ @gr.mcp.tool()
580
+ async def analyze_results(
581
+ results_repo: str,
582
+ analysis_focus: str = "comprehensive",
583
+ max_rows: int = 100,
584
+ hf_token: Optional[str] = None,
585
+ gemini_api_key: Optional[str] = None
586
+ ) -> str:
587
+ """
588
+ Analyze detailed test results and provide optimization recommendations.
589
+
590
+ USE THIS TOOL when you need to:
591
+ - Understand why tests are failing and get recommendations
592
+ - Identify performance bottlenecks in specific test cases
593
+ - Find cost optimization opportunities
594
+ - Get insights about tool usage patterns
595
+ - Analyze which types of tasks work well vs poorly
596
+
597
+ This tool analyzes individual test case results (not aggregate leaderboard data)
598
+ and uses Google Gemini 2.5 Pro to provide actionable optimization recommendations.
599
+
600
+ Args:
601
+ results_repo (str): HuggingFace dataset repository containing results (e.g., "username/smoltrace-results-gpt4-20251114")
602
+ analysis_focus (str): Focus area. Options: "failures", "performance", "cost", "comprehensive". Default: "comprehensive"
603
+ max_rows (int): Maximum test cases to analyze. Default: 100. Range: 10-500
604
+ hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
605
+ gemini_api_key (Optional[str]): Google Gemini API key. If None, uses GEMINI_API_KEY environment variable.
606
+
607
+ Returns:
608
+ str: Markdown-formatted analysis with failure patterns, performance insights, cost analysis, and optimization recommendations
609
+ """
610
+ try:
611
+ # Initialize Gemini client
612
+ gemini_client = GeminiClient(api_key=gemini_api_key) if gemini_api_key else GeminiClient()
613
+
614
+ # Load results dataset
615
+ print(f"Loading results from {results_repo}...")
616
+ token = hf_token if hf_token else os.getenv("HF_TOKEN")
617
+ ds = load_dataset(results_repo, split="train", token=token)
618
+ df = pd.DataFrame(ds)
619
+
620
+ if df.empty:
621
+ return "❌ **Error**: Results dataset is empty"
622
+
623
+ # Limit rows
624
+ max_rows = max(10, min(500, max_rows))
625
+ df_sample = df.head(max_rows)
626
+
627
+ # Calculate statistics
628
+ total_tests = len(df_sample)
629
+ successful = df_sample[df_sample['success'] == True]
630
+ failed = df_sample[df_sample['success'] == False]
631
+
632
+ success_rate = (len(successful) / total_tests * 100) if total_tests > 0 else 0
633
+
634
+ # Analyze by category/difficulty
635
+ category_stats = {}
636
+ if 'category' in df_sample.columns:
637
+ category_stats = df_sample.groupby('category').agg({
638
+ 'success': ['count', 'sum', 'mean'],
639
+ 'execution_time_ms': 'mean',
640
+ 'cost_usd': 'sum'
641
+ }).to_dict()
642
+
643
+ difficulty_stats = {}
644
+ if 'difficulty' in df_sample.columns:
645
+ difficulty_stats = df_sample.groupby('difficulty').agg({
646
+ 'success': ['count', 'sum', 'mean'],
647
+ 'execution_time_ms': 'mean'
648
+ }).to_dict()
649
+
650
+ # Find slowest tests
651
+ slowest_tests = df_sample.nlargest(5, 'execution_time_ms')[
652
+ ['task_id', 'prompt', 'execution_time_ms', 'success', 'cost_usd']
653
+ ].to_dict('records')
654
+
655
+ # Find most expensive tests
656
+ if 'cost_usd' in df_sample.columns:
657
+ most_expensive = df_sample.nlargest(5, 'cost_usd')[
658
+ ['task_id', 'prompt', 'cost_usd', 'total_tokens', 'success']
659
+ ].to_dict('records')
660
+ else:
661
+ most_expensive = []
662
+
663
+ # Analyze failures
664
+ failure_analysis = []
665
+ if len(failed) > 0:
666
+ # Get sample of failures
667
+ failure_sample = failed.head(10)[
668
+ ['task_id', 'prompt', 'error', 'error_type', 'tool_called', 'expected_tool']
669
+ ].to_dict('records')
670
+
671
+ # Count error types
672
+ if 'error_type' in failed.columns:
673
+ error_type_counts = failed['error_type'].value_counts().to_dict()
674
+ else:
675
+ error_type_counts = {}
676
+
677
+ failure_analysis = {
678
+ "total_failures": len(failed),
679
+ "failure_rate": (len(failed) / total_tests * 100),
680
+ "error_type_counts": error_type_counts,
681
+ "sample_failures": failure_sample
682
+ }
683
+
684
+ # Prepare data for Gemini analysis
685
+ analysis_data = {
686
+ "results_repo": results_repo,
687
+ "total_tests_analyzed": total_tests,
688
+ "overall_stats": {
689
+ "success_rate": round(success_rate, 2),
690
+ "successful_tests": len(successful),
691
+ "failed_tests": len(failed),
692
+ "avg_execution_time_ms": float(df_sample['execution_time_ms'].mean()),
693
+ "total_cost_usd": float(df_sample['cost_usd'].sum()) if 'cost_usd' in df_sample.columns else 0,
694
+ "avg_tokens_per_test": float(df_sample['total_tokens'].mean()) if 'total_tokens' in df_sample.columns else 0
695
+ },
696
+ "category_performance": category_stats,
697
+ "difficulty_performance": difficulty_stats,
698
+ "slowest_tests": slowest_tests,
699
+ "most_expensive_tests": most_expensive,
700
+ "failure_analysis": failure_analysis,
701
+ "analysis_focus": analysis_focus
702
+ }
703
+
704
+ # Create focus-specific prompt
705
+ focus_prompts = {
706
+ "failures": "Focus specifically on failure patterns. Analyze why tests are failing, identify common error types, and provide actionable recommendations to improve success rate.",
707
+ "performance": "Focus on performance optimization. Analyze execution times, identify bottlenecks, and recommend ways to speed up test execution.",
708
+ "cost": "Focus on cost optimization. Analyze token usage and costs, identify expensive tests, and recommend ways to reduce evaluation costs.",
709
+ "comprehensive": "Provide comprehensive analysis covering failures, performance, cost, and overall optimization opportunities."
710
+ }
711
+
712
+ specific_question = focus_prompts.get(analysis_focus, focus_prompts["comprehensive"])
713
+
714
+ # Get AI analysis
715
+ result = await gemini_client.analyze_with_context(
716
+ data=analysis_data,
717
+ analysis_type="results",
718
+ specific_question=specific_question
719
+ )
720
+
721
+ return result
722
+
723
+ except Exception as e:
724
+ return f"❌ **Error analyzing results**: {str(e)}\n\nPlease check:\n- Repository name is correct (should be smoltrace-results-*)\n- You have access to the dataset\n- HF_TOKEN is set correctly"
725
+
726
+
727
  @gr.mcp.tool()
728
  async def get_dataset(
729
  dataset_repo: str,