Spaces:

MCP-1st-Birthday
/

TraceMind-mcp-server

Running

Mandark-droid commited on 24 days ago

Commit

eb3c2b5

1 Parent(s): 266ceb7

feat: Add analyze_results tool for optimization recommendations

Added new MCP tool 'analyze_results' that:
- Analyzes individual test case results (not just aggregate data)
- Identifies failure patterns and common error types
- Finds performance bottlenecks and slowest tests
- Analyzes cost patterns and expensive tests
- Provides actionable optimization recommendations
- Supports 4 focus modes: comprehensive, failures, performance, cost

Tool capabilities:
- Loads results datasets (smoltrace-results-*)
- Analyzes up to 500 test cases per request
- Groups analysis by category and difficulty
- Uses Gemini 2.5 Pro for intelligent insights
- Returns markdown with detailed recommendations

UI updates:
- Added new 'Analyze Results' tab in Gradio UI
- Updated header to show 6 tools instead of 5
- Added import for analyze_results

Documentation updates:
- README now reflects 6 tools and 12 total MCP components
- Updated all tool counts throughout documentation
- Updated FAQ to mention 6 tools

Files changed (3) hide show

README.md +11 -8
app.py +77 -2
mcp_tools.py +148 -0

README.md CHANGED Viewed

@@ -33,12 +33,13 @@ tags:
 TraceMind MCP Server is a Gradio-based MCP (Model Context Protocol) server that provides a complete MCP implementation with:
-### 🛠️ **5 AI-Powered Tools**
 1. **📊 analyze_leaderboard**: Generate insights from evaluation leaderboard data
 2. **🐛 debug_trace**: Debug specific agent execution traces using OpenTelemetry data
 3. **💰 estimate_cost**: Predict evaluation costs before running
 4. **⚖️ compare_runs**: Compare two evaluation runs with AI-powered analysis
-5. **📦 get_dataset**: Load SMOLTRACE datasets (smoltrace-* prefix only) as JSON for flexible analysis
 ### 📦 **3 Data Resources**
 1. **leaderboard data**: Direct JSON access to evaluation results
@@ -386,19 +387,20 @@ A: Use the SSE endpoint (`/gradio_api/mcp/sse`) for now, but note that it's depr
 A: Streamable HTTP is the newer, more efficient protocol with better error handling and performance. SSE is the legacy protocol being phased out.
 **Q: How do I test if my connection works?**
-A: After configuring your client, restart it and look for "tracemind" in your available MCP tools/servers. You should see 5 tools, 3 resources, and 3 prompts.
 **Q: Can I use this MCP server without authentication?**
 A: The MCP endpoint is publicly accessible. However, the tools may require HuggingFace datasets to be public or accessible with your HF token (configured server-side).
 ### Available MCP Components
-**Tools** (5):
 1. **analyze_leaderboard**: AI-powered leaderboard analysis with Gemini 2.5 Pro
 2. **debug_trace**: Trace debugging with AI insights
 3. **estimate_cost**: Cost estimation with optimization recommendations
 4. **compare_runs**: Compare two evaluation runs with AI-powered analysis
-5. **get_dataset**: Load SMOLTRACE datasets (smoltrace-* only) as JSON
 **Resources** (3):
 1. **leaderboard://{repo}**: Direct access to raw leaderboard data in JSON
@@ -555,7 +557,7 @@ TraceMind UI is a Gradio-based agent evaluation platform that uses these MCP too
 ### app.py
 Main Gradio application with:
-- Testing UI for all 5 tools
 - MCP server enabled via `mcp_server=True`
 - API documentation
@@ -567,13 +569,14 @@ Google Gemini 2.5 Pro client that:
 - Uses `gemini-2.5-pro-latest` model (can switch to `gemini-2.5-flash-latest`)
 ### mcp_tools.py
-Complete MCP implementation with 11 components:
-**Tools** (5 async functions):
 - `analyze_leaderboard()`: AI-powered leaderboard analysis
 - `debug_trace()`: AI-powered trace debugging
 - `estimate_cost()`: AI-powered cost estimation
 - `compare_runs()`: AI-powered run comparison
 - `get_dataset()`: Load SMOLTRACE datasets as JSON
 **Resources** (3 decorated functions with `@gr.mcp.resource()`):

 TraceMind MCP Server is a Gradio-based MCP (Model Context Protocol) server that provides a complete MCP implementation with:
+### 🛠️ **6 AI-Powered Tools**
 1. **📊 analyze_leaderboard**: Generate insights from evaluation leaderboard data
 2. **🐛 debug_trace**: Debug specific agent execution traces using OpenTelemetry data
 3. **💰 estimate_cost**: Predict evaluation costs before running
 4. **⚖️ compare_runs**: Compare two evaluation runs with AI-powered analysis
+5. **🔍 analyze_results**: Deep dive into test results with optimization recommendations
+6. **📦 get_dataset**: Load SMOLTRACE datasets (smoltrace-* prefix only) as JSON for flexible analysis
 ### 📦 **3 Data Resources**
 1. **leaderboard data**: Direct JSON access to evaluation results
 A: Streamable HTTP is the newer, more efficient protocol with better error handling and performance. SSE is the legacy protocol being phased out.
 **Q: How do I test if my connection works?**
+A: After configuring your client, restart it and look for "tracemind" in your available MCP tools/servers. You should see 6 tools, 3 resources, and 3 prompts.
 **Q: Can I use this MCP server without authentication?**
 A: The MCP endpoint is publicly accessible. However, the tools may require HuggingFace datasets to be public or accessible with your HF token (configured server-side).
 ### Available MCP Components
+**Tools** (6):
 1. **analyze_leaderboard**: AI-powered leaderboard analysis with Gemini 2.5 Pro
 2. **debug_trace**: Trace debugging with AI insights
 3. **estimate_cost**: Cost estimation with optimization recommendations
 4. **compare_runs**: Compare two evaluation runs with AI-powered analysis
+5. **analyze_results**: Deep dive into test results with optimization recommendations
+6. **get_dataset**: Load SMOLTRACE datasets (smoltrace-* only) as JSON
 **Resources** (3):
 1. **leaderboard://{repo}**: Direct access to raw leaderboard data in JSON
 ### app.py
 Main Gradio application with:
+- Testing UI for all 6 tools
 - MCP server enabled via `mcp_server=True`
 - API documentation
 - Uses `gemini-2.5-pro-latest` model (can switch to `gemini-2.5-flash-latest`)
 ### mcp_tools.py
+Complete MCP implementation with 12 components:
+**Tools** (6 async functions):
 - `analyze_leaderboard()`: AI-powered leaderboard analysis
 - `debug_trace()`: AI-powered trace debugging
 - `estimate_cost()`: AI-powered cost estimation
 - `compare_runs()`: AI-powered run comparison
+- `analyze_results()`: AI-powered results analysis with optimization recommendations
 - `get_dataset()`: Load SMOLTRACE datasets as JSON
 **Resources** (3 decorated functions with `@gr.mcp.resource()`):

app.py CHANGED Viewed

@@ -21,6 +21,7 @@ from mcp_tools import (
     debug_trace,
     estimate_cost,
     compare_runs,
     get_dataset
 )
@@ -40,13 +41,14 @@ def create_gradio_ui():
         **AI-Powered Analysis for Agent Evaluation Data**
-        This server provides **5 MCP Tools + 3 MCP Resources + 3 MCP Prompts**:
         ### MCP Tools (AI-Powered)
         - 📊 **Analyze Leaderboard**: Get insights from evaluation results
         - 🐛 **Debug Trace**: Understand what happened in a specific test
         - 💰 **Estimate Cost**: Predict evaluation costs before running
         - ⚖️ **Compare Runs**: Compare two evaluation runs with AI-powered analysis
         - 📦 **Get Dataset**: Load any HuggingFace dataset as JSON for flexible analysis
         ### MCP Resources (Data Access)
@@ -520,7 +522,80 @@ def create_gradio_ui():
                     outputs=[compare_output]
                 )
-            # Tab 5: Get Dataset
             with gr.Tab("📦 Get Dataset"):
                 gr.Markdown("""
                 ## Load SMOLTRACE Datasets as JSON

     debug_trace,
     estimate_cost,
     compare_runs,
+    analyze_results,
     get_dataset
 )
         **AI-Powered Analysis for Agent Evaluation Data**
+        This server provides **6 MCP Tools + 3 MCP Resources + 3 MCP Prompts**:
         ### MCP Tools (AI-Powered)
         - 📊 **Analyze Leaderboard**: Get insights from evaluation results
         - 🐛 **Debug Trace**: Understand what happened in a specific test
         - 💰 **Estimate Cost**: Predict evaluation costs before running
         - ⚖️ **Compare Runs**: Compare two evaluation runs with AI-powered analysis
+        - 🔍 **Analyze Results**: Deep dive into test results with optimization recommendations
         - 📦 **Get Dataset**: Load any HuggingFace dataset as JSON for flexible analysis
         ### MCP Resources (Data Access)
                     outputs=[compare_output]
                 )
+            # Tab 5: Analyze Results
+            with gr.Tab("🔍 Analyze Results"):
+                gr.Markdown("""
+                ## Analyze Test Results & Get Optimization Recommendations
+                Deep dive into individual test case results to identify failure patterns,
+                performance bottlenecks, and cost optimization opportunities.
+                """)
+                with gr.Row():
+                    results_repo_input = gr.Textbox(
+                        label="Results Repository",
+                        placeholder="e.g., username/smoltrace-results-gpt4-20251114",
+                        info="HuggingFace dataset containing results data"
+                    )
+                    results_focus = gr.Dropdown(
+                        choices=["comprehensive", "failures", "performance", "cost"],
+                        value="comprehensive",
+                        label="Analysis Focus",
+                        info="What aspect to focus the analysis on"
+                    )
+                with gr.Row():
+                    results_max_rows = gr.Slider(
+                        minimum=10,
+                        maximum=500,
+                        value=100,
+                        step=10,
+                        label="Max Test Cases to Analyze",
+                        info="Limit number of test cases for analysis"
+                    )
+                results_button = gr.Button("🔍 Analyze Results", variant="primary")
+                results_output = gr.Markdown()
+                async def run_analyze_results(repo, focus, max_rows, gemini_key, hf_token):
+                    """
+                    Analyze detailed test results and provide optimization recommendations.
+                    Args:
+                        repo (str): HuggingFace dataset repository containing results
+                        focus (str): Analysis focus area
+                        max_rows (int): Maximum test cases to analyze
+                        gemini_key (str): Gemini API key from session state
+                        hf_token (str): HuggingFace token from session state
+                    Returns:
+                        str: Markdown-formatted analysis with recommendations
+                    """
+                    try:
+                        if not repo:
+                            return "❌ **Error**: Please provide a results repository"
+                        # Use user-provided key or fall back to environment variable
+                        api_key = gemini_key if gemini_key and gemini_key.strip() else None
+                        result = await analyze_results(
+                            results_repo=repo,
+                            analysis_focus=focus,
+                            max_rows=int(max_rows),
+                            hf_token=hf_token if hf_token and hf_token.strip() else None,
+                            gemini_api_key=api_key
+                        )
+                        return result
+                    except Exception as e:
+                        return f"❌ **Error**: {str(e)}"
+                results_button.click(
+                    fn=run_analyze_results,
+                    inputs=[results_repo_input, results_focus, results_max_rows, gemini_key_state, hf_token_state],
+                    outputs=[results_output]
+                )
+            # Tab 6: Get Dataset
             with gr.Tab("📦 Get Dataset"):
                 gr.Markdown("""
                 ## Load SMOLTRACE Datasets as JSON

mcp_tools.py CHANGED Viewed

@@ -576,6 +576,154 @@ Provide eco-conscious recommendations for sustainable AI deployment.
         return f"❌ **Error comparing runs**: {str(e)}"
 @gr.mcp.tool()
 async def get_dataset(
     dataset_repo: str,

         return f"❌ **Error comparing runs**: {str(e)}"
+@gr.mcp.tool()
+async def analyze_results(
+    results_repo: str,
+    analysis_focus: str = "comprehensive",
+    max_rows: int = 100,
+    hf_token: Optional[str] = None,
+    gemini_api_key: Optional[str] = None
+) -> str:
+    """
+    Analyze detailed test results and provide optimization recommendations.
+    USE THIS TOOL when you need to:
+    - Understand why tests are failing and get recommendations
+    - Identify performance bottlenecks in specific test cases
+    - Find cost optimization opportunities
+    - Get insights about tool usage patterns
+    - Analyze which types of tasks work well vs poorly
+    This tool analyzes individual test case results (not aggregate leaderboard data)
+    and uses Google Gemini 2.5 Pro to provide actionable optimization recommendations.
+    Args:
+        results_repo (str): HuggingFace dataset repository containing results (e.g., "username/smoltrace-results-gpt4-20251114")
+        analysis_focus (str): Focus area. Options: "failures", "performance", "cost", "comprehensive". Default: "comprehensive"
+        max_rows (int): Maximum test cases to analyze. Default: 100. Range: 10-500
+        hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
+        gemini_api_key (Optional[str]): Google Gemini API key. If None, uses GEMINI_API_KEY environment variable.
+    Returns:
+        str: Markdown-formatted analysis with failure patterns, performance insights, cost analysis, and optimization recommendations
+    """
+    try:
+        # Initialize Gemini client
+        gemini_client = GeminiClient(api_key=gemini_api_key) if gemini_api_key else GeminiClient()
+        # Load results dataset
+        print(f"Loading results from {results_repo}...")
+        token = hf_token if hf_token else os.getenv("HF_TOKEN")
+        ds = load_dataset(results_repo, split="train", token=token)
+        df = pd.DataFrame(ds)
+        if df.empty:
+            return "❌ **Error**: Results dataset is empty"
+        # Limit rows
+        max_rows = max(10, min(500, max_rows))
+        df_sample = df.head(max_rows)
+        # Calculate statistics
+        total_tests = len(df_sample)
+        successful = df_sample[df_sample['success'] == True]
+        failed = df_sample[df_sample['success'] == False]
+        success_rate = (len(successful) / total_tests * 100) if total_tests > 0 else 0
+        # Analyze by category/difficulty
+        category_stats = {}
+        if 'category' in df_sample.columns:
+            category_stats = df_sample.groupby('category').agg({
+                'success': ['count', 'sum', 'mean'],
+                'execution_time_ms': 'mean',
+                'cost_usd': 'sum'
+            }).to_dict()
+        difficulty_stats = {}
+        if 'difficulty' in df_sample.columns:
+            difficulty_stats = df_sample.groupby('difficulty').agg({
+                'success': ['count', 'sum', 'mean'],
+                'execution_time_ms': 'mean'
+            }).to_dict()
+        # Find slowest tests
+        slowest_tests = df_sample.nlargest(5, 'execution_time_ms')[
+            ['task_id', 'prompt', 'execution_time_ms', 'success', 'cost_usd']
+        ].to_dict('records')
+        # Find most expensive tests
+        if 'cost_usd' in df_sample.columns:
+            most_expensive = df_sample.nlargest(5, 'cost_usd')[
+                ['task_id', 'prompt', 'cost_usd', 'total_tokens', 'success']
+            ].to_dict('records')
+        else:
+            most_expensive = []
+        # Analyze failures
+        failure_analysis = []
+        if len(failed) > 0:
+            # Get sample of failures
+            failure_sample = failed.head(10)[
+                ['task_id', 'prompt', 'error', 'error_type', 'tool_called', 'expected_tool']
+            ].to_dict('records')
+            # Count error types
+            if 'error_type' in failed.columns:
+                error_type_counts = failed['error_type'].value_counts().to_dict()
+            else:
+                error_type_counts = {}
+            failure_analysis = {
+                "total_failures": len(failed),
+                "failure_rate": (len(failed) / total_tests * 100),
+                "error_type_counts": error_type_counts,
+                "sample_failures": failure_sample
+            }
+        # Prepare data for Gemini analysis
+        analysis_data = {
+            "results_repo": results_repo,
+            "total_tests_analyzed": total_tests,
+            "overall_stats": {
+                "success_rate": round(success_rate, 2),
+                "successful_tests": len(successful),
+                "failed_tests": len(failed),
+                "avg_execution_time_ms": float(df_sample['execution_time_ms'].mean()),
+                "total_cost_usd": float(df_sample['cost_usd'].sum()) if 'cost_usd' in df_sample.columns else 0,
+                "avg_tokens_per_test": float(df_sample['total_tokens'].mean()) if 'total_tokens' in df_sample.columns else 0
+            },
+            "category_performance": category_stats,
+            "difficulty_performance": difficulty_stats,
+            "slowest_tests": slowest_tests,
+            "most_expensive_tests": most_expensive,
+            "failure_analysis": failure_analysis,
+            "analysis_focus": analysis_focus
+        }
+        # Create focus-specific prompt
+        focus_prompts = {
+            "failures": "Focus specifically on failure patterns. Analyze why tests are failing, identify common error types, and provide actionable recommendations to improve success rate.",
+            "performance": "Focus on performance optimization. Analyze execution times, identify bottlenecks, and recommend ways to speed up test execution.",
+            "cost": "Focus on cost optimization. Analyze token usage and costs, identify expensive tests, and recommend ways to reduce evaluation costs.",
+            "comprehensive": "Provide comprehensive analysis covering failures, performance, cost, and overall optimization opportunities."
+        }
+        specific_question = focus_prompts.get(analysis_focus, focus_prompts["comprehensive"])
+        # Get AI analysis
+        result = await gemini_client.analyze_with_context(
+            data=analysis_data,
+            analysis_type="results",
+            specific_question=specific_question
+        )
+        return result
+    except Exception as e:
+        return f"❌ **Error analyzing results**: {str(e)}\n\nPlease check:\n- Repository name is correct (should be smoltrace-results-*)\n- You have access to the dataset\n- HF_TOKEN is set correctly"
 @gr.mcp.tool()
 async def get_dataset(
     dataset_repo: str,