kshitijthakkar commited on
Commit
6982f0b
·
1 Parent(s): e4b0c31

docs: Deploy final documentation package

Browse files
Files changed (3) hide show
  1. ARCHITECTURE.md +987 -0
  2. DOCUMENTATION.md +918 -0
  3. README.md +186 -770
ARCHITECTURE.md ADDED
@@ -0,0 +1,987 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TraceMind MCP Server - Technical Architecture
2
+
3
+ This document provides a deep technical dive into the TraceMind MCP Server architecture, implementation details, and deployment configuration.
4
+
5
+ ## Table of Contents
6
+
7
+ - [System Overview](#system-overview)
8
+ - [Project Structure](#project-structure)
9
+ - [Core Components](#core-components)
10
+ - [MCP Protocol Implementation](#mcp-protocol-implementation)
11
+ - [Gemini Integration](#gemini-integration)
12
+ - [Data Flow](#data-flow)
13
+ - [Deployment Architecture](#deployment-architecture)
14
+ - [Development Workflow](#development-workflow)
15
+ - [Performance Considerations](#performance-considerations)
16
+ - [Security](#security)
17
+
18
+ ---
19
+
20
+ ## System Overview
21
+
22
+ TraceMind MCP Server is a Gradio-based MCP (Model Context Protocol) server that provides AI-powered analysis tools for agent evaluation data. It serves as the backend intelligence layer for the TraceMind ecosystem.
23
+
24
+ ### Technology Stack
25
+
26
+ | Component | Technology | Version | Purpose |
27
+ |-----------|-----------|---------|---------|
28
+ | **Framework** | Gradio | 6.x | Native MCP support with `@gr.mcp.*` decorators |
29
+ | **AI Model** | Google Gemini | 2.5 Flash Lite | AI-powered analysis and insights |
30
+ | **Data Source** | HuggingFace Datasets | Latest | Load evaluation datasets |
31
+ | **Protocol** | MCP | 1.0 | Model Context Protocol for tool exposure |
32
+ | **Transport** | SSE | - | Server-Sent Events for real-time communication |
33
+ | **Deployment** | Docker | - | HuggingFace Spaces containerized deployment |
34
+ | **Language** | Python | 3.10+ | Core implementation |
35
+
36
+ ### Architecture Diagram
37
+
38
+ ```
39
+ ┌──────────────────────────────────────────────────────────────┐
40
+ │ MCP Clients (External) │
41
+ │ - Claude Desktop │
42
+ │ - VS Code (Continue, Cursor, Cline) │
43
+ │ - TraceMind-AI (Track 2) │
44
+ └────────────────┬─────────────────────────────────────────────┘
45
+
46
+ │ MCP Protocol
47
+ │ (SSE Transport)
48
+
49
+ ┌──────────────────────────────────────────────────────────────┐
50
+ │ TraceMind MCP Server (HuggingFace Spaces) │
51
+ │ │
52
+ │ ┌──────────────────────────────────────────────────────┐ │
53
+ │ │ Gradio App (app.py) │ │
54
+ │ │ - MCP Server Endpoint (mcp_server=True) │ │
55
+ │ │ - Testing UI (Gradio Blocks) │ │
56
+ │ │ - Configuration Management │ │
57
+ │ └─────────────┬────────────────────────────────────────┘ │
58
+ │ │ │
59
+ │ ↓ │
60
+ │ ┌──────────────────────────────────────────────────────┐ │
61
+ │ │ MCP Tools (mcp_tools.py) │ │
62
+ │ │ - 11 Tools (@gr.mcp.tool()) │ │
63
+ │ │ - 3 Resources (@gr.mcp.resource()) │ │
64
+ │ │ - 3 Prompts (@gr.mcp.prompt()) │ │
65
+ │ └─────────────┬────────────────────────────────────────┘ │
66
+ │ │ │
67
+ │ ↓ │
68
+ │ ┌──────────────────────────────────────────────────────┐ │
69
+ │ │ Gemini Client (gemini_client.py) │ │
70
+ │ │ - API Authentication │ │
71
+ │ │ - Prompt Engineering │ │
72
+ │ │ - Response Parsing │ │
73
+ │ └─────────────┬────────────────────────────────────────┘ │
74
+ │ │ │
75
+ └────────────��───┼──────────────────────────────────────────────┘
76
+
77
+
78
+ ┌────────────────┐
79
+ │ External APIs │
80
+ │ - Gemini API │
81
+ │ - HF Datasets │
82
+ └────────────────┘
83
+ ```
84
+
85
+ ---
86
+
87
+ ## Project Structure
88
+
89
+ ```
90
+ TraceMind-mcp-server/
91
+ ├── app.py # Main entry point, Gradio UI
92
+ ├── mcp_tools.py # MCP tool implementations (11 tools + 3 resources + 3 prompts)
93
+ ├── gemini_client.py # Google Gemini API client
94
+ ├── requirements.txt # Python dependencies
95
+ ├── Dockerfile # Container configuration
96
+ ├── .env.example # Environment variable template
97
+ ├── .gitignore # Git ignore rules
98
+ ├── README.md # Project documentation
99
+ └── DOCUMENTATION.md # Complete API reference
100
+
101
+ Total: 8 files (excluding docs)
102
+ Lines of Code: ~3,500 lines (breakdown below)
103
+ ```
104
+
105
+ ### File Sizes
106
+
107
+ | File | Lines | Purpose |
108
+ |------|-------|---------|
109
+ | `app.py` | ~1,200 | Gradio UI + MCP server setup + testing interface |
110
+ | `mcp_tools.py` | ~2,100 | All 17 MCP components (tools, resources, prompts) |
111
+ | `gemini_client.py` | ~200 | Gemini API integration |
112
+ | `requirements.txt` | ~20 | Dependencies |
113
+ | `Dockerfile` | ~30 | Deployment configuration |
114
+
115
+ ---
116
+
117
+ ## Core Components
118
+
119
+ ### 1. app.py - Main Application
120
+
121
+ **Purpose**: Entry point for HuggingFace Spaces deployment, provides both MCP server and testing UI.
122
+
123
+ **Key Responsibilities**:
124
+ - Initialize Gradio app with `mcp_server=True`
125
+ - Create testing interface for all MCP tools
126
+ - Handle configuration (API keys, settings)
127
+ - Manage client connections
128
+
129
+ **Architecture**:
130
+
131
+ ```python
132
+ # app.py structure
133
+ import gradio as gr
134
+ from gemini_client import GeminiClient
135
+ from mcp_tools import * # All tool implementations
136
+
137
+ # 1. Initialize Gemini client (with fallback)
138
+ default_gemini_client = GeminiClient()
139
+
140
+ # 2. Create Gradio UI for testing
141
+ def create_gradio_ui():
142
+ with gr.Blocks() as demo:
143
+ # Settings tab for API key configuration
144
+ # Tab for each MCP tool (11 tabs)
145
+ # Tab for testing resources
146
+ # Tab for testing prompts
147
+ # API documentation tab
148
+ return demo
149
+
150
+ # 3. Launch with MCP server enabled
151
+ if __name__ == "__main__":
152
+ demo = create_gradio_ui()
153
+ demo.launch(
154
+ mcp_server=True, # ← Enables MCP endpoint
155
+ share=False,
156
+ server_name="0.0.0.0",
157
+ server_port=7860
158
+ )
159
+ ```
160
+
161
+ **MCP Enablement**:
162
+ - `mcp_server=True` in `demo.launch()` automatically:
163
+ - Exposes `/gradio_api/mcp/sse` endpoint
164
+ - Discovers all `@gr.mcp.tool()`, `@gr.mcp.resource()`, `@gr.mcp.prompt()` decorated functions
165
+ - Generates MCP tool schemas from function signatures and docstrings
166
+ - Handles MCP protocol communication (SSE transport)
167
+
168
+ **Testing Interface**:
169
+ - **Settings Tab**: Configure Gemini API key and HF token
170
+ - **Tool Tabs** (11): One tab per tool for manual testing
171
+ - Input fields for all parameters
172
+ - Submit button
173
+ - Output display (Markdown or JSON)
174
+ - **Resources Tab**: Test resource URIs
175
+ - **Prompts Tab**: Test prompt templates
176
+ - **API Documentation Tab**: Generated from tool docstrings
177
+
178
+ ---
179
+
180
+ ### 2. mcp_tools.py - MCP Components
181
+
182
+ **Purpose**: Implements all 17 MCP components (11 tools + 3 resources + 3 prompts).
183
+
184
+ **Structure**:
185
+
186
+ ```python
187
+ # mcp_tools.py structure
188
+ import gradio as gr
189
+ from gemini_client import GeminiClient
190
+ from datasets import load_dataset
191
+
192
+ # ============ TOOLS (11) ============
193
+
194
+ @gr.mcp.tool()
195
+ async def analyze_leaderboard(...) -> str:
196
+ """Tool docstring (becomes MCP description)"""
197
+ # 1. Load data from HuggingFace
198
+ # 2. Process/filter data
199
+ # 3. Call Gemini for AI analysis
200
+ # 4. Return formatted response
201
+ pass
202
+
203
+ @gr.mcp.tool()
204
+ async def debug_trace(...) -> str:
205
+ """Debug traces with AI assistance"""
206
+ pass
207
+
208
+ # ... (9 more tools)
209
+
210
+ # ============ RESOURCES (3) ============
211
+
212
+ @gr.mcp.resource()
213
+ def get_leaderboard_data(uri: str) -> str:
214
+ """URI: leaderboard://{repo}"""
215
+ # Parse URI
216
+ # Load dataset
217
+ # Return raw JSON
218
+ pass
219
+
220
+ @gr.mcp.resource()
221
+ def get_trace_data(uri: str) -> str:
222
+ """URI: trace://{trace_id}/{repo}"""
223
+ pass
224
+
225
+ @gr.mcp.resource()
226
+ def get_cost_data(uri: str) -> str:
227
+ """URI: cost://model/{model_name}"""
228
+ pass
229
+
230
+ # ============ PROMPTS (3) ============
231
+
232
+ @gr.mcp.prompt()
233
+ def analysis_prompt(analysis_type: str, ...) -> str:
234
+ """Generate analysis prompt templates"""
235
+ pass
236
+
237
+ @gr.mcp.prompt()
238
+ def debug_prompt(debug_type: str, ...) -> str:
239
+ """Generate debug prompt templates"""
240
+ pass
241
+
242
+ @gr.mcp.prompt()
243
+ def optimization_prompt(optimization_goal: str, ...) -> str:
244
+ """Generate optimization prompt templates"""
245
+ pass
246
+ ```
247
+
248
+ **Design Patterns**:
249
+
250
+ 1. **Decorator-Based Registration**:
251
+ ```python
252
+ @gr.mcp.tool() # Gradio automatically registers as MCP tool
253
+ async def tool_name(...) -> str:
254
+ """Docstring becomes tool description in MCP schema"""
255
+ pass
256
+ ```
257
+
258
+ 2. **Structured Docstrings**:
259
+ ```python
260
+ """
261
+ Brief one-line description.
262
+
263
+ Longer detailed description explaining purpose and behavior.
264
+
265
+ Args:
266
+ param1 (type): Description of param1
267
+ param2 (type): Description of param2. Default: value
268
+
269
+ Returns:
270
+ type: Description of return value
271
+ """
272
+ ```
273
+ Gradio parses this to generate MCP tool schema automatically.
274
+
275
+ 3. **Error Handling**:
276
+ ```python
277
+ try:
278
+ # Tool implementation
279
+ return result
280
+ except Exception as e:
281
+ return f"❌ **Error**: {str(e)}"
282
+ ```
283
+ All errors returned as user-friendly strings.
284
+
285
+ 4. **Async/Await**:
286
+ All tools are `async` for efficient I/O operations (API calls, dataset loading).
287
+
288
+ ---
289
+
290
+ ### 3. gemini_client.py - AI Integration
291
+
292
+ **Purpose**: Handles all interactions with Google Gemini 2.5 Flash Lite API.
293
+
294
+ **Key Features**:
295
+ - API authentication
296
+ - Prompt engineering for different analysis types
297
+ - Response parsing and formatting
298
+ - Error handling and retries
299
+ - Token optimization
300
+
301
+ **Class Structure**:
302
+
303
+ ```python
304
+ class GeminiClient:
305
+ def __init__(self, api_key: str, model_name: str):
306
+ """Initialize with API key and model"""
307
+ self.api_key = api_key
308
+ self.model = genai.GenerativeModel(model_name)
309
+ self.generation_config = {
310
+ "temperature": 0.7,
311
+ "top_p": 0.95,
312
+ "max_output_tokens": 4096, # Optimized for HF Spaces
313
+ }
314
+ self.request_timeout = 30 # 30s timeout
315
+
316
+ async def analyze_with_context(
317
+ self,
318
+ data: Dict,
319
+ analysis_type: str,
320
+ specific_question: Optional[str] = None
321
+ ) -> str:
322
+ """
323
+ Core analysis method used by all AI-powered tools
324
+
325
+ Args:
326
+ data: Data to analyze (dict or JSON)
327
+ analysis_type: "leaderboard", "trace", "cost_estimate", "comparison", "results"
328
+ specific_question: Optional specific question
329
+
330
+ Returns:
331
+ Markdown-formatted analysis
332
+ """
333
+ # 1. Build system prompt based on analysis_type
334
+ system_prompt = self._get_system_prompt(analysis_type)
335
+
336
+ # 2. Format data for context
337
+ data_str = json.dumps(data, indent=2)
338
+
339
+ # 3. Build user prompt
340
+ user_prompt = f"{system_prompt}\n\nData:\n{data_str}"
341
+ if specific_question:
342
+ user_prompt += f"\n\nSpecific Question: {specific_question}"
343
+
344
+ # 4. Call Gemini API
345
+ response = await self.model.generate_content_async(
346
+ user_prompt,
347
+ generation_config=self.generation_config,
348
+ request_options={"timeout": self.request_timeout}
349
+ )
350
+
351
+ # 5. Extract and return text
352
+ return response.text
353
+
354
+ def _get_system_prompt(self, analysis_type: str) -> str:
355
+ """Get specialized system prompt for each analysis type"""
356
+ prompts = {
357
+ "leaderboard": """You are an expert AI agent performance analyst.
358
+ Analyze evaluation leaderboard data and provide:
359
+ - Top performers by key metrics
360
+ - Trade-off analysis (cost vs accuracy)
361
+ - Trend identification
362
+ - Actionable recommendations
363
+ Format: Markdown with clear sections.""",
364
+
365
+ "trace": """You are an expert at debugging AI agent executions.
366
+ Analyze OpenTelemetry trace data and:
367
+ - Answer specific questions about execution
368
+ - Identify performance bottlenecks
369
+ - Explain reasoning chain
370
+ - Provide optimization suggestions
371
+ Format: Clear, concise explanation.""",
372
+
373
+ "cost_estimate": """You are a cost optimization expert.
374
+ Analyze cost estimation data and provide:
375
+ - Detailed cost breakdown
376
+ - Hardware recommendations
377
+ - Cost optimization opportunities
378
+ - ROI analysis
379
+ Format: Structured breakdown with recommendations.""",
380
+
381
+ # ... more prompts for other analysis types
382
+ }
383
+ return prompts.get(analysis_type, prompts["leaderboard"])
384
+ ```
385
+
386
+ **Optimization Strategies**:
387
+ - **Token Reduction**: `max_output_tokens: 4096` (reduced from 8192) for faster responses
388
+ - **Request Timeout**: 30s timeout for HF Spaces compatibility
389
+ - **Temperature**: 0.7 for balanced creativity and consistency
390
+ - **Model Selection**: `gemini-2.5-flash-lite` for speed (can switch to `gemini-2.5-flash` for quality)
391
+
392
+ ---
393
+
394
+ ## MCP Protocol Implementation
395
+
396
+ ### How Gradio's Native MCP Support Works
397
+
398
+ Gradio 6+ provides native MCP server capabilities through decorators and automatic schema generation.
399
+
400
+ **1. Tool Registration**:
401
+ ```python
402
+ @gr.mcp.tool() # �� This decorator tells Gradio to expose this as an MCP tool
403
+ async def my_tool(param1: str, param2: int = 10) -> str:
404
+ """
405
+ Brief description (used in MCP tool schema).
406
+
407
+ Args:
408
+ param1 (str): Description of param1
409
+ param2 (int): Description of param2. Default: 10
410
+
411
+ Returns:
412
+ str: Description of return value
413
+ """
414
+ return f"Result: {param1}, {param2}"
415
+ ```
416
+
417
+ **What Gradio does automatically**:
418
+ - Parses function signature to extract parameter names and types
419
+ - Parses docstring to extract descriptions
420
+ - Generates MCP tool schema:
421
+ ```json
422
+ {
423
+ "name": "my_tool",
424
+ "description": "Brief description (used in MCP tool schema).",
425
+ "inputSchema": {
426
+ "type": "object",
427
+ "properties": {
428
+ "param1": {
429
+ "type": "string",
430
+ "description": "Description of param1"
431
+ },
432
+ "param2": {
433
+ "type": "integer",
434
+ "default": 10,
435
+ "description": "Description of param2. Default: 10"
436
+ }
437
+ },
438
+ "required": ["param1"]
439
+ }
440
+ }
441
+ ```
442
+
443
+ **2. Resource Registration**:
444
+ ```python
445
+ @gr.mcp.resource()
446
+ def get_resource(uri: str) -> str:
447
+ """
448
+ Resource description.
449
+
450
+ Args:
451
+ uri (str): Resource URI (e.g., "leaderboard://repo/name")
452
+
453
+ Returns:
454
+ str: JSON data
455
+ """
456
+ # Parse URI
457
+ # Load data
458
+ # Return JSON string
459
+ pass
460
+ ```
461
+
462
+ **3. Prompt Registration**:
463
+ ```python
464
+ @gr.mcp.prompt()
465
+ def generate_prompt(prompt_type: str, context: str) -> str:
466
+ """
467
+ Generate reusable prompt templates.
468
+
469
+ Args:
470
+ prompt_type (str): Type of prompt
471
+ context (str): Context for prompt generation
472
+
473
+ Returns:
474
+ str: Generated prompt text
475
+ """
476
+ return f"Prompt template for {prompt_type} with {context}"
477
+ ```
478
+
479
+ ### MCP Endpoint URLs
480
+
481
+ When `demo.launch(mcp_server=True)` is called:
482
+
483
+ **SSE Endpoint** (Primary):
484
+ ```
485
+ https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse
486
+ ```
487
+
488
+ **Streamable HTTP Endpoint** (Alternative):
489
+ ```
490
+ https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/
491
+ ```
492
+
493
+ ### Client Configuration
494
+
495
+ **Claude Desktop** (`claude_desktop_config.json`):
496
+ ```json
497
+ {
498
+ "mcpServers": {
499
+ "tracemind": {
500
+ "url": "https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse",
501
+ "transport": "sse"
502
+ }
503
+ }
504
+ }
505
+ ```
506
+
507
+ **Python MCP Client**:
508
+ ```python
509
+ from mcp import ClientSession, ServerParameters
510
+
511
+ session = ClientSession(
512
+ ServerParameters(
513
+ url="https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse",
514
+ transport="sse"
515
+ )
516
+ )
517
+ await session.__aenter__()
518
+
519
+ # List tools
520
+ tools = await session.list_tools()
521
+
522
+ # Call tool
523
+ result = await session.call_tool("analyze_leaderboard", arguments={
524
+ "metric_focus": "cost",
525
+ "top_n": 5
526
+ })
527
+ ```
528
+
529
+ ---
530
+
531
+ ## Gemini Integration
532
+
533
+ ### API Configuration
534
+
535
+ **Environment Variable**:
536
+ ```bash
537
+ GEMINI_API_KEY=your_api_key_here
538
+ ```
539
+
540
+ **Initialization**:
541
+ ```python
542
+ import google.generativeai as genai
543
+
544
+ genai.configure(api_key=os.getenv("GEMINI_API_KEY"))
545
+ model = genai.GenerativeModel("gemini-2.5-flash-lite")
546
+ ```
547
+
548
+ ### Prompt Engineering Strategy
549
+
550
+ **1. System Prompts by Analysis Type**:
551
+ Each analysis type (leaderboard, trace, cost, comparison, results) has a specialized system prompt that:
552
+ - Defines the AI's role and expertise
553
+ - Specifies output format (markdown, structured sections)
554
+ - Lists key insights to include
555
+ - Sets tone (professional, concise, actionable)
556
+
557
+ **2. Context Injection**:
558
+ ```python
559
+ user_prompt = f"""
560
+ {system_prompt}
561
+
562
+ Data to Analyze:
563
+ {json.dumps(data, indent=2)}
564
+
565
+ Specific Question: {question}
566
+ """
567
+ ```
568
+
569
+ **3. Output Formatting**:
570
+ - All responses in Markdown
571
+ - Clear sections: Top Performers, Key Insights, Trade-offs, Recommendations
572
+ - Bullet points for readability
573
+ - Code blocks for technical details
574
+
575
+ ### Rate Limiting & Error Handling
576
+
577
+ **Rate Limits** (Gemini 2.5 Flash Lite free tier):
578
+ - 1,500 requests per day
579
+ - 1 request per second
580
+
581
+ **Error Handling Strategy**:
582
+ ```python
583
+ try:
584
+ response = await model.generate_content_async(...)
585
+ return response.text
586
+ except google.api_core.exceptions.ResourceExhausted:
587
+ return "❌ **Rate limit exceeded**. Please try again in a few seconds."
588
+ except google.api_core.exceptions.DeadlineExceeded:
589
+ return "❌ **Request timeout**. The analysis is taking too long. Try with less data."
590
+ except Exception as e:
591
+ return f"❌ **Error**: {str(e)}"
592
+ ```
593
+
594
+ ---
595
+
596
+ ## Data Flow
597
+
598
+ ### Tool Execution Flow
599
+
600
+ ```
601
+ 1. MCP Client (e.g., Claude Desktop, TraceMind-AI)
602
+ └─→ Calls: analyze_leaderboard(metric_focus="cost", top_n=5)
603
+
604
+ 2. Gradio MCP Server (app.py)
605
+ └─→ Routes to: analyze_leaderboard() in mcp_tools.py
606
+
607
+ 3. MCP Tool Function (mcp_tools.py)
608
+ ├─→ Load data from HuggingFace Datasets
609
+ │ └─→ ds = load_dataset("kshitijthakkar/smoltrace-leaderboard")
610
+
611
+ ├─→ Process/filter data
612
+ �� └─→ Filter by time range, sort by metric
613
+
614
+ ├─→ Call Gemini Client
615
+ │ └─→ gemini_client.analyze_with_context(data, "leaderboard")
616
+
617
+ └─→ Return formatted response
618
+
619
+ 4. Gemini Client (gemini_client.py)
620
+ ├─→ Build system prompt
621
+ ├─→ Format data as JSON
622
+ ├─→ Call Gemini API
623
+ │ └─→ model.generate_content_async(prompt)
624
+ └─→ Return AI-generated analysis
625
+
626
+ 5. Response Path (back through stack)
627
+ └─→ Gemini → gemini_client → mcp_tool → Gradio → MCP Client
628
+
629
+ 6. MCP Client (displays result to user)
630
+ └─→ Shows markdown-formatted analysis
631
+ ```
632
+
633
+ ### Resource Access Flow
634
+
635
+ ```
636
+ 1. MCP Client
637
+ └─→ Accesses: leaderboard://kshitijthakkar/smoltrace-leaderboard
638
+
639
+ 2. Gradio MCP Server
640
+ └─→ Routes to: get_leaderboard_data(uri)
641
+
642
+ 3. Resource Function
643
+ ├─→ Parse URI to extract repo name
644
+ ├─→ Load dataset from HuggingFace
645
+ ├─→ Convert to JSON
646
+ └─→ Return raw JSON string
647
+
648
+ 4. MCP Client
649
+ └─→ Receives raw JSON data (no AI processing)
650
+ ```
651
+
652
+ ---
653
+
654
+ ## Deployment Architecture
655
+
656
+ ### HuggingFace Spaces Deployment
657
+
658
+ **Platform**: HuggingFace Spaces
659
+ **SDK**: Docker (for custom dependencies)
660
+ **Hardware**: CPU Basic (free tier) - sufficient for API calls and dataset loading
661
+ **URL**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
662
+
663
+ ### Dockerfile
664
+
665
+ ```dockerfile
666
+ # Base image
667
+ FROM python:3.10-slim
668
+
669
+ # Set working directory
670
+ WORKDIR /app
671
+
672
+ # Copy requirements
673
+ COPY requirements.txt .
674
+
675
+ # Install dependencies
676
+ RUN pip install --no-cache-dir -r requirements.txt
677
+
678
+ # Copy application files
679
+ COPY app.py .
680
+ COPY mcp_tools.py .
681
+ COPY gemini_client.py .
682
+
683
+ # Expose port
684
+ EXPOSE 7860
685
+
686
+ # Set environment variables
687
+ ENV GRADIO_SERVER_NAME="0.0.0.0"
688
+ ENV GRADIO_SERVER_PORT="7860"
689
+
690
+ # Run application
691
+ CMD ["python", "app.py"]
692
+ ```
693
+
694
+ ### Environment Variables (HF Spaces Secrets)
695
+
696
+ ```bash
697
+ # Required
698
+ GEMINI_API_KEY=your_gemini_api_key_here
699
+
700
+ # Optional (for testing)
701
+ HF_TOKEN=your_huggingface_token_here
702
+ ```
703
+
704
+ ### Scaling Considerations
705
+
706
+ **Current Setup** (Free Tier):
707
+ - Hardware: CPU Basic
708
+ - Concurrent Users: ~10-20
709
+ - Request Latency: 2-5 seconds (AI analysis)
710
+ - Rate Limit: Gemini API (1,500 req/day)
711
+
712
+ **If Scaling Needed**:
713
+ 1. **Upgrade Hardware**: CPU Basic → CPU Upgrade (2x performance)
714
+ 2. **Caching**: Add Redis for caching frequent queries
715
+ 3. **API Key Pool**: Rotate multiple Gemini API keys to bypass rate limits
716
+ 4. **Load Balancing**: Deploy multiple Spaces instances with load balancer
717
+
718
+ ---
719
+
720
+ ## Development Workflow
721
+
722
+ ### Local Development Setup
723
+
724
+ ```bash
725
+ # 1. Clone repository
726
+ git clone https://github.com/Mandark-droid/TraceMind-mcp-server.git
727
+ cd TraceMind-mcp-server
728
+
729
+ # 2. Create virtual environment
730
+ python -m venv venv
731
+ source venv/bin/activate # Windows: venv\Scripts\activate
732
+
733
+ # 3. Install dependencies
734
+ pip install -r requirements.txt
735
+
736
+ # 4. Configure environment
737
+ cp .env.example .env
738
+ # Edit .env with your API keys
739
+
740
+ # 5. Run locally
741
+ python app.py
742
+
743
+ # 6. Access
744
+ # - Gradio UI: http://localhost:7860
745
+ # - MCP Endpoint: http://localhost:7860/gradio_api/mcp/sse
746
+ ```
747
+
748
+ ### Testing MCP Tools
749
+
750
+ **Option 1: Gradio UI** (Easiest):
751
+ ```
752
+ 1. Run app.py
753
+ 2. Open http://localhost:7860
754
+ 3. Navigate to tool tab (e.g., "📊 Analyze Leaderboard")
755
+ 4. Fill in parameters
756
+ 5. Click submit button
757
+ 6. View results
758
+ ```
759
+
760
+ **Option 2: Python MCP Client**:
761
+ ```python
762
+ from mcp import ClientSession, ServerParameters
763
+
764
+ async def test_tool():
765
+ session = ClientSession(
766
+ ServerParameters(
767
+ url="http://localhost:7860/gradio_api/mcp/sse",
768
+ transport="sse"
769
+ )
770
+ )
771
+ await session.__aenter__()
772
+
773
+ result = await session.call_tool("analyze_leaderboard", {
774
+ "metric_focus": "cost",
775
+ "top_n": 3
776
+ })
777
+
778
+ print(result.content[0].text)
779
+
780
+ import asyncio
781
+ asyncio.run(test_tool())
782
+ ```
783
+
784
+ ### Adding New MCP Tools
785
+
786
+ **Step 1: Add function to mcp_tools.py**:
787
+ ```python
788
+ @gr.mcp.tool()
789
+ async def new_tool_name(
790
+ param1: str,
791
+ param2: int = 10
792
+ ) -> str:
793
+ """
794
+ Brief description of what this tool does.
795
+
796
+ Detailed explanation of the tool's purpose and behavior.
797
+
798
+ Args:
799
+ param1 (str): Description of param1 with examples
800
+ param2 (int): Description of param2. Default: 10
801
+
802
+ Returns:
803
+ str: Description of what the function returns
804
+ """
805
+ try:
806
+ # Implementation
807
+ result = f"Processed: {param1} with {param2}"
808
+ return result
809
+ except Exception as e:
810
+ return f"❌ **Error**: {str(e)}"
811
+ ```
812
+
813
+ **Step 2: Add testing UI to app.py** (optional):
814
+ ```python
815
+ with gr.Tab("🆕 New Tool"):
816
+ gr.Markdown("## New Tool Name")
817
+ param1_input = gr.Textbox(label="Param 1")
818
+ param2_input = gr.Number(label="Param 2", value=10)
819
+ submit_btn = gr.Button("Execute")
820
+ output = gr.Markdown()
821
+
822
+ submit_btn.click(
823
+ fn=new_tool_name,
824
+ inputs=[param1_input, param2_input],
825
+ outputs=output
826
+ )
827
+ ```
828
+
829
+ **Step 3: Test**:
830
+ ```bash
831
+ python app.py
832
+ # Visit http://localhost:7860
833
+ # Test in new tab
834
+ ```
835
+
836
+ **Step 4: Deploy**:
837
+ ```bash
838
+ git add mcp_tools.py app.py
839
+ git commit -m "feat: Add new_tool_name MCP tool"
840
+ git push origin main
841
+ # HF Spaces auto-deploys
842
+ ```
843
+
844
+ ---
845
+
846
+ ## Performance Considerations
847
+
848
+ ### 1. Token Optimization
849
+
850
+ **Problem**: Loading full datasets consumes excessive tokens in AI analysis.
851
+
852
+ **Solutions**:
853
+ - **get_top_performers**: Returns only top N models (90% token reduction)
854
+ - **get_leaderboard_summary**: Returns aggregated stats (99% token reduction)
855
+ - **Data sampling**: Limit rows when loading datasets (max_rows parameter)
856
+
857
+ **Example**:
858
+ ```python
859
+ # ❌ BAD: Loads 51 rows, ~50K tokens
860
+ full_data = load_dataset("kshitijthakkar/smoltrace-leaderboard")
861
+
862
+ # ✅ GOOD: Returns top 5, ~5K tokens (90% reduction)
863
+ top_5 = await get_top_performers(top_n=5)
864
+
865
+ # ✅ BETTER: Returns summary, ~500 tokens (99% reduction)
866
+ summary = await get_leaderboard_summary()
867
+ ```
868
+
869
+ ### 2. Async Operations
870
+
871
+ All tools are `async` for efficient I/O:
872
+ ```python
873
+ @gr.mcp.tool()
874
+ async def tool_name(...): # ← async
875
+ ds = load_dataset(...) # ← Blocks on I/O
876
+ result = await gemini_client.analyze(...) # ← async API call
877
+ return result
878
+ ```
879
+
880
+ Benefits:
881
+ - Non-blocking API calls
882
+ - Multiple concurrent requests
883
+ - Better resource utilization
884
+
885
+ ### 3. Caching (Future Enhancement)
886
+
887
+ **Current**: No caching (stateless)
888
+ **Future**: Add Redis for caching frequent queries
889
+
890
+ ```python
891
+ import redis
892
+ from functools import wraps
893
+
894
+ redis_client = redis.Redis(...)
895
+
896
+ def cache_result(ttl=300):
897
+ def decorator(func):
898
+ @wraps(func)
899
+ async def wrapper(*args, **kwargs):
900
+ # Generate cache key
901
+ cache_key = f"{func.__name__}:{hash((args, tuple(kwargs.items())))}"
902
+
903
+ # Check cache
904
+ cached = redis_client.get(cache_key)
905
+ if cached:
906
+ return cached.decode()
907
+
908
+ # Execute function
909
+ result = await func(*args, **kwargs)
910
+
911
+ # Store in cache
912
+ redis_client.setex(cache_key, ttl, result)
913
+
914
+ return result
915
+ return wrapper
916
+ return decorator
917
+
918
+ @gr.mcp.tool()
919
+ @cache_result(ttl=300) # 5-minute cache
920
+ async def analyze_leaderboard(...):
921
+ pass
922
+ ```
923
+
924
+ ---
925
+
926
+ ## Security
927
+
928
+ ### API Key Management
929
+
930
+ **Storage**:
931
+ - Development: `.env` file (gitignored)
932
+ - Production: HuggingFace Spaces Secrets (encrypted)
933
+
934
+ **Access**:
935
+ ```python
936
+ # gemini_client.py
937
+ api_key = os.getenv("GEMINI_API_KEY")
938
+ if not api_key:
939
+ raise ValueError("GEMINI_API_KEY not set")
940
+ ```
941
+
942
+ **Never**:
943
+ - ❌ Hardcode API keys in source code
944
+ - ❌ Commit `.env` to git
945
+ - ❌ Expose keys in client-side JavaScript
946
+ - ❌ Log API keys in console/files
947
+
948
+ ### Input Validation
949
+
950
+ **Dataset Repository Validation**:
951
+ ```python
952
+ # Only allow "smoltrace-" prefix datasets
953
+ if "smoltrace-" not in dataset_repo:
954
+ return "❌ Error: Dataset must contain 'smoltrace-' prefix for security"
955
+ ```
956
+
957
+ **Parameter Validation**:
958
+ ```python
959
+ # Constrain ranges
960
+ top_n = max(1, min(20, top_n)) # Clamp between 1-20
961
+ max_rows = max(10, min(500, max_rows)) # Clamp between 10-500
962
+ ```
963
+
964
+ ### Rate Limiting
965
+
966
+ **Gemini API**:
967
+ - Free tier: 1,500 requests/day
968
+ - Handled by Google (automatic)
969
+ - Errors returned as user-friendly messages
970
+
971
+ **HuggingFace Datasets**:
972
+ - No rate limits for public datasets
973
+ - Private datasets require HF token
974
+
975
+ ---
976
+
977
+ ## Related Documentation
978
+
979
+ - [README.md](PROPOSED_README_MCP_SERVER.md) - Overview and quick start
980
+ - [DOCUMENTATION.md](DOCUMENTATION_MCP_SERVER.md) - Complete API reference
981
+ - [TraceMind-AI Architecture](ARCHITECTURE_TRACEMIND_AI.md) - Client-side architecture
982
+
983
+ ---
984
+
985
+ **Last Updated**: November 21, 2025
986
+ **Version**: 1.0.0
987
+ **Track**: Building MCP (Enterprise)
DOCUMENTATION.md ADDED
@@ -0,0 +1,918 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TraceMind MCP Server - Complete API Documentation
2
+
3
+ This document provides comprehensive API reference for all MCP components provided by TraceMind MCP Server.
4
+
5
+ ## Table of Contents
6
+
7
+ - [MCP Tools (11)](#mcp-tools)
8
+ - [AI-Powered Analysis Tools](#ai-powered-analysis-tools)
9
+ - [Token-Optimized Tools](#token-optimized-tools)
10
+ - [Data Management Tools](#data-management-tools)
11
+ - [MCP Resources (3)](#mcp-resources)
12
+ - [MCP Prompts (3)](#mcp-prompts)
13
+ - [Error Handling](#error-handling)
14
+ - [Best Practices](#best-practices)
15
+
16
+ ---
17
+
18
+ ## MCP Tools
19
+
20
+ ### AI-Powered Analysis Tools
21
+
22
+ These tools use Google Gemini 2.5 Flash to provide intelligent, context-aware analysis of agent evaluation data.
23
+
24
+ #### 1. analyze_leaderboard
25
+
26
+ Analyzes evaluation leaderboard data from HuggingFace datasets and generates AI-powered insights.
27
+
28
+ **Parameters:**
29
+ - `leaderboard_repo` (str): HuggingFace dataset repository
30
+ - Default: `"kshitijthakkar/smoltrace-leaderboard"`
31
+ - Format: `"username/dataset-name"`
32
+ - `metric_focus` (str): Primary metric to analyze
33
+ - Options: `"overall"`, `"accuracy"`, `"cost"`, `"latency"`, `"co2"`
34
+ - Default: `"overall"`
35
+ - `time_range` (str): Time period to analyze
36
+ - Options: `"last_week"`, `"last_month"`, `"all_time"`
37
+ - Default: `"last_week"`
38
+ - `top_n` (int): Number of top models to highlight
39
+ - Range: 1-20
40
+ - Default: 5
41
+
42
+ **Returns:** String containing AI-generated analysis with:
43
+ - Top performers by selected metric
44
+ - Trade-off analysis (e.g., accuracy vs cost)
45
+ - Trend identification
46
+ - Actionable recommendations
47
+
48
+ **Example Use Case:**
49
+ Before choosing a model for production, get AI-powered insights on which configuration offers the best cost/performance for your requirements.
50
+
51
+ **Example Call:**
52
+ ```python
53
+ result = await analyze_leaderboard(
54
+ leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
55
+ metric_focus="cost",
56
+ time_range="last_week",
57
+ top_n=5
58
+ )
59
+ ```
60
+
61
+ **Example Response:**
62
+ ```
63
+ Based on 247 evaluations in the past week:
64
+
65
+ Top Performers (Cost Focus):
66
+ 1. meta-llama/Llama-3.1-8B: $0.002 per run, 93.4% accuracy
67
+ 2. mistralai/Mistral-7B: $0.003 per run, 91.2% accuracy
68
+ 3. openai/gpt-3.5-turbo: $0.008 per run, 94.1% accuracy
69
+
70
+ Trade-off Analysis:
71
+ - Llama-3.1 offers best cost/performance ratio at 25x cheaper than GPT-4
72
+ - GPT-4 leads in accuracy (95.8%) but costs $0.05 per run
73
+ - For production with 1M runs/month: Llama-3.1 saves $48,000 vs GPT-4
74
+
75
+ Recommendations:
76
+ - Cost-sensitive: Use Llama-3.1-8B (93% accuracy, minimal cost)
77
+ - Accuracy-critical: Use GPT-4 (96% accuracy, premium cost)
78
+ - Balanced: Use GPT-3.5-Turbo (94% accuracy, moderate cost)
79
+ ```
80
+
81
+ ---
82
+
83
+ #### 2. debug_trace
84
+
85
+ Analyzes OpenTelemetry trace data and answers specific questions about agent execution.
86
+
87
+ **Parameters:**
88
+ - `trace_dataset` (str): HuggingFace dataset containing traces
89
+ - Format: `"username/smoltrace-traces-model"`
90
+ - Must contain "smoltrace-" prefix
91
+ - `trace_id` (str): Specific trace ID to analyze
92
+ - Format: `"trace_abc123"`
93
+ - `question` (str): Question about the trace
94
+ - Examples: "Why was tool X called twice?", "Which step took the most time?"
95
+ - `include_metrics` (bool): Include GPU metrics in analysis
96
+ - Default: `true`
97
+
98
+ **Returns:** String containing AI analysis of the trace with:
99
+ - Answer to the specific question
100
+ - Relevant span details
101
+ - Performance insights
102
+ - GPU metrics (if available and requested)
103
+
104
+ **Example Use Case:**
105
+ When an agent test fails, understand exactly what happened without manually parsing trace spans.
106
+
107
+ **Example Call:**
108
+ ```python
109
+ result = await debug_trace(
110
+ trace_dataset="kshitij/smoltrace-traces-gpt4",
111
+ trace_id="trace_abc123",
112
+ question="Why was the search tool called twice?",
113
+ include_metrics=True
114
+ )
115
+ ```
116
+
117
+ **Example Response:**
118
+ ```
119
+ Based on trace analysis:
120
+
121
+ Answer:
122
+ The agent called the search_web tool twice due to an iterative reasoning pattern:
123
+
124
+ 1. First call (span_003 at 14:23:19.000):
125
+ - Query: "weather in Tokyo"
126
+ - Duration: 890ms
127
+ - Result: 5 results, oldest was 2 days old
128
+
129
+ 2. Second call (span_005 at 14:23:21.200):
130
+ - Query: "latest weather in Tokyo"
131
+ - Duration: 1200ms
132
+ - Modified reasoning: LLM determined first results were stale
133
+
134
+ Performance Impact:
135
+ - Added 2.09s to total execution time
136
+ - Cost increase: +$0.0003 (tokens for second reasoning step)
137
+ - This is normal behavior for tool-calling agents with iterative reasoning
138
+
139
+ GPU Metrics:
140
+ - N/A (API model, no GPU used)
141
+ ```
142
+
143
+ ---
144
+
145
+ #### 3. estimate_cost
146
+
147
+ Predicts costs, duration, and environmental impact before running evaluations.
148
+
149
+ **Parameters:**
150
+ - `model` (str, required): Model name to evaluate
151
+ - Format: `"provider/model-name"` (e.g., `"openai/gpt-4"`, `"meta-llama/Llama-3.1-8B"`)
152
+ - `agent_type` (str): Type of agent evaluation
153
+ - Options: `"tool"`, `"code"`, `"both"`
154
+ - Default: `"both"`
155
+ - `num_tests` (int): Number of test cases
156
+ - Range: 1-10000
157
+ - Default: 100
158
+ - `hardware` (str): Hardware type
159
+ - Options: `"auto"`, `"cpu"`, `"gpu_a10"`, `"gpu_h200"`
160
+ - Default: `"auto"` (auto-selects based on model)
161
+
162
+ **Returns:** String containing cost estimate with:
163
+ - LLM API costs (for API models)
164
+ - HuggingFace Jobs compute costs (for local models)
165
+ - Estimated duration
166
+ - CO2 emissions estimate
167
+ - Hardware recommendations
168
+
169
+ **Example Use Case:**
170
+ Compare the cost of evaluating GPT-4 vs Llama-3.1 across 1000 tests before committing resources.
171
+
172
+ **Example Call:**
173
+ ```python
174
+ result = await estimate_cost(
175
+ model="openai/gpt-4",
176
+ agent_type="both",
177
+ num_tests=1000,
178
+ hardware="auto"
179
+ )
180
+ ```
181
+
182
+ **Example Response:**
183
+ ```
184
+ Cost Estimate for openai/gpt-4:
185
+
186
+ LLM API Costs:
187
+ - Estimated tokens per test: 1,500
188
+ - Token cost: $0.03/1K input, $0.06/1K output
189
+ - Total LLM cost: $50.00 (1000 tests)
190
+
191
+ Compute Costs:
192
+ - Recommended hardware: cpu-basic (API model)
193
+ - HF Jobs cost: ~$0.05/hr
194
+ - Estimated duration: 45 minutes
195
+ - Total compute cost: $0.04
196
+
197
+ Total Cost: $50.04
198
+ Cost per test: $0.05
199
+ CO2 emissions: ~0.5g (API calls, minimal compute)
200
+
201
+ Recommendations:
202
+ - This is an API model, CPU hardware is sufficient
203
+ - For cost optimization, consider Llama-3.1-8B (25x cheaper)
204
+ - Estimated runtime: 45 minutes for 1000 tests
205
+ ```
206
+
207
+ ---
208
+
209
+ #### 4. compare_runs
210
+
211
+ Compares two evaluation runs with AI-powered analysis across multiple dimensions.
212
+
213
+ **Parameters:**
214
+ - `run_id_1` (str, required): First run ID from leaderboard
215
+ - `run_id_2` (str, required): Second run ID from leaderboard
216
+ - `leaderboard_repo` (str): Leaderboard dataset repository
217
+ - Default: `"kshitijthakkar/smoltrace-leaderboard"`
218
+ - `focus` (str): Comparison focus area
219
+ - Options:
220
+ - `"comprehensive"`: All dimensions
221
+ - `"cost"`: Cost efficiency and ROI
222
+ - `"performance"`: Speed and accuracy trade-offs
223
+ - `"eco_friendly"`: Environmental impact
224
+ - Default: `"comprehensive"`
225
+
226
+ **Returns:** String containing AI comparison with:
227
+ - Success rate comparison with statistical significance
228
+ - Cost efficiency analysis
229
+ - Speed comparison
230
+ - Environmental impact (CO2 emissions)
231
+ - GPU efficiency (for GPU jobs)
232
+
233
+ **Example Use Case:**
234
+ After running evaluations with two different models, compare them head-to-head to determine which is better for production deployment.
235
+
236
+ **Example Call:**
237
+ ```python
238
+ result = await compare_runs(
239
+ run_id_1="run_abc123",
240
+ run_id_2="run_def456",
241
+ leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
242
+ focus="cost"
243
+ )
244
+ ```
245
+
246
+ **Example Response:**
247
+ ```
248
+ Comparison: GPT-4 vs Llama-3.1-8B (Cost Focus)
249
+
250
+ Success Rates:
251
+ - GPT-4: 95.8% (96/100 tests)
252
+ - Llama-3.1: 93.4% (93/100 tests)
253
+ - Difference: +2.4% for GPT-4 (statistically significant, p<0.05)
254
+
255
+ Cost Efficiency:
256
+ - GPT-4: $0.05 per test, $0.052 per successful test
257
+ - Llama-3.1: $0.002 per test, $0.0021 per successful test
258
+ - Cost ratio: GPT-4 is 25x more expensive
259
+
260
+ ROI Analysis:
261
+ - For 1M evaluations/month:
262
+ - GPT-4: $50,000/month, 958K successes
263
+ - Llama-3.1: $2,000/month, 934K successes
264
+ - GPT-4 provides 24K more successes for $48K more cost
265
+ - Cost per additional success: $2.00
266
+
267
+ Recommendation (Cost Focus):
268
+ Use Llama-3.1-8B for cost-sensitive workloads where 93% accuracy is acceptable.
269
+ Switch to GPT-4 only for accuracy-critical tasks where the 2.4% improvement justifies 25x cost.
270
+ ```
271
+
272
+ ---
273
+
274
+ #### 5. analyze_results
275
+
276
+ Analyzes detailed test results and provides optimization recommendations.
277
+
278
+ **Parameters:**
279
+ - `results_repo` (str, required): HuggingFace dataset containing results
280
+ - Format: `"username/smoltrace-results-model-timestamp"`
281
+ - Must contain "smoltrace-results-" prefix
282
+ - `analysis_focus` (str): Focus area for analysis
283
+ - Options: `"failures"`, `"performance"`, `"cost"`, `"comprehensive"`
284
+ - Default: `"comprehensive"`
285
+ - `max_rows` (int): Maximum test cases to analyze
286
+ - Range: 10-500
287
+ - Default: 100
288
+
289
+ **Returns:** String containing AI analysis with:
290
+ - Failure patterns and root causes
291
+ - Performance bottlenecks in specific test cases
292
+ - Cost optimization opportunities
293
+ - Tool usage patterns
294
+ - Task-specific insights (which types work well vs poorly)
295
+ - Actionable optimization recommendations
296
+
297
+ **Example Use Case:**
298
+ After running an evaluation, analyze the detailed test results to understand why certain tests are failing and get specific recommendations for improving success rate.
299
+
300
+ **Example Call:**
301
+ ```python
302
+ result = await analyze_results(
303
+ results_repo="kshitij/smoltrace-results-gpt4-20251120",
304
+ analysis_focus="failures",
305
+ max_rows=100
306
+ )
307
+ ```
308
+
309
+ **Example Response:**
310
+ ```
311
+ Analysis of Test Results (100 tests analyzed)
312
+
313
+ Overall Statistics:
314
+ - Success Rate: 89% (89/100 tests passed)
315
+ - Average Duration: 3.2s per test
316
+ - Total Cost: $4.50 ($0.045 per test)
317
+
318
+ Failure Analysis (11 failures):
319
+ 1. Tool Not Found (6 failures):
320
+ - Test IDs: task_012, task_045, task_067, task_089, task_091, task_093
321
+ - Pattern: All failed tests required the 'get_weather' tool
322
+ - Root Cause: Tool definition missing or incorrect name
323
+ - Fix: Ensure 'get_weather' tool is available in agent's tool list
324
+
325
+ 2. Timeout (3 failures):
326
+ - Test IDs: task_034, task_071, task_088
327
+ - Pattern: Complex multi-step tasks with >5 tool calls
328
+ - Root Cause: Exceeding 30s timeout limit
329
+ - Fix: Increase timeout to 60s or simplify complex tasks
330
+
331
+ 3. Incorrect Response (2 failures):
332
+ - Test IDs: task_056, task_072
333
+ - Pattern: Math calculation tasks
334
+ - Root Cause: Model hallucinating numbers instead of using calculator tool
335
+ - Fix: Update prompt to emphasize tool usage for calculations
336
+
337
+ Performance Insights:
338
+ - Fast tasks (<2s): 45 tests - Simple single-tool calls
339
+ - Slow tasks (>5s): 12 tests - Multi-step reasoning with 3+ tools
340
+ - Optimal duration: 2-3s for most tasks
341
+
342
+ Cost Optimization:
343
+ - High-cost tests: task_023 ($0.12) - Used 4K tokens
344
+ - Low-cost tests: task_087 ($0.008) - Used 180 tokens
345
+ - Recommendation: Optimize prompt to reduce token usage by 20%
346
+
347
+ Recommendations:
348
+ 1. Add missing 'get_weather' tool → Fixes 6 failures
349
+ 2. Increase timeout from 30s to 60s → Fixes 3 failures
350
+ 3. Strengthen calculator tool instruction → Fixes 2 failures
351
+ 4. Expected improvement: 89% → 100% success rate
352
+ ```
353
+
354
+ ---
355
+
356
+ ### Token-Optimized Tools
357
+
358
+ These tools are specifically designed to minimize token usage when querying leaderboard data.
359
+
360
+ #### 6. get_top_performers
361
+
362
+ Get top N performing models from leaderboard with 90% token reduction.
363
+
364
+ **Performance Optimization:** Returns only top N models instead of loading the full leaderboard dataset (51 runs), resulting in **90% token reduction**.
365
+
366
+ **When to Use:** Perfect for queries like "Which model is leading?", "Show me the top 5 models".
367
+
368
+ **Parameters:**
369
+ - `leaderboard_repo` (str): HuggingFace dataset repository
370
+ - Default: `"kshitijthakkar/smoltrace-leaderboard"`
371
+ - `metric` (str): Metric to rank by
372
+ - Options: `"success_rate"`, `"total_cost_usd"`, `"avg_duration_ms"`, `"co2_emissions_g"`
373
+ - Default: `"success_rate"`
374
+ - `top_n` (int): Number of top models to return
375
+ - Range: 1-20
376
+ - Default: 5
377
+
378
+ **Returns:** JSON string with:
379
+ - Metric used for ranking
380
+ - Ranking order (ascending/descending)
381
+ - Total runs in leaderboard
382
+ - Array of top performers with 10 essential fields
383
+
384
+ **Benefits:**
385
+ - ✅ Token Reduction: 90% fewer tokens vs full dataset
386
+ - ✅ Ready to Use: Properly formatted JSON
387
+ - ✅ Pre-Sorted: Already ranked by chosen metric
388
+ - ✅ Essential Data Only: 10 fields vs 20+ in full dataset
389
+
390
+ **Example Call:**
391
+ ```python
392
+ result = await get_top_performers(
393
+ leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
394
+ metric="total_cost_usd",
395
+ top_n=3
396
+ )
397
+ ```
398
+
399
+ **Example Response:**
400
+ ```json
401
+ {
402
+ "metric": "total_cost_usd",
403
+ "order": "ascending",
404
+ "total_runs": 51,
405
+ "top_performers": [
406
+ {
407
+ "run_id": "run_001",
408
+ "model": "meta-llama/Llama-3.1-8B",
409
+ "success_rate": 93.4,
410
+ "total_cost_usd": 0.002,
411
+ "avg_duration_ms": 2100,
412
+ "agent_type": "both",
413
+ "provider": "transformers",
414
+ "submitted_by": "kshitij",
415
+ "timestamp": "2025-11-20T10:30:00Z",
416
+ "total_tests": 100
417
+ },
418
+ ...
419
+ ]
420
+ }
421
+ ```
422
+
423
+ ---
424
+
425
+ #### 7. get_leaderboard_summary
426
+
427
+ Get high-level leaderboard statistics with 99% token reduction.
428
+
429
+ **Performance Optimization:** Returns only aggregated statistics instead of raw data, resulting in **99% token reduction**.
430
+
431
+ **When to Use:** Perfect for overview queries like "How many runs are in the leaderboard?", "What's the average success rate?".
432
+
433
+ **Parameters:**
434
+ - `leaderboard_repo` (str): HuggingFace dataset repository
435
+ - Default: `"kshitijthakkar/smoltrace-leaderboard"`
436
+
437
+ **Returns:** JSON string with:
438
+ - Total runs count
439
+ - Unique models and submitters
440
+ - Overall statistics (avg/best/worst success rates, avg cost, avg duration, total CO2)
441
+ - Breakdown by agent type
442
+ - Breakdown by provider
443
+ - Top 3 models by success rate
444
+
445
+ **Benefits:**
446
+ - ✅ Extreme Token Reduction: 99% fewer tokens
447
+ - ✅ Ready to Use: Properly formatted JSON
448
+ - ✅ Comprehensive Stats: Averages, distributions, breakdowns
449
+ - ✅ Quick Insights: Perfect for overview questions
450
+
451
+ **Example Call:**
452
+ ```python
453
+ result = await get_leaderboard_summary(
454
+ leaderboard_repo="kshitijthakkar/smoltrace-leaderboard"
455
+ )
456
+ ```
457
+
458
+ **Example Response:**
459
+ ```json
460
+ {
461
+ "total_runs": 51,
462
+ "unique_models": 12,
463
+ "unique_submitters": 3,
464
+ "overall_stats": {
465
+ "avg_success_rate": 89.2,
466
+ "best_success_rate": 95.8,
467
+ "worst_success_rate": 78.3,
468
+ "avg_cost_usd": 0.012,
469
+ "avg_duration_ms": 3200,
470
+ "total_co2_g": 45.6
471
+ },
472
+ "by_agent_type": {
473
+ "tool": {"count": 20, "avg_success_rate": 88.5},
474
+ "code": {"count": 18, "avg_success_rate": 87.2},
475
+ "both": {"count": 13, "avg_success_rate": 92.1}
476
+ },
477
+ "by_provider": {
478
+ "litellm": {"count": 30, "avg_success_rate": 91.3},
479
+ "transformers": {"count": 21, "avg_success_rate": 86.4}
480
+ },
481
+ "top_3_models": [
482
+ {"model": "openai/gpt-4", "success_rate": 95.8},
483
+ {"model": "anthropic/claude-3", "success_rate": 94.1},
484
+ {"model": "meta-llama/Llama-3.1-8B", "success_rate": 93.4}
485
+ ]
486
+ }
487
+ ```
488
+
489
+ ---
490
+
491
+ ### Data Management Tools
492
+
493
+ #### 8. get_dataset
494
+
495
+ Loads SMOLTRACE datasets from HuggingFace and returns raw data as JSON.
496
+
497
+ **⚠️ Important:** For leaderboard queries, prefer using `get_top_performers()` or `get_leaderboard_summary()` to avoid token bloat!
498
+
499
+ **Security Restriction:** Only datasets with "smoltrace-" in the repository name are allowed.
500
+
501
+ **Parameters:**
502
+ - `dataset_repo` (str, required): HuggingFace dataset repository
503
+ - Must contain "smoltrace-" prefix
504
+ - Format: `"username/smoltrace-type-model"`
505
+ - `split` (str): Dataset split to load
506
+ - Default: `"train"`
507
+ - `limit` (int): Maximum rows to return
508
+ - Range: 1-200
509
+ - Default: 100
510
+
511
+ **Returns:** JSON string with:
512
+ - Total rows in dataset
513
+ - List of column names
514
+ - Array of data rows (up to `limit`)
515
+
516
+ **Primary Use Cases:**
517
+ - Load `smoltrace-results-*` datasets for test case details
518
+ - Load `smoltrace-traces-*` datasets for OpenTelemetry data
519
+ - Load `smoltrace-metrics-*` datasets for GPU metrics
520
+ - **NOT recommended** for leaderboard queries (use optimized tools)
521
+
522
+ **Example Call:**
523
+ ```python
524
+ result = await get_dataset(
525
+ dataset_repo="kshitij/smoltrace-results-gpt4",
526
+ split="train",
527
+ limit=50
528
+ )
529
+ ```
530
+
531
+ ---
532
+
533
+ #### 9. generate_synthetic_dataset
534
+
535
+ Creates domain-specific test datasets for SMOLTRACE evaluations using AI.
536
+
537
+ **Parameters:**
538
+ - `domain` (str, required): Domain for tasks
539
+ - Examples: "e-commerce", "customer service", "finance", "healthcare"
540
+ - `tools` (list[str], required): Available tools
541
+ - Example: `["search_web", "get_weather", "calculator"]`
542
+ - `num_tasks` (int): Number of tasks to generate
543
+ - Range: 1-100
544
+ - Default: 20
545
+ - `difficulty_distribution` (str): Task difficulty mix
546
+ - Options: `"balanced"`, `"easy_only"`, `"medium_only"`, `"hard_only"`, `"progressive"`
547
+ - Default: `"balanced"`
548
+ - `agent_type` (str): Target agent type
549
+ - Options: `"tool"`, `"code"`, `"both"`
550
+ - Default: `"both"`
551
+
552
+ **Returns:** JSON string with:
553
+ - `dataset_info`: Metadata (domain, tools, counts, timestamp)
554
+ - `tasks`: Array of SMOLTRACE-formatted tasks
555
+ - `usage_instructions`: Guide for HuggingFace upload and SMOLTRACE usage
556
+
557
+ **SMOLTRACE Task Format:**
558
+ ```json
559
+ {
560
+ "id": "unique_identifier",
561
+ "prompt": "Clear, specific task for the agent",
562
+ "expected_tool": "tool_name",
563
+ "expected_tool_calls": 1,
564
+ "difficulty": "easy|medium|hard",
565
+ "agent_type": "tool|code",
566
+ "expected_keywords": ["keyword1", "keyword2"]
567
+ }
568
+ ```
569
+
570
+ **Difficulty Calibration:**
571
+ - **Easy** (40%): Single tool call, straightforward input
572
+ - **Medium** (40%): Multiple tool calls OR complex input parsing
573
+ - **Hard** (20%): Multiple tools, complex reasoning, edge cases
574
+
575
+ **Enterprise Use Cases:**
576
+ - Custom Tools: Benchmark proprietary APIs
577
+ - Industry-Specific: Generate tasks for finance, healthcare, legal
578
+ - Internal Workflows: Test company-specific processes
579
+
580
+ **Example Call:**
581
+ ```python
582
+ result = await generate_synthetic_dataset(
583
+ domain="customer service",
584
+ tools=["search_knowledge_base", "create_ticket", "send_email"],
585
+ num_tasks=50,
586
+ difficulty_distribution="balanced",
587
+ agent_type="tool"
588
+ )
589
+ ```
590
+
591
+ ---
592
+
593
+ #### 10. push_dataset_to_hub
594
+
595
+ Upload generated datasets to HuggingFace Hub with proper formatting.
596
+
597
+ **Parameters:**
598
+ - `dataset_name` (str, required): Repository name on HuggingFace
599
+ - Format: `"username/my-dataset"`
600
+ - `data` (str or list, required): Dataset content
601
+ - Can be JSON string or list of dictionaries
602
+ - `description` (str): Dataset description for card
603
+ - Default: Auto-generated
604
+ - `private` (bool): Make dataset private
605
+ - Default: `False`
606
+
607
+ **Returns:** Success message with dataset URL
608
+
609
+ **Example Workflow:**
610
+ 1. Generate synthetic dataset with `generate_synthetic_dataset`
611
+ 2. Review and modify tasks if needed
612
+ 3. Upload to HuggingFace with `push_dataset_to_hub`
613
+ 4. Use in SMOLTRACE evaluations or share with team
614
+
615
+ **Example Call:**
616
+ ```python
617
+ result = await push_dataset_to_hub(
618
+ dataset_name="kshitij/my-custom-evaluation",
619
+ data=generated_tasks,
620
+ description="Custom evaluation dataset for e-commerce agents",
621
+ private=False
622
+ )
623
+ ```
624
+
625
+ ---
626
+
627
+ #### 11. generate_prompt_template
628
+
629
+ Generate customized smolagents prompt template for a specific domain and tool set.
630
+
631
+ **Parameters:**
632
+ - `domain` (str, required): Domain for the prompt template
633
+ - Examples: `"finance"`, `"healthcare"`, `"customer_support"`, `"e-commerce"`
634
+ - `tool_names` (str, required): Comma-separated list of tool names
635
+ - Format: `"tool1,tool2,tool3"`
636
+ - Example: `"get_stock_price,calculate_roi,fetch_company_info"`
637
+ - `agent_type` (str): Agent type
638
+ - Options: `"tool"` (ToolCallingAgent), `"code"` (CodeAgent)
639
+ - Default: `"tool"`
640
+
641
+ **Returns:** JSON response containing:
642
+ - Customized YAML prompt template
643
+ - Metadata (domain, tools, agent_type, timestamp)
644
+ - Usage instructions
645
+
646
+ **Use Case:**
647
+ When you generate synthetic datasets with `generate_synthetic_dataset`, use this tool to create a matching prompt template that agents can use during evaluation. This ensures your evaluation setup is complete and ready to run.
648
+
649
+ **Integration:**
650
+ The generated prompt template can be included in your HuggingFace dataset card, making it easy for anyone to run evaluations with your dataset.
651
+
652
+ **Example Call:**
653
+ ```python
654
+ result = await generate_prompt_template(
655
+ domain="customer_support",
656
+ tool_names="search_knowledge_base,create_ticket,send_email,escalate_to_human",
657
+ agent_type="tool"
658
+ )
659
+ ```
660
+
661
+ **Example Response:**
662
+ ```json
663
+ {
664
+ "prompt_template": "---\nname: customer_support_agent\ndescription: An AI agent for customer support tasks...\n\ninstructions: |-\n You are a helpful customer support agent...\n \n Available tools:\n - search_knowledge_base: Search the knowledge base...\n - create_ticket: Create a support ticket...\n ...",
665
+ "metadata": {
666
+ "domain": "customer_support",
667
+ "tools": ["search_knowledge_base", "create_ticket", "send_email", "escalate_to_human"],
668
+ "agent_type": "tool",
669
+ "base_template": "ToolCallingAgent",
670
+ "timestamp": "2025-11-21T10:30:00Z"
671
+ },
672
+ "usage_instructions": "1. Save the prompt_template to a file (e.g., customer_support_prompt.yaml)\n2. Use with SMOLTRACE: smoltrace-eval --model your-model --prompt-file customer_support_prompt.yaml\n3. Or include in your dataset card for easy evaluation"
673
+ }
674
+ ```
675
+
676
+ ---
677
+
678
+ ## MCP Resources
679
+
680
+ Resources provide direct data access without AI analysis. Access via URI scheme.
681
+
682
+ ### 1. leaderboard://{repo}
683
+
684
+ Direct access to raw leaderboard data in JSON format.
685
+
686
+ **URI Format:**
687
+ ```
688
+ leaderboard://username/dataset-name
689
+ ```
690
+
691
+ **Example:**
692
+ ```
693
+ GET leaderboard://kshitijthakkar/smoltrace-leaderboard
694
+ ```
695
+
696
+ **Returns:** JSON array with all evaluation runs, including:
697
+ - run_id, model, agent_type, provider
698
+ - success_rate, total_tests, successful_tests, failed_tests
699
+ - avg_duration_ms, total_tokens, total_cost_usd, co2_emissions_g
700
+ - results_dataset, traces_dataset, metrics_dataset (references)
701
+ - timestamp, submitted_by, hf_job_id
702
+
703
+ ---
704
+
705
+ ### 2. trace://{trace_id}/{repo}
706
+
707
+ Direct access to trace data with OpenTelemetry spans.
708
+
709
+ **URI Format:**
710
+ ```
711
+ trace://trace_id/username/dataset-name
712
+ ```
713
+
714
+ **Example:**
715
+ ```
716
+ GET trace://trace_abc123/kshitij/agent-traces-gpt4
717
+ ```
718
+
719
+ **Returns:** JSON with:
720
+ - traceId
721
+ - spans array (spanId, parentSpanId, name, kind, startTime, endTime, attributes, status)
722
+
723
+ ---
724
+
725
+ ### 3. cost://model/{model_name}
726
+
727
+ Model pricing and hardware cost information.
728
+
729
+ **URI Format:**
730
+ ```
731
+ cost://model/provider/model-name
732
+ ```
733
+
734
+ **Example:**
735
+ ```
736
+ GET cost://model/openai/gpt-4
737
+ ```
738
+
739
+ **Returns:** JSON with:
740
+ - Model pricing (input/output token costs)
741
+ - Recommended hardware tier
742
+ - Estimated compute costs
743
+ - CO2 emissions per 1K tokens
744
+
745
+ ---
746
+
747
+ ## MCP Prompts
748
+
749
+ Prompts provide reusable templates for standardized interactions.
750
+
751
+ ### 1. analysis_prompt
752
+
753
+ Templates for different analysis types.
754
+
755
+ **Parameters:**
756
+ - `analysis_type` (str): Type of analysis
757
+ - Options: `"leaderboard"`, `"cost"`, `"performance"`, `"trace"`
758
+ - `focus_area` (str): Specific focus
759
+ - Options: `"overall"`, `"cost"`, `"accuracy"`, `"speed"`, `"eco"`
760
+ - `detail_level` (str): Level of detail
761
+ - Options: `"summary"`, `"detailed"`, `"comprehensive"`
762
+
763
+ **Returns:** Formatted prompt string for use with AI tools
764
+
765
+ **Example:**
766
+ ```python
767
+ prompt = analysis_prompt(
768
+ analysis_type="leaderboard",
769
+ focus_area="cost",
770
+ detail_level="detailed"
771
+ )
772
+ # Returns: "Provide a detailed analysis of cost efficiency in the leaderboard..."
773
+ ```
774
+
775
+ ---
776
+
777
+ ### 2. debug_prompt
778
+
779
+ Templates for debugging scenarios.
780
+
781
+ **Parameters:**
782
+ - `debug_type` (str): Type of debugging
783
+ - Options: `"failure"`, `"performance"`, `"tool_calling"`, `"reasoning"`
784
+ - `context` (str): Additional context
785
+ - Options: `"test_failure"`, `"timeout"`, `"unexpected_tool"`, `"reasoning_loop"`
786
+
787
+ **Returns:** Formatted prompt string
788
+
789
+ **Example:**
790
+ ```python
791
+ prompt = debug_prompt(
792
+ debug_type="performance",
793
+ context="tool_calling"
794
+ )
795
+ # Returns: "Analyze tool calling performance. Identify which tools are slow..."
796
+ ```
797
+
798
+ ---
799
+
800
+ ### 3. optimization_prompt
801
+
802
+ Templates for optimization goals.
803
+
804
+ **Parameters:**
805
+ - `optimization_goal` (str): Optimization target
806
+ - Options: `"cost"`, `"speed"`, `"accuracy"`, `"co2"`
807
+ - `constraints` (str): Constraints to respect
808
+ - Options: `"maintain_quality"`, `"no_accuracy_loss"`, `"budget_limit"`, `"time_limit"`
809
+
810
+ **Returns:** Formatted prompt string
811
+
812
+ **Example:**
813
+ ```python
814
+ prompt = optimization_prompt(
815
+ optimization_goal="cost",
816
+ constraints="maintain_quality"
817
+ )
818
+ # Returns: "Analyze this evaluation setup and recommend cost optimizations..."
819
+ ```
820
+
821
+ ---
822
+
823
+ ## Error Handling
824
+
825
+ ### Common Error Responses
826
+
827
+ **Invalid Dataset Repository:**
828
+ ```json
829
+ {
830
+ "error": "Dataset must contain 'smoltrace-' prefix for security",
831
+ "provided": "username/invalid-dataset"
832
+ }
833
+ ```
834
+
835
+ **Dataset Not Found:**
836
+ ```json
837
+ {
838
+ "error": "Dataset not found on HuggingFace",
839
+ "repository": "username/smoltrace-nonexistent"
840
+ }
841
+ ```
842
+
843
+ **API Rate Limit:**
844
+ ```json
845
+ {
846
+ "error": "Gemini API rate limit exceeded",
847
+ "retry_after": 60
848
+ }
849
+ ```
850
+
851
+ **Invalid Parameters:**
852
+ ```json
853
+ {
854
+ "error": "Invalid parameter value",
855
+ "parameter": "top_n",
856
+ "value": 50,
857
+ "allowed_range": "1-20"
858
+ }
859
+ ```
860
+
861
+ ---
862
+
863
+ ## Best Practices
864
+
865
+ ### 1. Token Optimization
866
+
867
+ **DO:**
868
+ - Use `get_top_performers()` for "top N" queries (90% token reduction)
869
+ - Use `get_leaderboard_summary()` for overview queries (99% token reduction)
870
+ - Set appropriate `limit` when using `get_dataset()`
871
+
872
+ **DON'T:**
873
+ - Use `get_dataset()` for leaderboard queries (loads all 51 runs)
874
+ - Request more data than needed
875
+ - Ignore token optimization tools
876
+
877
+ ### 2. AI Tool Usage
878
+
879
+ **DO:**
880
+ - Use AI tools (`analyze_leaderboard`, `debug_trace`) for complex analysis
881
+ - Provide specific questions to `debug_trace` for focused answers
882
+ - Use `focus` parameter in `compare_runs` for targeted comparisons
883
+
884
+ **DON'T:**
885
+ - Use AI tools for simple data retrieval (use resources instead)
886
+ - Make vague requests (be specific for better results)
887
+
888
+ ### 3. Dataset Security
889
+
890
+ **DO:**
891
+ - Only use datasets with "smoltrace-" prefix
892
+ - Verify dataset exists before requesting
893
+ - Use public datasets or authenticate for private ones
894
+
895
+ **DON'T:**
896
+ - Try to access arbitrary HuggingFace datasets
897
+ - Share private dataset URLs without authentication
898
+
899
+ ### 4. Cost Management
900
+
901
+ **DO:**
902
+ - Use `estimate_cost` before running large evaluations
903
+ - Compare cost estimates across different models
904
+ - Consider token-optimized tools to reduce API costs
905
+
906
+ **DON'T:**
907
+ - Skip cost estimation for expensive operations
908
+ - Ignore hardware recommendations
909
+ - Overlook CO2 emissions in decision-making
910
+
911
+ ---
912
+
913
+ ## Support
914
+
915
+ For issues or questions:
916
+ - 📧 GitHub Issues: [TraceMind-mcp-server/issues](https://github.com/Mandark-droid/TraceMind-mcp-server/issues)
917
+ - 💬 HF Discord: `#agents-mcp-hackathon-winter25`
918
+ - 🏷️ Tag: `building-mcp-track-enterprise`
README.md CHANGED
@@ -23,497 +23,143 @@ tags:
23
  <img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-mcp-server/assets/Logo.png" alt="TraceMind MCP Server Logo" width="200"/>
24
  </p>
25
 
26
- **AI-Powered Analysis Tools for Agent Evaluation Data**
27
 
28
  [![MCP's 1st Birthday Hackathon](https://img.shields.io/badge/MCP%27s%201st%20Birthday-Hackathon-blue)](https://github.com/modelcontextprotocol)
29
- [![Track 1](https://img.shields.io/badge/Track-Building%20MCP%20(Enterprise)-blue)](https://github.com/modelcontextprotocol/hackathon)
30
- [![HF Space](https://img.shields.io/badge/HuggingFace-TraceMind--MCP--Server-yellow?logo=huggingface)](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server)
31
- [![Google Gemini](https://img.shields.io/badge/Powered%20by-Google%20Gemini%202.5%20Pro-orange)](https://ai.google.dev/)
32
 
33
  > **🎯 Track 1 Submission**: Building MCP (Enterprise)
34
  > **📅 MCP's 1st Birthday Hackathon**: November 14-30, 2025
35
 
36
- ## Overview
37
-
38
- TraceMind MCP Server is a Gradio-based MCP (Model Context Protocol) server that provides a complete MCP implementation with:
39
-
40
- ### 🏗️ **Built on Open Source Foundation**
41
 
42
- This MCP server is part of a complete agent evaluation ecosystem built on two foundational open-source projects:
43
 
44
- **🔭 TraceVerde (genai_otel_instrument)** - Automatic OpenTelemetry Instrumentation
45
- - **What**: Zero-code OTEL instrumentation for LLM frameworks (LiteLLM, Transformers, LangChain, etc.)
46
- - **Why**: Captures every LLM call, tool usage, and agent step automatically
47
- - **Links**: [GitHub](https://github.com/Mandark-droid/genai_otel_instrument) | [PyPI](https://pypi.org/project/genai-otel-instrument)
48
 
49
- **📊 SMOLTRACE** - Agent Evaluation Engine
50
- - **What**: Lightweight, production-ready evaluation framework with OTEL tracing built-in
51
- - **Why**: Generates structured datasets (leaderboard, results, traces, metrics) that this MCP server analyzes
52
- - **Links**: [GitHub](https://github.com/Mandark-droid/SMOLTRACE) | [PyPI](https://pypi.org/project/smoltrace/)
53
 
54
- **The Flow**: `TraceVerde` instruments your agents `SMOLTRACE` evaluates them → `TraceMind MCP Server` provides AI-powered analysis of the results
55
 
56
  ---
57
 
58
- ### 🛠️ **9 AI-Powered & Optimized Tools**
59
- 1. **📊 analyze_leaderboard**: Generate AI-powered insights from evaluation leaderboard data
60
- 2. **🐛 debug_trace**: Debug specific agent execution traces using OpenTelemetry data with AI assistance
61
- 3. **💰 estimate_cost**: Predict evaluation costs before running with AI-powered recommendations
62
- 4. **⚖️ compare_runs**: Compare two evaluation runs with AI-powered analysis
63
- 5. **🏆 get_top_performers**: Get top N models from leaderboard (optimized for quick queries, avoids token bloat)
64
- 6. **📈 get_leaderboard_summary**: Get high-level leaderboard statistics (optimized for overview queries)
65
- 7. **📦 get_dataset**: Load SMOLTRACE datasets (smoltrace-* prefix only) as JSON for flexible analysis
66
- 8. **🧪 generate_synthetic_dataset**: Create domain-specific test datasets for SMOLTRACE evaluations (supports up to 100 tasks with parallel batched generation)
67
- 9. **📤 push_dataset_to_hub**: Upload generated datasets to HuggingFace Hub
68
-
69
- ### 📦 **3 Data Resources**
70
- 1. **leaderboard data**: Direct JSON access to evaluation results
71
- 2. **trace data**: Raw OpenTelemetry trace data with spans
72
- 3. **cost data**: Model pricing and hardware cost information
73
-
74
- ### 📝 **3 Prompt Templates**
75
- 1. **analysis prompts**: Standardized templates for different analysis types
76
- 2. **debug prompts**: Templates for debugging scenarios
77
- 3. **optimization prompts**: Templates for optimization goals
78
-
79
- All analysis is powered by **Google Gemini 2.5 Flash** for intelligent, context-aware insights.
80
-
81
  ## 🔗 Quick Links
82
 
83
- - **Gradio UI**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
84
- - **MCP Endpoint (SSE - Recommended)**: `https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse`
85
- - **MCP Endpoint (Streamable HTTP)**: `https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/`
86
- - **Auto-Config**: Add `MCP-1st-Birthday/TraceMind-mcp-server` at https://huggingface.co/settings/mcp
87
-
88
- > 💡 **Tip**: Use the Auto-Config link above for the easiest setup! It generates the correct config for your MCP client automatically.
89
-
90
- ## 📱 Social Media & Demo
91
 
92
- **📢 Announcement Post**: [Coming Soon - X/LinkedIn post]
93
-
94
- **🎥 Demo Video**: [Coming Soon - YouTube/Loom link showing MCP server integration with Claude Desktop]
95
 
96
  ---
97
 
98
- ## Why This MCP Server?
99
-
100
- **Problem**: Agent evaluation generates massive amounts of data (leaderboards, traces, metrics), but developers struggle to:
101
- - Understand which models perform best for their use case
102
- - Debug why specific agent executions failed
103
- - Estimate costs before running expensive evaluations
104
-
105
- **Solution**: This MCP server provides AI-powered analysis tools that connect to HuggingFace datasets and deliver actionable insights in natural language.
106
-
107
- **Impact**: Developers can make informed decisions about agent configurations, debug issues faster, and optimize costs—all through a simple MCP interface.
108
-
109
- ## Features
110
-
111
- ### 🎯 Track 1 Compliance: Building MCP (Enterprise)
112
-
113
- - ✅ **Complete MCP Implementation**: Tools, Resources, AND Prompts
114
- - ✅ **MCP Standard Compliant**: Built with Gradio's native MCP support (`@gr.mcp.*` decorators)
115
- - ✅ **Production-Ready**: Deployable to HuggingFace Spaces with SSE transport
116
- - ✅ **Testing Interface**: Beautiful Gradio UI for testing all components
117
- - ✅ **Enterprise Focus**: Cost optimization, debugging, decision support, and custom dataset generation
118
- - ✅ **Google Gemini Powered**: Leverages Gemini 2.5 Flash for intelligent analysis
119
- - ✅ **17 Total Components**: 11 Tools + 3 Resources + 3 Prompts
120
-
121
- ### 🛠️ Eleven Production-Ready Tools
122
-
123
- #### 1. analyze_leaderboard
124
-
125
- Analyzes evaluation leaderboard data from HuggingFace datasets and provides:
126
- - Top performers by selected metric (accuracy, cost, latency, CO2)
127
- - Trade-off analysis (e.g., "GPT-4 is most accurate but Llama-3.1 is 25x cheaper")
128
- - Trend identification
129
- - Actionable recommendations
130
-
131
- **Example Use Case**: Before choosing a model for production, get AI-powered insights on which configuration offers the best cost/performance for your requirements.
132
-
133
- #### 2. debug_trace
134
-
135
- Analyzes OpenTelemetry trace data and answers specific questions like:
136
- - "Why was tool X called twice?"
137
- - "Which step took the most time?"
138
- - "Why did this test fail?"
139
-
140
- **Example Use Case**: When an agent test fails, understand exactly what happened without manually parsing trace spans.
141
-
142
- #### 3. estimate_cost
143
-
144
- Predicts costs before running evaluations:
145
- - LLM API costs (token-based)
146
- - HuggingFace Jobs compute costs
147
- - CO2 emissions estimate
148
- - Hardware recommendations
149
-
150
- **Example Use Case**: Compare the cost of evaluating GPT-4 vs Llama-3.1 across 1000 tests before committing resources.
151
-
152
- #### 4. compare_runs
153
-
154
- Compares two evaluation runs with AI-powered analysis across multiple dimensions:
155
- - Success rate comparison with statistical significance
156
- - Cost efficiency analysis (total cost, cost per test, cost per successful test)
157
- - Speed comparison (average duration, throughput)
158
- - Environmental impact (CO2 emissions per test)
159
- - GPU efficiency (for GPU jobs)
160
-
161
- **Focus Options**:
162
- - `comprehensive`: Complete comparison across all dimensions
163
- - `cost`: Detailed cost efficiency and ROI analysis
164
- - `performance`: Speed and accuracy trade-off analysis
165
- - `eco_friendly`: Environmental impact and carbon footprint comparison
166
-
167
- **Example Use Case**: After running evaluations with two different models, compare them head-to-head to determine which is better for production deployment based on your priorities (accuracy, cost, speed, or environmental impact).
168
-
169
- #### 5. get_top_performers
170
-
171
- Get top performing models from leaderboard with optimized token usage.
172
-
173
- **⚡ Performance Optimization**: This tool returns only the top N models (5-20 runs) instead of loading the full leaderboard dataset (51 runs), resulting in **90% token reduction** compared to using `get_dataset()`.
174
-
175
- **When to Use**: Perfect for queries like:
176
- - "Which model is leading?"
177
- - "Show me the top 5 models"
178
- - "What's the best model for cost efficiency?"
179
-
180
- **Parameters**:
181
- - `leaderboard_repo` (str): HuggingFace dataset repository (default: "kshitijthakkar/smoltrace-leaderboard")
182
- - `metric` (str): Metric to rank by - "success_rate", "total_cost_usd", "avg_duration_ms", or "co2_emissions_g" (default: "success_rate")
183
- - `top_n` (int): Number of top models to return (range: 1-20, default: 5)
184
-
185
- **Returns**: Properly formatted JSON with:
186
- - Metric used for ranking
187
- - Ranking order (ascending/descending)
188
- - Total runs in leaderboard
189
- - Array of top performers with essential fields only (10 fields vs 20+ in full dataset)
190
-
191
- **Benefits**:
192
- - ✅ **Token Reduction**: Returns 5-20 runs instead of all 51 runs (90% fewer tokens)
193
- - ✅ **Ready to Use**: Properly formatted JSON (no parsing needed, no string conversion issues)
194
- - ✅ **Pre-Sorted**: Already sorted by your chosen metric
195
- - ✅ **Essential Data Only**: Includes only 10 essential columns to minimize token usage
196
-
197
- **Example Use Case**: An agent needs to quickly answer "What are the top 3 most cost-effective models?" without consuming excessive tokens by loading the entire leaderboard dataset.
198
-
199
- #### 6. get_leaderboard_summary
200
-
201
- Get high-level leaderboard statistics without loading individual runs.
202
-
203
- **⚡ Performance Optimization**: This tool returns only aggregated statistics instead of raw data, resulting in **99% token reduction** compared to using `get_dataset()` on the full leaderboard.
204
-
205
- **When to Use**: Perfect for overview queries like:
206
- - "How many runs are in the leaderboard?"
207
- - "What's the average success rate across all models?"
208
- - "Give me an overview of evaluation results"
209
-
210
- **Parameters**:
211
- - `leaderboard_repo` (str): HuggingFace dataset repository (default: "kshitijthakkar/smoltrace-leaderboard")
212
-
213
- **Returns**: Properly formatted JSON with:
214
- - Total runs count
215
- - Unique models and submitters count
216
- - Overall statistics (avg/best/worst success rates, avg cost, avg duration, total CO2)
217
- - Breakdown by agent type (tool/code/both)
218
- - Breakdown by provider (litellm/transformers)
219
- - Top 3 models by success rate
220
-
221
- **Benefits**:
222
- - ✅ **Extreme Token Reduction**: Returns summary stats instead of 51 runs (99% fewer tokens)
223
- - ✅ **Ready to Use**: Properly formatted JSON (no parsing needed)
224
- - ✅ **Comprehensive Stats**: Includes averages, distributions, and breakdowns
225
- - ✅ **Quick Insights**: Perfect for "overview" and "summary" questions
226
-
227
- **Example Use Case**: An agent needs to provide a high-level overview of evaluation results without loading 51 individual runs and consuming 50K+ tokens.
228
-
229
- #### 7. get_dataset
230
-
231
- Loads SMOLTRACE datasets from HuggingFace and returns raw data as JSON:
232
- - Simple, flexible tool that returns complete dataset with metadata
233
- - Works with any dataset containing "smoltrace-" prefix
234
- - Returns total rows, columns list, and data array
235
- - Automatically sorts by timestamp if available
236
- - Configurable row limit (1-200) to manage token usage
237
-
238
- **⚠️ Important**: For leaderboard queries, **prefer using `get_top_performers()` or `get_leaderboard_summary()` instead** - they're specifically optimized to avoid token bloat!
239
-
240
- **Security Restriction**: Only datasets with "smoltrace-" in the repository name are allowed.
241
-
242
- **Primary Use Cases**:
243
- - Load `smoltrace-results-*` datasets to see individual test case details
244
- - Load `smoltrace-traces-*` datasets to access OpenTelemetry trace data
245
- - Load `smoltrace-metrics-*` datasets to get GPU performance data
246
- - For leaderboard queries: **Use `get_top_performers()` or `get_leaderboard_summary()` instead!**
247
-
248
- **Recommended Workflow**:
249
- 1. For overview: Use `get_leaderboard_summary()` (99% token reduction)
250
- 2. For top N queries: Use `get_top_performers()` (90% token reduction)
251
- 3. For specific run IDs: Use `get_dataset()` only when you need non-leaderboard datasets
252
-
253
- **Example Use Case**: When you need to load trace data or results data for a specific run, use `get_dataset("username/smoltrace-traces-gpt4")`. For leaderboard queries, use the optimized tools instead.
254
-
255
- #### 8. generate_synthetic_dataset
256
-
257
- Generates domain-specific synthetic test datasets for SMOLTRACE evaluations using Google Gemini 2.5 Flash:
258
- - AI-powered task generation tailored to your domain
259
- - Custom tool specifications
260
- - Configurable difficulty distribution (balanced, easy_only, medium_only, hard_only, progressive)
261
- - Target specific agent types (tool, code, or both)
262
- - Output follows SMOLTRACE task format exactly
263
- - Supports up to 100 tasks with parallel batched generation
264
-
265
- **SMOLTRACE Task Format**:
266
- Each generated task includes:
267
- ```json
268
- {
269
- "id": "unique_identifier",
270
- "prompt": "Clear, specific task for the agent",
271
- "expected_tool": "tool_name",
272
- "expected_tool_calls": 1,
273
- "difficulty": "easy|medium|hard",
274
- "agent_type": "tool|code",
275
- "expected_keywords": ["keyword1", "keyword2"]
276
- }
277
- ```
278
-
279
- **Enterprise Use Cases**:
280
- - **Custom Tools**: Create benchmarks for your proprietary APIs and tools
281
- - **Industry-Specific**: Generate tasks for finance, healthcare, legal, manufacturing, etc.
282
- - **Internal Workflows**: Test agents on company-specific processes
283
- - **Rapid Prototyping**: Quickly create evaluation datasets without manual curation
284
-
285
- **Difficulty Calibration**:
286
- - **Easy** (40%): Single tool call, straightforward input, clear expected output
287
- - **Medium** (40%): Multiple tool calls OR complex input parsing OR conditional logic
288
- - **Hard** (20%): Multiple tools, complex reasoning, edge cases, error handling
289
-
290
- **Output Includes**:
291
- - `dataset_info`: Metadata (domain, tools, counts, timestamp)
292
- - `tasks`: Ready-to-use SMOLTRACE task array
293
- - `usage_instructions`: Step-by-step guide for HuggingFace upload and SMOLTRACE usage
294
-
295
- **Example Use Case**: A financial services company wants to evaluate their customer service agent that uses custom tools for stock quotes, portfolio analysis, and transaction processing. They use this tool to generate 50 realistic tasks covering common customer inquiries across different difficulty levels, then run SMOLTRACE evaluations to benchmark different LLM models before deployment.
296
-
297
- #### 9. push_dataset_to_hub
298
-
299
- Upload generated datasets to HuggingFace Hub with proper formatting and metadata:
300
- - Automatically formats data for HuggingFace datasets library
301
- - Handles authentication via HF_TOKEN
302
- - Validates dataset structure before upload
303
- - Supports both public and private datasets
304
- - Adds comprehensive metadata (description, tags, license)
305
- - Creates dataset card with usage instructions
306
-
307
- **Parameters**:
308
- - `dataset_name`: Repository name on HuggingFace (e.g., "username/my-dataset")
309
- - `data`: Dataset content (list of dictionaries or JSON string)
310
- - `description`: Dataset description for the card
311
- - `private`: Whether to make the dataset private (default: False)
312
-
313
- **Example Workflow**:
314
- 1. Generate synthetic dataset with `generate_synthetic_dataset`
315
- 2. Review and modify tasks if needed
316
- 3. Upload to HuggingFace with `push_dataset_to_hub`
317
- 4. Use in SMOLTRACE evaluations or share with team
318
-
319
- **Example Use Case**: After generating a custom evaluation dataset for your domain, upload it to HuggingFace to share with your team, version control your benchmarks, or make it publicly available for the community.
320
-
321
-
322
- ## MCP Resources Usage
323
-
324
- Resources provide direct data access without AI analysis:
325
-
326
- ```python
327
- # Access leaderboard data
328
- GET leaderboard://kshitijthakkar/smoltrace-leaderboard
329
- # Returns: JSON with all evaluation runs
330
-
331
- # Access specific trace
332
- GET trace://trace_abc123/username/agent-traces-gpt4
333
- # Returns: JSON with trace spans and attributes
334
-
335
- # Get model cost information
336
- GET cost://model/openai/gpt-4
337
- # Returns: JSON with pricing and hardware costs
338
- ```
339
-
340
- ## MCP Prompts Usage
341
-
342
- Prompts provide reusable templates for standardized interactions:
343
-
344
- ```python
345
- # Get analysis prompt template
346
- analysis_prompt(analysis_type="leaderboard", focus_area="cost", detail_level="detailed")
347
- # Returns: "Provide a detailed analysis. Analyze cost efficiency in the leaderboard..."
348
-
349
- # Get debug prompt template
350
- debug_prompt(debug_type="performance", context="tool_calling")
351
- # Returns: "Analyze tool calling performance. Identify which tools are slow..."
352
-
353
- # Get optimization prompt template
354
- optimization_prompt(optimization_goal="cost", constraints="maintain_quality")
355
- # Returns: "Analyze this evaluation setup and recommend cost optimizations..."
356
- ```
357
-
358
- Use these prompts when interacting with the tools to get consistent, high-quality analysis.
359
-
360
- ## Quick Start
361
-
362
- ### 1. Installation
363
-
364
- ```bash
365
- git clone https://github.com/Mandark-droid/TraceMind-mcp-server.git
366
- cd TraceMind-mcp-server
367
-
368
- # Create virtual environment
369
- python -m venv venv
370
- source venv/bin/activate # On Windows: venv\Scripts\activate
371
-
372
- # Install dependencies (note: gradio[mcp] includes MCP support)
373
- pip install -r requirements.txt
374
- ```
375
-
376
- ### 2. Environment Setup
377
-
378
- Create `.env` file:
379
-
380
- ```bash
381
- cp .env.example .env
382
- # Edit .env and add your API keys
383
- ```
384
-
385
- Get your keys:
386
- - **Gemini API Key**: https://ai.google.dev/
387
- - **HuggingFace Token**: https://huggingface.co/settings/tokens
388
-
389
- ### 3. Run Locally
390
-
391
- ```bash
392
- python app.py
393
- ```
394
 
395
- Open http://localhost:7860 to test the tools via Gradio interface.
396
 
397
- ### 4. Test with Live Data
398
-
399
- Try the live example with real HuggingFace dataset:
400
-
401
- **In the Gradio UI, Tab "📊 Analyze Leaderboard":**
402
 
403
  ```
404
- Leaderboard Repository: kshitijthakkar/smoltrace-leaderboard
405
- Metric Focus: overall
406
- Time Range: last_week
407
- Top N Models: 5
 
 
 
 
 
 
 
 
 
 
 
 
408
  ```
409
 
410
- Click "🔍 Analyze" and get AI-powered insights from live data!
411
-
412
- ## 🎯 For Hackathon Judges & Visitors
413
-
414
- ### Using Your Own API Keys (Recommended)
415
-
416
- This MCP server has pre-configured API keys in HuggingFace Spaces Secrets for quick testing. However, **to prevent credit issues during evaluation**, we strongly recommend using your own API keys:
417
-
418
- #### Option 1: Configure in MCP Server UI (Simplest)
419
-
420
- 1. **Open the MCP Server Space**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
421
- 2. Navigate to the **⚙️ Settings** tab
422
- 3. Enter your own **Gemini API Key** and **HuggingFace Token**
423
- 4. Click **"Save & Override Keys"**
424
- 5. ✅ Your keys will be used for all MCP tool calls in this session
425
-
426
- **Then you can**:
427
- - Use any tool in the tabs above
428
- - Connect from TraceMind-AI (it will automatically use your keys configured here)
429
- - Test with Claude Desktop (will use your keys)
430
-
431
- #### Option 2: For TraceMind-AI Integration
432
-
433
- If you're testing the complete TraceMind platform (Track 2 - MCP in Action):
434
-
435
- 1. **Configure MCP Server** (as described above)
436
- 2. **Open TraceMind-AI**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
437
- 3. Navigate to **⚙️ Settings** in TraceMind-AI
438
- 4. Enter your API keys there as well
439
- 5. ✅ Both apps will use your keys
440
 
441
- ### Why Two Settings Screens?
 
442
 
443
- - **TraceMind-AI** (Track 2) is the user-facing UI - calls MCP server for intelligent analysis
444
- - **TraceMind MCP Server** (Track 1) is the backend service - provides MCP tools
445
- - They run in **separate browser sessions** → need separate configuration
446
- - Configuring both ensures your keys are used throughout the evaluation flow
447
 
448
- ### Getting Free API Keys
449
 
450
- Both APIs have generous free tiers perfect for hackathon evaluation:
 
 
451
 
452
- **Google Gemini API Key**:
453
- - Go to https://ai.google.dev/
454
- - Click "Get API Key" Create project → Generate key
455
- - **Free tier**: 1,500 requests/day
456
 
457
- **HuggingFace Token**:
458
- - Go to https://huggingface.co/settings/tokens
459
- - Click "New token" → Name it (e.g., "TraceMind Access")
460
- - **Permissions**:
461
- - Select "Read" for viewing datasets (sufficient for most tools)
462
- - Select "Write" if you want to use `push_dataset_to_hub` tool to upload synthetic datasets
463
- - **Recommended**: Use "Write" permissions for full functionality
464
- - No rate limits for public dataset access
465
 
466
- ### Default Configuration (If You Don't Configure)
467
 
468
- If you don't configure your own keys, the MCP server will use our pre-configured keys from HuggingFace Spaces Secrets. This is fine for quick testing, but please note:
469
- - Uses our API credits
470
- - May hit rate limits during high traffic
471
- - Recommended only for brief testing
472
 
473
- ## MCP Integration
 
 
 
 
 
474
 
475
- ### How It Works
 
 
476
 
477
- This Gradio app uses `mcp_server=True` in the launch configuration, which automatically:
478
- - Exposes all async functions with proper docstrings as MCP tools
479
- - Handles MCP protocol communication
480
- - Provides MCP interfaces via:
481
- - **Streamable HTTP** (recommended) - Modern streaming protocol
482
- - **SSE** (deprecated) - Server-Sent Events for legacy compatibility
483
 
484
- ### Connecting from MCP Clients
485
 
486
- Once deployed to HuggingFace Spaces, your MCP server will be available at:
 
 
 
487
 
488
- **🎯 MCP Endpoint (SSE - Recommended)**:
489
- ```
490
- https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse
491
- ```
492
 
493
- **MCP Endpoint (Streamable HTTP)**:
494
- ```
495
- https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/
496
- ```
497
-
498
- **Note**: Both SSE and Streamable HTTP endpoints are fully supported. The SSE endpoint is recommended for most MCP clients.
499
 
500
- ### Easiest Way to Connect
501
 
502
- **Recommended for all users** - HuggingFace provides an automatic configuration generator:
503
 
504
- 1. **Visit**: https://huggingface.co/settings/mcp (while logged in)
505
- 2. **Add Space**: Enter `MCP-1st-Birthday/TraceMind-mcp-server`
506
- 3. **Select Client**: Choose Claude Desktop, VSCode, Cursor, etc.
507
- 4. **Copy Config**: Get the auto-generated configuration snippet
508
- 5. **Paste & Restart**: Add to your client's config file and restart
509
 
510
- This automatically configures the correct endpoint URL and transport method for your chosen client!
511
 
512
- ### 🔧 Manual Configuration (Advanced)
 
 
 
 
513
 
514
- If you prefer to manually configure your MCP client:
515
 
516
- **Claude Desktop (`claude_desktop_config.json`)**:
517
  ```json
518
  {
519
  "mcpServers": {
@@ -525,7 +171,7 @@ If you prefer to manually configure your MCP client:
525
  }
526
  ```
527
 
528
- **VSCode / Cursor (`settings.json` or `.cursor/mcp.json`)**:
529
  ```json
530
  {
531
  "mcp.servers": {
@@ -537,375 +183,145 @@ If you prefer to manually configure your MCP client:
537
  }
538
  ```
539
 
540
- **Cline / Other MCP Clients**:
541
- - **URL**: `https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse`
542
- - **Transport**: `sse` (or use streamable HTTP endpoint with `streamable-http` transport)
543
-
544
- ### ❓ Connection FAQ
545
-
546
- **Q: Which endpoint should I use?**
547
- A: Use the **Streamable HTTP endpoint** (`/gradio_api/mcp/`) for all new connections. It's the modern, recommended protocol.
548
-
549
- **Q: My client only supports SSE. What should I do?**
550
- A: Use the SSE endpoint (`/gradio_api/mcp/sse`) for now, but note that it's deprecated. Consider upgrading your client if possible.
551
-
552
- **Q: What's the difference between the two transports?**
553
- A: Streamable HTTP is the newer, more efficient protocol with better error handling and performance. SSE is the legacy protocol being phased out.
554
-
555
- **Q: How do I test if my connection works?**
556
- A: After configuring your client, restart it and look for "tracemind" in your available MCP tools/servers. You should see 7 tools, 3 resources, and 3 prompts.
557
-
558
- **Q: Can I use this MCP server without authentication?**
559
- A: The MCP endpoint is publicly accessible. However, the tools may require HuggingFace datasets to be public or accessible with your HF token (configured server-side).
560
-
561
- ### Available MCP Components
562
-
563
- **Tools** (9):
564
- 1. **analyze_leaderboard**: AI-powered leaderboard analysis with Gemini 2.5 Flash
565
- 2. **debug_trace**: Trace debugging with AI insights
566
- 3. **estimate_cost**: Cost estimation with optimization recommendations
567
- 4. **compare_runs**: Compare two evaluation runs with AI-powered analysis
568
- 5. **get_top_performers**: Get top N models from leaderboard (optimized, 90% token reduction)
569
- 6. **get_leaderboard_summary**: Get leaderboard statistics (optimized, 99% token reduction)
570
- 7. **get_dataset**: Load SMOLTRACE datasets (smoltrace-* only) as JSON
571
- 8. **generate_synthetic_dataset**: Create domain-specific test datasets with AI
572
- 9. **push_dataset_to_hub**: Upload datasets to HuggingFace Hub
573
-
574
- **Resources** (3):
575
- 1. **leaderboard://{repo}**: Direct access to raw leaderboard data in JSON
576
- 2. **trace://{trace_id}/{repo}**: Direct access to trace data with spans
577
- 3. **cost://model/{model_name}**: Model pricing and hardware cost information
578
-
579
- **Prompts** (3):
580
- 1. **analysis_prompt**: Reusable templates for different analysis types
581
- 2. **debug_prompt**: Reusable templates for debugging scenarios
582
- 3. **optimization_prompt**: Reusable templates for optimization goals
583
-
584
- See full API documentation in the Gradio interface under "📖 API Documentation" tab.
585
-
586
- ## Architecture
587
 
 
588
  ```
589
- TraceMind-mcp-server/
590
- ├── app.py # Gradio UI + MCP server (mcp_server=True)
591
- ├── gemini_client.py # Google Gemini 2.5 Flash integration
592
- ├── mcp_tools.py # 7 tool implementations
593
- ├── requirements.txt # Python dependencies
594
- ├── .env.example # Environment variable template
595
- ├── .gitignore
596
- └── README.md
597
  ```
598
 
599
- **Key Technologies**:
600
- - **Gradio 6 with MCP support**: `gradio[mcp]` provides native MCP server capabilities
601
- - **Google Gemini 2.5 Flash**: Latest AI model for intelligent analysis
602
- - **HuggingFace Datasets**: Data source for evaluations
603
- - **Streamable HTTP Transport**: Modern streaming protocol for MCP communication (recommended)
604
- - **SSE Transport**: Server-Sent Events for legacy MCP compatibility (deprecated)
605
-
606
- ## Deploy to HuggingFace Spaces
607
-
608
- ### 1. Create Space
609
-
610
- Go to https://huggingface.co/new-space
611
-
612
- - **Space name**: `TraceMind-mcp-server`
613
- - **License**: AGPL-3.0
614
- - **SDK**: Gradio
615
- - **Hardware**: CPU Basic (free tier works fine)
616
-
617
- ### 2. Add Files
618
-
619
- Upload all files from this repository to your Space:
620
- - `app.py`
621
- - `gemini_client.py`
622
- - `mcp_tools.py`
623
- - `requirements.txt`
624
- - `README.md`
625
 
626
- ### 3. Add Secrets
627
 
628
- In Space settings Variables and secrets, add:
629
- - `GEMINI_API_KEY`: Your Gemini API key
630
- - `HF_TOKEN`: Your HuggingFace token
631
-
632
- ### 4. Add Hackathon Tag
633
-
634
- In Space settings → Tags, add:
635
- - `building-mcp-track-enterprise`
636
-
637
- ### 5. Access Your MCP Server
638
-
639
- Your MCP server will be publicly available at:
640
-
641
- **Gradio UI**:
642
- ```
643
- https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
644
- ```
645
-
646
- **MCP Endpoint (SSE - Recommended)**:
647
- ```
648
- https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse
649
- ```
650
-
651
- **MCP Endpoint (Streamable HTTP)**:
652
- ```
653
- https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/
654
- ```
655
-
656
- Use the **Easiest Way to Connect** section above to configure your MCP client automatically!
657
-
658
- ## Testing
659
-
660
- ### Test 1: Analyze Leaderboard (Live Data)
661
-
662
- ```bash
663
- # In Gradio UI - Tab "📊 Analyze Leaderboard":
664
- Repository: kshitijthakkar/smoltrace-leaderboard
665
- Metric: overall
666
- Time Range: last_week
667
- Top N: 5
668
- Click "🔍 Analyze"
669
- ```
670
-
671
- **Expected**: AI-generated analysis of top performing models from live HuggingFace dataset
672
-
673
- ### Test 2: Estimate Cost
674
-
675
- ```bash
676
- # In Gradio UI - Tab "💰 Estimate Cost":
677
- Model: openai/gpt-4
678
- Agent Type: both
679
- Number of Tests: 100
680
- Hardware: auto
681
- Click "💰 Estimate"
682
- ```
683
-
684
- **Expected**: Cost breakdown with LLM costs, HF Jobs costs, duration, and CO2 estimate
685
-
686
- ### Test 3: Debug Trace
687
-
688
- Note: This requires actual trace data from an evaluation run. For testing purposes, this will show an error about missing data, which is expected behavior.
689
-
690
- ## Hackathon Submission
691
-
692
- ### Track 1: Building MCP (Enterprise)
693
-
694
- **Tag**: `building-mcp-track-enterprise`
695
 
696
- **Why Enterprise Track?**
697
- - Solves real business problems (cost optimization, debugging, decision support)
698
- - Production-ready tools with clear ROI
699
- - Integrates with enterprise data infrastructure (HuggingFace datasets)
700
 
701
- **Technology Stack**
702
- - **AI Analysis**: Google Gemini 2.5 Flash for all intelligent insights
703
- - **MCP Framework**: Gradio 6 with native MCP support
704
- - **Data Source**: HuggingFace Datasets
705
- - **Transport**: Streamable HTTP (recommended) and SSE (deprecated)
706
 
707
- ## Related Project: TraceMind-AI (Track 2)
708
 
709
- This MCP server is designed to be consumed by **[TraceMind-AI](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind)** (separate submission for Track 2: MCP in Action).
710
 
711
- **Links**:
712
- - **Live Demo**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
713
- - **GitHub**: https://github.com/Mandark-droid/TraceMind-AI
 
 
 
714
 
715
- TraceMind-AI is a Gradio-based agent evaluation platform that uses these MCP tools to provide:
716
- - AI-powered leaderboard insights with autonomous agent chat
717
- - Interactive trace debugging with MCP-powered Q&A
718
- - Real-time cost estimation and comparison
719
- - Complete evaluation workflow visualization
720
 
721
- ## File Descriptions
 
 
 
 
722
 
723
- ### app.py
724
- Main Gradio application with:
725
- - Testing UI for all 7 tools
726
- - MCP server enabled via `mcp_server=True`
727
- - API documentation
728
 
729
- ### gemini_client.py
730
- Google Gemini 2.5 Flash client that:
731
- - Handles API authentication
732
- - Provides specialized analysis methods for different data types
733
- - Formats prompts for optimal results
734
- - Uses `gemini-2.5-pro-latest` model (can switch to `gemini-2.5-flash-latest`)
735
 
736
- ### mcp_tools.py
737
- Complete MCP implementation with 13 components:
738
 
739
- **Tools** (9 async functions):
740
- - `analyze_leaderboard()`: AI-powered leaderboard analysis
741
- - `debug_trace()`: AI-powered trace debugging
742
- - `estimate_cost()`: AI-powered cost estimation
743
- - `compare_runs()`: AI-powered run comparison
744
- - `get_top_performers()`: Optimized tool to get top N models (90% token reduction)
745
- - `get_leaderboard_summary()`: Optimized tool for leaderboard statistics (99% token reduction)
746
- - `get_dataset()`: Load SMOLTRACE datasets as JSON (use optimized tools for leaderboard!)
747
- - `generate_synthetic_dataset()`: Create domain-specific test datasets with AI
748
- - `push_dataset_to_hub()`: Upload datasets to HuggingFace Hub
 
 
 
 
 
 
 
 
 
 
 
 
749
 
750
- **Resources** (3 decorated functions with `@gr.mcp.resource()`):
751
- - `get_leaderboard_data()`: Raw leaderboard JSON data
752
- - `get_trace_data()`: Raw trace JSON data with spans
753
- - `get_cost_data()`: Model pricing and hardware cost JSON
754
 
755
- **Prompts** (3 decorated functions with `@gr.mcp.prompt()`):
756
- - `analysis_prompt()`: Templates for different analysis types
757
- - `debug_prompt()`: Templates for debugging scenarios
758
- - `optimization_prompt()`: Templates for optimization goals
759
 
760
- Each function includes:
761
- - Appropriate decorator (`@gr.mcp.tool()`, `@gr.mcp.resource()`, or `@gr.mcp.prompt()`)
762
- - Detailed docstring with "Args:" section
763
- - Type hints for all parameters and return values
764
- - Descriptive function name (becomes the MCP component name)
765
 
766
- ## Environment Variables
767
 
768
- Required environment variables:
769
 
770
  ```bash
771
- GEMINI_API_KEY=your_gemini_api_key_here
772
- HF_TOKEN=your_huggingface_token_here
773
- ```
774
-
775
- ## Development
776
-
777
- ### Running Tests
778
 
779
- ```bash
780
- # Test Gemini client
781
- python -c "from gemini_client import GeminiClient; client = GeminiClient(); print('✅ Gemini client initialized')"
782
 
783
- # Test with live leaderboard data
784
  python app.py
785
- # Open browser, test "Analyze Leaderboard" tab
786
- ```
787
-
788
- ### Adding New Tools
789
-
790
- To add a new MCP tool (with Gradio's native MCP support):
791
-
792
- 1. **Add function to `mcp_tools.py`** with proper docstring:
793
- ```python
794
- async def your_new_tool(
795
- gemini_client: GeminiClient,
796
- param1: str,
797
- param2: int = 10
798
- ) -> str:
799
- """
800
- Brief description of what the tool does.
801
-
802
- Longer description explaining the tool's purpose and behavior.
803
-
804
- Args:
805
- gemini_client (GeminiClient): Initialized Gemini client for AI analysis
806
- param1 (str): Description of param1 with examples if helpful
807
- param2 (int): Description of param2. Default: 10
808
-
809
- Returns:
810
- str: Description of what the function returns
811
- """
812
- # Your implementation
813
- return result
814
- ```
815
-
816
- 2. **Add UI tab in `app.py`** (optional, for testing):
817
- ```python
818
- with gr.Tab("Your Tool"):
819
- # Add UI components
820
- # Wire up to your_new_tool()
821
  ```
822
 
823
- 3. That's it! Gradio automatically exposes it as an MCP tool based on:
824
- - Function name (becomes tool name)
825
- - Docstring (becomes tool description)
826
- - Args section (becomes parameter descriptions)
827
- - Type hints (become parameter types)
828
-
829
- ### Switching to Gemini 2.5 Flash
830
-
831
- For faster (but slightly less capable) responses, switch to Gemini 2.5 Flash:
832
-
833
- ```python
834
- # In app.py, change:
835
- gemini_client = GeminiClient(model_name="gemini-2.5-flash-latest")
836
- ```
837
-
838
- ## 🙏 Credits & Acknowledgments
839
-
840
- ### Hackathon Sponsors
841
 
842
- Special thanks to the sponsors of **MCP's 1st Birthday Hackathon** (November 14-30, 2025):
843
-
844
- - **🤗 HuggingFace** - Hosting platform and dataset infrastructure
845
- - **🧠 Google Gemini** - AI analysis powered by Gemini 2.5 Flash API
846
- - **⚡ Modal** - Serverless infrastructure partner
847
- - **🏢 Anthropic** - MCP protocol creators
848
- - **🎨 Gradio** - Native MCP framework support
849
- - **🎙️ ElevenLabs** - Audio AI capabilities
850
- - **🦙 SambaNova** - High-performance AI infrastructure
851
- - **🎯 Blaxel** - Additional compute credits
852
 
853
- ### Related Open Source Projects
854
 
855
- This MCP server builds upon our open source agent evaluation ecosystem:
 
 
 
856
 
857
- #### 📊 SMOLTRACE - Agent Evaluation Engine
858
- - **Description**: Lightweight, production-ready evaluation framework for AI agents with OpenTelemetry instrumentation
859
- - **GitHub**: [https://github.com/Mandark-droid/SMOLTRACE](https://github.com/Mandark-droid/SMOLTRACE)
860
- - **PyPI**: [https://pypi.org/project/smoltrace/](https://pypi.org/project/smoltrace/)
861
 
862
- #### 🔭 TraceVerde - GenAI OpenTelemetry Instrumentation
863
- - **Description**: Automatic OpenTelemetry instrumentation for LLM frameworks (LiteLLM, Transformers, LangChain, etc.)
864
- - **GitHub**: [https://github.com/Mandark-droid/genai_otel_instrument](https://github.com/Mandark-droid/genai_otel_instrument)
865
- - **PyPI**: [https://pypi.org/project/genai-otel-instrument](https://pypi.org/project/genai-otel-instrument)
866
 
867
- ### Built By
868
 
 
869
  **Track**: Building MCP (Enterprise)
870
  **Author**: Kshitij Thakkar
871
  **Powered by**: Google Gemini 2.5 Flash
872
  **Built with**: Gradio (native MCP support)
873
 
874
- ---
875
 
876
- ## 📄 License
877
 
878
- AGPL-3.0 License
879
 
880
- This project is licensed under the GNU Affero General Public License v3.0. See the LICENSE file for details.
881
 
882
  ---
883
 
884
- ## 💬 Support
885
-
886
- For issues or questions:
887
- - 📧 Open an issue on GitHub
888
- - 💬 Join the [HuggingFace Discord](https://discord.gg/huggingface) - Channel: `#agents-mcp-hackathon-winter25`
889
- - 🏷️ Tag `building-mcp-track-enterprise` for hackathon-related questions
890
- - 🐦 Follow us on X: [@TraceMindAI](https://twitter.com/TraceMindAI) (placeholder)
891
-
892
- ## Changelog
893
-
894
- ### v1.0.0 (2025-11-14)
895
- - Initial release for MCP Hackathon
896
- - **Complete MCP Implementation**: 17 components total
897
- - 11 AI-powered and optimized tools:
898
- - analyze_leaderboard, debug_trace, estimate_cost, compare_runs, analyze_results (AI-powered analysis)
899
- - get_top_performers, get_leaderboard_summary (optimized for token reduction)
900
- - get_dataset, generate_synthetic_dataset, generate_prompt_template, push_dataset_to_hub (data management)
901
- - 3 data resources (leaderboard, trace, cost data)
902
- - 3 prompt templates (analysis, debug, optimization)
903
- - Gradio native MCP support with decorators (`@gr.mcp.*`)
904
- - Google Gemini 2.5 Flash integration for all AI analysis
905
- - Live HuggingFace dataset integration
906
- - **Performance Optimizations**:
907
- - get_top_performers: 90% token reduction vs full leaderboard
908
- - get_leaderboard_summary: 99% token reduction vs full leaderboard
909
- - Proper JSON serialization (no string conversion issues)
910
- - SSE transport for MCP communication
911
- - Production-ready for HuggingFace Spaces deployment
 
23
  <img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-mcp-server/assets/Logo.png" alt="TraceMind MCP Server Logo" width="200"/>
24
  </p>
25
 
26
+ **AI-Powered Analysis Tools for Agent Evaluation**
27
 
28
  [![MCP's 1st Birthday Hackathon](https://img.shields.io/badge/MCP%27s%201st%20Birthday-Hackathon-blue)](https://github.com/modelcontextprotocol)
29
+ [![Track 1: Building MCP](https://img.shields.io/badge/Track-Building%20MCP%20(Enterprise)-blue)](https://github.com/modelcontextprotocol/hackathon)
30
+ [![Powered by Google Gemini](https://img.shields.io/badge/Powered%20by-Google%20Gemini%202.5%20Pro-orange)](https://ai.google.dev/)
 
31
 
32
  > **🎯 Track 1 Submission**: Building MCP (Enterprise)
33
  > **📅 MCP's 1st Birthday Hackathon**: November 14-30, 2025
34
 
35
+ ---
 
 
 
 
36
 
37
+ ## Why This MCP Server?
38
 
39
+ **Problem**: Agent evaluation generates mountains of data—leaderboards, traces, metrics—but developers struggle to extract actionable insights.
 
 
 
40
 
41
+ **Solution**: This MCP server provides **11 AI-powered tools** that transform raw evaluation data into clear answers:
42
+ - *"Which model is best for my use case?"*
43
+ - *"Why did this agent execution fail?"*
44
+ - *"How much will this evaluation cost?"*
45
 
46
+ **Powered by Google Gemini 2.5 Flash** for intelligent, context-aware analysis of agent performance data.
47
 
48
  ---
49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ## 🔗 Quick Links
51
 
52
+ - **🌐 Live Demo**: [TraceMind-mcp-server Space](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server)
53
+ - **⚡ Auto-Config**: Add `MCP-1st-Birthday/TraceMind-mcp-server` at https://huggingface.co/settings/mcp
54
+ - **📖 Full Docs**: See [DOCUMENTATION.md](DOCUMENTATION.md) for complete technical reference
55
+ - **🎬 Quick Demo (5 min)**: [Watch on Loom](https://www.loom.com/share/d4d0003f06fa4327b46ba5c081bdf835)
56
+ - **📺 Full Demo (20 min)**: [Watch on Loom](https://www.loom.com/share/de559bb0aef749559c79117b7f951250)
 
 
 
57
 
58
+ **MCP Endpoints**:
59
+ - SSE (Recommended): `https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse`
60
+ - Streamable HTTP: `https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/`
61
 
62
  ---
63
 
64
+ ## The TraceMind Ecosystem
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
+ This MCP server is part of a **complete agent evaluation platform** built from four interconnected projects:
67
 
68
+ <p align="center">
69
+ <img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-AI/assets/TraceVerse_Logo.png" alt="TraceVerse Ecosystem" width="400"/>
70
+ </p>
 
 
71
 
72
  ```
73
+ 🔭 TraceVerde 📊 SMOLTRACE
74
+ (genai_otel_instrument) (Evaluation Engine)
75
+ ↓ ↓
76
+ Instruments Evaluates
77
+ LLM calls agents
78
+ ↓ ↓
79
+ └───────────┬───────────────────┘
80
+
81
+ Generates Datasets
82
+ (leaderboard, traces, metrics)
83
+
84
+ ┌───────────┴───────────────────┐
85
+ ↓ ↓
86
+ 🛠️ TraceMind MCP Server 🧠 TraceMind-AI
87
+ (This Project - Track 1) (UI Platform - Track 2)
88
+ Analyzes with AI Visualizes & Interacts
89
  ```
90
 
91
+ ### The Foundation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
+ **🔭 TraceVerde** - Zero-code OpenTelemetry instrumentation for LLM frameworks
94
+ → [GitHub](https://github.com/Mandark-droid/genai_otel_instrument) | [PyPI](https://pypi.org/project/genai-otel-instrument)
95
 
96
+ **📊 SMOLTRACE** - Lightweight evaluation engine that generates structured datasets
97
+ → [GitHub](https://github.com/Mandark-droid/SMOLTRACE) | [PyPI](https://pypi.org/project/smoltrace/)
 
 
98
 
99
+ ### The Platform
100
 
101
+ **🛠️ TraceMind MCP Server** (This Project) - Provides MCP tools for AI-powered analysis
102
+ → **Track 1**: Building MCP (Enterprise)
103
+ → [Live Demo](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server) | [GitHub](https://github.com/Mandark-droid/TraceMind-mcp-server)
104
 
105
+ **🧠 TraceMind-AI** - Gradio UI that consumes MCP tools for interactive evaluation
106
+ → [Live Demo](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind) | [GitHub](https://github.com/Mandark-droid/TraceMind-AI)
107
+ **Track 2**: MCP in Action (Enterprise)
 
108
 
109
+ ---
 
 
 
 
 
 
 
110
 
111
+ ## What's Included
112
 
113
+ ### 11 AI-Powered Tools
 
 
 
114
 
115
+ **Core Analysis** (AI-Powered by Gemini 2.5 Flash):
116
+ 1. **📊 analyze_leaderboard** - Generate insights from evaluation data
117
+ 2. **🐛 debug_trace** - Debug agent execution traces with AI assistance
118
+ 3. **💰 estimate_cost** - Predict costs before running evaluations
119
+ 4. **⚖️ compare_runs** - Compare two evaluation runs with AI analysis
120
+ 5. **📋 analyze_results** - Analyze detailed test results with optimization recommendations
121
 
122
+ **Token-Optimized Tools**:
123
+ 6. **🏆 get_top_performers** - Get top N models (90% token reduction vs. full dataset)
124
+ 7. **📈 get_leaderboard_summary** - High-level statistics (99% token reduction)
125
 
126
+ **Data Management**:
127
+ 8. **📦 get_dataset** - Load SMOLTRACE datasets as JSON
128
+ 9. **🧪 generate_synthetic_dataset** - Create domain-specific test datasets with AI (up to 100 tasks)
129
+ 10. **📤 push_dataset_to_hub** - Upload datasets to HuggingFace
130
+ 11. **📝 generate_prompt_template** - Generate customized smolagents prompt templates
 
131
 
132
+ ### 3 Data Resources
133
 
134
+ Direct JSON access without AI analysis:
135
+ - **leaderboard://{repo}** - Raw evaluation results
136
+ - **trace://{trace_id}/{repo}** - OpenTelemetry spans
137
+ - **cost://model/{model}** - Pricing information
138
 
139
+ ### 3 Prompt Templates
 
 
 
140
 
141
+ Standardized templates for consistent analysis:
142
+ - **analysis_prompt** - Different analysis types (leaderboard, cost, performance)
143
+ - **debug_prompt** - Debugging scenarios
144
+ - **optimization_prompt** - Optimization goals
 
 
145
 
146
+ **Total: 17 MCP Components** (11 + 3 + 3)
147
 
148
+ ---
149
 
150
+ ## Quick Start
 
 
 
 
151
 
152
+ ### 1. Connect to the Live Server
153
 
154
+ **Easiest Method** (Recommended):
155
+ 1. Visit https://huggingface.co/settings/mcp (while logged in)
156
+ 2. Add Space: `MCP-1st-Birthday/TraceMind-mcp-server`
157
+ 3. Select your MCP client (Claude Desktop, VSCode, Cursor, etc.)
158
+ 4. Copy the auto-generated config and paste into your client
159
 
160
+ **Manual Configuration** (Advanced):
161
 
162
+ For Claude Desktop (`claude_desktop_config.json`):
163
  ```json
164
  {
165
  "mcpServers": {
 
171
  }
172
  ```
173
 
174
+ For VSCode/Cursor (`settings.json`):
175
  ```json
176
  {
177
  "mcp.servers": {
 
183
  }
184
  ```
185
 
186
+ ### 2. Try It Out
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
187
 
188
+ Open your MCP client and try:
189
  ```
190
+ "Analyze the leaderboard at kshitijthakkar/smoltrace-leaderboard and show me the top 5 models"
 
 
 
 
 
 
 
191
  ```
192
 
193
+ You should see AI-powered insights generated by Gemini 2.5 Flash!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194
 
195
+ ### 3. Using Your Own API Keys (Recommended)
196
 
197
+ To avoid rate limits during evaluation:
198
+ 1. Visit the [MCP Server Space](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server)
199
+ 2. Go to **⚙️ Settings** tab
200
+ 3. Enter your **Gemini API Key** and **HuggingFace Token**
201
+ 4. Click **"Save & Override Keys"**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
202
 
203
+ **Get Free API Keys**:
204
+ - **Gemini**: https://ai.google.dev/ (1,500 requests/day free)
205
+ - **HuggingFace**: https://huggingface.co/settings/tokens (unlimited for public datasets)
 
206
 
207
+ ---
 
 
 
 
208
 
209
+ ## For Hackathon Judges
210
 
211
+ ### Track 1 Compliance
212
 
213
+ - **Complete MCP Implementation**: 11 Tools + 3 Resources + 3 Prompts (17 total)
214
+ - **MCP Standard Compliant**: Built with Gradio's native `@gr.mcp.*` decorators
215
+ - **Production-Ready**: Deployed to HuggingFace Spaces with SSE transport
216
+ - **Enterprise Focus**: Cost optimization, debugging, decision support
217
+ - **Google Gemini Powered**: All AI analysis uses Gemini 2.5 Flash
218
+ - **Interactive Testing**: Beautiful Gradio UI for testing all components
219
 
220
+ ### 🎯 Key Innovations
 
 
 
 
221
 
222
+ 1. **Token Optimization**: `get_top_performers` and `get_leaderboard_summary` reduce token usage by 90-99%
223
+ 2. **AI-Powered Synthetic Data**: Generate domain-specific test datasets + matching prompt templates
224
+ 3. **Complete Ecosystem**: Part of 4-project platform with TraceVerde → SMOLTRACE → MCP Server → TraceMind-AI
225
+ 4. **Real Data Integration**: Works with live HuggingFace datasets from SMOLTRACE evaluations
226
+ 5. **Test Results Analysis**: Deep-dive into individual test cases with `analyze_results` tool
227
 
228
+ ### 📹 Demo Materials
 
 
 
 
229
 
230
+ - **🎥 Demo Video**: [Coming Soon - Link to video]
231
+ - **📢 Social Post**: [Coming Soon - Link to announcement]
 
 
 
 
232
 
233
+ ---
 
234
 
235
+ ## Documentation
236
+
237
+ **For quick evaluation**:
238
+ - Read this README for overview
239
+ - Visit the [Live Demo](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server) to test tools
240
+ - Use the Auto-Config link to connect your MCP client
241
+
242
+ **For deep dives**:
243
+ - [DOCUMENTATION.md](DOCUMENTATION.md) - Complete API reference
244
+ - Tool descriptions and parameters
245
+ - Resource URIs and schemas
246
+ - Prompt template details
247
+ - Example use cases
248
+ - [ARCHITECTURE.md](ARCHITECTURE.md) - Technical architecture
249
+ - Project structure
250
+ - MCP protocol implementation
251
+ - Gemini integration details
252
+ - Deployment guide
253
+ - [UI_GUIDE.md](UI_GUIDE.md) - Gradio interface walkthrough
254
+ - Tab-by-tab explanations
255
+ - Testing workflows
256
+ - Configuration options
257
 
258
+ ---
 
 
 
259
 
260
+ ## Technology Stack
 
 
 
261
 
262
+ - **AI Model**: Google Gemini 2.5 Flash (via Google AI SDK)
263
+ - **MCP Framework**: Gradio 6 with native MCP support (`@gr.mcp.*` decorators)
264
+ - **Data Source**: HuggingFace Datasets API
265
+ - **Transport**: SSE (recommended) + Streamable HTTP
266
+ - **Deployment**: HuggingFace Spaces (Docker SDK)
267
 
268
+ ---
269
 
270
+ ## Run Locally (Optional)
271
 
272
  ```bash
273
+ # Clone and setup
274
+ git clone https://github.com/Mandark-droid/TraceMind-mcp-server.git
275
+ cd TraceMind-mcp-server
276
+ python -m venv venv
277
+ source venv/bin/activate # Windows: venv\Scripts\activate
278
+ pip install -r requirements.txt
 
279
 
280
+ # Configure API keys
281
+ cp .env.example .env
282
+ # Edit .env with your GEMINI_API_KEY and HF_TOKEN
283
 
284
+ # Run the server
285
  python app.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
286
  ```
287
 
288
+ Visit http://localhost:7860 to test the tools via Gradio UI.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
289
 
290
+ ---
 
 
 
 
 
 
 
 
 
291
 
292
+ ## Related Projects
293
 
294
+ **🧠 TraceMind-AI** (Track 2 - MCP in Action):
295
+ - Live Demo: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
296
+ - Consumes this MCP server for AI-powered agent evaluation UI
297
+ - Features autonomous agent chat, trace visualization, job submission
298
 
299
+ **📊 Foundation Libraries**:
300
+ - TraceVerde: https://github.com/Mandark-droid/genai_otel_instrument
301
+ - SMOLTRACE: https://github.com/Mandark-droid/SMOLTRACE
 
302
 
303
+ ---
 
 
 
304
 
305
+ ## Credits
306
 
307
+ **Built for**: MCP's 1st Birthday Hackathon (Nov 14-30, 2025)
308
  **Track**: Building MCP (Enterprise)
309
  **Author**: Kshitij Thakkar
310
  **Powered by**: Google Gemini 2.5 Flash
311
  **Built with**: Gradio (native MCP support)
312
 
313
+ **Sponsors**: HuggingFace • Google Gemini • Modal • Anthropic • Gradio • ElevenLabs • SambaNova • Blaxel
314
 
315
+ ---
316
 
317
+ ## License
318
 
319
+ AGPL-3.0 - See [LICENSE](LICENSE) for details
320
 
321
  ---
322
 
323
+ ## Support
324
+
325
+ - 📧 GitHub Issues: [TraceMind-mcp-server/issues](https://github.com/Mandark-droid/TraceMind-mcp-server/issues)
326
+ - 💬 HF Discord: `#mcp-1st-birthday-official🏆`
327
+ - 🏷️ Tag: `building-mcp-track-enterprise`