TraceMind / MCP_INTEGRATION.md
kshitijthakkar's picture
docs: Update Gemini model version and fix typos
040fd52

A newer version of the Gradio SDK is available: 6.0.2

Upgrade

TraceMind-AI - MCP Integration Guide

This document explains how TraceMind-AI integrates with MCP servers to provide AI-powered agent evaluation.

Table of Contents


Overview

TraceMind-AI demonstrates enterprise MCP client usage as part of the Track 2: MCP in Action submission. It showcases two distinct patterns of MCP integration:

  1. Direct MCP Client: Python-based client connecting to remote MCP server via SSE transport
  2. Autonomous Agent: smolagents-based agent with access to MCP tools for multi-step reasoning

Both patterns consume the same MCP server (TraceMind-mcp-server) to provide AI-powered analysis of agent evaluation data.


Dual MCP Integration

Pattern 1: Direct MCP Client Integration

Where: Leaderboard insights, cost estimation dialogs, trace debugging

How it works:

# TraceMind-AI calls MCP server directly
mcp_client = get_sync_mcp_client()
insights = mcp_client.analyze_leaderboard(
    metric_focus="overall",
    time_range="last_week",
    top_n=5
)
# Display insights in UI

Use cases:

  • Generate leaderboard insights when user clicks "Load Leaderboard"
  • Estimate costs when user clicks "Estimate Cost" in New Evaluation form
  • Debug traces when user asks questions in trace visualization

Advantages:

  • Direct, fast execution
  • Synchronous API (easy to integrate with Gradio)
  • Predictable, structured responses

Pattern 2: Autonomous Agent with MCP Tools

Where: Agent Chat tab

How it works:

# smolagents agent discovers and uses MCP tools autonomously
from smolagents import ToolCallingAgent, MCPClient

# Agent initialized with MCP client
agent = ToolCallingAgent(
    tools=[],  # Tools loaded from MCP server
    model=model_client,
    mcp_client=MCPClient(mcp_server_url)
)

# User asks question
result = agent.run("What are the top 3 models and their costs?")

# Agent plans:
#   1. Call get_top_performers MCP tool
#   2. Extract costs from results
#   3. Format and present to user

Use cases:

  • Answer complex questions requiring multi-step analysis
  • Compare models across multiple dimensions
  • Plan evaluation strategies with cost estimates
  • Provide recommendations based on leaderboard data

Advantages:

  • Natural language interface
  • Multi-step reasoning
  • Autonomous tool selection
  • Context-aware responses

Architecture

System Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TraceMind-AI (Gradio App) - Track 2                         β”‚
β”‚                                                               β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ UI Layer (Gradio)                                       β”‚ β”‚
β”‚ β”‚  - Leaderboard tab                                      β”‚ β”‚
β”‚ β”‚  - Agent Chat tab                                       β”‚ β”‚
β”‚ β”‚  - New Evaluation tab                                   β”‚ β”‚
β”‚ β”‚  - Trace Visualization tab                              β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚              ↓                             ↓                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Direct MCP Client     β”‚   β”‚ Autonomous Agent         β”‚  β”‚
β”‚  β”‚ (sync_wrapper.py)     β”‚   β”‚ (smolagents)             β”‚  β”‚
β”‚  β”‚                       β”‚   β”‚                          β”‚  β”‚
β”‚  β”‚ - Synchronous API     β”‚   β”‚ - Multi-step reasoning   β”‚  β”‚
β”‚  β”‚ - Tool calling        β”‚   β”‚ - Tool discovery         β”‚  β”‚
β”‚  β”‚ - Error handling      β”‚   β”‚ - Context management     β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚                                ↓                             β”‚
β”‚                         MCP Protocol                         β”‚
β”‚                         (SSE Transport)                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TraceMind MCP Server - Track 1                              β”‚
β”‚ https://huggingface.co/spaces/MCP-1st-Birthday/             β”‚
β”‚ TraceMind-mcp-server                                        β”‚
β”‚                                                               β”‚
β”‚ 11 AI-Powered Tools:                                        β”‚
β”‚  - analyze_leaderboard                                      β”‚
β”‚  - debug_trace                                              β”‚
β”‚  - estimate_cost                                            β”‚
β”‚  - compare_runs                                             β”‚
β”‚  - analyze_results                                          β”‚
β”‚  - get_top_performers                                       β”‚
β”‚  - get_leaderboard_summary                                  β”‚
β”‚  - get_dataset                                              β”‚
β”‚  - generate_synthetic_dataset                               β”‚
β”‚  - push_dataset_to_hub                                      β”‚
β”‚  - generate_prompt_template                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

MCP Client Implementation

File Structure

TraceMind-AI/
β”œβ”€β”€ mcp_client/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ client.py              # Async MCP client
β”‚   └── sync_wrapper.py        # Synchronous wrapper for Gradio
β”œβ”€β”€ agent/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── smolagents_setup.py    # Agent with MCP integration
└── app.py                     # Main Gradio app

Async MCP Client (client.py)

from mcp import ClientSession, StdioServerParameters
import mcp.types as types

class TraceMindMCPClient:
    """Async MCP client for TraceMind MCP Server"""

    def __init__(self, mcp_server_url: str):
        self.mcp_server_url = mcp_server_url
        self.session = None

    async def connect(self):
        """Establish connection to MCP server via SSE"""
        # For HTTP-based MCP servers (HuggingFace Spaces)
        self.session = ClientSession(
            ServerParameters(
                url=self.mcp_server_url,
                transport="sse"
            )
        )
        await self.session.__aenter__()

        # List available tools
        tools_result = await self.session.list_tools()
        self.available_tools = {tool.name: tool for tool in tools_result.tools}

        print(f"Connected to MCP server. Available tools: {list(self.available_tools.keys())}")

    async def call_tool(self, tool_name: str, arguments: dict) -> str:
        """Call an MCP tool with given arguments"""
        if not self.session:
            raise RuntimeError("MCP client not connected. Call connect() first.")

        if tool_name not in self.available_tools:
            raise ValueError(f"Tool '{tool_name}' not available. Available: {list(self.available_tools.keys())}")

        # Call the tool
        result = await self.session.call_tool(tool_name, arguments=arguments)

        # Extract text response
        if result.content and len(result.content) > 0:
            return result.content[0].text
        return ""

    async def analyze_leaderboard(self, **kwargs) -> str:
        """Wrapper for analyze_leaderboard tool"""
        return await self.call_tool("analyze_leaderboard", kwargs)

    async def estimate_cost(self, **kwargs) -> str:
        """Wrapper for estimate_cost tool"""
        return await self.call_tool("estimate_cost", kwargs)

    async def debug_trace(self, **kwargs) -> str:
        """Wrapper for debug_trace tool"""
        return await self.call_tool("debug_trace", kwargs)

    async def compare_runs(self, **kwargs) -> str:
        """Wrapper for compare_runs tool"""
        return await self.call_tool("compare_runs", kwargs)

    async def get_top_performers(self, **kwargs) -> str:
        """Wrapper for get_top_performers tool"""
        return await self.call_tool("get_top_performers", kwargs)

    async def disconnect(self):
        """Close MCP connection"""
        if self.session:
            await self.session.__aexit__(None, None, None)

Synchronous Wrapper (sync_wrapper.py)

import asyncio
from typing import Optional
from .client import TraceMindMCPClient

class SyncMCPClient:
    """Synchronous wrapper for async MCP client (Gradio-compatible)"""

    def __init__(self, mcp_server_url: str):
        self.mcp_server_url = mcp_server_url
        self.async_client = TraceMindMCPClient(mcp_server_url)
        self._connected = False

    def _run_async(self, coro):
        """Run async coroutine in sync context"""
        try:
            loop = asyncio.get_event_loop()
        except RuntimeError:
            loop = asyncio.new_event_loop()
            asyncio.set_event_loop(loop)

        return loop.run_until_complete(coro)

    def initialize(self):
        """Connect to MCP server"""
        if not self._connected:
            self._run_async(self.async_client.connect())
            self._connected = True

    def analyze_leaderboard(self, **kwargs) -> str:
        """Synchronous wrapper for analyze_leaderboard"""
        if not self._connected:
            self.initialize()
        return self._run_async(self.async_client.analyze_leaderboard(**kwargs))

    def estimate_cost(self, **kwargs) -> str:
        """Synchronous wrapper for estimate_cost"""
        if not self._connected:
            self.initialize()
        return self._run_async(self.async_client.estimate_cost(**kwargs))

    def debug_trace(self, **kwargs) -> str:
        """Synchronous wrapper for debug_trace"""
        if not self._connected:
            self.initialize()
        return self._run_async(self.async_client.debug_trace(**kwargs))

    # ... (similar wrappers for other tools)

# Global instance for use in Gradio app
_mcp_client: Optional[SyncMCPClient] = None

def get_sync_mcp_client() -> SyncMCPClient:
    """Get or create global sync MCP client instance"""
    global _mcp_client
    if _mcp_client is None:
        mcp_server_url = os.getenv(
            "MCP_SERVER_URL",
            "https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse"
        )
        _mcp_client = SyncMCPClient(mcp_server_url)
    return _mcp_client

Usage in Gradio App

# app.py
from mcp_client.sync_wrapper import get_sync_mcp_client

# Initialize MCP client
mcp_client = get_sync_mcp_client()
mcp_client.initialize()

# Use in Gradio event handlers
def load_leaderboard():
    """Load leaderboard and generate AI insights"""
    # Load dataset
    ds = load_dataset("kshitijthakkar/smoltrace-leaderboard")
    df = pd.DataFrame(ds)

    # Get AI insights from MCP server
    try:
        insights = mcp_client.analyze_leaderboard(
            metric_focus="overall",
            time_range="last_week",
            top_n=5
        )
    except Exception as e:
        insights = f"❌ Error generating insights: {str(e)}"

    return df, insights

# Gradio UI
with gr.Blocks() as app:
    with gr.Tab("πŸ“Š Leaderboard"):
        load_btn = gr.Button("Load Leaderboard")
        insights_md = gr.Markdown(label="AI Insights")
        leaderboard_table = gr.Dataframe()

        load_btn.click(
            fn=load_leaderboard,
            outputs=[leaderboard_table, insights_md]
        )

Agent Framework Integration

smolagents Setup

# agent/smolagents_setup.py
from smolagents import ToolCallingAgent, MCPClient, HfApiModel
import os

def create_agent():
    """Create smolagents agent with MCP tool access"""

    # 1. Configure MCP client
    mcp_server_url = os.getenv(
        "MCP_SERVER_URL",
        "https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse"
    )

    mcp_client = MCPClient(mcp_server_url)

    # 2. Configure LLM
    model = HfApiModel(
        model_id="Qwen/Qwen2.5-Coder-32B-Instruct",
        token=os.getenv("HF_TOKEN")
    )

    # 3. Create agent with MCP tools
    agent = ToolCallingAgent(
        tools=[],  # MCP tools loaded automatically
        model=model,
        mcp_client=mcp_client,
        max_steps=10,
        verbosity_level=1
    )

    return agent

def run_agent_query(agent: ToolCallingAgent, query: str, show_reasoning: bool = False):
    """Run agent query and return response"""
    try:
        # Set verbosity based on show_reasoning flag
        if show_reasoning:
            agent.verbosity_level = 2  # Show tool execution logs
        else:
            agent.verbosity_level = 0  # Only show final answer

        # Run agent
        result = agent.run(query)

        return result
    except Exception as e:
        return f"❌ Agent error: {str(e)}"

Agent Chat UI

# app.py
from agent.smolagents_setup import create_agent, run_agent_query

# Initialize agent (once at startup)
agent = create_agent()

def agent_chat(message: str, history: list, show_reasoning: bool):
    """Handle agent chat interaction"""
    # Run agent query
    response = run_agent_query(agent, message, show_reasoning)

    # Update chat history
    history.append((message, response))

    return history, ""

# Gradio UI
with gr.Blocks() as app:
    with gr.Tab("πŸ€– Agent Chat"):
        gr.Markdown("## Autonomous Agent with MCP Tools")
        gr.Markdown("Ask questions about agent evaluations. The agent has access to all MCP tools.")

        chatbot = gr.Chatbot(label="Agent Chat")
        msg = gr.Textbox(label="Your Question", placeholder="What are the top 3 models and their costs?")
        show_reasoning = gr.Checkbox(label="Show Agent Reasoning", value=False)

        # Quick action buttons
        with gr.Row():
            quick_top = gr.Button("Quick: Top Models")
            quick_cost = gr.Button("Quick: Cost Estimate")
            quick_load = gr.Button("Quick: Load Leaderboard")

        # Event handlers
        msg.submit(agent_chat, [msg, chatbot, show_reasoning], [chatbot, msg])

        quick_top.click(
            lambda h, sr: agent_chat(
                "What are the top 5 models by success rate with their costs?",
                h,
                sr
            ),
            [chatbot, show_reasoning],
            [chatbot, msg]
        )

MCP Tools Usage

Tools Used in TraceMind-AI

Tool Where Used Purpose
analyze_leaderboard Leaderboard tab Generate AI insights when user loads leaderboard
estimate_cost New Evaluation tab Predict costs before submitting evaluation
debug_trace Trace Visualization Answer questions about execution traces
compare_runs Compare Runs/Agent Chat Compare two evaluation runs side-by-side
analyze_results Agent Chat Analyze detailed test results with optimization recommendations
get_top_performers Agent Chat Efficiently fetch top N models (90% token reduction)
get_leaderboard_summary Agent Chat Get high-level statistics (99% token reduction)
get_dataset Agent Chat Load SMOLTRACE datasets for detailed analysis

Example Tool Calls

Example 1: Leaderboard Insights

# User clicks "Load Leaderboard" button
insights = mcp_client.analyze_leaderboard(
    leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
    metric_focus="overall",
    time_range="last_week",
    top_n=5
)

# Display in Gradio Markdown component
insights_md.value = insights

Example 2: Cost Estimation

# User fills New Evaluation form and clicks "Estimate Cost"
estimate = mcp_client.estimate_cost(
    model="meta-llama/Llama-3.1-8B",
    agent_type="both",
    num_tests=100,
    hardware="auto"
)

# Display in dialog
gr.Info(estimate)

Example 3: Agent Multi-Step Query

# User asks: "What are the top 3 models and how much do they cost?"

# Agent reasoning (internal):
#   Step 1: Need to get top models by success rate
#   β†’ Call get_top_performers(metric="success_rate", top_n=3)
#
#   Step 2: Extract cost information from results
#   β†’ Parse JSON response, get "total_cost_usd" field
#
#   Step 3: Format response for user
#   β†’ Create markdown table with model names, success rates, costs

# Agent response:
"""
Here are the top 3 models by success rate:

1. **GPT-4**: 95.8% success rate, $0.05 per run
2. **Claude-3**: 94.1% success rate, $0.04 per run
3. **Llama-3.1-8B**: 93.4% success rate, $0.002 per run

GPT-4 leads in accuracy but is 25x more expensive than Llama-3.1.
For cost-sensitive workloads, Llama-3.1 offers the best value.
"""

Development Guide

Adding New MCP Tool Integration

  1. Add method to async client (client.py):
async def new_tool_name(self, **kwargs) -> str:
    """Wrapper for new_tool_name MCP tool"""
    return await self.call_tool("new_tool_name", kwargs)
  1. Add synchronous wrapper (sync_wrapper.py):
def new_tool_name(self, **kwargs) -> str:
    """Synchronous wrapper for new_tool_name"""
    if not self._connected:
        self.initialize()
    return self._run_async(self.async_client.new_tool_name(**kwargs))
  1. Use in Gradio app (app.py):
def handle_new_tool():
    result = mcp_client.new_tool_name(param1="value1", param2="value2")
    return result

Note: Agent automatically discovers new tools from MCP server, no code changes needed!

Testing MCP Integration

Test 1: Connection

python -c "from mcp_client.sync_wrapper import get_sync_mcp_client; client = get_sync_mcp_client(); client.initialize(); print('βœ… MCP client connected')"

Test 2: Tool Call

from mcp_client.sync_wrapper import get_sync_mcp_client

client = get_sync_mcp_client()
client.initialize()

result = client.analyze_leaderboard(
    metric_focus="cost",
    time_range="last_week",
    top_n=3
)

print(result)

Test 3: Agent

from agent.smolagents_setup import create_agent, run_agent_query

agent = create_agent()
response = run_agent_query(agent, "What are the top 3 models?", show_reasoning=True)
print(response)

Debugging MCP Issues

Issue: Connection timeout

  • Check: MCP server is running at specified URL
  • Check: Network connectivity to HuggingFace Spaces
  • Check: SSE transport is enabled on server

Issue: Tool not found

  • Check: MCP server has the tool implemented
  • Check: Tool name matches exactly (case-sensitive)
  • Check: Client initialized successfully (call initialize() first)

Issue: Agent not using MCP tools

  • Check: MCPClient is properly configured in agent setup
  • Check: Agent has max_steps > 0 to allow tool usage
  • Check: Query requires tool usage (not answerable from agent's knowledge alone)

Performance Considerations

Token Optimization

Problem: Loading full leaderboard dataset consumes excessive tokens Solution: Use token-optimized MCP tools

# ❌ BAD: Loads all 51 runs (50K+ tokens)
leaderboard = mcp_client.get_dataset("kshitijthakkar/smoltrace-leaderboard")

# βœ… GOOD: Returns only top 5 (5K tokens, 90% reduction)
top_performers = mcp_client.get_top_performers(top_n=5)

# βœ… BETTER: Returns summary stats (500 tokens, 99% reduction)
summary = mcp_client.get_leaderboard_summary()

Caching

Problem: Repeated identical MCP calls waste time and credits Solution: Implement client-side caching

from functools import lru_cache
import time

@lru_cache(maxsize=32)
def cached_analyze_leaderboard(metric_focus: str, time_range: str, top_n: int, cache_key: int):
    """Cached MCP call with TTL via cache_key"""
    return mcp_client.analyze_leaderboard(
        metric_focus=metric_focus,
        time_range=time_range,
        top_n=top_n
    )

# Use with 5-minute cache TTL
cache_key = int(time.time() // 300)  # Changes every 5 minutes
insights = cached_analyze_leaderboard("overall", "last_week", 5, cache_key)

Async Optimization

Problem: Sequential MCP calls block UI Solution: Use async for parallel calls

import asyncio

async def load_leaderboard_with_insights():
    """Load leaderboard and insights in parallel"""
    # Start both operations concurrently
    leaderboard_task = asyncio.create_task(load_dataset_async("kshitijthakkar/smoltrace-leaderboard"))
    insights_task = asyncio.create_task(mcp_client.analyze_leaderboard(metric_focus="overall"))

    # Wait for both to complete
    leaderboard, insights = await asyncio.gather(leaderboard_task, insights_task)

    return leaderboard, insights

Security Considerations

API Key Management

DO:

  • Store API keys in environment variables or HF Spaces secrets
  • Use session-only storage in Gradio (not server-side persistence)
  • Rotate keys regularly

DON'T:

  • Hardcode API keys in source code
  • Expose keys in client-side JavaScript
  • Log API keys in console or files

MCP Server Trust

Verify MCP server authenticity:

  • Use HTTPS URLs only
  • Verify domain ownership (huggingface.co spaces)
  • Review MCP server code before connecting (open source)

Limit tool access:

  • Only connect to trusted MCP servers
  • Review tool permissions before use
  • Implement rate limiting for tool calls

Related Documentation


Last Updated: November 21, 2025