Mandark-droid commited on
Commit
a3116de
·
1 Parent(s): c5a5a1d

TraceMind MCP Server V1 Working and tested.

Browse files
Files changed (9) hide show
  1. .env.example +8 -0
  2. .gitignore +34 -0
  3. API_KEY_CONFIGURATION.md +261 -0
  4. LICENSE +680 -0
  5. README.md +586 -8
  6. app.py +1006 -0
  7. gemini_client.py +185 -0
  8. mcp_tools.py +943 -0
  9. requirements.txt +11 -0
.env.example ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ # Google Gemini API Key
2
+ # Get from: https://ai.google.dev/
3
+ GEMINI_API_KEY=your_gemini_api_key_here
4
+
5
+ # HuggingFace Token
6
+ # Get from: https://huggingface.co/settings/tokens
7
+ # Needs read access to datasets
8
+ HF_TOKEN=your_huggingface_token_here
.gitignore ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Environment
2
+ .env
3
+ .venv/
4
+ venv/
5
+ env/
6
+
7
+ # Python
8
+ __pycache__/
9
+ *.py[cod]
10
+ *$py.class
11
+ *.so
12
+ .Python
13
+
14
+ # Distribution / packaging
15
+ dist/
16
+ build/
17
+ *.egg-info/
18
+
19
+ # IDEs
20
+ .vscode/
21
+ .idea/
22
+ *.swp
23
+ *.swo
24
+
25
+ # OS
26
+ .DS_Store
27
+ Thumbs.db
28
+
29
+ # Gradio
30
+ flagged/
31
+ gradio_cached_examples/
32
+
33
+ # Logs
34
+ *.log
API_KEY_CONFIGURATION.md ADDED
@@ -0,0 +1,261 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # API Key Configuration Feature
2
+
3
+ ## Overview
4
+
5
+ Users can now configure their API keys directly through the TraceMind MCP Server UI. These user-provided keys will override environment variables for the current session.
6
+
7
+ ## What's New
8
+
9
+ ### 1. Settings Tab (⚙️)
10
+ A new Settings tab has been added as the first tab in the UI where users can:
11
+ - Enter their **Google Gemini API Key**
12
+ - Enter their **HuggingFace Token**
13
+ - Save keys for the current session
14
+ - Clear session keys and revert to environment variables
15
+
16
+ ### 2. Session-Only Storage
17
+ - API keys are stored in Gradio's session state
18
+ - Keys are **NOT** persisted to disk or cookies
19
+ - Keys are automatically cleared when the browser session ends
20
+ - Each user session has its own isolated key storage
21
+
22
+ ### 3. Automatic Key Validation
23
+ - Gemini API keys are validated when saved by creating a test client
24
+ - Invalid keys are rejected with clear error messages
25
+ - Users receive immediate feedback on key validity
26
+
27
+ ## How to Use
28
+
29
+ ### Option 1: Configure via UI (Recommended)
30
+
31
+ 1. **Navigate to Settings Tab**
32
+ - Open the TraceMind MCP Server app
33
+ - Click on the "⚙️ Settings" tab (first tab)
34
+
35
+ 2. **Enter Your Keys**
36
+ - **Gemini API Key**: Get from https://aistudio.google.com/app/apikey
37
+ - **HuggingFace Token**: Get from https://huggingface.co/settings/tokens
38
+
39
+ 3. **Save Keys**
40
+ - Click "💾 Save API Keys for This Session"
41
+ - Wait for validation confirmation
42
+ - Keys are now active for all tools
43
+
44
+ 4. **Use Any Tool**
45
+ - Navigate to any other tab (Analyze Leaderboard, Debug Trace, etc.)
46
+ - Tools will automatically use your configured keys
47
+ - No additional configuration needed
48
+
49
+ ### Option 2: Environment Variables (Still Supported)
50
+
51
+ You can still use environment variables as before:
52
+
53
+ ```bash
54
+ export GEMINI_API_KEY="your-key-here"
55
+ export HF_TOKEN="your-token-here"
56
+ ```
57
+
58
+ **Note**: UI-configured keys always override environment variables.
59
+
60
+ ## Technical Details
61
+
62
+ ### Architecture Changes
63
+
64
+ #### 1. UI Layer (`app.py`)
65
+ - Added Settings tab with key input forms
66
+ - Implemented session state management with `gr.State()`
67
+ - Updated all tool functions to accept API keys as parameters
68
+ - Added key validation and error handling
69
+
70
+ #### 2. Tool Layer (`mcp_tools.py`)
71
+ - Updated all functions to accept optional `hf_token` parameter
72
+ - Modified `load_dataset()` calls to use user-provided tokens
73
+ - Added fallback to environment variables when no token provided
74
+ - Functions updated:
75
+ - `analyze_leaderboard()`
76
+ - `debug_trace()`
77
+ - `compare_runs()`
78
+ - `get_dataset()`
79
+ - `get_leaderboard_data()` (MCP Resource)
80
+ - `get_trace_data()` (MCP Resource)
81
+
82
+ #### 3. Client Layer (`gemini_client.py`)
83
+ - `GeminiClient.__init__()` already supported optional `api_key` parameter
84
+ - No changes needed - already designed for key override
85
+
86
+ ### Key Features
87
+
88
+ 1. **Priority Order**:
89
+ ```
90
+ User-provided key (UI) > Environment variable > Error
91
+ ```
92
+
93
+ 2. **Validation**:
94
+ - Gemini keys: Validated by creating test `GeminiClient`
95
+ - HF tokens: Accepted without validation (validated on first use)
96
+
97
+ 3. **Error Handling**:
98
+ - Clear error messages when keys are missing
99
+ - Helpful prompts to configure keys in Settings tab
100
+ - Validation errors shown immediately
101
+
102
+ 4. **Session Management**:
103
+ - Keys stored in `gr.State()` (Gradio session state)
104
+ - Isolated per-user in multi-user environments
105
+ - Automatically cleared on session end
106
+
107
+ ## Security Considerations
108
+
109
+ ### ✅ Secure Practices
110
+
111
+ 1. **No Persistence**: Keys are never written to disk
112
+ 2. **Session Isolation**: Each user has isolated key storage
113
+ 3. **Password Fields**: Keys displayed as `type="password"` (hidden)
114
+ 4. **No Logging**: Keys not logged or exposed in error messages
115
+
116
+ ### ⚠️ Security Notes
117
+
118
+ - **HTTPS Required**: Always use HTTPS in production to protect keys in transit
119
+ - **Public Spaces**: Be cautious using on public HuggingFace Spaces
120
+ - **Shared Environments**: Each browser session is isolated, but server has access
121
+ - **Recommendation**: Use environment variables for production deployments
122
+
123
+ ## Examples
124
+
125
+ ### Example 1: First-Time User
126
+
127
+ ```
128
+ 1. User opens app (no env vars set)
129
+ 2. User sees "⚠️ Status: No API key configured" in Settings
130
+ 3. User enters Gemini API key and HF token
131
+ 4. User clicks "Save API Keys"
132
+ 5. User sees "✅ Gemini API key validated and saved"
133
+ 6. User switches to "Analyze Leaderboard" tab
134
+ 7. Tool works using user-provided keys
135
+ ```
136
+
137
+ ### Example 2: Overriding Environment Variables
138
+
139
+ ```
140
+ 1. User has GEMINI_API_KEY set in environment
141
+ 2. User wants to test with a different key
142
+ 3. User enters new key in Settings tab
143
+ 4. User clicks "Save API Keys"
144
+ 5. All tools now use the new key (not the env var)
145
+ 6. User clicks "Clear Session Keys" to revert
146
+ 7. Tools now use environment variable again
147
+ ```
148
+
149
+ ### Example 3: Error Handling
150
+
151
+ ```
152
+ 1. User enters invalid Gemini API key
153
+ 2. User clicks "Save API Keys"
154
+ 3. User sees "❌ Gemini API key invalid: [error message]"
155
+ 4. User corrects the key and tries again
156
+ 5. User sees "✅ Gemini API key validated and saved"
157
+ ```
158
+
159
+ ## API Changes
160
+
161
+ ### Function Signatures
162
+
163
+ All tool functions now accept optional API key parameters:
164
+
165
+ ```python
166
+ # Before
167
+ async def analyze_leaderboard(
168
+ gemini_client: GeminiClient,
169
+ leaderboard_repo: str = "...",
170
+ ...
171
+ ) -> str:
172
+
173
+ # After
174
+ async def analyze_leaderboard(
175
+ gemini_client: GeminiClient,
176
+ leaderboard_repo: str = "...",
177
+ ...,
178
+ hf_token: Optional[str] = None # NEW
179
+ ) -> str:
180
+ ```
181
+
182
+ ### Backward Compatibility
183
+
184
+ - ✅ All existing code continues to work
185
+ - ✅ Environment variables still supported
186
+ - ✅ No breaking changes to MCP protocol
187
+ - ✅ Optional parameters have sensible defaults
188
+
189
+ ## Testing Checklist
190
+
191
+ - [x] UI renders Settings tab correctly
192
+ - [x] Gemini API key input works (password field)
193
+ - [x] HF token input works (password field)
194
+ - [x] Save button validates and stores keys
195
+ - [x] Clear button reverts to environment variables
196
+ - [ ] All tools use user-provided Gemini key
197
+ - [ ] All tools use user-provided HF token
198
+ - [ ] Invalid Gemini key shows error
199
+ - [ ] Missing keys show helpful error messages
200
+ - [ ] Session isolation works in multi-user scenario
201
+ - [ ] Keys cleared on browser close
202
+
203
+ ## Future Enhancements
204
+
205
+ 1. **Key Persistence** (Optional):
206
+ - Add opt-in browser localStorage support
207
+ - Warning about security implications
208
+
209
+ 2. **Multiple Key Profiles**:
210
+ - Save multiple key configurations
211
+ - Quick switch between profiles
212
+
213
+ 3. **Usage Tracking**:
214
+ - Show API usage per session
215
+ - Cost estimation based on usage
216
+
217
+ 4. **Token Expiration**:
218
+ - Detect expired HF tokens
219
+ - Prompt for refresh
220
+
221
+ ## Troubleshooting
222
+
223
+ ### Keys Not Working
224
+
225
+ **Problem**: Tools show "No API key configured" error
226
+
227
+ **Solutions**:
228
+ 1. Check you clicked "Save API Keys" button
229
+ 2. Look for validation success message
230
+ 3. Try refreshing the page and re-entering keys
231
+ 4. Check browser console for errors
232
+
233
+ ### Validation Fails
234
+
235
+ **Problem**: "❌ Gemini API key invalid" error
236
+
237
+ **Solutions**:
238
+ 1. Verify key copied correctly (no extra spaces)
239
+ 2. Check key is active at https://aistudio.google.com/app/apikey
240
+ 3. Ensure you have API quota remaining
241
+ 4. Try generating a new key
242
+
243
+ ### Dataset Access Denied
244
+
245
+ **Problem**: "Error loading dataset: Access denied"
246
+
247
+ **Solutions**:
248
+ 1. Verify HF token is correct
249
+ 2. Check token has read permissions
250
+ 3. Ensure dataset is public or you have access
251
+ 4. Try using a new token
252
+
253
+ ## Support
254
+
255
+ For issues or questions:
256
+ - Check the Settings tab for status messages
257
+ - Review error messages in tool outputs
258
+ - Open an issue on GitHub with:
259
+ - Steps to reproduce
260
+ - Error messages (DO NOT include actual API keys)
261
+ - Browser and OS information
LICENSE ADDED
@@ -0,0 +1,680 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ TraceMind MCP Server - AI-powered MCP server for agent evaluation analysis
2
+ Copyright (C) 2025 Kshitij Thakkar
3
+
4
+ This program is free software: you can redistribute it and/or modify
5
+ it under the terms of the GNU Affero General Public License as published by
6
+ the Free Software Foundation, either version 3 of the License, or
7
+ (at your option) any later version.
8
+
9
+ This program is distributed in the hope that it will be useful,
10
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
11
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12
+ GNU Affero General Public License for more details.
13
+
14
+ You should have received a copy of the GNU Affero General Public License
15
+ along with this program. If not, see <https://www.gnu.org/licenses/>.
16
+
17
+ ================================================================================
18
+
19
+ GNU AFFERO GENERAL PUBLIC LICENSE
20
+ Version 3, 19 November 2007
21
+
22
+ Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
23
+ Everyone is permitted to copy and distribute verbatim copies
24
+ of this license document, but changing it is not allowed.
25
+
26
+ Preamble
27
+
28
+ The GNU Affero General Public License is a free, copyleft license for
29
+ software and other kinds of works, specifically designed to ensure
30
+ cooperation with the community in the case of network server software.
31
+
32
+ The licenses for most software and other practical works are designed
33
+ to take away your freedom to share and change the works. By contrast,
34
+ our General Public Licenses are intended to guarantee your freedom to
35
+ share and change all versions of a program--to make sure it remains free
36
+ software for all its users.
37
+
38
+ When we speak of free software, we are referring to freedom, not
39
+ price. Our General Public Licenses are designed to make sure that you
40
+ have the freedom to distribute copies of free software (and charge for
41
+ them if you wish), that you receive source code or can get it if you
42
+ want it, that you can change the software or use pieces of it in new
43
+ free programs, and that you know you can do these things.
44
+
45
+ Developers that use our General Public Licenses protect your rights
46
+ with two steps: (1) assert copyright on the software, and (2) offer
47
+ you this License which gives you legal permission to copy, distribute
48
+ and/or modify the software.
49
+
50
+ A secondary benefit of defending all users' freedom is that
51
+ improvements made in alternate versions of the program, if they
52
+ receive widespread use, become available for other developers to
53
+ incorporate. Many developers of free software are heartened and
54
+ encouraged by the resulting cooperation. However, in the case of
55
+ software used on network servers, this result may fail to come about.
56
+ The GNU General Public License permits making a modified version and
57
+ letting the public access it on a server without ever releasing its
58
+ source code to the public.
59
+
60
+ The GNU Affero General Public License is designed specifically to
61
+ ensure that, in such cases, the modified source code becomes available
62
+ to the community. It requires the operator of a network server to
63
+ provide the source code of the modified version running there to the
64
+ users of that server. Therefore, public use of a modified version, on
65
+ a publicly accessible server, gives the public access to the source
66
+ code of the modified version.
67
+
68
+ An older license, called the Affero General Public License and
69
+ published by Affero, was designed to accomplish similar goals. This is
70
+ a different license, not a version of the Affero GPL, but Affero has
71
+ released a new version of the Affero GPL which permits relicensing under
72
+ this license.
73
+
74
+ The precise terms and conditions for copying, distribution and
75
+ modification follow.
76
+
77
+ TERMS AND CONDITIONS
78
+
79
+ 0. Definitions.
80
+
81
+ "This License" refers to version 3 of the GNU Affero General Public License.
82
+
83
+ "Copyright" also means copyright-like laws that apply to other kinds of
84
+ works, such as semiconductor masks.
85
+
86
+ "The Program" refers to any copyrightable work licensed under this
87
+ License. Each licensee is addressed as "you". "Licensees" and
88
+ "recipients" may be individuals or organizations.
89
+
90
+ To "modify" a work means to copy from or adapt all or part of the work
91
+ in a fashion requiring copyright permission, other than the making of an
92
+ exact copy. The resulting work is called a "modified version" of the
93
+ earlier work or a work "based on" the earlier work.
94
+
95
+ A "covered work" means either the unmodified Program or a work based
96
+ on the Program.
97
+
98
+ To "propagate" a work means to do anything with it that, without
99
+ permission, would make you directly or secondarily liable for
100
+ infringement under applicable copyright law, except executing it on a
101
+ computer or modifying a private copy. Propagation includes copying,
102
+ distribution (with or without modification), making available to the
103
+ public, and in some countries other activities as well.
104
+
105
+ To "convey" a work means any kind of propagation that enables other
106
+ parties to make or receive copies. Mere interaction with a user through
107
+ a computer network, with no transfer of a copy, is not conveying.
108
+
109
+ An interactive user interface displays "Appropriate Legal Notices"
110
+ to the extent that it includes a convenient and prominently visible
111
+ feature that (1) displays an appropriate copyright notice, and (2)
112
+ tells the user that there is no warranty for the work (except to the
113
+ extent that warranties are provided), that licensees may convey the
114
+ work under this License, and how to view a copy of this License. If
115
+ the interface presents a list of user commands or options, such as a
116
+ menu, a prominent item in the list meets this criterion.
117
+
118
+ 1. Source Code.
119
+
120
+ The "source code" for a work means the preferred form of the work
121
+ for making modifications to it. "Object code" means any non-source
122
+ form of a work.
123
+
124
+ A "Standard Interface" means an interface that either is an official
125
+ standard defined by a recognized standards body, or, in the case of
126
+ interfaces specified for a particular programming language, one that
127
+ is widely used among developers working in that language.
128
+
129
+ The "System Libraries" of an executable work include anything, other
130
+ than the work as a whole, that (a) is included in the normal form of
131
+ packaging a Major Component, but which is not part of that Major
132
+ Component, and (b) serves only to enable use of the work with that
133
+ Major Component, or to implement a Standard Interface for which an
134
+ implementation is available to the public in source code form. A
135
+ "Major Component", in this context, means a major essential component
136
+ (kernel, window system, and so on) of the specific operating system
137
+ (if any) on which the executable work runs, or a compiler used to
138
+ produce the work, or an object code interpreter used to run it.
139
+
140
+ The "Corresponding Source" for a work in object code form means all
141
+ the source code needed to generate, install, and (for an executable
142
+ work) run the object code and to modify the work, including scripts to
143
+ control those activities. However, it does not include the work's
144
+ System Libraries, or general-purpose tools or generally available free
145
+ programs which are used unmodified in performing those activities but
146
+ which are not part of the work. For example, Corresponding Source
147
+ includes interface definition files associated with source files for
148
+ the work, and the source code for shared libraries and dynamically
149
+ linked subprograms that the work is specifically designed to require,
150
+ such as by intimate data communication or control flow between those
151
+ subprograms and other parts of the work.
152
+
153
+ The Corresponding Source need not include anything that users
154
+ can regenerate automatically from other parts of the Corresponding
155
+ Source.
156
+
157
+ The Corresponding Source for a work in source code form is that
158
+ same work.
159
+
160
+ 2. Basic Permissions.
161
+
162
+ All rights granted under this License are granted for the term of
163
+ copyright on the Program, and are irrevocable provided the stated
164
+ conditions are met. This License explicitly affirms your unlimited
165
+ permission to run the unmodified Program. The output from running a
166
+ covered work is covered by this License only if the output, given its
167
+ content, constitutes a covered work. This License acknowledges your
168
+ rights of fair use or other equivalent, as provided by copyright law.
169
+
170
+ You may make, run and propagate covered works that you do not
171
+ convey, without conditions so long as your license otherwise remains
172
+ in force. You may convey covered works to others for the sole purpose
173
+ of having them make modifications exclusively for you, or provide you
174
+ with facilities for running those works, provided that you comply with
175
+ the terms of this License in conveying all material for which you do
176
+ not control copyright. Those thus making or running the covered works
177
+ for you must do so exclusively on your behalf, under your direction
178
+ and control, on terms that prohibit them from making any copies of
179
+ your copyrighted material outside their relationship with you.
180
+
181
+ Conveying under any other circumstances is permitted solely under
182
+ the conditions stated below. Sublicensing is not allowed; section 10
183
+ makes it unnecessary.
184
+
185
+ 3. Protecting Users' Legal Rights From Anti-Circumvention Law.
186
+
187
+ No covered work shall be deemed part of an effective technological
188
+ measure under any applicable law fulfilling obligations under article
189
+ 11 of the WIPO copyright treaty adopted on 20 December 1996, or
190
+ similar laws prohibiting or restricting circumvention of such
191
+ measures.
192
+
193
+ When you convey a covered work, you waive any legal power to forbid
194
+ circumvention of technological measures to the extent such circumvention
195
+ is effected by exercising rights under this License with respect to
196
+ the covered work, and you disclaim any intention to limit operation or
197
+ modification of the work as a means of enforcing, against the work's
198
+ users, your or third parties' legal rights to forbid circumvention of
199
+ technological measures.
200
+
201
+ 4. Conveying Verbatim Copies.
202
+
203
+ You may convey verbatim copies of the Program's source code as you
204
+ receive it, in any medium, provided that you conspicuously and
205
+ appropriately publish on each copy an appropriate copyright notice;
206
+ keep intact all notices stating that this License and any
207
+ non-permissive terms added in accord with section 7 apply to the code;
208
+ keep intact all notices of the absence of any warranty; and give all
209
+ recipients a copy of this License along with the Program.
210
+
211
+ You may charge any price or no price for each copy that you convey,
212
+ and you may offer support or warranty protection for a fee.
213
+
214
+ 5. Conveying Modified Source Versions.
215
+
216
+ You may convey a work based on the Program, or the modifications to
217
+ produce it from the Program, in the form of source code under the
218
+ terms of section 4, provided that you also meet all of these conditions:
219
+
220
+ a) The work must carry prominent notices stating that you modified
221
+ it, and giving a relevant date.
222
+
223
+ b) The work must carry prominent notices stating that it is
224
+ released under this License and any conditions added under section
225
+ 7. This requirement modifies the requirement in section 4 to
226
+ "keep intact all notices".
227
+
228
+ c) You must license the entire work, as a whole, under this
229
+ License to anyone who comes into possession of a copy. This
230
+ License will therefore apply, along with any applicable section 7
231
+ additional terms, to the whole of the work, and all its parts,
232
+ regardless of how they are packaged. This License gives no
233
+ permission to license the work in any other way, but it does not
234
+ invalidate such permission if you have separately received it.
235
+
236
+ d) If the work has interactive user interfaces, each must display
237
+ Appropriate Legal Notices; however, if the Program has interactive
238
+ interfaces that do not display Appropriate Legal Notices, your
239
+ work need not make them do so.
240
+
241
+ A compilation of a covered work with other separate and independent
242
+ works, which are not by their nature extensions of the covered work,
243
+ and which are not combined with it such as to form a larger program,
244
+ in or on a volume of a storage or distribution medium, is called an
245
+ "aggregate" if the compilation and its resulting copyright are not
246
+ used to limit the access or legal rights of the compilation's users
247
+ beyond what the individual works permit. Inclusion of a covered work
248
+ in an aggregate does not cause this License to apply to the other
249
+ parts of the aggregate.
250
+
251
+ 6. Conveying Non-Source Forms.
252
+
253
+ You may convey a covered work in object code form under the terms
254
+ of sections 4 and 5, provided that you also convey the
255
+ machine-readable Corresponding Source under the terms of this License,
256
+ in one of these ways:
257
+
258
+ a) Convey the object code in, or embodied in, a physical product
259
+ (including a physical distribution medium), accompanied by the
260
+ Corresponding Source fixed on a durable physical medium
261
+ customarily used for software interchange.
262
+
263
+ b) Convey the object code in, or embodied in, a physical product
264
+ (including a physical distribution medium), accompanied by a
265
+ written offer, valid for at least three years and valid for as
266
+ long as you offer spare parts or customer support for that product
267
+ model, to give anyone who possesses the object code either (1) a
268
+ copy of the Corresponding Source for all the software in the
269
+ product that is covered by this License, on a durable physical
270
+ medium customarily used for software interchange, for a price no
271
+ more than your reasonable cost of physically performing this
272
+ conveying of source, or (2) access to copy the
273
+ Corresponding Source from a network server at no charge.
274
+
275
+ c) Convey individual copies of the object code with a copy of the
276
+ written offer to provide the Corresponding Source. This
277
+ alternative is allowed only occasionally and noncommercially, and
278
+ only if you received the object code with such an offer, in accord
279
+ with subsection 6b.
280
+
281
+ d) Convey the object code by offering access from a designated
282
+ place (gratis or for a charge), and offer equivalent access to the
283
+ Corresponding Source in the same way through the same place at no
284
+ further charge. You need not require recipients to copy the
285
+ Corresponding Source along with the object code. If the place to
286
+ copy the object code is a network server, the Corresponding Source
287
+ may be on a different server (operated by you or a third party)
288
+ that supports equivalent copying facilities, provided you maintain
289
+ clear directions next to the object code saying where to find the
290
+ Corresponding Source. Regardless of what server hosts the
291
+ Corresponding Source, you remain obligated to ensure that it is
292
+ available for as long as needed to satisfy these requirements.
293
+
294
+ e) Convey the object code using peer-to-peer transmission, provided
295
+ you inform other peers where the object code and Corresponding
296
+ Source of the work are being offered to the general public at no
297
+ charge under subsection 6d.
298
+
299
+ A separable portion of the object code, whose source code is excluded
300
+ from the Corresponding Source as a System Library, need not be
301
+ included in conveying the object code work.
302
+
303
+ A "User Product" is either (1) a "consumer product", which means any
304
+ tangible personal property which is normally used for personal, family,
305
+ or household purposes, or (2) anything designed or sold for incorporation
306
+ into a dwelling. In determining whether a product is a consumer product,
307
+ doubtful cases shall be resolved in favor of coverage. For a particular
308
+ product received by a particular user, "normally used" refers to a
309
+ typical or common use of that class of product, regardless of the status
310
+ of the particular user or of the way in which the particular user
311
+ actually uses, or expects or is expected to use, the product. A product
312
+ is a consumer product regardless of whether the product has substantial
313
+ commercial, industrial or non-consumer uses, unless such uses represent
314
+ the only significant mode of use of the product.
315
+
316
+ "Installation Information" for a User Product means any methods,
317
+ procedures, authorization keys, or other information required to install
318
+ and execute modified versions of a covered work in that User Product from
319
+ a modified version of its Corresponding Source. The information must
320
+ suffice to ensure that the continued functioning of the modified object
321
+ code is in no case prevented or interfered with solely because
322
+ modification has been made.
323
+
324
+ If you convey an object code work under this section in, or with, or
325
+ specifically for use in, a User Product, and the conveying occurs as
326
+ part of a transaction in which the right of possession and use of the
327
+ User Product is transferred to the recipient in perpetuity or for a
328
+ fixed term (regardless of how the transaction is characterized), the
329
+ Corresponding Source conveyed under this section must be accompanied
330
+ by the Installation Information. But this requirement does not apply
331
+ if neither you nor any third party retains the ability to install
332
+ modified object code on the User Product (for example, the work has
333
+ been installed in ROM).
334
+
335
+ The requirement to provide Installation Information does not include a
336
+ requirement to continue to provide support service, warranty, or updates
337
+ for a work that has been modified or installed by the recipient, or for
338
+ the User Product in which it has been modified or installed. Access to a
339
+ network may be denied when the modification itself materially and
340
+ adversely affects the operation of the network or violates the rules and
341
+ protocols for communication across the network.
342
+
343
+ Corresponding Source conveyed, and Installation Information provided,
344
+ in accord with this section must be in a format that is publicly
345
+ documented (and with an implementation available to the public in
346
+ source code form), and must require no special password or key for
347
+ unpacking, reading or copying.
348
+
349
+ 7. Additional Terms.
350
+
351
+ "Additional permissions" are terms that supplement the terms of this
352
+ License by making exceptions from one or more of its conditions.
353
+ Additional permissions that are applicable to the entire Program shall
354
+ be treated as though they were included in this License, to the extent
355
+ that they are valid under applicable law. If additional permissions
356
+ apply only to part of the Program, that part may be used separately
357
+ under those permissions, but the entire Program remains governed by
358
+ this License without regard to the additional permissions.
359
+
360
+ When you convey a copy of a covered work, you may at your option
361
+ remove any additional permissions from that copy, or from any part of
362
+ it. (Additional permissions may be written to require their own
363
+ removal in certain cases when you modify the work.) You may place
364
+ additional permissions on material, added by you to a covered work,
365
+ for which you have or can give appropriate copyright permission.
366
+
367
+ Notwithstanding any other provision of this License, for material you
368
+ add to a covered work, you may (if authorized by the copyright holders of
369
+ that material) supplement the terms of this License with terms:
370
+
371
+ a) Disclaiming warranty or limiting liability differently from the
372
+ terms of sections 15 and 16 of this License; or
373
+
374
+ b) Requiring preservation of specified reasonable legal notices or
375
+ author attributions in that material or in the Appropriate Legal
376
+ Notices displayed by works containing it; or
377
+
378
+ c) Prohibiting misrepresentation of the origin of that material, or
379
+ requiring that modified versions of such material be marked in
380
+ reasonable ways as different from the original version; or
381
+
382
+ d) Limiting the use for publicity purposes of names of licensors or
383
+ authors of the material; or
384
+
385
+ e) Declining to grant rights under trademark law for use of some
386
+ trade names, trademarks, or service marks; or
387
+
388
+ f) Requiring indemnification of licensors and authors of that
389
+ material by anyone who conveys the material (or modified versions of
390
+ it) with contractual assumptions of liability to the recipient, for
391
+ any liability that these contractual assumptions directly impose on
392
+ those licensors and authors.
393
+
394
+ All other non-permissive additional terms are considered "further
395
+ restrictions" within the meaning of section 10. If the Program as you
396
+ received it, or any part of it, contains a notice stating that it is
397
+ governed by this License along with a term that is a further
398
+ restriction, you may remove that term. If a license document contains
399
+ a further restriction but permits relicensing or conveying under this
400
+ License, you may add to a covered work material governed by the terms
401
+ of that license document, provided that the further restriction does
402
+ not survive such relicensing or conveying.
403
+
404
+ If you add terms to a covered work in accord with this section, you
405
+ must place, in the relevant source files, a statement of the
406
+ additional terms that apply to those files, or a notice indicating
407
+ where to find the applicable terms.
408
+
409
+ Additional terms, permissive or non-permissive, may be stated in the
410
+ form of a separately written license, or stated as exceptions;
411
+ the above requirements apply either way.
412
+
413
+ 8. Termination.
414
+
415
+ You may not propagate or modify a covered work except as expressly
416
+ provided under this License. Any attempt otherwise to propagate or
417
+ modify it is void, and will automatically terminate your rights under
418
+ this License (including any patent licenses granted under the third
419
+ paragraph of section 11).
420
+
421
+ However, if you cease all violation of this License, then your
422
+ license from a particular copyright holder is reinstated (a)
423
+ provisionally, unless and until the copyright holder explicitly and
424
+ finally terminates your license, and (b) permanently, if the copyright
425
+ holder fails to notify you of the violation by some reasonable means
426
+ prior to 60 days after the cessation.
427
+
428
+ Moreover, your license from a particular copyright holder is
429
+ reinstated permanently if the copyright holder notifies you of the
430
+ violation by some reasonable means, this is the first time you have
431
+ received notice of violation of this License (for any work) from that
432
+ copyright holder, and you cure the violation prior to 30 days after
433
+ your receipt of the notice.
434
+
435
+ Termination of your rights under this section does not terminate the
436
+ licenses of parties who have received copies or rights from you under
437
+ this License. If your rights have been terminated and not permanently
438
+ reinstated, you do not qualify to receive new licenses for the same
439
+ material under section 10.
440
+
441
+ 9. Acceptance Not Required for Having Copies.
442
+
443
+ You are not required to accept this License in order to receive or
444
+ run a copy of the Program. Ancillary propagation of a covered work
445
+ occurring solely as a consequence of using peer-to-peer transmission
446
+ to receive a copy likewise does not require acceptance. However,
447
+ nothing other than this License grants you permission to propagate or
448
+ modify any covered work. These actions infringe copyright if you do
449
+ not accept this License. Therefore, by modifying or propagating a
450
+ covered work, you indicate your acceptance of this License to do so.
451
+
452
+ 10. Automatic Licensing of Downstream Recipients.
453
+
454
+ Each time you convey a covered work, the recipient automatically
455
+ receives a license from the original licensors, to run, modify and
456
+ propagate that work, subject to this License. You are not responsible
457
+ for enforcing compliance by third parties with this License.
458
+
459
+ An "entity transaction" is a transaction transferring control of an
460
+ organization, or substantially all assets of one, or subdividing an
461
+ organization, or merging organizations. If propagation of a covered
462
+ work results from an entity transaction, each party to that
463
+ transaction who receives a copy of the work also receives whatever
464
+ licenses to the work the party's predecessor in interest had or could
465
+ give under the previous paragraph, plus a right to possession of the
466
+ Corresponding Source of the work from the predecessor in interest, if
467
+ the predecessor has it or can get it with reasonable efforts.
468
+
469
+ You may not impose any further restrictions on the exercise of the
470
+ rights granted or affirmed under this License. For example, you may
471
+ not impose a license fee, royalty, or other charge for exercise of
472
+ rights granted under this License, and you may not initiate litigation
473
+ (including a cross-claim or counterclaim in a lawsuit) alleging that
474
+ any patent claim is infringed by making, using, selling, offering for
475
+ sale, or importing the Program or any portion of it.
476
+
477
+ 11. Patents.
478
+
479
+ A "contributor" is a copyright holder who authorizes use under this
480
+ License of the Program or a work on which the Program is based. The
481
+ work thus licensed is called the contributor's "contributor version".
482
+
483
+ A contributor's "essential patent claims" are all patent claims
484
+ owned or controlled by the contributor, whether already acquired or
485
+ hereafter acquired, that would be infringed by some manner, permitted
486
+ by this License, of making, using, or selling its contributor version,
487
+ but do not include claims that would be infringed only as a
488
+ consequence of further modification of the contributor version. For
489
+ purposes of this definition, "control" includes the right to grant
490
+ patent sublicenses in a manner consistent with the requirements of
491
+ this License.
492
+
493
+ Each contributor grants you a non-exclusive, worldwide, royalty-free
494
+ patent license under the contributor's essential patent claims, to
495
+ make, use, sell, offer for sale, import and otherwise run, modify and
496
+ propagate the contents of its contributor version.
497
+
498
+ In the following three paragraphs, a "patent license" is any express
499
+ agreement or commitment, however denominated, not to enforce a patent
500
+ (such as an express permission to practice a patent or covenant not to
501
+ sue for patent infringement). To "grant" such a patent license to a
502
+ party means to make such an agreement or commitment not to enforce a
503
+ patent against the party.
504
+
505
+ If you convey a covered work, knowingly relying on a patent license,
506
+ and the Corresponding Source of the work is not available for anyone
507
+ to copy, free of charge and under the terms of this License, through a
508
+ publicly available network server or other readily accessible means,
509
+ then you must either (1) cause the Corresponding Source to be so
510
+ available, or (2) arrange to deprive yourself of the benefit of the
511
+ patent license for this particular work, or (3) arrange, in a manner
512
+ consistent with the requirements of this License, to extend the patent
513
+ license to downstream recipients. "Knowingly relying" means you have
514
+ actual knowledge that, but for the patent license, your conveying the
515
+ covered work in a country, or your recipient's use of the covered work
516
+ in a country, would infringe one or more identifiable patents in that
517
+ country that you have reason to believe are valid.
518
+
519
+ If, pursuant to or in connection with a single transaction or
520
+ arrangement, you convey, or propagate by procuring conveyance of, a
521
+ covered work, and grant a patent license to some of the parties
522
+ receiving the covered work authorizing them to use, propagate, modify
523
+ or convey a specific copy of the covered work, then the patent license
524
+ you grant is automatically extended to all recipients of the covered
525
+ work and works based on it.
526
+
527
+ A patent license is "discriminatory" if it does not include within
528
+ the scope of its coverage, prohibits the exercise of, or is
529
+ conditioned on the non-exercise of one or more of the rights that are
530
+ specifically granted under this License. You may not convey a covered
531
+ work if you are a party to an arrangement with a third party that is
532
+ in the business of distributing software, under which you make payment
533
+ to the third party based on the extent of your activity of conveying
534
+ the work, and under which the third party grants, to any of the
535
+ parties who would receive the covered work from you, a discriminatory
536
+ patent license (a) in connection with copies of the covered work
537
+ conveyed by you (or copies made from those copies), or (b) primarily
538
+ for and in connection with specific products or compilations that
539
+ contain the covered work, unless you entered into that arrangement,
540
+ or that patent license was granted, prior to 28 March 2007.
541
+
542
+ Nothing in this License shall be construed as excluding or limiting
543
+ any implied license or other defenses to infringement that may
544
+ otherwise be available to you under applicable patent law.
545
+
546
+ 12. No Surrender of Others' Freedom.
547
+
548
+ If conditions are imposed on you (whether by court order, agreement or
549
+ otherwise) that contradict the conditions of this License, they do not
550
+ excuse you from the conditions of this License. If you cannot convey a
551
+ covered work so as to satisfy simultaneously your obligations under this
552
+ License and any other pertinent obligations, then as a consequence you may
553
+ not convey it at all. For example, if you agree to terms that obligate you
554
+ to collect a royalty for further conveying from those to whom you convey
555
+ the Program, the only way you could satisfy both those terms and this
556
+ License would be to refrain entirely from conveying the Program.
557
+
558
+ 13. Remote Network Interaction; Use with the GNU General Public License.
559
+
560
+ Notwithstanding any other provision of this License, if you modify the
561
+ Program, your modified version must prominently offer all users
562
+ interacting with it remotely through a computer network (if your version
563
+ supports such interaction) an opportunity to receive the Corresponding
564
+ Source of your version by providing access to the Corresponding Source
565
+ from a network server at no charge, through some standard or customary
566
+ means of facilitating copying of software. This Corresponding Source
567
+ shall include the Corresponding Source for any work covered by version 3
568
+ of the GNU General Public License that is incorporated pursuant to the
569
+ following paragraph.
570
+
571
+ Notwithstanding any other provision of this License, you have
572
+ permission to link or combine any covered work with a work licensed
573
+ under version 3 of the GNU General Public License into a single
574
+ combined work, and to convey the resulting work. The terms of this
575
+ License will continue to apply to the part which is the covered work,
576
+ but the work with which it is combined will remain governed by version
577
+ 3 of the GNU General Public License.
578
+
579
+ 14. Revised Versions of this License.
580
+
581
+ The Free Software Foundation may publish revised and/or new versions of
582
+ the GNU Affero General Public License from time to time. Such new versions
583
+ will be similar in spirit to the present version, but may differ in detail to
584
+ address new problems or concerns.
585
+
586
+ Each version is given a distinguishing version number. If the
587
+ Program specifies that a certain numbered version of the GNU Affero General
588
+ Public License "or any later version" applies to it, you have the
589
+ option of following the terms and conditions either of that numbered
590
+ version or of any later version published by the Free Software
591
+ Foundation. If the Program does not specify a version number of the
592
+ GNU Affero General Public License, you may choose any version ever published
593
+ by the Free Software Foundation.
594
+
595
+ If the Program specifies that a proxy can decide which future
596
+ versions of the GNU Affero General Public License can be used, that proxy's
597
+ public statement of acceptance of a version permanently authorizes you
598
+ to choose that version for the Program.
599
+
600
+ Later license versions may give you additional or different
601
+ permissions. However, no additional obligations are imposed on any
602
+ author or copyright holder as a result of your choosing to follow a
603
+ later version.
604
+
605
+ 15. Disclaimer of Warranty.
606
+
607
+ THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
608
+ APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
609
+ HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
610
+ OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
611
+ THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
612
+ PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
613
+ IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
614
+ ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
615
+
616
+ 16. Limitation of Liability.
617
+
618
+ IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
619
+ WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
620
+ THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
621
+ GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
622
+ USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
623
+ DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
624
+ PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
625
+ EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
626
+ SUCH DAMAGES.
627
+
628
+ 17. Interpretation of Sections 15 and 16.
629
+
630
+ If the disclaimer of warranty and limitation of liability provided
631
+ above cannot be given local legal effect according to their terms,
632
+ reviewing courts shall apply local law that most closely approximates
633
+ an absolute waiver of all civil liability in connection with the
634
+ Program, unless a warranty or assumption of liability accompanies a
635
+ copy of the Program in return for a fee.
636
+
637
+ END OF TERMS AND CONDITIONS
638
+
639
+ How to Apply These Terms to Your New Programs
640
+
641
+ If you develop a new program, and you want it to be of the greatest
642
+ possible use to the public, the best way to achieve this is to make it
643
+ free software which everyone can redistribute and change under these terms.
644
+
645
+ To do so, attach the following notices to the program. It is safest
646
+ to attach them to the start of each source file to most effectively
647
+ state the exclusion of warranty; and each file should have at least
648
+ the "copyright" line and a pointer to where the full notice is found.
649
+
650
+ TraceMind MCP Server - AI-powered MCP server for agent evaluation analysis
651
+ Copyright (C) 2025 Kshitij Thakkar
652
+
653
+ This program is free software: you can redistribute it and/or modify
654
+ it under the terms of the GNU Affero General Public License as published by
655
+ the Free Software Foundation, either version 3 of the License, or
656
+ (at your option) any later version.
657
+
658
+ This program is distributed in the hope that it will be useful,
659
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
660
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
661
+ GNU Affero General Public License for more details.
662
+
663
+ You should have received a copy of the GNU Affero General Public License
664
+ along with this program. If not, see <https://www.gnu.org/licenses/>.
665
+
666
+ For contact information: kshitijthakkar@rocketmail.com
667
+ GitHub: https://github.com/Mandark-droid/TraceMind-mcp-server
668
+
669
+ If your software can interact with users remotely through a computer
670
+ network, you should also make sure that it provides a way for users to
671
+ get its source. For example, if your program is a web application, its
672
+ interface could display a "Source" link that leads users to an archive
673
+ of the code. There are many ways you could offer source, and different
674
+ solutions will be better for different programs; see section 13 for the
675
+ specific requirements.
676
+
677
+ You should also get your employer (if you work as a programmer) or school,
678
+ if any, to sign a "copyright disclaimer" for the program, if necessary.
679
+ For more information on this, and how to apply and follow the GNU AGPL, see
680
+ <https://www.gnu.org/licenses/>.
README.md CHANGED
@@ -1,14 +1,592 @@
1
  ---
2
- title: TraceMind Mcp Server
3
- emoji: 🏃
4
- colorFrom: green
5
- colorTo: blue
6
  sdk: gradio
7
- sdk_version: 5.49.1
8
- app_file: app.py
9
  pinned: false
10
  license: agpl-3.0
11
- short_description: AI-powered MCP server for model/agent evaluation analysis
 
 
 
 
 
 
 
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: TraceMind MCP Server
3
+ emoji: 🤖
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: gradio
7
+ app_port: 7860
 
8
  pinned: false
9
  license: agpl-3.0
10
+ short_description: AI-powered MCP server for agent evaluation analysis with Gemini 2.5 Pro
11
+ tags:
12
+ - building-mcp-track-enterprise
13
+ - mcp
14
+ - gradio
15
+ - gemini
16
+ - agent-evaluation
17
+ - leaderboard
18
  ---
19
 
20
+ # TraceMind MCP Server
21
+
22
+ **AI-Powered Analysis Tools for Agent Evaluation Data**
23
+
24
+ [![MCP's 1st Birthday Hackathon](https://img.shields.io/badge/MCP%27s%201st%20Birthday-Hackathon-blue)](https://github.com/modelcontextprotocol)
25
+ [![Track](https://img.shields.io/badge/Track-Building%20MCP%20(Enterprise)-green)](https://github.com/modelcontextprotocol/hackathon)
26
+ [![Google Gemini](https://img.shields.io/badge/Powered%20by-Google%20Gemini%202.5%20Pro-orange)](https://ai.google.dev/)
27
+
28
+ > **🎯 Track 1 Submission**: Building MCP (Enterprise)
29
+ > **📅 MCP's 1st Birthday Hackathon**: November 14-30, 2025
30
+
31
+ ## Overview
32
+
33
+ TraceMind MCP Server is a Gradio-based MCP (Model Context Protocol) server that provides a complete MCP implementation with:
34
+
35
+ ### 🛠️ **5 AI-Powered Tools**
36
+ 1. **📊 analyze_leaderboard**: Generate insights from evaluation leaderboard data
37
+ 2. **🐛 debug_trace**: Debug specific agent execution traces using OpenTelemetry data
38
+ 3. **💰 estimate_cost**: Predict evaluation costs before running
39
+ 4. **⚖️ compare_runs**: Compare two evaluation runs with AI-powered analysis
40
+ 5. **📦 get_dataset**: Load SMOLTRACE datasets (smoltrace-* prefix only) as JSON for flexible analysis
41
+
42
+ ### 📦 **3 Data Resources**
43
+ 1. **leaderboard data**: Direct JSON access to evaluation results
44
+ 2. **trace data**: Raw OpenTelemetry trace data with spans
45
+ 3. **cost data**: Model pricing and hardware cost information
46
+
47
+ ### 📝 **3 Prompt Templates**
48
+ 1. **analysis prompts**: Standardized templates for different analysis types
49
+ 2. **debug prompts**: Templates for debugging scenarios
50
+ 3. **optimization prompts**: Templates for optimization goals
51
+
52
+ All analysis is powered by **Google Gemini 2.5 Pro** for intelligent, context-aware insights.
53
+
54
+ ## 📱 Social Media & Demo
55
+
56
+ **📢 Announcement Post**: [Coming Soon - X/LinkedIn post]
57
+
58
+ **🎥 Demo Video**: [Coming Soon - YouTube/Loom link showing MCP server integration with Claude Desktop]
59
+
60
+ ---
61
+
62
+ ## Why This MCP Server?
63
+
64
+ **Problem**: Agent evaluation generates massive amounts of data (leaderboards, traces, metrics), but developers struggle to:
65
+ - Understand which models perform best for their use case
66
+ - Debug why specific agent executions failed
67
+ - Estimate costs before running expensive evaluations
68
+
69
+ **Solution**: This MCP server provides AI-powered analysis tools that connect to HuggingFace datasets and deliver actionable insights in natural language.
70
+
71
+ **Impact**: Developers can make informed decisions about agent configurations, debug issues faster, and optimize costs—all through a simple MCP interface.
72
+
73
+ ## Features
74
+
75
+ ### 🎯 Track 1 Compliance: Building MCP (Enterprise)
76
+
77
+ - ✅ **Complete MCP Implementation**: Tools, Resources, AND Prompts
78
+ - ✅ **MCP Standard Compliant**: Built with Gradio's native MCP support (`@gr.mcp.*` decorators)
79
+ - ✅ **Production-Ready**: Deployable to HuggingFace Spaces with SSE transport
80
+ - ✅ **Testing Interface**: Beautiful Gradio UI for testing all components
81
+ - ✅ **Enterprise Focus**: Cost optimization, debugging, and decision support
82
+ - ✅ **Google Gemini Powered**: Leverages Gemini 2.5 Pro for intelligent analysis
83
+ - ✅ **11 Total Components**: 5 Tools + 3 Resources + 3 Prompts
84
+
85
+ ### 🛠️ Five Production-Ready Tools
86
+
87
+ #### 1. analyze_leaderboard
88
+
89
+ Analyzes evaluation leaderboard data from HuggingFace datasets and provides:
90
+ - Top performers by selected metric (accuracy, cost, latency, CO2)
91
+ - Trade-off analysis (e.g., "GPT-4 is most accurate but Llama-3.1 is 25x cheaper")
92
+ - Trend identification
93
+ - Actionable recommendations
94
+
95
+ **Example Use Case**: Before choosing a model for production, get AI-powered insights on which configuration offers the best cost/performance for your requirements.
96
+
97
+ #### 2. debug_trace
98
+
99
+ Analyzes OpenTelemetry trace data and answers specific questions like:
100
+ - "Why was tool X called twice?"
101
+ - "Which step took the most time?"
102
+ - "Why did this test fail?"
103
+
104
+ **Example Use Case**: When an agent test fails, understand exactly what happened without manually parsing trace spans.
105
+
106
+ #### 3. estimate_cost
107
+
108
+ Predicts costs before running evaluations:
109
+ - LLM API costs (token-based)
110
+ - HuggingFace Jobs compute costs
111
+ - CO2 emissions estimate
112
+ - Hardware recommendations
113
+
114
+ **Example Use Case**: Compare the cost of evaluating GPT-4 vs Llama-3.1 across 1000 tests before committing resources.
115
+
116
+ #### 4. compare_runs
117
+
118
+ Compares two evaluation runs with AI-powered analysis across multiple dimensions:
119
+ - Success rate comparison with statistical significance
120
+ - Cost efficiency analysis (total cost, cost per test, cost per successful test)
121
+ - Speed comparison (average duration, throughput)
122
+ - Environmental impact (CO2 emissions per test)
123
+ - GPU efficiency (for GPU jobs)
124
+
125
+ **Focus Options**:
126
+ - `comprehensive`: Complete comparison across all dimensions
127
+ - `cost`: Detailed cost efficiency and ROI analysis
128
+ - `performance`: Speed and accuracy trade-off analysis
129
+ - `eco_friendly`: Environmental impact and carbon footprint comparison
130
+
131
+ **Example Use Case**: After running evaluations with two different models, compare them head-to-head to determine which is better for production deployment based on your priorities (accuracy, cost, speed, or environmental impact).
132
+
133
+ #### 5. get_dataset
134
+
135
+ Loads SMOLTRACE datasets from HuggingFace and returns raw data as JSON:
136
+ - Simple, flexible tool that returns complete dataset with metadata
137
+ - Works with any dataset containing "smoltrace-" prefix
138
+ - Returns total rows, columns list, and data array
139
+ - Automatically sorts by timestamp if available
140
+ - Configurable row limit (1-200) to manage token usage
141
+
142
+ **Security Restriction**: Only datasets with "smoltrace-" in the repository name are allowed.
143
+
144
+ **Primary Use Cases**:
145
+ - Load `smoltrace-leaderboard` to find run IDs and model names
146
+ - Discover supporting datasets via `results_dataset`, `traces_dataset`, `metrics_dataset` fields
147
+ - Load `smoltrace-results-*` datasets to see individual test case details
148
+ - Load `smoltrace-traces-*` datasets to access OpenTelemetry trace data
149
+ - Load `smoltrace-metrics-*` datasets to get GPU performance data
150
+ - Answer specific questions requiring raw data access
151
+
152
+ **Example Workflow**:
153
+ 1. LLM calls `get_dataset("kshitijthakkar/smoltrace-leaderboard")` to see all runs
154
+ 2. Examines the JSON response to find run IDs, models, and supporting dataset names
155
+ 3. Calls `get_dataset("username/smoltrace-results-gpt4")` to load detailed results
156
+ 4. Can now answer questions like "What are the last 10 run IDs?" or "Which models were tested?"
157
+
158
+ **Example Use Case**: When the user asks "Can you provide me with the list of last 10 runIds and model names?", the LLM loads the leaderboard dataset and extracts the requested information from the JSON response.
159
+
160
+ ## MCP Resources Usage
161
+
162
+ Resources provide direct data access without AI analysis:
163
+
164
+ ```python
165
+ # Access leaderboard data
166
+ GET leaderboard://kshitijthakkar/smoltrace-leaderboard
167
+ # Returns: JSON with all evaluation runs
168
+
169
+ # Access specific trace
170
+ GET trace://trace_abc123/username/agent-traces-gpt4
171
+ # Returns: JSON with trace spans and attributes
172
+
173
+ # Get model cost information
174
+ GET cost://model/openai/gpt-4
175
+ # Returns: JSON with pricing and hardware costs
176
+ ```
177
+
178
+ ## MCP Prompts Usage
179
+
180
+ Prompts provide reusable templates for standardized interactions:
181
+
182
+ ```python
183
+ # Get analysis prompt template
184
+ analysis_prompt(analysis_type="leaderboard", focus_area="cost", detail_level="detailed")
185
+ # Returns: "Provide a detailed analysis. Analyze cost efficiency in the leaderboard..."
186
+
187
+ # Get debug prompt template
188
+ debug_prompt(debug_type="performance", context="tool_calling")
189
+ # Returns: "Analyze tool calling performance. Identify which tools are slow..."
190
+
191
+ # Get optimization prompt template
192
+ optimization_prompt(optimization_goal="cost", constraints="maintain_quality")
193
+ # Returns: "Analyze this evaluation setup and recommend cost optimizations..."
194
+ ```
195
+
196
+ Use these prompts when interacting with the tools to get consistent, high-quality analysis.
197
+
198
+ ## Quick Start
199
+
200
+ ### 1. Installation
201
+
202
+ ```bash
203
+ git clone https://github.com/Mandark-droid/TraceMind-mcp-server.git
204
+ cd TraceMind-mcp-server
205
+
206
+ # Create virtual environment
207
+ python -m venv venv
208
+ source venv/bin/activate # On Windows: venv\Scripts\activate
209
+
210
+ # Install dependencies (note: gradio[mcp] includes MCP support)
211
+ pip install -r requirements.txt
212
+ ```
213
+
214
+ ### 2. Environment Setup
215
+
216
+ Create `.env` file:
217
+
218
+ ```bash
219
+ cp .env.example .env
220
+ # Edit .env and add your API keys
221
+ ```
222
+
223
+ Get your keys:
224
+ - **Gemini API Key**: https://ai.google.dev/
225
+ - **HuggingFace Token**: https://huggingface.co/settings/tokens
226
+
227
+ ### 3. Run Locally
228
+
229
+ ```bash
230
+ python app.py
231
+ ```
232
+
233
+ Open http://localhost:7860 to test the tools via Gradio interface.
234
+
235
+ ### 4. Test with Live Data
236
+
237
+ Try the live example with real HuggingFace dataset:
238
+
239
+ **In the Gradio UI, Tab "📊 Analyze Leaderboard":**
240
+
241
+ ```
242
+ Leaderboard Repository: kshitijthakkar/smoltrace-leaderboard
243
+ Metric Focus: overall
244
+ Time Range: last_week
245
+ Top N Models: 5
246
+ ```
247
+
248
+ Click "🔍 Analyze" and get AI-powered insights from live data!
249
+
250
+ ## MCP Integration
251
+
252
+ ### How It Works
253
+
254
+ This Gradio app uses `mcp_server=True` in the launch configuration, which automatically:
255
+ - Exposes all async functions with proper docstrings as MCP tools
256
+ - Handles MCP protocol communication
257
+ - Provides a standard MCP interface via SSE (Server-Sent Events)
258
+
259
+ ### Connecting from MCP Clients
260
+
261
+ Once deployed to HuggingFace Spaces, your MCP server will be available at:
262
+
263
+ **MCP Endpoint (SSE)**:
264
+ ```
265
+ https://huggingface.co/spaces/kshitijthakkar/TraceMind-mcp-server/gradio_api/mcp/sse
266
+ ```
267
+
268
+ **Schema Endpoint**:
269
+ ```
270
+ https://huggingface.co/spaces/kshitijthakkar/TraceMind-mcp-server/gradio_api/mcp/schema
271
+ ```
272
+
273
+ Configure your MCP client (Claude Desktop, Cursor, Cline, etc.) with the SSE endpoint.
274
+
275
+ ### Available MCP Components
276
+
277
+ **Tools** (5):
278
+ 1. **analyze_leaderboard**: AI-powered leaderboard analysis with Gemini 2.5 Pro
279
+ 2. **debug_trace**: Trace debugging with AI insights
280
+ 3. **estimate_cost**: Cost estimation with optimization recommendations
281
+ 4. **compare_runs**: Compare two evaluation runs with AI-powered analysis
282
+ 5. **get_dataset**: Load SMOLTRACE datasets (smoltrace-* only) as JSON
283
+
284
+ **Resources** (3):
285
+ 1. **leaderboard://{repo}**: Direct access to raw leaderboard data in JSON
286
+ 2. **trace://{trace_id}/{repo}**: Direct access to trace data with spans
287
+ 3. **cost://model/{model_name}**: Model pricing and hardware cost information
288
+
289
+ **Prompts** (3):
290
+ 1. **analysis_prompt**: Reusable templates for different analysis types
291
+ 2. **debug_prompt**: Reusable templates for debugging scenarios
292
+ 3. **optimization_prompt**: Reusable templates for optimization goals
293
+
294
+ See full API documentation in the Gradio interface under "📖 API Documentation" tab.
295
+
296
+ ## Architecture
297
+
298
+ ```
299
+ TraceMind-mcp-server/
300
+ ├── app.py # Gradio UI + MCP server (mcp_server=True)
301
+ ├── gemini_client.py # Google Gemini 2.5 Pro integration
302
+ ├── mcp_tools.py # 3 tool implementations
303
+ ├── requirements.txt # Python dependencies
304
+ ├── .env.example # Environment variable template
305
+ ├── .gitignore
306
+ └── README.md
307
+ ```
308
+
309
+ **Key Technologies**:
310
+ - **Gradio 6 with MCP support**: `gradio[mcp]` provides native MCP server capabilities
311
+ - **Google Gemini 2.5 Pro**: Latest AI model for intelligent analysis
312
+ - **HuggingFace Datasets**: Data source for evaluations
313
+ - **SSE Transport**: Server-Sent Events for real-time MCP communication
314
+
315
+ ## Deploy to HuggingFace Spaces
316
+
317
+ ### 1. Create Space
318
+
319
+ Go to https://huggingface.co/new-space
320
+
321
+ - **Space name**: `TraceMind-mcp-server`
322
+ - **License**: MIT
323
+ - **SDK**: Gradio
324
+ - **Hardware**: CPU Basic (free tier works fine)
325
+
326
+ ### 2. Add Files
327
+
328
+ Upload all files from this repository to your Space:
329
+ - `app.py`
330
+ - `gemini_client.py`
331
+ - `mcp_tools.py`
332
+ - `requirements.txt`
333
+ - `README.md`
334
+
335
+ ### 3. Add Secrets
336
+
337
+ In Space settings → Variables and secrets, add:
338
+ - `GEMINI_API_KEY`: Your Gemini API key
339
+ - `HF_TOKEN`: Your HuggingFace token
340
+
341
+ ### 4. Add Hackathon Tag
342
+
343
+ In Space settings → Tags, add:
344
+ - `building-mcp-track-enterprise`
345
+
346
+ ### 5. Access Your MCP Server
347
+
348
+ Your MCP server will be publicly available at:
349
+ ```
350
+ https://huggingface.co/spaces/kshitijthakkar/TraceMind-mcp-server
351
+ ```
352
+
353
+ ## Testing
354
+
355
+ ### Test 1: Analyze Leaderboard (Live Data)
356
+
357
+ ```bash
358
+ # In Gradio UI - Tab "📊 Analyze Leaderboard":
359
+ Repository: kshitijthakkar/smoltrace-leaderboard
360
+ Metric: overall
361
+ Time Range: last_week
362
+ Top N: 5
363
+ Click "🔍 Analyze"
364
+ ```
365
+
366
+ **Expected**: AI-generated analysis of top performing models from live HuggingFace dataset
367
+
368
+ ### Test 2: Estimate Cost
369
+
370
+ ```bash
371
+ # In Gradio UI - Tab "💰 Estimate Cost":
372
+ Model: openai/gpt-4
373
+ Agent Type: both
374
+ Number of Tests: 100
375
+ Hardware: auto
376
+ Click "💰 Estimate"
377
+ ```
378
+
379
+ **Expected**: Cost breakdown with LLM costs, HF Jobs costs, duration, and CO2 estimate
380
+
381
+ ### Test 3: Debug Trace
382
+
383
+ Note: This requires actual trace data from an evaluation run. For testing purposes, this will show an error about missing data, which is expected behavior.
384
+
385
+ ## Hackathon Submission
386
+
387
+ ### Track 1: Building MCP (Enterprise)
388
+
389
+ **Tag**: `building-mcp-track-enterprise`
390
+
391
+ **Why Enterprise Track?**
392
+ - Solves real business problems (cost optimization, debugging, decision support)
393
+ - Production-ready tools with clear ROI
394
+ - Integrates with enterprise data infrastructure (HuggingFace datasets)
395
+
396
+ **Technology Stack**
397
+ - **AI Analysis**: Google Gemini 2.5 Pro for all intelligent insights
398
+ - **MCP Framework**: Gradio 6 with native MCP support
399
+ - **Data Source**: HuggingFace Datasets
400
+ - **Transport**: SSE (Server-Sent Events)
401
+
402
+ ## Related Project: TraceMind UI (Track 2)
403
+
404
+ This MCP server is designed to be consumed by **TraceMind UI** (separate submission for Track 2: MCP in Action).
405
+
406
+ TraceMind UI is a Gradio-based agent evaluation platform that uses these MCP tools to provide:
407
+ - AI-powered leaderboard insights
408
+ - Interactive trace debugging
409
+ - Pre-evaluation cost estimates
410
+
411
+ ## File Descriptions
412
+
413
+ ### app.py
414
+ Main Gradio application with:
415
+ - Testing UI for all 5 tools
416
+ - MCP server enabled via `mcp_server=True`
417
+ - API documentation
418
+
419
+ ### gemini_client.py
420
+ Google Gemini 2.5 Pro client that:
421
+ - Handles API authentication
422
+ - Provides specialized analysis methods for different data types
423
+ - Formats prompts for optimal results
424
+ - Uses `gemini-2.5-pro-latest` model (can switch to `gemini-2.5-flash-latest`)
425
+
426
+ ### mcp_tools.py
427
+ Complete MCP implementation with 11 components:
428
+
429
+ **Tools** (5 async functions):
430
+ - `analyze_leaderboard()`: AI-powered leaderboard analysis
431
+ - `debug_trace()`: AI-powered trace debugging
432
+ - `estimate_cost()`: AI-powered cost estimation
433
+ - `compare_runs()`: AI-powered run comparison
434
+ - `get_dataset()`: Load SMOLTRACE datasets as JSON
435
+
436
+ **Resources** (3 decorated functions with `@gr.mcp.resource()`):
437
+ - `get_leaderboard_data()`: Raw leaderboard JSON data
438
+ - `get_trace_data()`: Raw trace JSON data with spans
439
+ - `get_cost_data()`: Model pricing and hardware cost JSON
440
+
441
+ **Prompts** (3 decorated functions with `@gr.mcp.prompt()`):
442
+ - `analysis_prompt()`: Templates for different analysis types
443
+ - `debug_prompt()`: Templates for debugging scenarios
444
+ - `optimization_prompt()`: Templates for optimization goals
445
+
446
+ Each function includes:
447
+ - Appropriate decorator (`@gr.mcp.tool()`, `@gr.mcp.resource()`, or `@gr.mcp.prompt()`)
448
+ - Detailed docstring with "Args:" section
449
+ - Type hints for all parameters and return values
450
+ - Descriptive function name (becomes the MCP component name)
451
+
452
+ ## Environment Variables
453
+
454
+ Required environment variables:
455
+
456
+ ```bash
457
+ GEMINI_API_KEY=your_gemini_api_key_here
458
+ HF_TOKEN=your_huggingface_token_here
459
+ ```
460
+
461
+ ## Development
462
+
463
+ ### Running Tests
464
+
465
+ ```bash
466
+ # Test Gemini client
467
+ python -c "from gemini_client import GeminiClient; client = GeminiClient(); print('✅ Gemini client initialized')"
468
+
469
+ # Test with live leaderboard data
470
+ python app.py
471
+ # Open browser, test "Analyze Leaderboard" tab
472
+ ```
473
+
474
+ ### Adding New Tools
475
+
476
+ To add a new MCP tool (with Gradio's native MCP support):
477
+
478
+ 1. **Add function to `mcp_tools.py`** with proper docstring:
479
+ ```python
480
+ async def your_new_tool(
481
+ gemini_client: GeminiClient,
482
+ param1: str,
483
+ param2: int = 10
484
+ ) -> str:
485
+ """
486
+ Brief description of what the tool does.
487
+
488
+ Longer description explaining the tool's purpose and behavior.
489
+
490
+ Args:
491
+ gemini_client (GeminiClient): Initialized Gemini client for AI analysis
492
+ param1 (str): Description of param1 with examples if helpful
493
+ param2 (int): Description of param2. Default: 10
494
+
495
+ Returns:
496
+ str: Description of what the function returns
497
+ """
498
+ # Your implementation
499
+ return result
500
+ ```
501
+
502
+ 2. **Add UI tab in `app.py`** (optional, for testing):
503
+ ```python
504
+ with gr.Tab("Your Tool"):
505
+ # Add UI components
506
+ # Wire up to your_new_tool()
507
+ ```
508
+
509
+ 3. That's it! Gradio automatically exposes it as an MCP tool based on:
510
+ - Function name (becomes tool name)
511
+ - Docstring (becomes tool description)
512
+ - Args section (becomes parameter descriptions)
513
+ - Type hints (become parameter types)
514
+
515
+ ### Switching to Gemini 2.5 Flash
516
+
517
+ For faster (but slightly less capable) responses, switch to Gemini 2.5 Flash:
518
+
519
+ ```python
520
+ # In app.py, change:
521
+ gemini_client = GeminiClient(model_name="gemini-2.5-flash-latest")
522
+ ```
523
+
524
+ ## 🙏 Credits & Acknowledgments
525
+
526
+ ### Hackathon Sponsors
527
+
528
+ Special thanks to the sponsors of **MCP's 1st Birthday Hackathon** (November 14-30, 2025):
529
+
530
+ - **🤗 HuggingFace** - Hosting platform and dataset infrastructure
531
+ - **🧠 Google Gemini** - AI analysis powered by Gemini 2.5 Pro API
532
+ - **⚡ Modal** - Serverless infrastructure partner
533
+ - **🏢 Anthropic** - MCP protocol creators
534
+ - **🎨 Gradio** - Native MCP framework support
535
+ - **🎙️ ElevenLabs** - Audio AI capabilities
536
+ - **🦙 SambaNova** - High-performance AI infrastructure
537
+ - **🎯 Blaxel** - Additional compute credits
538
+
539
+ ### Related Open Source Projects
540
+
541
+ This MCP server builds upon our open source agent evaluation ecosystem:
542
+
543
+ #### 📊 SMOLTRACE - Agent Evaluation Engine
544
+ - **Description**: Lightweight, production-ready evaluation framework for AI agents with OpenTelemetry instrumentation
545
+ - **GitHub**: [https://github.com/Mandark-droid/SMOLTRACE](https://github.com/Mandark-droid/SMOLTRACE)
546
+ - **PyPI**: [https://pypi.org/project/smoltrace/](https://pypi.org/project/smoltrace/)
547
+ - **Social**: [@smoltrace on X](https://twitter.com/smoltrace)
548
+
549
+ #### 🔭 TraceVerde - GenAI OpenTelemetry Instrumentation
550
+ - **Description**: Automatic OpenTelemetry instrumentation for LLM frameworks (LiteLLM, Transformers, LangChain, etc.)
551
+ - **GitHub**: [https://github.com/Mandark-droid/genai_otel_instrument](https://github.com/Mandark-droid/genai_otel_instrument)
552
+ - **PyPI**: [https://pypi.org/project/genai-otel-instrument](https://pypi.org/project/genai-otel-instrument)
553
+ - **Social**: [@genai_otel on X](https://twitter.com/genai_otel)
554
+
555
+ ### Built By
556
+
557
+ **Track**: Building MCP (Enterprise)
558
+ **Author**: Kshitij Thakkar
559
+ **Powered by**: Google Gemini 2.5 Pro
560
+ **Built with**: Gradio 6 (native MCP support)
561
+
562
+ ---
563
+
564
+ ## 📄 License
565
+
566
+ AGPL-3.0 License
567
+
568
+ This project is licensed under the GNU Affero General Public License v3.0. See the LICENSE file for details.
569
+
570
+ ---
571
+
572
+ ## 💬 Support
573
+
574
+ For issues or questions:
575
+ - 📧 Open an issue on GitHub
576
+ - 💬 Join the [HuggingFace Discord](https://discord.gg/huggingface) - Channel: `#agents-mcp-hackathon-winter25`
577
+ - 🏷️ Tag `building-mcp-track-enterprise` for hackathon-related questions
578
+ - 🐦 Follow us on X: [@TraceMindAI](https://twitter.com/TraceMindAI) (placeholder)
579
+
580
+ ## Changelog
581
+
582
+ ### v1.0.0 (2025-11-14)
583
+ - Initial release for MCP Hackathon
584
+ - **Complete MCP Implementation**: 11 components total
585
+ - 5 AI-powered tools (analyze_leaderboard, debug_trace, estimate_cost, compare_runs, get_dataset)
586
+ - 3 data resources (leaderboard, trace, cost data)
587
+ - 3 prompt templates (analysis, debug, optimization)
588
+ - Gradio native MCP support with decorators (`@gr.mcp.*`)
589
+ - Google Gemini 2.5 Pro integration for all AI analysis
590
+ - Live HuggingFace dataset integration
591
+ - SSE transport for MCP communication
592
+ - Production-ready for HuggingFace Spaces deployment
app.py ADDED
@@ -0,0 +1,1006 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ TraceMind MCP Server - Gradio Interface with MCP Support
3
+
4
+ This server provides AI-powered analysis tools for agent evaluation data:
5
+ 1. analyze_leaderboard: Summarize trends and insights from leaderboard
6
+ 2. debug_trace: Debug specific agent execution traces
7
+ 3. estimate_cost: Predict evaluation costs before running
8
+ 4. compare_runs: Compare two evaluation runs with AI-powered analysis
9
+ 5. get_dataset: Load any HuggingFace dataset as JSON for flexible analysis
10
+ """
11
+
12
+ import os
13
+ import gradio as gr
14
+ from typing import Optional, Dict, Any
15
+ from datetime import datetime
16
+
17
+ # Local imports
18
+ from gemini_client import GeminiClient
19
+ from mcp_tools import (
20
+ analyze_leaderboard,
21
+ debug_trace,
22
+ estimate_cost,
23
+ compare_runs,
24
+ get_dataset
25
+ )
26
+
27
+ # Initialize default Gemini client (fallback if user doesn't provide key)
28
+ try:
29
+ default_gemini_client = GeminiClient()
30
+ except ValueError:
31
+ default_gemini_client = None # Will prompt user to enter API key
32
+
33
+ # Gradio Interface for Testing
34
+ def create_gradio_ui():
35
+ """Create Gradio UI for testing MCP tools"""
36
+
37
+ with gr.Blocks(title="TraceMind MCP Server", theme=gr.themes.Soft()) as demo:
38
+ gr.Markdown("""
39
+ # 🤖 TraceMind MCP Server
40
+
41
+ **AI-Powered Analysis for Agent Evaluation Data**
42
+
43
+ This server provides **5 MCP Tools + 3 MCP Resources + 3 MCP Prompts**:
44
+
45
+ ### MCP Tools (AI-Powered)
46
+ - 📊 **Analyze Leaderboard**: Get insights from evaluation results
47
+ - 🐛 **Debug Trace**: Understand what happened in a specific test
48
+ - 💰 **Estimate Cost**: Predict evaluation costs before running
49
+ - ⚖️ **Compare Runs**: Compare two evaluation runs with AI-powered analysis
50
+ - 📦 **Get Dataset**: Load any HuggingFace dataset as JSON for flexible analysis
51
+
52
+ ### MCP Resources (Data Access)
53
+ - 📊 **leaderboard://{repo}**: Raw leaderboard data
54
+ - 🔍 **trace://{trace_id}/{repo}**: Raw trace data
55
+ - 💰 **cost://model/{model_name}**: Model pricing data
56
+
57
+ ### MCP Prompts (Templates)
58
+ - 📝 **analysis_prompt**: Templates for analysis requests
59
+ - 🐛 **debug_prompt**: Templates for debugging traces
60
+ - ⚡ **optimization_prompt**: Templates for optimization recommendations
61
+
62
+ All powered by **Google Gemini 2.5 Pro**.
63
+
64
+ ## For Track 2 Integration
65
+
66
+ **HuggingFace Space URL**: `https://huggingface.co/spaces/kshitijthakkar/TraceMind-mcp-server`
67
+
68
+ **MCP Endpoint**: `https://huggingface.co/spaces/kshitijthakkar/TraceMind-mcp-server/gradio_api/mcp/sse`
69
+
70
+ **Schema**: `https://huggingface.co/spaces/kshitijthakkar/TraceMind-mcp-server/gradio_api/mcp/schema`
71
+ """)
72
+
73
+ # Session state for API keys
74
+ gemini_key_state = gr.State(value=os.getenv("GEMINI_API_KEY", ""))
75
+ hf_token_state = gr.State(value=os.getenv("HF_TOKEN", ""))
76
+
77
+ with gr.Tabs():
78
+ # Tab 0: Settings (API Keys)
79
+ with gr.Tab("⚙️ Settings"):
80
+ gr.Markdown("""
81
+ ## 🔑 API Key Configuration
82
+
83
+ Configure your API keys here. These will override environment variables for this session only.
84
+
85
+ **Why configure here?**
86
+ - No need to set environment variables
87
+ - Test with different API keys easily
88
+ - Secure session-only storage (not persisted)
89
+
90
+ **Security Note**: API keys are stored in session state only and are not saved permanently.
91
+ """)
92
+
93
+ with gr.Row():
94
+ with gr.Column():
95
+ gr.Markdown("### Google Gemini API Key")
96
+ gemini_key_input = gr.Textbox(
97
+ label="Gemini API Key",
98
+ placeholder="Enter your Google Gemini API key",
99
+ type="password",
100
+ value=os.getenv("GEMINI_API_KEY", ""),
101
+ info="Get your key from: https://aistudio.google.com/app/apikey"
102
+ )
103
+ gemini_status = gr.Markdown("Status: Using environment variable" if os.getenv("GEMINI_API_KEY") else "⚠️ Status: No API key configured")
104
+
105
+ with gr.Column():
106
+ gr.Markdown("### HuggingFace Token")
107
+ hf_token_input = gr.Textbox(
108
+ label="HuggingFace Token",
109
+ placeholder="Enter your HuggingFace token",
110
+ type="password",
111
+ value=os.getenv("HF_TOKEN", ""),
112
+ info="Get your token from: https://huggingface.co/settings/tokens"
113
+ )
114
+ hf_status = gr.Markdown("Status: Using environment variable" if os.getenv("HF_TOKEN") else "⚠️ Status: No token configured")
115
+
116
+ with gr.Row():
117
+ save_keys_button = gr.Button("💾 Save API Keys for This Session", variant="primary", size="lg")
118
+ clear_keys_button = gr.Button("🗑️ Clear Session Keys", variant="secondary")
119
+
120
+ keys_save_status = gr.Markdown("")
121
+
122
+ def save_api_keys(gemini_key, hf_token):
123
+ """
124
+ Save API keys to session state.
125
+
126
+ Args:
127
+ gemini_key (str): Google Gemini API key
128
+ hf_token (str): HuggingFace token
129
+
130
+ Returns:
131
+ tuple: Updated state values and status message
132
+ """
133
+ status_messages = []
134
+
135
+ # Validate and save Gemini key
136
+ if gemini_key and gemini_key.strip():
137
+ try:
138
+ # Test the key by creating a client
139
+ test_client = GeminiClient(api_key=gemini_key.strip())
140
+ gemini_saved = gemini_key.strip()
141
+ status_messages.append("✅ Gemini API key validated and saved")
142
+ except Exception as e:
143
+ gemini_saved = os.getenv("GEMINI_API_KEY", "")
144
+ status_messages.append(f"❌ Gemini API key invalid: {str(e)}")
145
+ else:
146
+ gemini_saved = os.getenv("GEMINI_API_KEY", "")
147
+ status_messages.append("ℹ️ Gemini API key cleared (using environment variable if set)")
148
+
149
+ # Validate and save HF token
150
+ if hf_token and hf_token.strip():
151
+ hf_saved = hf_token.strip()
152
+ status_messages.append("✅ HuggingFace token saved")
153
+ else:
154
+ hf_saved = os.getenv("HF_TOKEN", "")
155
+ status_messages.append("ℹ️ HuggingFace token cleared (using environment variable if set)")
156
+
157
+ status_markdown = "\n\n".join(status_messages)
158
+
159
+ return gemini_saved, hf_saved, f"### Save Status\n\n{status_markdown}"
160
+
161
+ def clear_api_keys():
162
+ """
163
+ Clear session API keys and revert to environment variables.
164
+
165
+ Returns:
166
+ tuple: Cleared state values and status message
167
+ """
168
+ env_gemini = os.getenv("GEMINI_API_KEY", "")
169
+ env_hf = os.getenv("HF_TOKEN", "")
170
+
171
+ status = "### Keys Cleared\n\nReverted to environment variables.\n\n"
172
+ if env_gemini:
173
+ status += "✅ Using GEMINI_API_KEY from environment\n\n"
174
+ else:
175
+ status += "⚠️ No GEMINI_API_KEY in environment\n\n"
176
+
177
+ if env_hf:
178
+ status += "✅ Using HF_TOKEN from environment"
179
+ else:
180
+ status += "⚠️ No HF_TOKEN in environment"
181
+
182
+ return env_gemini, env_hf, status
183
+
184
+ save_keys_button.click(
185
+ fn=save_api_keys,
186
+ inputs=[gemini_key_input, hf_token_input],
187
+ outputs=[gemini_key_state, hf_token_state, keys_save_status]
188
+ )
189
+
190
+ clear_keys_button.click(
191
+ fn=clear_api_keys,
192
+ inputs=[],
193
+ outputs=[gemini_key_state, hf_token_state, keys_save_status]
194
+ )
195
+
196
+ gr.Markdown("""
197
+ ---
198
+
199
+ ### How It Works
200
+
201
+ 1. **Enter your API keys** in the fields above
202
+ 2. **Click "Save API Keys"** to validate and store them for this session
203
+ 3. **Use any tool** - they will automatically use your configured keys
204
+ 4. **Keys are session-only** - they won't be saved when you close the browser
205
+
206
+ ### Environment Variables (Alternative)
207
+
208
+ You can also set these as environment variables:
209
+ ```bash
210
+ export GEMINI_API_KEY="your-key-here"
211
+ export HF_TOKEN="your-token-here"
212
+ ```
213
+
214
+ UI-configured keys will always override environment variables.
215
+ """)
216
+
217
+ # Tab 1: Analyze Leaderboard
218
+ with gr.Tab("📊 Analyze Leaderboard"):
219
+ gr.Markdown("### Get AI-powered insights from evaluation leaderboard")
220
+
221
+ with gr.Row():
222
+ with gr.Column():
223
+ lb_repo = gr.Textbox(
224
+ label="Leaderboard Repository",
225
+ value="kshitijthakkar/smoltrace-leaderboard",
226
+ placeholder="username/dataset-name"
227
+ )
228
+ lb_metric = gr.Dropdown(
229
+ label="Metric Focus",
230
+ choices=["overall", "accuracy", "cost", "latency", "co2"],
231
+ value="overall"
232
+ )
233
+ lb_time = gr.Dropdown(
234
+ label="Time Range",
235
+ choices=["last_week", "last_month", "all_time"],
236
+ value="last_week"
237
+ )
238
+ lb_top_n = gr.Slider(
239
+ label="Top N Models",
240
+ minimum=3,
241
+ maximum=10,
242
+ value=5,
243
+ step=1
244
+ )
245
+ lb_button = gr.Button("🔍 Analyze", variant="primary")
246
+
247
+ with gr.Column():
248
+ lb_output = gr.Markdown(label="Analysis Results")
249
+
250
+ async def run_analyze_leaderboard(repo, metric, time_range, top_n, gemini_key, hf_token):
251
+ """
252
+ Analyze agent evaluation leaderboard and generate AI-powered insights.
253
+
254
+ This tool loads agent evaluation data from HuggingFace datasets and uses
255
+ Google Gemini 2.5 Pro to provide intelligent analysis of top performers,
256
+ trends, cost/performance trade-offs, and actionable recommendations.
257
+
258
+ Args:
259
+ repo (str): HuggingFace dataset repository containing leaderboard data
260
+ metric (str): Primary metric to focus analysis on - "overall", "accuracy", "cost", "latency", or "co2"
261
+ time_range (str): Time range for analysis - "last_week", "last_month", or "all_time"
262
+ top_n (int): Number of top models to highlight in analysis (3-10)
263
+ gemini_key (str): Gemini API key from session state
264
+ hf_token (str): HuggingFace token from session state
265
+
266
+ Returns:
267
+ str: Markdown-formatted analysis with top performers, trends, and recommendations
268
+ """
269
+ try:
270
+ # Create GeminiClient with user-provided key or fallback to default
271
+ if gemini_key and gemini_key.strip():
272
+ client = GeminiClient(api_key=gemini_key)
273
+ elif default_gemini_client:
274
+ client = default_gemini_client
275
+ else:
276
+ return "❌ **Error**: No Gemini API key configured. Please set it in the Settings tab."
277
+
278
+ result = await analyze_leaderboard(
279
+ gemini_client=client,
280
+ leaderboard_repo=repo,
281
+ metric_focus=metric,
282
+ time_range=time_range,
283
+ top_n=int(top_n),
284
+ hf_token=hf_token if hf_token and hf_token.strip() else None
285
+ )
286
+ return result
287
+ except Exception as e:
288
+ return f"❌ **Error**: {str(e)}"
289
+
290
+ lb_button.click(
291
+ fn=run_analyze_leaderboard,
292
+ inputs=[lb_repo, lb_metric, lb_time, lb_top_n, gemini_key_state, hf_token_state],
293
+ outputs=[lb_output]
294
+ )
295
+
296
+ # Tab 2: Debug Trace
297
+ with gr.Tab("🐛 Debug Trace"):
298
+ gr.Markdown("### Ask questions about specific agent execution traces")
299
+
300
+ with gr.Row():
301
+ with gr.Column():
302
+ trace_id = gr.Textbox(
303
+ label="Trace ID",
304
+ placeholder="trace_abc123",
305
+ info="Get this from the Run Detail screen"
306
+ )
307
+ traces_repo = gr.Textbox(
308
+ label="Traces Repository",
309
+ placeholder="username/agent-traces-model-timestamp",
310
+ info="Dataset containing trace data"
311
+ )
312
+ question = gr.Textbox(
313
+ label="Your Question",
314
+ placeholder="Why was tool X called twice?",
315
+ lines=3
316
+ )
317
+ trace_button = gr.Button("🔍 Analyze", variant="primary")
318
+
319
+ with gr.Column():
320
+ trace_output = gr.Markdown(label="Debug Analysis")
321
+
322
+ async def run_debug_trace(trace_id_val, traces_repo_val, question_val, gemini_key, hf_token):
323
+ """
324
+ Debug a specific agent execution trace using OpenTelemetry data.
325
+
326
+ This tool analyzes OpenTelemetry trace data from agent executions and uses
327
+ Google Gemini 2.5 Pro to answer specific questions about the execution flow,
328
+ identify bottlenecks, explain agent behavior, and provide debugging insights.
329
+
330
+ Args:
331
+ trace_id_val (str): Unique identifier for the trace to analyze (e.g., "trace_abc123")
332
+ traces_repo_val (str): HuggingFace dataset repository containing trace data
333
+ question_val (str): Specific question about the trace (optional, defaults to general analysis)
334
+ gemini_key (str): Gemini API key from session state
335
+ hf_token (str): HuggingFace token from session state
336
+
337
+ Returns:
338
+ str: Markdown-formatted debug analysis with step-by-step breakdown and answers
339
+ """
340
+ try:
341
+ if not trace_id_val or not traces_repo_val:
342
+ return "❌ **Error**: Please provide both Trace ID and Traces Repository"
343
+
344
+ # Create GeminiClient with user-provided key or fallback to default
345
+ if gemini_key and gemini_key.strip():
346
+ client = GeminiClient(api_key=gemini_key)
347
+ elif default_gemini_client:
348
+ client = default_gemini_client
349
+ else:
350
+ return "❌ **Error**: No Gemini API key configured. Please set it in the Settings tab."
351
+
352
+ result = await debug_trace(
353
+ gemini_client=client,
354
+ trace_id=trace_id_val,
355
+ traces_repo=traces_repo_val,
356
+ question=question_val or "Analyze this trace",
357
+ hf_token=hf_token if hf_token and hf_token.strip() else None
358
+ )
359
+ return result
360
+ except Exception as e:
361
+ return f"❌ **Error**: {str(e)}"
362
+
363
+ trace_button.click(
364
+ fn=run_debug_trace,
365
+ inputs=[trace_id, traces_repo, question, gemini_key_state, hf_token_state],
366
+ outputs=[trace_output]
367
+ )
368
+
369
+ # Tab 3: Estimate Cost
370
+ with gr.Tab("💰 Estimate Cost"):
371
+ gr.Markdown("### Predict evaluation costs before running")
372
+
373
+ with gr.Row():
374
+ with gr.Column():
375
+ cost_model = gr.Textbox(
376
+ label="Model",
377
+ placeholder="openai/gpt-4 or meta-llama/Llama-3.1-8B",
378
+ info="Use litellm format (provider/model)"
379
+ )
380
+ cost_agent_type = gr.Dropdown(
381
+ label="Agent Type",
382
+ choices=["tool", "code", "both"],
383
+ value="both"
384
+ )
385
+ cost_num_tests = gr.Slider(
386
+ label="Number of Tests",
387
+ minimum=10,
388
+ maximum=1000,
389
+ value=100,
390
+ step=10
391
+ )
392
+ cost_hardware = gr.Dropdown(
393
+ label="Hardware Type",
394
+ choices=["auto", "cpu", "gpu_a10", "gpu_h200"],
395
+ value="auto",
396
+ info="'auto' will choose based on model type"
397
+ )
398
+ cost_button = gr.Button("💰 Estimate", variant="primary")
399
+
400
+ with gr.Column():
401
+ cost_output = gr.Markdown(label="Cost Estimate")
402
+
403
+ async def run_estimate_cost(model, agent_type, num_tests, hardware, gemini_key):
404
+ """
405
+ Estimate the cost, duration, and CO2 emissions of running agent evaluations.
406
+
407
+ This tool predicts costs before running evaluations by calculating LLM API costs,
408
+ HuggingFace Jobs compute costs, and CO2 emissions. Uses Google Gemini 2.5 Pro
409
+ to provide detailed cost breakdown and optimization recommendations.
410
+
411
+ Args:
412
+ model (str): Model identifier in litellm format (e.g., "openai/gpt-4", "meta-llama/Llama-3.1-8B")
413
+ agent_type (str): Type of agent capabilities to test - "tool", "code", or "both"
414
+ num_tests (int): Number of test cases to run (10-1000)
415
+ hardware (str): Hardware type for HF Jobs - "auto", "cpu", "gpu_a10", or "gpu_h200"
416
+ gemini_key (str): Gemini API key from session state
417
+
418
+ Returns:
419
+ str: Markdown-formatted cost estimate with LLM costs, HF Jobs costs, duration, CO2, and tips
420
+ """
421
+ try:
422
+ if not model:
423
+ return "❌ **Error**: Please provide a model name"
424
+
425
+ # Create GeminiClient with user-provided key or fallback to default
426
+ if gemini_key and gemini_key.strip():
427
+ client = GeminiClient(api_key=gemini_key)
428
+ elif default_gemini_client:
429
+ client = default_gemini_client
430
+ else:
431
+ return "❌ **Error**: No Gemini API key configured. Please set it in the Settings tab."
432
+
433
+ result = await estimate_cost(
434
+ gemini_client=client,
435
+ model=model,
436
+ agent_type=agent_type,
437
+ num_tests=int(num_tests),
438
+ hardware=hardware
439
+ )
440
+ return result
441
+ except Exception as e:
442
+ return f"❌ **Error**: {str(e)}"
443
+
444
+ cost_button.click(
445
+ fn=run_estimate_cost,
446
+ inputs=[cost_model, cost_agent_type, cost_num_tests, cost_hardware, gemini_key_state],
447
+ outputs=[cost_output]
448
+ )
449
+
450
+ # Tab 4: Compare Runs
451
+ with gr.Tab("⚖️ Compare Runs"):
452
+ gr.Markdown("""
453
+ ## Compare Two Evaluation Runs
454
+
455
+ Compare two evaluation runs with AI-powered analysis across multiple dimensions:
456
+ success rate, cost efficiency, speed, environmental impact, and more.
457
+ """)
458
+
459
+ with gr.Row():
460
+ with gr.Column():
461
+ compare_run_id_1 = gr.Textbox(
462
+ label="First Run ID",
463
+ placeholder="e.g., run_abc123",
464
+ info="Enter the run_id from the leaderboard"
465
+ )
466
+ with gr.Column():
467
+ compare_run_id_2 = gr.Textbox(
468
+ label="Second Run ID",
469
+ placeholder="e.g., run_xyz789",
470
+ info="Enter the run_id to compare against"
471
+ )
472
+
473
+ with gr.Row():
474
+ compare_focus = gr.Dropdown(
475
+ choices=["comprehensive", "cost", "performance", "eco_friendly"],
476
+ value="comprehensive",
477
+ label="Comparison Focus",
478
+ info="Choose what aspect to focus the comparison on"
479
+ )
480
+ compare_repo = gr.Textbox(
481
+ label="Leaderboard Repository",
482
+ value="kshitijthakkar/smoltrace-leaderboard",
483
+ info="HuggingFace dataset containing leaderboard data"
484
+ )
485
+
486
+ compare_button = gr.Button("🔍 Compare Runs", variant="primary")
487
+ compare_output = gr.Markdown()
488
+
489
+ async def run_compare_runs(run_id_1, run_id_2, focus, repo, gemini_key, hf_token):
490
+ """
491
+ Compare two evaluation runs and generate AI-powered comparative analysis.
492
+
493
+ This tool fetches data for two evaluation runs from the leaderboard and uses
494
+ Google Gemini 2.5 Pro to provide intelligent comparison across multiple dimensions:
495
+ success rate, cost efficiency, speed, environmental impact, and use case recommendations.
496
+
497
+ Args:
498
+ run_id_1 (str): First run ID from the leaderboard to compare
499
+ run_id_2 (str): Second run ID from the leaderboard to compare against
500
+ focus (str): Focus area - "comprehensive", "cost", "performance", or "eco_friendly"
501
+ repo (str): HuggingFace dataset repository containing leaderboard data
502
+ gemini_key (str): Gemini API key from session state
503
+ hf_token (str): HuggingFace token from session state
504
+
505
+ Returns:
506
+ str: Markdown-formatted comparative analysis with winners, trade-offs, and recommendations
507
+ """
508
+ try:
509
+ # Create GeminiClient with user-provided key or fallback to default
510
+ if gemini_key and gemini_key.strip():
511
+ client = GeminiClient(api_key=gemini_key)
512
+ elif default_gemini_client:
513
+ client = default_gemini_client
514
+ else:
515
+ return "❌ **Error**: No Gemini API key configured. Please set it in the Settings tab."
516
+
517
+ result = await compare_runs(
518
+ gemini_client=client,
519
+ run_id_1=run_id_1,
520
+ run_id_2=run_id_2,
521
+ leaderboard_repo=repo,
522
+ comparison_focus=focus,
523
+ hf_token=hf_token if hf_token and hf_token.strip() else None
524
+ )
525
+ return result
526
+ except Exception as e:
527
+ return f"❌ **Error**: {str(e)}"
528
+
529
+ compare_button.click(
530
+ fn=run_compare_runs,
531
+ inputs=[compare_run_id_1, compare_run_id_2, compare_focus, compare_repo, gemini_key_state, hf_token_state],
532
+ outputs=[compare_output]
533
+ )
534
+
535
+ # Tab 5: Get Dataset
536
+ with gr.Tab("📦 Get Dataset"):
537
+ gr.Markdown("""
538
+ ## Load SMOLTRACE Datasets as JSON
539
+
540
+ This tool loads datasets with the **smoltrace-** prefix and returns the raw data as JSON.
541
+ Use this to access leaderboard data, results datasets, traces datasets, or metrics datasets.
542
+
543
+ **Restriction**: Only datasets with "smoltrace-" in the name are allowed for security.
544
+
545
+ **Tip**: If you don't know which dataset to load, first load the leaderboard to see
546
+ dataset references in the `results_dataset`, `traces_dataset`, `metrics_dataset` fields.
547
+ """)
548
+
549
+ with gr.Row():
550
+ dataset_repo_input = gr.Textbox(
551
+ label="Dataset Repository (must contain 'smoltrace-')",
552
+ placeholder="e.g., kshitijthakkar/smoltrace-leaderboard",
553
+ value="kshitijthakkar/smoltrace-leaderboard",
554
+ info="HuggingFace dataset repository path with smoltrace- prefix"
555
+ )
556
+ dataset_max_rows = gr.Slider(
557
+ minimum=1,
558
+ maximum=200,
559
+ value=50,
560
+ step=1,
561
+ label="Max Rows",
562
+ info="Limit rows to avoid token limits"
563
+ )
564
+
565
+ dataset_button = gr.Button("📥 Load Dataset", variant="primary")
566
+ dataset_output = gr.JSON(label="Dataset JSON Output")
567
+
568
+ async def run_get_dataset(repo, max_rows, hf_token):
569
+ """
570
+ Load SMOLTRACE datasets from HuggingFace and return as JSON.
571
+
572
+ This tool loads datasets with the "smoltrace-" prefix and returns the raw data
573
+ as JSON. Use this to access leaderboard data, results datasets, traces datasets,
574
+ or metrics datasets. Only datasets with "smoltrace-" in the name are allowed.
575
+
576
+ Args:
577
+ repo (str): HuggingFace dataset repository path with "smoltrace-" prefix (e.g., "kshitijthakkar/smoltrace-leaderboard")
578
+ max_rows (int): Maximum number of rows to return (1-200, default 50)
579
+ hf_token (str): HuggingFace token from session state
580
+
581
+ Returns:
582
+ dict: JSON object with dataset data, metadata, total rows, and column names
583
+ """
584
+ try:
585
+ import json
586
+ result = await get_dataset(
587
+ dataset_repo=repo,
588
+ max_rows=int(max_rows),
589
+ hf_token=hf_token if hf_token and hf_token.strip() else None
590
+ )
591
+ # Parse JSON string back to dict for JSON component
592
+ return json.loads(result)
593
+ except Exception as e:
594
+ return {"error": str(e)}
595
+
596
+ dataset_button.click(
597
+ fn=run_get_dataset,
598
+ inputs=[dataset_repo_input, dataset_max_rows, hf_token_state],
599
+ outputs=[dataset_output]
600
+ )
601
+
602
+ # Tab 6: MCP Resources & Prompts
603
+ with gr.Tab("🔌 MCP Resources & Prompts"):
604
+ gr.Markdown("""
605
+ ## MCP Resources & Prompts
606
+
607
+ Beyond the 5 MCP Tools, this server also exposes **MCP Resources** and **MCP Prompts**
608
+ that MCP clients can use directly.
609
+
610
+ ### MCP Resources (Read-Only Data Access)
611
+
612
+ Resources provide direct access to data without AI processing:
613
+
614
+ #### 1. `leaderboard://{repo}`
615
+ Get raw leaderboard data in JSON format.
616
+
617
+ **Example**: `leaderboard://kshitijthakkar/smoltrace-leaderboard`
618
+
619
+ **Returns**: JSON with all evaluation runs
620
+
621
+ #### 2. `trace://{trace_id}/{repo}`
622
+ Get raw trace data for a specific trace.
623
+
624
+ **Example**: `trace://trace_abc123/kshitijthakkar/smoltrace-traces-gpt4`
625
+
626
+ **Returns**: JSON with OpenTelemetry spans
627
+
628
+ #### 3. `cost://model/{model_name}`
629
+ Get cost information for a specific model.
630
+
631
+ **Example**: `cost://model/openai/gpt-4`
632
+
633
+ **Returns**: JSON with pricing data
634
+
635
+ ---
636
+
637
+ ### MCP Prompts (Reusable Templates)
638
+
639
+ Prompts provide standardized templates for common workflows:
640
+
641
+ #### 1. `analysis_prompt(analysis_type, focus_area, detail_level)`
642
+ Generate analysis prompt templates.
643
+
644
+ **Parameters**:
645
+ - `analysis_type`: "leaderboard", "trace", "cost"
646
+ - `focus_area`: "overall", "performance", "cost", "efficiency"
647
+ - `detail_level`: "summary", "detailed", "comprehensive"
648
+
649
+ #### 2. `debug_prompt(debug_type, context)`
650
+ Generate debugging prompt templates.
651
+
652
+ **Parameters**:
653
+ - `debug_type`: "error", "performance", "behavior", "optimization"
654
+ - `context`: "agent_execution", "tool_calling", "llm_reasoning"
655
+
656
+ #### 3. `optimization_prompt(optimization_goal, constraints)`
657
+ Generate optimization prompt templates.
658
+
659
+ **Parameters**:
660
+ - `optimization_goal`: "cost", "speed", "quality", "efficiency"
661
+ - `constraints`: "maintain_quality", "maintain_speed", "no_constraints"
662
+
663
+ ---
664
+
665
+ ### Testing MCP Resources
666
+
667
+ Test resources directly from this UI:
668
+ """)
669
+
670
+ with gr.Row():
671
+ with gr.Column():
672
+ gr.Markdown("#### Test Leaderboard Resource")
673
+ resource_lb_repo = gr.Textbox(
674
+ label="Repository",
675
+ value="kshitijthakkar/smoltrace-leaderboard"
676
+ )
677
+ resource_lb_button = gr.Button("Fetch Leaderboard Data", variant="primary")
678
+ resource_lb_output = gr.JSON(label="Resource Output")
679
+
680
+ def test_leaderboard_resource(repo):
681
+ """
682
+ Test the leaderboard MCP resource by fetching raw leaderboard data.
683
+
684
+ Args:
685
+ repo (str): HuggingFace dataset repository name
686
+
687
+ Returns:
688
+ dict: JSON object with leaderboard data
689
+ """
690
+ from mcp_tools import get_leaderboard_data
691
+ import json
692
+ result = get_leaderboard_data(repo)
693
+ return json.loads(result)
694
+
695
+ resource_lb_button.click(
696
+ fn=test_leaderboard_resource,
697
+ inputs=[resource_lb_repo],
698
+ outputs=[resource_lb_output]
699
+ )
700
+
701
+ with gr.Column():
702
+ gr.Markdown("#### Test Cost Resource")
703
+ resource_cost_model = gr.Textbox(
704
+ label="Model Name",
705
+ value="openai/gpt-4"
706
+ )
707
+ resource_cost_button = gr.Button("Fetch Cost Data", variant="primary")
708
+ resource_cost_output = gr.JSON(label="Resource Output")
709
+
710
+ def test_cost_resource(model):
711
+ """
712
+ Test the cost MCP resource by fetching model pricing data.
713
+
714
+ Args:
715
+ model (str): Model identifier (e.g., "openai/gpt-4")
716
+
717
+ Returns:
718
+ dict: JSON object with cost and pricing information
719
+ """
720
+ from mcp_tools import get_cost_data
721
+ import json
722
+ result = get_cost_data(model)
723
+ return json.loads(result)
724
+
725
+ resource_cost_button.click(
726
+ fn=test_cost_resource,
727
+ inputs=[resource_cost_model],
728
+ outputs=[resource_cost_output]
729
+ )
730
+
731
+ gr.Markdown("---")
732
+ gr.Markdown("### Testing MCP Prompts")
733
+ gr.Markdown("Generate prompt templates for different scenarios:")
734
+
735
+ with gr.Row():
736
+ with gr.Column():
737
+ prompt_type = gr.Radio(
738
+ label="Prompt Type",
739
+ choices=["analysis_prompt", "debug_prompt", "optimization_prompt"],
740
+ value="analysis_prompt"
741
+ )
742
+
743
+ # Analysis prompt params
744
+ with gr.Group(visible=True) as analysis_group:
745
+ analysis_type = gr.Dropdown(
746
+ label="Analysis Type",
747
+ choices=["leaderboard", "trace", "cost"],
748
+ value="leaderboard"
749
+ )
750
+ focus_area = gr.Dropdown(
751
+ label="Focus Area",
752
+ choices=["overall", "performance", "cost", "efficiency"],
753
+ value="overall"
754
+ )
755
+ detail_level = gr.Dropdown(
756
+ label="Detail Level",
757
+ choices=["summary", "detailed", "comprehensive"],
758
+ value="detailed"
759
+ )
760
+
761
+ # Debug prompt params
762
+ with gr.Group(visible=False) as debug_group:
763
+ debug_type = gr.Dropdown(
764
+ label="Debug Type",
765
+ choices=["error", "performance", "behavior", "optimization"],
766
+ value="error"
767
+ )
768
+ debug_context = gr.Dropdown(
769
+ label="Context",
770
+ choices=["agent_execution", "tool_calling", "llm_reasoning"],
771
+ value="agent_execution"
772
+ )
773
+
774
+ # Optimization prompt params
775
+ with gr.Group(visible=False) as optimization_group:
776
+ optimization_goal = gr.Dropdown(
777
+ label="Optimization Goal",
778
+ choices=["cost", "speed", "quality", "efficiency"],
779
+ value="cost"
780
+ )
781
+ constraints = gr.Dropdown(
782
+ label="Constraints",
783
+ choices=["maintain_quality", "maintain_speed", "no_constraints"],
784
+ value="maintain_quality"
785
+ )
786
+
787
+ prompt_button = gr.Button("Generate Prompt", variant="primary")
788
+
789
+ with gr.Column():
790
+ prompt_output = gr.Textbox(
791
+ label="Generated Prompt Template",
792
+ lines=10,
793
+ max_lines=20
794
+ )
795
+
796
+ def toggle_prompt_groups(prompt_type):
797
+ """
798
+ Toggle visibility of prompt parameter groups based on selected prompt type.
799
+
800
+ Args:
801
+ prompt_type (str): The type of prompt selected
802
+
803
+ Returns:
804
+ dict: Gradio update objects for group visibility
805
+ """
806
+ return {
807
+ analysis_group: gr.update(visible=(prompt_type == "analysis_prompt")),
808
+ debug_group: gr.update(visible=(prompt_type == "debug_prompt")),
809
+ optimization_group: gr.update(visible=(prompt_type == "optimization_prompt"))
810
+ }
811
+
812
+ prompt_type.change(
813
+ fn=toggle_prompt_groups,
814
+ inputs=[prompt_type],
815
+ outputs=[analysis_group, debug_group, optimization_group]
816
+ )
817
+
818
+ def generate_prompt(
819
+ prompt_type,
820
+ analysis_type_val, focus_area_val, detail_level_val,
821
+ debug_type_val, debug_context_val,
822
+ optimization_goal_val, constraints_val
823
+ ):
824
+ """
825
+ Generate a prompt template based on the selected type and parameters.
826
+
827
+ Args:
828
+ prompt_type (str): Type of prompt to generate
829
+ analysis_type_val (str): Analysis type parameter
830
+ focus_area_val (str): Focus area parameter
831
+ detail_level_val (str): Detail level parameter
832
+ debug_type_val (str): Debug type parameter
833
+ debug_context_val (str): Debug context parameter
834
+ optimization_goal_val (str): Optimization goal parameter
835
+ constraints_val (str): Constraints parameter
836
+
837
+ Returns:
838
+ str: Generated prompt template text
839
+ """
840
+ from mcp_tools import analysis_prompt, debug_prompt, optimization_prompt
841
+
842
+ if prompt_type == "analysis_prompt":
843
+ return analysis_prompt(analysis_type_val, focus_area_val, detail_level_val)
844
+ elif prompt_type == "debug_prompt":
845
+ return debug_prompt(debug_type_val, debug_context_val)
846
+ elif prompt_type == "optimization_prompt":
847
+ return optimization_prompt(optimization_goal_val, constraints_val)
848
+
849
+ prompt_button.click(
850
+ fn=generate_prompt,
851
+ inputs=[
852
+ prompt_type,
853
+ analysis_type, focus_area, detail_level,
854
+ debug_type, debug_context,
855
+ optimization_goal, constraints
856
+ ],
857
+ outputs=[prompt_output]
858
+ )
859
+
860
+ # Tab 7: API Documentation
861
+ with gr.Tab("📖 API Documentation"):
862
+ gr.Markdown("""
863
+ ## MCP Tool Specifications
864
+
865
+ ### 1. analyze_leaderboard
866
+
867
+ **Description**: Generate AI-powered insights from evaluation leaderboard data
868
+
869
+ **Parameters**:
870
+ - `leaderboard_repo` (str): HuggingFace dataset repository (default: "kshitijthakkar/smoltrace-leaderboard")
871
+ - `metric_focus` (str): "overall", "accuracy", "cost", "latency", or "co2" (default: "overall")
872
+ - `time_range` (str): "last_week", "last_month", or "all_time" (default: "last_week")
873
+ - `top_n` (int): Number of top models to highlight (default: 5, min: 3, max: 10)
874
+
875
+ **Returns**: Markdown-formatted analysis with top performers, trends, and recommendations
876
+
877
+ ---
878
+
879
+ ### 2. debug_trace
880
+
881
+ **Description**: Answer questions about specific agent execution traces
882
+
883
+ **Parameters**:
884
+ - `trace_id` (str, required): Unique identifier for the trace
885
+ - `traces_repo` (str, required): HuggingFace dataset repository with trace data
886
+ - `question` (str): Specific question about the trace (default: "Analyze this trace and explain what happened")
887
+
888
+ **Returns**: Markdown-formatted debug analysis with step-by-step breakdown
889
+
890
+ ---
891
+
892
+ ### 3. estimate_cost
893
+
894
+ **Description**: Predict evaluation costs before running
895
+
896
+ **Parameters**:
897
+ - `model` (str, required): Model identifier in litellm format (e.g., "openai/gpt-4")
898
+ - `agent_type` (str, required): "tool", "code", or "both"
899
+ - `num_tests` (int): Number of test cases (default: 100, min: 10, max: 1000)
900
+ - `hardware` (str): "auto", "cpu", "gpu_a10", or "gpu_h200" (default: "auto")
901
+
902
+ **Returns**: Markdown-formatted cost estimate with breakdown and optimization tips
903
+
904
+ ---
905
+
906
+ ### 4. compare_runs
907
+
908
+ **Description**: Compare two evaluation runs with AI-powered analysis
909
+
910
+ **Parameters**:
911
+ - `run_id_1` (str, required): First run ID from the leaderboard
912
+ - `run_id_2` (str, required): Second run ID to compare against
913
+ - `leaderboard_repo` (str): HuggingFace dataset repository (default: "kshitijthakkar/smoltrace-leaderboard")
914
+ - `comparison_focus` (str): "comprehensive", "cost", "performance", or "eco_friendly" (default: "comprehensive")
915
+
916
+ **Returns**: Markdown-formatted comparative analysis with winner for each category, trade-offs, and recommendations
917
+
918
+ **Focus Options**:
919
+ - `comprehensive`: Complete comparison across all dimensions (success rate, cost, speed, CO2, GPU)
920
+ - `cost`: Detailed cost efficiency analysis and ROI
921
+ - `performance`: Speed and accuracy trade-off analysis
922
+ - `eco_friendly`: Environmental impact and carbon footprint comparison
923
+
924
+ ---
925
+
926
+ ### 5. get_dataset
927
+
928
+ **Description**: Load SMOLTRACE datasets from HuggingFace and return as JSON
929
+
930
+ **Parameters**:
931
+ - `dataset_repo` (str, required): HuggingFace dataset repository path with "smoltrace-" prefix (e.g., "kshitijthakkar/smoltrace-leaderboard")
932
+ - `max_rows` (int): Maximum number of rows to return (default: 50, range: 1-200)
933
+
934
+ **Returns**: JSON object with dataset data and metadata
935
+
936
+ **Restriction**: Only datasets with "smoltrace-" in the repository name are allowed for security.
937
+
938
+ **Use Cases**:
939
+ - Load smoltrace-leaderboard to find run IDs, model names, and supporting dataset references
940
+ - Load smoltrace-results-* datasets to see individual test case details
941
+ - Load smoltrace-traces-* datasets to access OpenTelemetry trace data
942
+ - Load smoltrace-metrics-* datasets to get GPU metrics and performance data
943
+
944
+ **Workflow**:
945
+ 1. Call `get_dataset("kshitijthakkar/smoltrace-leaderboard")` to see all runs
946
+ 2. Find the `results_dataset`, `traces_dataset`, or `metrics_dataset` field for a specific run
947
+ 3. Call `get_dataset(dataset_repo)` with that smoltrace-* dataset name to get detailed data
948
+
949
+ ---
950
+
951
+ ## MCP Integration
952
+
953
+ This Gradio app is MCP-enabled. When deployed to HuggingFace Spaces, it can be accessed via MCP clients.
954
+
955
+ **Space URL**: `https://huggingface.co/spaces/kshitijthakkar/TraceMind-mcp-server`
956
+
957
+ ### What's Exposed via MCP:
958
+
959
+ #### 5 MCP Tools (AI-Powered)
960
+ The five tools above (`analyze_leaderboard`, `debug_trace`, `estimate_cost`, `compare_runs`, `get_dataset`)
961
+ are automatically exposed as MCP tools and can be called from any MCP client.
962
+
963
+ #### 3 MCP Resources (Data Access)
964
+ - `leaderboard://{repo}` - Raw leaderboard data
965
+ - `trace://{trace_id}/{repo}` - Raw trace data
966
+ - `cost://model/{model_name}` - Model pricing data
967
+
968
+ #### 3 MCP Prompts (Templates)
969
+ - `analysis_prompt(analysis_type, focus_area, detail_level)` - Analysis templates
970
+ - `debug_prompt(debug_type, context)` - Debug templates
971
+ - `optimization_prompt(optimization_goal, constraints)` - Optimization templates
972
+
973
+ **See the "🔌 MCP Resources & Prompts" tab to test these features.**
974
+ """)
975
+
976
+ gr.Markdown("""
977
+ ---
978
+
979
+ ## Environment Variables
980
+
981
+ Required:
982
+ - `GEMINI_API_KEY`: Your Google Gemini API key
983
+ - `HF_TOKEN`: Your HuggingFace token (for dataset access)
984
+
985
+ ## Source Code
986
+
987
+ This server is part of the TraceMind project submission for MCP's 1st Birthday Hackathon.
988
+
989
+ **Track 1**: Building MCP (Enterprise)
990
+ **Tag**: `building-mcp-track-enterprise`
991
+ """)
992
+
993
+ return demo
994
+
995
+ if __name__ == "__main__":
996
+ # Create Gradio interface
997
+ demo = create_gradio_ui()
998
+
999
+ # Launch with MCP server enabled
1000
+ # share=True creates a temporary public HTTPS URL for testing with Claude Code
1001
+ demo.launch(
1002
+ server_name="0.0.0.0",
1003
+ server_port=7860,
1004
+ share=True, # Creates temporary HTTPS URL (e.g., https://abc123.gradio.live)
1005
+ mcp_server=True # Enable MCP server functionality
1006
+ )
gemini_client.py ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Gemini Client for TraceMind MCP Server
3
+
4
+ Handles all interactions with Google Gemini 2.5 Pro API
5
+ """
6
+
7
+ import os
8
+ import google.generativeai as genai
9
+ from typing import Optional, Dict, Any, List
10
+ import json
11
+
12
+ class GeminiClient:
13
+ """Client for Google Gemini API"""
14
+
15
+ def __init__(self, api_key: Optional[str] = None, model_name: str = "gemini-2.5-flash"):
16
+ """
17
+ Initialize Gemini client
18
+
19
+ Args:
20
+ api_key: Gemini API key (defaults to GEMINI_API_KEY env var)
21
+ model_name: Model to use (default: gemini-2.5-flash, can also use gemini-2.5-flash-lite)
22
+ """
23
+ self.api_key = api_key or os.getenv("GEMINI_API_KEY")
24
+ if not self.api_key:
25
+ raise ValueError("GEMINI_API_KEY environment variable not set")
26
+
27
+ # Configure API
28
+ genai.configure(api_key=self.api_key)
29
+
30
+ # Initialize model
31
+ self.model = genai.GenerativeModel(model_name)
32
+
33
+ # Generation config for consistent outputs
34
+ self.generation_config = {
35
+ "temperature": 0.7,
36
+ "top_p": 0.95,
37
+ "top_k": 40,
38
+ "max_output_tokens": 8192,
39
+ }
40
+
41
+ async def analyze_with_context(
42
+ self,
43
+ data: Dict[str, Any],
44
+ analysis_type: str,
45
+ specific_question: Optional[str] = None
46
+ ) -> str:
47
+ """
48
+ Analyze data with Gemini, providing context about the analysis type
49
+
50
+ Args:
51
+ data: Data to analyze (will be converted to JSON)
52
+ analysis_type: Type of analysis ("leaderboard", "trace", "cost_estimate")
53
+ specific_question: Optional specific question to answer
54
+
55
+ Returns:
56
+ Markdown-formatted analysis
57
+ """
58
+
59
+ # Build prompt based on analysis type
60
+ if analysis_type == "leaderboard":
61
+ system_prompt = """You are an expert AI agent performance analyst.
62
+
63
+ You are analyzing evaluation leaderboard data from agent benchmarks. Your task is to:
64
+ 1. Identify top performers across key metrics (accuracy, cost, latency, CO2)
65
+ 2. Explain trade-offs between different approaches (API vs local models, GPU types)
66
+ 3. Identify trends and patterns
67
+ 4. Provide actionable recommendations
68
+
69
+ Focus on insights that would help developers choose the right agent configuration for their use case.
70
+
71
+ Format your response in clear markdown with sections for:
72
+ - **Top Performers**
73
+ - **Key Insights**
74
+ - **Trade-offs**
75
+ - **Recommendations**
76
+ """
77
+
78
+ elif analysis_type == "trace":
79
+ system_prompt = """You are an expert agent debugging specialist.
80
+
81
+ You are analyzing OpenTelemetry trace data from agent execution. Your task is to:
82
+ 1. Understand the sequence of operations (LLM calls, tool calls, etc.)
83
+ 2. Identify performance bottlenecks or inefficiencies
84
+ 3. Explain why certain decisions were made
85
+ 4. Answer the specific question asked
86
+
87
+ Focus on providing clear explanations that help developers understand agent behavior.
88
+
89
+ Format your response in clear markdown with relevant code snippets and timing information.
90
+ """
91
+
92
+ elif analysis_type == "cost_estimate":
93
+ system_prompt = """You are an expert in LLM cost optimization and cloud resource estimation.
94
+
95
+ You are estimating the cost of running agent evaluations. Your task is to:
96
+ 1. Calculate LLM API costs based on token usage patterns
97
+ 2. Estimate HuggingFace Jobs compute costs
98
+ 3. Predict CO2 emissions
99
+ 4. Provide cost optimization recommendations
100
+
101
+ Focus on giving accurate estimates with clear breakdowns.
102
+
103
+ Format your response in clear markdown with cost breakdowns and optimization tips.
104
+ """
105
+
106
+ else:
107
+ system_prompt = "You are a helpful AI assistant analyzing agent evaluation data."
108
+
109
+ # Build user prompt
110
+ data_json = json.dumps(data, indent=2)
111
+
112
+ user_prompt = f"{system_prompt}\n\n**Data to analyze:**\n```json\n{data_json}\n```\n\n"
113
+
114
+ if specific_question:
115
+ user_prompt += f"**Specific question:** {specific_question}\n\n"
116
+
117
+ user_prompt += "Provide your analysis:"
118
+
119
+ # Generate response
120
+ try:
121
+ response = await self.model.generate_content_async(
122
+ user_prompt,
123
+ generation_config=self.generation_config
124
+ )
125
+
126
+ return response.text
127
+
128
+ except Exception as e:
129
+ return f"Error generating analysis: {str(e)}"
130
+
131
+ async def generate_summary(
132
+ self,
133
+ text: str,
134
+ max_words: int = 100
135
+ ) -> str:
136
+ """
137
+ Generate a concise summary of text
138
+
139
+ Args:
140
+ text: Text to summarize
141
+ max_words: Maximum words in summary
142
+
143
+ Returns:
144
+ Summary text
145
+ """
146
+ prompt = f"Summarize the following in {max_words} words or less:\n\n{text}"
147
+
148
+ try:
149
+ response = await self.model.generate_content_async(prompt)
150
+ return response.text
151
+ except Exception as e:
152
+ return f"Error generating summary: {str(e)}"
153
+
154
+ async def answer_question(
155
+ self,
156
+ context: str,
157
+ question: str
158
+ ) -> str:
159
+ """
160
+ Answer a question given context
161
+
162
+ Args:
163
+ context: Context information
164
+ question: Question to answer
165
+
166
+ Returns:
167
+ Answer
168
+ """
169
+ prompt = f"""Based on the following context, answer the question.
170
+
171
+ **Context:**
172
+ {context}
173
+
174
+ **Question:** {question}
175
+
176
+ **Answer:**"""
177
+
178
+ try:
179
+ response = await self.model.generate_content_async(
180
+ prompt,
181
+ generation_config=self.generation_config
182
+ )
183
+ return response.text
184
+ except Exception as e:
185
+ return f"Error answering question: {str(e)}"
mcp_tools.py ADDED
@@ -0,0 +1,943 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ MCP Tool Implementations for TraceMind
3
+
4
+ Implements:
5
+ - 5 MCP Tools: analyze_leaderboard, debug_trace, estimate_cost, compare_runs, get_dataset
6
+ - 3 MCP Resources: leaderboard data, trace data, cost data
7
+ - 3 MCP Prompts: analysis prompts, debug prompts, optimization prompts
8
+
9
+ With Gradio's native MCP support (mcp_server=True), these are automatically
10
+ exposed based on decorators (@gr.mcp.tool, @gr.mcp.resource, @gr.mcp.prompt),
11
+ docstrings, and type hints.
12
+ """
13
+
14
+ import os
15
+ import json
16
+ from typing import Optional
17
+ from datasets import load_dataset
18
+ import pandas as pd
19
+ from datetime import datetime, timedelta
20
+ import gradio as gr
21
+
22
+ from gemini_client import GeminiClient
23
+
24
+
25
+ async def analyze_leaderboard(
26
+ gemini_client: GeminiClient,
27
+ leaderboard_repo: str = "kshitijthakkar/smoltrace-leaderboard",
28
+ metric_focus: str = "overall",
29
+ time_range: str = "last_week",
30
+ top_n: int = 5,
31
+ hf_token: Optional[str] = None
32
+ ) -> str:
33
+ """
34
+ Analyze evaluation leaderboard and generate AI-powered insights.
35
+
36
+ This tool loads agent evaluation data from HuggingFace datasets and uses
37
+ Google Gemini 2.5 Pro to provide intelligent analysis of top performers,
38
+ trends, cost/performance trade-offs, and actionable recommendations.
39
+
40
+ Args:
41
+ gemini_client (GeminiClient): Initialized Gemini client for AI analysis
42
+ leaderboard_repo (str): HuggingFace dataset repository containing leaderboard data. Default: "kshitijthakkar/smoltrace-leaderboard"
43
+ metric_focus (str): Primary metric to focus analysis on. Options: "overall", "accuracy", "cost", "latency", "co2". Default: "overall"
44
+ time_range (str): Time range for analysis. Options: "last_week", "last_month", "all_time". Default: "last_week"
45
+ top_n (int): Number of top models to highlight in analysis. Must be between 3 and 10. Default: 5
46
+ hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
47
+
48
+ Returns:
49
+ str: Markdown-formatted analysis with top performers, insights, trade-offs, and recommendations
50
+ """
51
+ try:
52
+ # Load leaderboard data from HuggingFace
53
+ print(f"Loading leaderboard from {leaderboard_repo}...")
54
+
55
+ # Use user-provided token or fall back to environment variable
56
+ token = hf_token if hf_token else os.getenv("HF_TOKEN")
57
+ ds = load_dataset(leaderboard_repo, split="train", token=token)
58
+ df = pd.DataFrame(ds)
59
+
60
+ # Filter by time range
61
+ if time_range != "all_time":
62
+ df['timestamp'] = pd.to_datetime(df['timestamp'])
63
+ now = datetime.now()
64
+
65
+ if time_range == "last_week":
66
+ cutoff = now - timedelta(days=7)
67
+ elif time_range == "last_month":
68
+ cutoff = now - timedelta(days=30)
69
+
70
+ df = df[df['timestamp'] >= cutoff]
71
+
72
+ # Sort by metric
73
+ metric_column_map = {
74
+ "overall": "success_rate",
75
+ "accuracy": "success_rate",
76
+ "cost": "total_cost_usd",
77
+ "latency": "avg_duration_ms",
78
+ "co2": "co2_emissions_g"
79
+ }
80
+
81
+ sort_column = metric_column_map.get(metric_focus, "success_rate")
82
+ ascending = metric_focus in ["cost", "latency", "co2"] # Lower is better for these
83
+
84
+ df_sorted = df.sort_values(sort_column, ascending=ascending)
85
+
86
+ # Get top N
87
+ top_models = df_sorted.head(top_n)
88
+
89
+ # Prepare data summary for Gemini
90
+ analysis_data = {
91
+ "total_evaluations": len(df),
92
+ "time_range": time_range,
93
+ "metric_focus": metric_focus,
94
+ "top_models": top_models[[
95
+ "model", "agent_type", "provider",
96
+ "success_rate", "total_cost_usd", "avg_duration_ms",
97
+ "co2_emissions_g", "submitted_by"
98
+ ]].to_dict('records'),
99
+ "summary_stats": {
100
+ "avg_success_rate": float(df['success_rate'].mean()),
101
+ "avg_cost": float(df['total_cost_usd'].mean()),
102
+ "avg_duration_ms": float(df['avg_duration_ms'].mean()),
103
+ "total_co2_g": float(df['co2_emissions_g'].sum()),
104
+ "models_tested": df['model'].nunique(),
105
+ "unique_submitters": df['submitted_by'].nunique()
106
+ }
107
+ }
108
+
109
+ # Get AI analysis from Gemini
110
+ result = await gemini_client.analyze_with_context(
111
+ data=analysis_data,
112
+ analysis_type="leaderboard",
113
+ specific_question=f"Focus on {metric_focus} performance. What are the key insights?"
114
+ )
115
+
116
+ return result
117
+
118
+ except Exception as e:
119
+ return f"❌ **Error analyzing leaderboard**: {str(e)}\n\nPlease check:\n- Repository name is correct\n- You have access to the dataset\n- HF_TOKEN is set correctly"
120
+
121
+
122
+ async def debug_trace(
123
+ gemini_client: GeminiClient,
124
+ trace_id: str,
125
+ traces_repo: str,
126
+ question: str = "Analyze this trace and explain what happened",
127
+ hf_token: Optional[str] = None
128
+ ) -> str:
129
+ """
130
+ Debug a specific agent execution trace using OpenTelemetry data.
131
+
132
+ This tool analyzes OpenTelemetry trace data from agent executions and uses
133
+ Google Gemini 2.5 Pro to answer specific questions about the execution flow,
134
+ identify bottlenecks, and explain agent behavior.
135
+
136
+ Args:
137
+ gemini_client (GeminiClient): Initialized Gemini client for AI analysis
138
+ trace_id (str): Unique identifier for the trace to analyze (e.g., "trace_abc123")
139
+ traces_repo (str): HuggingFace dataset repository containing trace data (e.g., "username/agent-traces-model-timestamp")
140
+ question (str): Specific question about the trace. Default: "Analyze this trace and explain what happened"
141
+ hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
142
+
143
+ Returns:
144
+ str: Markdown-formatted debug analysis with step-by-step breakdown, timing information, and answer to the question
145
+ """
146
+ try:
147
+ # Load traces dataset
148
+ print(f"Loading traces from {traces_repo}...")
149
+
150
+ # Use user-provided token or fall back to environment variable
151
+ token = hf_token if hf_token else os.getenv("HF_TOKEN")
152
+ ds = load_dataset(traces_repo, split="train", token=token)
153
+ df = pd.DataFrame(ds)
154
+
155
+ # Find the specific trace
156
+ trace_data = df[df['trace_id'] == trace_id]
157
+
158
+ if len(trace_data) == 0:
159
+ return f"❌ **Trace not found**: No trace with ID `{trace_id}` in repository `{traces_repo}`"
160
+
161
+ trace_row = trace_data.iloc[0]
162
+
163
+ # Parse spans (OpenTelemetry format)
164
+ spans = trace_row['spans']
165
+ if isinstance(spans, str):
166
+ import json
167
+ spans = json.loads(spans)
168
+
169
+ # Helper function to handle different OTEL timestamp field formats
170
+ def get_timestamp(span, field):
171
+ """Get timestamp handling multiple OTEL formats"""
172
+ # Try different field name variations
173
+ for key in [field, f"{field}UnixNano", f"{field}_unix_nano", "timeUnixNano"]:
174
+ if key in span:
175
+ return span[key]
176
+ return 0
177
+
178
+ # Build trace analysis data
179
+ start_time = get_timestamp(spans[0], 'startTime')
180
+ end_time = get_timestamp(spans[-1], 'endTime')
181
+
182
+ trace_analysis = {
183
+ "trace_id": trace_id,
184
+ "run_id": trace_row.get('run_id', 'unknown'),
185
+ "total_duration_ms": (end_time - start_time) / 1_000_000 if end_time > start_time else 0,
186
+ "num_spans": len(spans),
187
+ "spans": []
188
+ }
189
+
190
+ # Process each span
191
+ for span in spans:
192
+ span_start = get_timestamp(span, 'startTime')
193
+ span_end = get_timestamp(span, 'endTime')
194
+
195
+ span_info = {
196
+ "name": span.get('name', 'Unknown'),
197
+ "kind": span.get('kind', 'INTERNAL'),
198
+ "duration_ms": (span_end - span_start) / 1_000_000 if span_end > span_start else 0,
199
+ "attributes": span.get('attributes', {}),
200
+ "status": span.get('status', {}).get('code', 'UNKNOWN')
201
+ }
202
+ trace_analysis["spans"].append(span_info)
203
+
204
+ # Get AI analysis from Gemini
205
+ result = await gemini_client.analyze_with_context(
206
+ data=trace_analysis,
207
+ analysis_type="trace",
208
+ specific_question=question
209
+ )
210
+
211
+ return result
212
+
213
+ except Exception as e:
214
+ return f"❌ **Error debugging trace**: {str(e)}\n\nPlease check:\n- Trace ID is correct\n- Repository name is correct\n- You have access to the dataset"
215
+
216
+
217
+ async def estimate_cost(
218
+ gemini_client: GeminiClient,
219
+ model: str,
220
+ agent_type: str,
221
+ num_tests: int = 100,
222
+ hardware: str = "auto"
223
+ ) -> str:
224
+ """
225
+ Estimate the cost, duration, and CO2 emissions of running agent evaluations.
226
+
227
+ This tool predicts costs before running evaluations by calculating LLM API costs,
228
+ HuggingFace Jobs compute costs, and CO2 emissions. Uses Google Gemini 2.5 Pro
229
+ to provide cost breakdown and optimization recommendations.
230
+
231
+ Args:
232
+ gemini_client (GeminiClient): Initialized Gemini client for AI analysis
233
+ model (str): Model identifier in litellm format (e.g., "openai/gpt-4", "meta-llama/Llama-3.1-8B")
234
+ agent_type (str): Type of agent capabilities to test. Options: "tool", "code", "both"
235
+ num_tests (int): Number of test cases to run. Must be between 10 and 1000. Default: 100
236
+ hardware (str): Hardware type for HuggingFace Jobs. Options: "auto", "cpu", "gpu_a10", "gpu_h200". Default: "auto"
237
+
238
+ Returns:
239
+ str: Markdown-formatted cost estimate with breakdown of LLM costs, HF Jobs costs, duration, CO2 emissions, and optimization tips
240
+ """
241
+ try:
242
+ # Determine if API or local model
243
+ is_api_model = any(provider in model.lower() for provider in ["openai", "anthropic", "google", "cohere"])
244
+
245
+ # Auto-select hardware
246
+ if hardware == "auto":
247
+ hardware = "cpu" if is_api_model else "gpu_a10"
248
+
249
+ # Cost data (simplified estimates)
250
+ llm_costs = {
251
+ "openai/gpt-4": {"input": 0.03, "output": 0.06}, # per 1K tokens
252
+ "openai/gpt-3.5-turbo": {"input": 0.0015, "output": 0.002},
253
+ "anthropic/claude-3-opus": {"input": 0.015, "output": 0.075},
254
+ "anthropic/claude-3-sonnet": {"input": 0.003, "output": 0.015},
255
+ "meta-llama/Llama-3.1-8B": {"input": 0, "output": 0}, # Local model
256
+ "default": {"input": 0.001, "output": 0.002}
257
+ }
258
+
259
+ hf_jobs_costs = {
260
+ "cpu": 0.60, # per hour
261
+ "gpu_a10": 1.10, # per hour
262
+ "gpu_h200": 4.50 # per hour
263
+ }
264
+
265
+ # Get model costs
266
+ model_cost = llm_costs.get(model, llm_costs["default"])
267
+
268
+ # Estimate token usage per test
269
+ # Tool agent: ~200 tokens input, ~150 output
270
+ # Code agent: ~300 tokens input, ~400 output
271
+ # Both: ~400 tokens input, ~500 output
272
+ token_estimates = {
273
+ "tool": {"input": 200, "output": 150},
274
+ "code": {"input": 300, "output": 400},
275
+ "both": {"input": 400, "output": 500}
276
+ }
277
+
278
+ tokens_per_test = token_estimates[agent_type]
279
+
280
+ # Calculate LLM costs
281
+ llm_cost_per_test = (
282
+ (tokens_per_test["input"] / 1000) * model_cost["input"] +
283
+ (tokens_per_test["output"] / 1000) * model_cost["output"]
284
+ )
285
+ total_llm_cost = llm_cost_per_test * num_tests
286
+
287
+ # Estimate duration (seconds per test)
288
+ if is_api_model:
289
+ duration_per_test = 3.0 # API models are fast
290
+ else:
291
+ duration_per_test = 8.0 # Local models slower but depends on GPU
292
+
293
+ total_duration_hours = (duration_per_test * num_tests) / 3600
294
+
295
+ # Calculate HF Jobs costs
296
+ jobs_hourly_rate = hf_jobs_costs.get(hardware, hf_jobs_costs["cpu"])
297
+ total_jobs_cost = total_duration_hours * jobs_hourly_rate
298
+
299
+ # Estimate CO2 (rough estimates)
300
+ co2_per_hour = {
301
+ "cpu": 0.05, # kg CO2
302
+ "gpu_a10": 0.15,
303
+ "gpu_h200": 0.30
304
+ }
305
+
306
+ total_co2_kg = total_duration_hours * co2_per_hour.get(hardware, 0.05)
307
+
308
+ # Prepare estimate data
309
+ estimate_data = {
310
+ "model": model,
311
+ "agent_type": agent_type,
312
+ "num_tests": num_tests,
313
+ "hardware": hardware,
314
+ "is_api_model": is_api_model,
315
+ "estimates": {
316
+ "llm_cost_usd": round(total_llm_cost, 4),
317
+ "llm_cost_per_test": round(llm_cost_per_test, 4),
318
+ "jobs_cost_usd": round(total_jobs_cost, 4),
319
+ "total_cost_usd": round(total_llm_cost + total_jobs_cost, 4),
320
+ "duration_hours": round(total_duration_hours, 2),
321
+ "duration_per_test_seconds": round(duration_per_test, 2),
322
+ "co2_emissions_kg": round(total_co2_kg, 3),
323
+ "tokens_per_test": tokens_per_test
324
+ }
325
+ }
326
+
327
+ # Get AI analysis from Gemini
328
+ result = await gemini_client.analyze_with_context(
329
+ data=estimate_data,
330
+ analysis_type="cost_estimate",
331
+ specific_question="Provide cost breakdown and optimization recommendations"
332
+ )
333
+
334
+ return result
335
+
336
+ except Exception as e:
337
+ return f"❌ **Error estimating cost**: {str(e)}"
338
+
339
+
340
+ async def compare_runs(
341
+ gemini_client: GeminiClient,
342
+ run_id_1: str,
343
+ run_id_2: str,
344
+ leaderboard_repo: str = "kshitijthakkar/smoltrace-leaderboard",
345
+ comparison_focus: str = "comprehensive",
346
+ hf_token: Optional[str] = None
347
+ ) -> str:
348
+ """
349
+ Compare two evaluation runs and generate AI-powered comparative analysis.
350
+
351
+ This tool fetches data for two evaluation runs from the leaderboard and uses
352
+ Google Gemini 2.5 Pro to provide intelligent comparison across multiple dimensions:
353
+ success rate, cost efficiency, speed, environmental impact, and use case recommendations.
354
+
355
+ Args:
356
+ gemini_client (GeminiClient): Initialized Gemini client for AI analysis
357
+ run_id_1 (str): First run ID to compare
358
+ run_id_2 (str): Second run ID to compare
359
+ leaderboard_repo (str): HuggingFace dataset repository containing leaderboard data. Default: "kshitijthakkar/smoltrace-leaderboard"
360
+ comparison_focus (str): Focus area for comparison. Options: "comprehensive", "cost", "performance", "eco_friendly". Default: "comprehensive"
361
+ hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
362
+
363
+ Returns:
364
+ str: Markdown-formatted comparative analysis with winner for each category, trade-offs, and use case recommendations
365
+ """
366
+ try:
367
+ # Load leaderboard data
368
+ # Use user-provided token or fall back to environment variable
369
+ token = hf_token if hf_token else os.getenv("HF_TOKEN")
370
+ dataset = load_dataset(leaderboard_repo, split="train", token=token)
371
+ df = pd.DataFrame(dataset)
372
+
373
+ # Find the two runs
374
+ run1 = df[df['run_id'] == run_id_1]
375
+ run2 = df[df['run_id'] == run_id_2]
376
+
377
+ if run1.empty:
378
+ return f"❌ **Error**: Run ID '{run_id_1}' not found in leaderboard"
379
+ if run2.empty:
380
+ return f"❌ **Error**: Run ID '{run_id_2}' not found in leaderboard"
381
+
382
+ run1_data = run1.iloc[0].to_dict()
383
+ run2_data = run2.iloc[0].to_dict()
384
+
385
+ # Build comparison context for Gemini
386
+ comparison_data = {
387
+ "run_1": {
388
+ "run_id": run1_data.get('run_id'),
389
+ "model": run1_data.get('model'),
390
+ "agent_type": run1_data.get('agent_type'),
391
+ "success_rate": run1_data.get('success_rate'),
392
+ "total_tests": run1_data.get('total_tests'),
393
+ "successful_tests": run1_data.get('successful_tests'),
394
+ "avg_duration_ms": run1_data.get('avg_duration_ms'),
395
+ "total_cost_usd": run1_data.get('total_cost_usd'),
396
+ "avg_cost_per_test_usd": run1_data.get('avg_cost_per_test_usd'),
397
+ "co2_emissions_g": run1_data.get('co2_emissions_g'),
398
+ "gpu_utilization_avg": run1_data.get('gpu_utilization_avg'),
399
+ "total_tokens": run1_data.get('total_tokens'),
400
+ "provider": run1_data.get('provider'),
401
+ "job_type": run1_data.get('job_type'),
402
+ "timestamp": run1_data.get('timestamp')
403
+ },
404
+ "run_2": {
405
+ "run_id": run2_data.get('run_id'),
406
+ "model": run2_data.get('model'),
407
+ "agent_type": run2_data.get('agent_type'),
408
+ "success_rate": run2_data.get('success_rate'),
409
+ "total_tests": run2_data.get('total_tests'),
410
+ "successful_tests": run2_data.get('successful_tests'),
411
+ "avg_duration_ms": run2_data.get('avg_duration_ms'),
412
+ "total_cost_usd": run2_data.get('total_cost_usd'),
413
+ "avg_cost_per_test_usd": run2_data.get('avg_cost_per_test_usd'),
414
+ "co2_emissions_g": run2_data.get('co2_emissions_g'),
415
+ "gpu_utilization_avg": run2_data.get('gpu_utilization_avg'),
416
+ "total_tokens": run2_data.get('total_tokens'),
417
+ "provider": run2_data.get('provider'),
418
+ "job_type": run2_data.get('job_type'),
419
+ "timestamp": run2_data.get('timestamp')
420
+ },
421
+ "comparison_focus": comparison_focus
422
+ }
423
+
424
+ # Create comparison prompt based on focus
425
+ if comparison_focus == "comprehensive":
426
+ prompt = f"""
427
+ You are analyzing a comparison between two agent evaluation runs. Provide a comprehensive analysis covering all aspects.
428
+
429
+ **Run 1 ({comparison_data['run_1']['model']}):**
430
+ {json.dumps(comparison_data['run_1'], indent=2)}
431
+
432
+ **Run 2 ({comparison_data['run_2']['model']}):**
433
+ {json.dumps(comparison_data['run_2'], indent=2)}
434
+
435
+ Please provide a detailed comparison in the following format:
436
+
437
+ ## 📊 Head-to-Head Comparison
438
+
439
+ ### 🎯 Accuracy Winner
440
+ [Which run has better success rate and by how much? Explain significance]
441
+
442
+ ### ⚡ Speed Winner
443
+ [Which run is faster and by how much? Include average duration comparison]
444
+
445
+ ### 💰 Cost Winner
446
+ [Which run is more cost-effective? Compare total cost AND cost per test]
447
+
448
+ ### 🌱 Eco-Friendly Winner
449
+ [Which run has lower CO2 emissions? Calculate the difference]
450
+
451
+ ### 🔧 GPU Efficiency Winner (if applicable)
452
+ [For GPU jobs, which has better utilization? Explain implications]
453
+
454
+ ## 📈 Performance Summary
455
+
456
+ ### Run 1 Strengths
457
+ - [List 3-4 key strengths]
458
+
459
+ ### Run 2 Strengths
460
+ - [List 3-4 key strengths]
461
+
462
+ ## 💡 Use Case Recommendations
463
+
464
+ ### When to Choose Run 1 ({comparison_data['run_1']['model']})
465
+ [Specific scenarios where Run 1 is the better choice]
466
+
467
+ ### When to Choose Run 2 ({comparison_data['run_2']['model']})
468
+ [Specific scenarios where Run 2 is the better choice]
469
+
470
+ ## ⚖️ Overall Recommendation
471
+ [Based on the analysis, provide a balanced recommendation considering different priorities]
472
+
473
+ Be specific with numbers and percentages. Make the comparison actionable and insightful.
474
+ """
475
+ elif comparison_focus == "cost":
476
+ prompt = f"""
477
+ Compare these two evaluation runs focusing specifically on cost efficiency:
478
+
479
+ **Run 1:** {json.dumps(comparison_data['run_1'], indent=2)}
480
+ **Run 2:** {json.dumps(comparison_data['run_2'], indent=2)}
481
+
482
+ Provide detailed cost analysis:
483
+ 1. Which run has lower total cost and by what percentage?
484
+ 2. Cost per test comparison - which is more efficient?
485
+ 3. Calculate cost per successful test (accounting for failures)
486
+ 4. Token usage efficiency - cost per 1000 tokens
487
+ 5. ROI analysis - is higher cost justified by better accuracy?
488
+ 6. Scaling implications - at 1000 tests, what would each cost?
489
+
490
+ Provide actionable cost optimization recommendations.
491
+ """
492
+ elif comparison_focus == "performance":
493
+ prompt = f"""
494
+ Compare these two evaluation runs focusing on performance (speed + accuracy):
495
+
496
+ **Run 1:** {json.dumps(comparison_data['run_1'], indent=2)}
497
+ **Run 2:** {json.dumps(comparison_data['run_2'], indent=2)}
498
+
499
+ Analyze:
500
+ 1. Success rate difference - statistical significance?
501
+ 2. Speed comparison - average duration per test
502
+ 3. Which delivers faster results without sacrificing accuracy?
503
+ 4. Throughput analysis - tests per minute
504
+ 5. Quality vs Speed trade-off assessment
505
+ 6. GPU utilization efficiency (if applicable)
506
+
507
+ Recommend which run offers best performance for production workloads.
508
+ """
509
+ elif comparison_focus == "eco_friendly":
510
+ prompt = f"""
511
+ Compare these two evaluation runs focusing on environmental impact:
512
+
513
+ **Run 1:** {json.dumps(comparison_data['run_1'], indent=2)}
514
+ **Run 2:** {json.dumps(comparison_data['run_2'], indent=2)}
515
+
516
+ Analyze:
517
+ 1. CO2 emissions comparison - which is greener?
518
+ 2. Emissions per test and per successful test
519
+ 3. GPU vs API model environmental trade-offs
520
+ 4. Energy efficiency based on duration and GPU utilization
521
+ 5. Emissions reduction if scaled to 10,000 tests
522
+ 6. Carbon offset cost comparison
523
+
524
+ Provide eco-conscious recommendations for sustainable AI deployment.
525
+ """
526
+
527
+ # Get AI analysis from Gemini
528
+ analysis = await gemini_client.analyze_with_context(
529
+ comparison_data,
530
+ analysis_type="comparison",
531
+ specific_question=prompt
532
+ )
533
+
534
+ return analysis
535
+
536
+ except Exception as e:
537
+ return f"❌ **Error comparing runs**: {str(e)}"
538
+
539
+
540
+ async def get_dataset(
541
+ dataset_repo: str,
542
+ max_rows: int = 50,
543
+ hf_token: Optional[str] = None
544
+ ) -> str:
545
+ """
546
+ Load SMOLTRACE datasets from HuggingFace and return as JSON.
547
+
548
+ This tool loads datasets with the "smoltrace-" prefix and returns the raw data
549
+ as JSON. Use this to access:
550
+ - Leaderboard data (kshitijthakkar/smoltrace-leaderboard)
551
+ - Results datasets (e.g., username/smoltrace-results-*)
552
+ - Traces datasets (e.g., username/smoltrace-traces-*)
553
+ - Metrics datasets (e.g., username/smoltrace-metrics-*)
554
+ - Any other smoltrace-prefixed evaluation dataset
555
+
556
+ If you don't know which dataset to load, first load the leaderboard to see
557
+ the dataset references in the results_dataset, traces_dataset, metrics_dataset,
558
+ and dataset_used fields.
559
+
560
+ Args:
561
+ dataset_repo (str): HuggingFace dataset repository path with "smoltrace-" prefix (e.g., "kshitijthakkar/smoltrace-leaderboard")
562
+ max_rows (int): Maximum number of rows to return. Default: 50. Range: 1-200
563
+ hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
564
+
565
+ Returns:
566
+ str: JSON object with dataset data and metadata
567
+ """
568
+ try:
569
+ # Validate dataset has smoltrace- prefix
570
+ if "smoltrace-" not in dataset_repo:
571
+ return json.dumps({
572
+ "dataset_repo": dataset_repo,
573
+ "error": "Only datasets with 'smoltrace-' prefix are allowed. Please use smoltrace-leaderboard or other smoltrace-* datasets.",
574
+ "data": []
575
+ }, indent=2)
576
+
577
+ # Load dataset from HuggingFace
578
+ # Use user-provided token or fall back to environment variable
579
+ token = hf_token if hf_token else os.getenv("HF_TOKEN")
580
+ dataset = load_dataset(dataset_repo, split="train", token=token)
581
+ df = pd.DataFrame(dataset)
582
+
583
+ if df.empty:
584
+ return json.dumps({
585
+ "dataset_repo": dataset_repo,
586
+ "error": "Dataset is empty",
587
+ "total_rows": 0,
588
+ "data": []
589
+ }, indent=2)
590
+
591
+ # Get total row count before limiting
592
+ total_rows = len(df)
593
+
594
+ # Limit rows to avoid overwhelming the context
595
+ max_rows = max(1, min(200, max_rows))
596
+
597
+ # Sort by timestamp if available (newest first)
598
+ if "timestamp" in df.columns:
599
+ df = df.sort_values("timestamp", ascending=False)
600
+
601
+ df_limited = df.head(max_rows)
602
+
603
+ # Convert to list of dictionaries
604
+ data = df_limited.to_dict(orient="records")
605
+
606
+ # Build response with metadata
607
+ result = {
608
+ "dataset_repo": dataset_repo,
609
+ "total_rows": total_rows,
610
+ "rows_returned": len(data),
611
+ "columns": list(df.columns),
612
+ "data": data
613
+ }
614
+
615
+ return json.dumps(result, indent=2, default=str)
616
+
617
+ except Exception as e:
618
+ return json.dumps({
619
+ "dataset_repo": dataset_repo,
620
+ "error": f"Failed to load dataset: {str(e)}",
621
+ "data": []
622
+ }, indent=2)
623
+
624
+
625
+ # ============================================================================
626
+ # MCP RESOURCES - Expose data for retrieval by MCP clients
627
+ # ============================================================================
628
+
629
+ @gr.mcp.resource("leaderboard://{repo}")
630
+ def get_leaderboard_data(repo: str = "kshitijthakkar/smoltrace-leaderboard", hf_token: Optional[str] = None) -> str:
631
+ """
632
+ Get raw leaderboard data from HuggingFace dataset.
633
+
634
+ This resource provides direct access to leaderboard data in JSON format,
635
+ allowing MCP clients to retrieve and process evaluation results.
636
+
637
+ Args:
638
+ repo (str): HuggingFace dataset repository name. Default: "kshitijthakkar/smoltrace-leaderboard"
639
+ hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
640
+
641
+ Returns:
642
+ str: JSON string containing leaderboard data with all evaluation runs
643
+ """
644
+ try:
645
+ # Use user-provided token or fall back to environment variable
646
+ token = hf_token if hf_token else os.getenv("HF_TOKEN")
647
+ ds = load_dataset(repo, split="train", token=token)
648
+ df = pd.DataFrame(ds)
649
+
650
+ # Convert to JSON with proper formatting
651
+ data = df.to_dict('records')
652
+ return json.dumps({
653
+ "total_runs": len(data),
654
+ "repository": repo,
655
+ "data": data
656
+ }, indent=2)
657
+
658
+ except Exception as e:
659
+ return json.dumps({
660
+ "error": str(e),
661
+ "repository": repo
662
+ })
663
+
664
+
665
+ @gr.mcp.resource("trace://{trace_id}/{repo}")
666
+ def get_trace_data(trace_id: str, repo: str, hf_token: Optional[str] = None) -> str:
667
+ """
668
+ Get raw trace data for a specific trace ID from HuggingFace dataset.
669
+
670
+ This resource provides direct access to OpenTelemetry trace data,
671
+ allowing MCP clients to retrieve detailed execution information.
672
+
673
+ Args:
674
+ trace_id (str): Unique identifier for the trace (e.g., "trace_abc123")
675
+ repo (str): HuggingFace dataset repository containing traces (e.g., "username/agent-traces-model")
676
+ hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
677
+
678
+ Returns:
679
+ str: JSON string containing trace data with all spans and attributes
680
+ """
681
+ try:
682
+ # Use user-provided token or fall back to environment variable
683
+ token = hf_token if hf_token else os.getenv("HF_TOKEN")
684
+ ds = load_dataset(repo, split="train", token=token)
685
+ df = pd.DataFrame(ds)
686
+
687
+ # Find specific trace
688
+ trace_data = df[df['trace_id'] == trace_id]
689
+
690
+ if len(trace_data) == 0:
691
+ return json.dumps({
692
+ "error": f"Trace {trace_id} not found",
693
+ "trace_id": trace_id,
694
+ "repository": repo
695
+ })
696
+
697
+ trace_row = trace_data.iloc[0]
698
+
699
+ # Parse spans if they're stored as string
700
+ spans = trace_row['spans']
701
+ if isinstance(spans, str):
702
+ spans = json.loads(spans)
703
+
704
+ return json.dumps({
705
+ "trace_id": trace_id,
706
+ "repository": repo,
707
+ "run_id": trace_row.get('run_id', 'unknown'),
708
+ "spans": spans
709
+ }, indent=2)
710
+
711
+ except Exception as e:
712
+ return json.dumps({
713
+ "error": str(e),
714
+ "trace_id": trace_id,
715
+ "repository": repo
716
+ })
717
+
718
+
719
+ @gr.mcp.resource("cost://model/{model_name}")
720
+ def get_cost_data(model_name: str) -> str:
721
+ """
722
+ Get cost information for a specific model.
723
+
724
+ This resource provides pricing data for LLM models and hardware configurations,
725
+ helping users understand evaluation costs.
726
+
727
+ Args:
728
+ model_name (str): Model identifier (e.g., "openai/gpt-4", "meta-llama/Llama-3.1-8B")
729
+
730
+ Returns:
731
+ str: JSON string containing cost data for the model
732
+ """
733
+ # Cost database
734
+ llm_costs = {
735
+ "openai/gpt-4": {
736
+ "input_per_1k_tokens": 0.03,
737
+ "output_per_1k_tokens": 0.06,
738
+ "type": "api",
739
+ "provider": "openai"
740
+ },
741
+ "openai/gpt-3.5-turbo": {
742
+ "input_per_1k_tokens": 0.0015,
743
+ "output_per_1k_tokens": 0.002,
744
+ "type": "api",
745
+ "provider": "openai"
746
+ },
747
+ "anthropic/claude-3-opus": {
748
+ "input_per_1k_tokens": 0.015,
749
+ "output_per_1k_tokens": 0.075,
750
+ "type": "api",
751
+ "provider": "anthropic"
752
+ },
753
+ "anthropic/claude-3-sonnet": {
754
+ "input_per_1k_tokens": 0.003,
755
+ "output_per_1k_tokens": 0.015,
756
+ "type": "api",
757
+ "provider": "anthropic"
758
+ },
759
+ "meta-llama/Llama-3.1-8B": {
760
+ "input_per_1k_tokens": 0,
761
+ "output_per_1k_tokens": 0,
762
+ "type": "local",
763
+ "provider": "meta",
764
+ "requires_gpu": True,
765
+ "recommended_hardware": "gpu_a10"
766
+ }
767
+ }
768
+
769
+ hardware_costs = {
770
+ "cpu": {"hourly_rate_usd": 0.60, "type": "cpu"},
771
+ "gpu_a10": {"hourly_rate_usd": 1.10, "type": "gpu", "model": "A10"},
772
+ "gpu_h200": {"hourly_rate_usd": 4.50, "type": "gpu", "model": "H200"}
773
+ }
774
+
775
+ model_cost = llm_costs.get(model_name)
776
+
777
+ if model_cost:
778
+ return json.dumps({
779
+ "model": model_name,
780
+ "cost_data": model_cost,
781
+ "hardware_options": hardware_costs,
782
+ "currency": "USD"
783
+ }, indent=2)
784
+ else:
785
+ return json.dumps({
786
+ "model": model_name,
787
+ "error": "Model not found in cost database",
788
+ "available_models": list(llm_costs.keys()),
789
+ "hardware_options": hardware_costs
790
+ }, indent=2)
791
+
792
+
793
+ # ============================================================================
794
+ # MCP PROMPTS - Reusable prompt templates for common workflows
795
+ # ============================================================================
796
+
797
+ @gr.mcp.prompt()
798
+ def analysis_prompt(
799
+ analysis_type: str = "leaderboard",
800
+ focus_area: str = "overall",
801
+ detail_level: str = "detailed"
802
+ ) -> str:
803
+ """
804
+ Generate a prompt template for analyzing agent evaluation data.
805
+
806
+ This prompt helps standardize analysis requests across different
807
+ evaluation data types and focus areas.
808
+
809
+ Args:
810
+ analysis_type (str): Type of analysis. Options: "leaderboard", "trace", "cost". Default: "leaderboard"
811
+ focus_area (str): What to focus on. Options: "overall", "performance", "cost", "efficiency". Default: "overall"
812
+ detail_level (str): Level of detail. Options: "summary", "detailed", "comprehensive". Default: "detailed"
813
+
814
+ Returns:
815
+ str: Formatted prompt template for analysis
816
+ """
817
+ templates = {
818
+ "leaderboard": {
819
+ "overall": "Analyze the agent evaluation leaderboard data comprehensively. Identify top performers across all metrics (accuracy, cost, latency, CO2), explain trade-offs between different approaches, and provide actionable recommendations for model selection.",
820
+ "performance": "Focus on performance metrics in the leaderboard. Compare success rates and accuracy across different models and agent types. Identify which configurations achieve the highest success rates and explain why.",
821
+ "cost": "Analyze cost efficiency in the leaderboard. Compare costs across different models and identify the best cost-performance ratios. Recommend the most cost-effective configurations for different use cases.",
822
+ "efficiency": "Evaluate efficiency metrics including latency, GPU utilization, and CO2 emissions. Identify the most efficient models and explain how to optimize for speed while maintaining quality."
823
+ },
824
+ "trace": {
825
+ "overall": "Analyze this agent execution trace comprehensively. Explain the sequence of operations, identify any bottlenecks or inefficiencies, and suggest optimizations.",
826
+ "performance": "Focus on performance aspects of this trace. Identify which steps took the most time, explain why, and suggest ways to improve execution speed.",
827
+ "cost": "Analyze the cost implications of this trace execution. Break down token usage and API calls, calculate costs, and suggest ways to reduce expenses.",
828
+ "efficiency": "Evaluate the efficiency of this trace. Identify redundant operations, suggest ways to optimize the execution flow, and recommend best practices."
829
+ },
830
+ "cost": {
831
+ "overall": "Analyze the cost estimation comprehensively. Break down LLM API costs, infrastructure costs, and provide optimization recommendations.",
832
+ "performance": "Focus on the cost-performance trade-off. Compare different hardware options and explain which provides the best value.",
833
+ "cost": "Deep dive into cost breakdown. Explain each cost component in detail and provide specific recommendations for cost reduction.",
834
+ "efficiency": "Analyze cost efficiency. Compare different model configurations and recommend the most cost-effective approach for the given use case."
835
+ }
836
+ }
837
+
838
+ detail_prefixes = {
839
+ "summary": "Provide a brief, high-level summary. ",
840
+ "detailed": "Provide a detailed analysis with specific insights. ",
841
+ "comprehensive": "Provide a comprehensive, in-depth analysis with detailed recommendations. "
842
+ }
843
+
844
+ prefix = detail_prefixes.get(detail_level, detail_prefixes["detailed"])
845
+ template = templates.get(analysis_type, {}).get(focus_area, templates["leaderboard"]["overall"])
846
+
847
+ return f"{prefix}{template}"
848
+
849
+
850
+ @gr.mcp.prompt()
851
+ def debug_prompt(
852
+ debug_type: str = "error",
853
+ context: str = "agent_execution"
854
+ ) -> str:
855
+ """
856
+ Generate a prompt template for debugging agent traces.
857
+
858
+ This prompt helps standardize debugging requests for different
859
+ types of issues and contexts.
860
+
861
+ Args:
862
+ debug_type (str): Type of debugging. Options: "error", "performance", "behavior", "optimization". Default: "error"
863
+ context (str): Execution context. Options: "agent_execution", "tool_calling", "llm_reasoning". Default: "agent_execution"
864
+
865
+ Returns:
866
+ str: Formatted prompt template for debugging
867
+ """
868
+ templates = {
869
+ "error": {
870
+ "agent_execution": "Debug this agent execution trace to identify why it failed. Analyze each step in the execution flow, identify where the error occurred, explain the root cause, and suggest how to fix it.",
871
+ "tool_calling": "Debug this tool calling sequence. Identify which tool call failed or produced unexpected results, explain why it happened, and suggest corrections.",
872
+ "llm_reasoning": "Debug the LLM reasoning in this trace. Analyze the prompts and responses, identify where the reasoning went wrong, and suggest improvements to the prompts or approach."
873
+ },
874
+ "performance": {
875
+ "agent_execution": "Analyze this trace for performance issues. Identify bottlenecks, measure time spent in each component, and recommend optimizations to improve execution speed.",
876
+ "tool_calling": "Analyze tool calling performance. Identify which tools are slow, explain why, and suggest ways to optimize tool execution or caching.",
877
+ "llm_reasoning": "Analyze LLM reasoning efficiency. Identify unnecessary calls, redundant reasoning steps, and suggest ways to streamline the reasoning process."
878
+ },
879
+ "behavior": {
880
+ "agent_execution": "Analyze the agent's behavior in this trace. Explain why the agent made certain decisions, whether the behavior is expected, and suggest improvements if needed.",
881
+ "tool_calling": "Analyze tool selection behavior. Explain why certain tools were called, whether the choices were optimal, and suggest alternative approaches if applicable.",
882
+ "llm_reasoning": "Analyze the LLM's reasoning patterns. Explain the logic flow, identify any unexpected reasoning, and suggest how to guide the model toward better decisions."
883
+ },
884
+ "optimization": {
885
+ "agent_execution": "Analyze this trace for optimization opportunities. Identify redundant operations, suggest caching strategies, and recommend ways to reduce costs and improve efficiency.",
886
+ "tool_calling": "Optimize tool usage in this trace. Suggest ways to reduce tool calls, batch operations, or use more efficient alternatives.",
887
+ "llm_reasoning": "Optimize LLM usage. Suggest ways to reduce token usage, improve prompt efficiency, and achieve the same results with lower costs."
888
+ }
889
+ }
890
+
891
+ template = templates.get(debug_type, {}).get(context, templates["error"]["agent_execution"])
892
+ return template
893
+
894
+
895
+ @gr.mcp.prompt()
896
+ def optimization_prompt(
897
+ optimization_goal: str = "cost",
898
+ constraints: str = "maintain_quality"
899
+ ) -> str:
900
+ """
901
+ Generate a prompt template for optimization recommendations.
902
+
903
+ This prompt helps standardize optimization requests for different
904
+ goals and constraints.
905
+
906
+ Args:
907
+ optimization_goal (str): What to optimize. Options: "cost", "speed", "quality", "efficiency". Default: "cost"
908
+ constraints (str): Constraints to consider. Options: "maintain_quality", "maintain_speed", "no_constraints". Default: "maintain_quality"
909
+
910
+ Returns:
911
+ str: Formatted prompt template for optimization
912
+ """
913
+ templates = {
914
+ "cost": {
915
+ "maintain_quality": "Analyze this evaluation setup and recommend cost optimizations while maintaining quality. Consider cheaper models, optimized prompts, caching strategies, and hardware selection. Quantify potential savings.",
916
+ "maintain_speed": "Recommend cost optimizations while maintaining execution speed. Consider model alternatives, batch processing, and infrastructure choices that reduce costs without adding latency.",
917
+ "no_constraints": "Recommend aggressive cost optimizations. Identify all opportunities to reduce expenses, even if it means trade-offs in quality or speed. Prioritize maximum cost reduction."
918
+ },
919
+ "speed": {
920
+ "maintain_quality": "Recommend speed optimizations while maintaining quality. Consider parallel execution, caching, faster models with similar accuracy, and infrastructure upgrades. Quantify potential speedups.",
921
+ "maintain_cost": "Recommend speed optimizations within the current cost budget. Suggest configuration changes, caching strategies, and optimizations that don't increase expenses.",
922
+ "no_constraints": "Recommend aggressive speed optimizations. Identify all opportunities to reduce latency, even if it increases costs. Prioritize maximum performance."
923
+ },
924
+ "quality": {
925
+ "maintain_cost": "Recommend quality improvements within the current cost budget. Suggest better prompts, model configurations, and strategies that improve accuracy without increasing expenses.",
926
+ "maintain_speed": "Recommend quality improvements while maintaining execution speed. Suggest prompt improvements, reasoning enhancements, and configurations that improve accuracy without adding latency.",
927
+ "no_constraints": "Recommend quality improvements without budget constraints. Suggest the best models, optimal configurations, and strategies to maximize accuracy and success rates."
928
+ },
929
+ "efficiency": {
930
+ "maintain_quality": "Recommend overall efficiency improvements. Optimize for the best cost-speed-quality balance. Identify waste, suggest streamlined processes, and provide holistic optimization strategies.",
931
+ "maintain_cost": "Recommend efficiency improvements within budget. Focus on reducing waste, optimizing resource usage, and getting better results with the same cost.",
932
+ "maintain_speed": "Recommend efficiency improvements maintaining speed. Reduce unnecessary operations, optimize resource usage, and improve output quality without adding latency."
933
+ }
934
+ }
935
+
936
+ # Handle constraint variations
937
+ if constraints == "maintain_quality" and optimization_goal == "speed":
938
+ constraints = "maintain_quality" # Use existing template
939
+ elif constraints == "maintain_speed" and optimization_goal == "cost":
940
+ constraints = "maintain_speed" # Use existing template
941
+
942
+ template = templates.get(optimization_goal, {}).get(constraints, templates["cost"]["maintain_quality"])
943
+ return template
requirements.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core dependencies
2
+ gradio[mcp]>=6.0.0.dev1
3
+ google-generativeai>=0.8.0
4
+ datasets>=2.14.0
5
+ pandas>=2.0.0
6
+
7
+ # HuggingFace
8
+ huggingface-hub>=0.20.0
9
+
10
+ # Utilities
11
+ python-dotenv>=1.0.0