Spaces:

MCP-1st-Birthday
/

TraceMind-mcp-server

Running

App Files Files Community

Mandark-droid commited on 23 days ago

Commit

a3116de

1 Parent(s): c5a5a1d

TraceMind MCP Server V1 Working and tested.

Browse files

Files changed (9) hide show

.env.example +8 -0
.gitignore +34 -0
API_KEY_CONFIGURATION.md +261 -0
LICENSE +680 -0
README.md +586 -8
app.py +1006 -0
gemini_client.py +185 -0
mcp_tools.py +943 -0
requirements.txt +11 -0

.env.example ADDED Viewed

	@@ -0,0 +1,8 @@

+# Google Gemini API Key
+# Get from: https://ai.google.dev/
+GEMINI_API_KEY=your_gemini_api_key_here
+# HuggingFace Token
+# Get from: https://huggingface.co/settings/tokens
+# Needs read access to datasets
+HF_TOKEN=your_huggingface_token_here

.gitignore ADDED Viewed

	@@ -0,0 +1,34 @@

+# Environment
+.env
+.venv/
+venv/
+env/
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+# Distribution / packaging
+dist/
+build/
+*.egg-info/
+# IDEs
+.vscode/
+.idea/
+*.swp
+*.swo
+# OS
+.DS_Store
+Thumbs.db
+# Gradio
+flagged/
+gradio_cached_examples/
+# Logs
+*.log

API_KEY_CONFIGURATION.md ADDED Viewed

	@@ -0,0 +1,261 @@

+# API Key Configuration Feature
+## Overview
+Users can now configure their API keys directly through the TraceMind MCP Server UI. These user-provided keys will override environment variables for the current session.
+## What's New
+### 1. Settings Tab (⚙️)
+A new Settings tab has been added as the first tab in the UI where users can:
+- Enter their **Google Gemini API Key**
+- Enter their **HuggingFace Token**
+- Save keys for the current session
+- Clear session keys and revert to environment variables
+### 2. Session-Only Storage
+- API keys are stored in Gradio's session state
+- Keys are **NOT** persisted to disk or cookies
+- Keys are automatically cleared when the browser session ends
+- Each user session has its own isolated key storage
+### 3. Automatic Key Validation
+- Gemini API keys are validated when saved by creating a test client
+- Invalid keys are rejected with clear error messages
+- Users receive immediate feedback on key validity
+## How to Use
+### Option 1: Configure via UI (Recommended)
+1. **Navigate to Settings Tab**
+   - Open the TraceMind MCP Server app
+   - Click on the "⚙️ Settings" tab (first tab)
+2. **Enter Your Keys**
+   - **Gemini API Key**: Get from https://aistudio.google.com/app/apikey
+   - **HuggingFace Token**: Get from https://huggingface.co/settings/tokens
+3. **Save Keys**
+   - Click "💾 Save API Keys for This Session"
+   - Wait for validation confirmation
+   - Keys are now active for all tools
+4. **Use Any Tool**
+   - Navigate to any other tab (Analyze Leaderboard, Debug Trace, etc.)
+   - Tools will automatically use your configured keys
+   - No additional configuration needed
+### Option 2: Environment Variables (Still Supported)
+You can still use environment variables as before:
+```bash
+export GEMINI_API_KEY="your-key-here"
+export HF_TOKEN="your-token-here"
+```
+**Note**: UI-configured keys always override environment variables.
+## Technical Details
+### Architecture Changes
+#### 1. UI Layer (`app.py`)
+- Added Settings tab with key input forms
+- Implemented session state management with `gr.State()`
+- Updated all tool functions to accept API keys as parameters
+- Added key validation and error handling
+#### 2. Tool Layer (`mcp_tools.py`)
+- Updated all functions to accept optional `hf_token` parameter
+- Modified `load_dataset()` calls to use user-provided tokens
+- Added fallback to environment variables when no token provided
+- Functions updated:
+  - `analyze_leaderboard()`
+  - `debug_trace()`
+  - `compare_runs()`
+  - `get_dataset()`
+  - `get_leaderboard_data()` (MCP Resource)
+  - `get_trace_data()` (MCP Resource)
+#### 3. Client Layer (`gemini_client.py`)
+- `GeminiClient.__init__()` already supported optional `api_key` parameter
+- No changes needed - already designed for key override
+### Key Features
+1. **Priority Order**:
+   ```
+   User-provided key (UI) > Environment variable > Error
+   ```
+2. **Validation**:
+   - Gemini keys: Validated by creating test `GeminiClient`
+   - HF tokens: Accepted without validation (validated on first use)
+3. **Error Handling**:
+   - Clear error messages when keys are missing
+   - Helpful prompts to configure keys in Settings tab
+   - Validation errors shown immediately
+4. **Session Management**:
+   - Keys stored in `gr.State()` (Gradio session state)
+   - Isolated per-user in multi-user environments
+   - Automatically cleared on session end
+## Security Considerations
+### ✅ Secure Practices
+1. **No Persistence**: Keys are never written to disk
+2. **Session Isolation**: Each user has isolated key storage
+3. **Password Fields**: Keys displayed as `type="password"` (hidden)
+4. **No Logging**: Keys not logged or exposed in error messages
+### ⚠️ Security Notes
+- **HTTPS Required**: Always use HTTPS in production to protect keys in transit
+- **Public Spaces**: Be cautious using on public HuggingFace Spaces
+- **Shared Environments**: Each browser session is isolated, but server has access
+- **Recommendation**: Use environment variables for production deployments
+## Examples
+### Example 1: First-Time User
+```
+1. User opens app (no env vars set)
+2. User sees "⚠️ Status: No API key configured" in Settings
+3. User enters Gemini API key and HF token
+4. User clicks "Save API Keys"
+5. User sees "✅ Gemini API key validated and saved"
+6. User switches to "Analyze Leaderboard" tab
+7. Tool works using user-provided keys
+```
+### Example 2: Overriding Environment Variables
+```
+1. User has GEMINI_API_KEY set in environment
+2. User wants to test with a different key
+3. User enters new key in Settings tab
+4. User clicks "Save API Keys"
+5. All tools now use the new key (not the env var)
+6. User clicks "Clear Session Keys" to revert
+7. Tools now use environment variable again
+```
+### Example 3: Error Handling
+```
+1. User enters invalid Gemini API key
+2. User clicks "Save API Keys"
+3. User sees "❌ Gemini API key invalid: [error message]"
+4. User corrects the key and tries again
+5. User sees "✅ Gemini API key validated and saved"
+```
+## API Changes
+### Function Signatures
+All tool functions now accept optional API key parameters:
+```python
+# Before
+async def analyze_leaderboard(
+    gemini_client: GeminiClient,
+    leaderboard_repo: str = "...",
+    ...
+) -> str:
+# After
+async def analyze_leaderboard(
+    gemini_client: GeminiClient,
+    leaderboard_repo: str = "...",
+    ...,
+    hf_token: Optional[str] = None  # NEW
+) -> str:
+```
+### Backward Compatibility
+- ✅ All existing code continues to work
+- ✅ Environment variables still supported
+- ✅ No breaking changes to MCP protocol
+- ✅ Optional parameters have sensible defaults
+## Testing Checklist
+- [x] UI renders Settings tab correctly
+- [x] Gemini API key input works (password field)
+- [x] HF token input works (password field)
+- [x] Save button validates and stores keys
+- [x] Clear button reverts to environment variables
+- [ ] All tools use user-provided Gemini key
+- [ ] All tools use user-provided HF token
+- [ ] Invalid Gemini key shows error
+- [ ] Missing keys show helpful error messages
+- [ ] Session isolation works in multi-user scenario
+- [ ] Keys cleared on browser close
+## Future Enhancements
+1. **Key Persistence** (Optional):
+   - Add opt-in browser localStorage support
+   - Warning about security implications
+2. **Multiple Key Profiles**:
+   - Save multiple key configurations
+   - Quick switch between profiles
+3. **Usage Tracking**:
+   - Show API usage per session
+   - Cost estimation based on usage
+4. **Token Expiration**:
+   - Detect expired HF tokens
+   - Prompt for refresh
+## Troubleshooting
+### Keys Not Working
+**Problem**: Tools show "No API key configured" error
+**Solutions**:
+1. Check you clicked "Save API Keys" button
+2. Look for validation success message
+3. Try refreshing the page and re-entering keys
+4. Check browser console for errors
+### Validation Fails
+**Problem**: "❌ Gemini API key invalid" error
+**Solutions**:
+1. Verify key copied correctly (no extra spaces)
+2. Check key is active at https://aistudio.google.com/app/apikey
+3. Ensure you have API quota remaining
+4. Try generating a new key
+### Dataset Access Denied
+**Problem**: "Error loading dataset: Access denied"
+**Solutions**:
+1. Verify HF token is correct
+2. Check token has read permissions
+3. Ensure dataset is public or you have access
+4. Try using a new token
+## Support
+For issues or questions:
+- Check the Settings tab for status messages
+- Review error messages in tool outputs
+- Open an issue on GitHub with:
+  - Steps to reproduce
+  - Error messages (DO NOT include actual API keys)
+  - Browser and OS information

LICENSE ADDED Viewed

	@@ -0,0 +1,680 @@

+TraceMind MCP Server - AI-powered MCP server for agent evaluation analysis
+Copyright (C) 2025 Kshitij Thakkar
+This program is free software: you can redistribute it and/or modify
+it under the terms of the GNU Affero General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU Affero General Public License for more details.
+You should have received a copy of the GNU Affero General Public License
+along with this program.  If not, see <https://www.gnu.org/licenses/>.
+================================================================================
+                    GNU AFFERO GENERAL PUBLIC LICENSE
+                       Version 3, 19 November 2007
+ Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+                            Preamble
+  The GNU Affero General Public License is a free, copyleft license for
+software and other kinds of works, specifically designed to ensure
+cooperation with the community in the case of network server software.
+  The licenses for most software and other practical works are designed
+to take away your freedom to share and change the works.  By contrast,
+our General Public Licenses are intended to guarantee your freedom to
+share and change all versions of a program--to make sure it remains free
+software for all its users.
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+them if you wish), that you receive source code or can get it if you
+want it, that you can change the software or use pieces of it in new
+free programs, and that you know you can do these things.
+  Developers that use our General Public Licenses protect your rights
+with two steps: (1) assert copyright on the software, and (2) offer
+you this License which gives you legal permission to copy, distribute
+and/or modify the software.
+  A secondary benefit of defending all users' freedom is that
+improvements made in alternate versions of the program, if they
+receive widespread use, become available for other developers to
+incorporate.  Many developers of free software are heartened and
+encouraged by the resulting cooperation.  However, in the case of
+software used on network servers, this result may fail to come about.
+The GNU General Public License permits making a modified version and
+letting the public access it on a server without ever releasing its
+source code to the public.
+  The GNU Affero General Public License is designed specifically to
+ensure that, in such cases, the modified source code becomes available
+to the community.  It requires the operator of a network server to
+provide the source code of the modified version running there to the
+users of that server.  Therefore, public use of a modified version, on
+a publicly accessible server, gives the public access to the source
+code of the modified version.
+  An older license, called the Affero General Public License and
+published by Affero, was designed to accomplish similar goals.  This is
+a different license, not a version of the Affero GPL, but Affero has
+released a new version of the Affero GPL which permits relicensing under
+this license.
+  The precise terms and conditions for copying, distribution and
+modification follow.
+                       TERMS AND CONDITIONS
+  0. Definitions.
+  "This License" refers to version 3 of the GNU Affero General Public License.
+  "Copyright" also means copyright-like laws that apply to other kinds of
+works, such as semiconductor masks.
+  "The Program" refers to any copyrightable work licensed under this
+License.  Each licensee is addressed as "you".  "Licensees" and
+"recipients" may be individuals or organizations.
+  To "modify" a work means to copy from or adapt all or part of the work
+in a fashion requiring copyright permission, other than the making of an
+exact copy.  The resulting work is called a "modified version" of the
+earlier work or a work "based on" the earlier work.
+  A "covered work" means either the unmodified Program or a work based
+on the Program.
+  To "propagate" a work means to do anything with it that, without
+permission, would make you directly or secondarily liable for
+infringement under applicable copyright law, except executing it on a
+computer or modifying a private copy.  Propagation includes copying,
+distribution (with or without modification), making available to the
+public, and in some countries other activities as well.
+  To "convey" a work means any kind of propagation that enables other
+parties to make or receive copies.  Mere interaction with a user through
+a computer network, with no transfer of a copy, is not conveying.
+  An interactive user interface displays "Appropriate Legal Notices"
+to the extent that it includes a convenient and prominently visible
+feature that (1) displays an appropriate copyright notice, and (2)
+tells the user that there is no warranty for the work (except to the
+extent that warranties are provided), that licensees may convey the
+work under this License, and how to view a copy of this License.  If
+the interface presents a list of user commands or options, such as a
+menu, a prominent item in the list meets this criterion.
+  1. Source Code.
+  The "source code" for a work means the preferred form of the work
+for making modifications to it.  "Object code" means any non-source
+form of a work.
+  A "Standard Interface" means an interface that either is an official
+standard defined by a recognized standards body, or, in the case of
+interfaces specified for a particular programming language, one that
+is widely used among developers working in that language.
+  The "System Libraries" of an executable work include anything, other
+than the work as a whole, that (a) is included in the normal form of
+packaging a Major Component, but which is not part of that Major
+Component, and (b) serves only to enable use of the work with that
+Major Component, or to implement a Standard Interface for which an
+implementation is available to the public in source code form.  A
+"Major Component", in this context, means a major essential component
+(kernel, window system, and so on) of the specific operating system
+(if any) on which the executable work runs, or a compiler used to
+produce the work, or an object code interpreter used to run it.
+  The "Corresponding Source" for a work in object code form means all
+the source code needed to generate, install, and (for an executable
+work) run the object code and to modify the work, including scripts to
+control those activities.  However, it does not include the work's
+System Libraries, or general-purpose tools or generally available free
+programs which are used unmodified in performing those activities but
+which are not part of the work.  For example, Corresponding Source
+includes interface definition files associated with source files for
+the work, and the source code for shared libraries and dynamically
+linked subprograms that the work is specifically designed to require,
+such as by intimate data communication or control flow between those
+subprograms and other parts of the work.
+  The Corresponding Source need not include anything that users
+can regenerate automatically from other parts of the Corresponding
+Source.
+  The Corresponding Source for a work in source code form is that
+same work.
+  2. Basic Permissions.
+  All rights granted under this License are granted for the term of
+copyright on the Program, and are irrevocable provided the stated
+conditions are met.  This License explicitly affirms your unlimited
+permission to run the unmodified Program.  The output from running a
+covered work is covered by this License only if the output, given its
+content, constitutes a covered work.  This License acknowledges your
+rights of fair use or other equivalent, as provided by copyright law.
+  You may make, run and propagate covered works that you do not
+convey, without conditions so long as your license otherwise remains
+in force.  You may convey covered works to others for the sole purpose
+of having them make modifications exclusively for you, or provide you
+with facilities for running those works, provided that you comply with
+the terms of this License in conveying all material for which you do
+not control copyright.  Those thus making or running the covered works
+for you must do so exclusively on your behalf, under your direction
+and control, on terms that prohibit them from making any copies of
+your copyrighted material outside their relationship with you.
+  Conveying under any other circumstances is permitted solely under
+the conditions stated below.  Sublicensing is not allowed; section 10
+makes it unnecessary.
+  3. Protecting Users' Legal Rights From Anti-Circumvention Law.
+  No covered work shall be deemed part of an effective technological
+measure under any applicable law fulfilling obligations under article
+11 of the WIPO copyright treaty adopted on 20 December 1996, or
+similar laws prohibiting or restricting circumvention of such
+measures.
+  When you convey a covered work, you waive any legal power to forbid
+circumvention of technological measures to the extent such circumvention
+is effected by exercising rights under this License with respect to
+the covered work, and you disclaim any intention to limit operation or
+modification of the work as a means of enforcing, against the work's
+users, your or third parties' legal rights to forbid circumvention of
+technological measures.
+  4. Conveying Verbatim Copies.
+  You may convey verbatim copies of the Program's source code as you
+receive it, in any medium, provided that you conspicuously and
+appropriately publish on each copy an appropriate copyright notice;
+keep intact all notices stating that this License and any
+non-permissive terms added in accord with section 7 apply to the code;
+keep intact all notices of the absence of any warranty; and give all
+recipients a copy of this License along with the Program.
+  You may charge any price or no price for each copy that you convey,
+and you may offer support or warranty protection for a fee.
+  5. Conveying Modified Source Versions.
+  You may convey a work based on the Program, or the modifications to
+produce it from the Program, in the form of source code under the
+terms of section 4, provided that you also meet all of these conditions:
+    a) The work must carry prominent notices stating that you modified
+    it, and giving a relevant date.
+    b) The work must carry prominent notices stating that it is
+    released under this License and any conditions added under section
+    7.  This requirement modifies the requirement in section 4 to
+    "keep intact all notices".
+    c) You must license the entire work, as a whole, under this
+    License to anyone who comes into possession of a copy.  This
+    License will therefore apply, along with any applicable section 7
+    additional terms, to the whole of the work, and all its parts,
+    regardless of how they are packaged.  This License gives no
+    permission to license the work in any other way, but it does not
+    invalidate such permission if you have separately received it.
+    d) If the work has interactive user interfaces, each must display
+    Appropriate Legal Notices; however, if the Program has interactive
+    interfaces that do not display Appropriate Legal Notices, your
+    work need not make them do so.
+  A compilation of a covered work with other separate and independent
+works, which are not by their nature extensions of the covered work,
+and which are not combined with it such as to form a larger program,
+in or on a volume of a storage or distribution medium, is called an
+"aggregate" if the compilation and its resulting copyright are not
+used to limit the access or legal rights of the compilation's users
+beyond what the individual works permit.  Inclusion of a covered work
+in an aggregate does not cause this License to apply to the other
+parts of the aggregate.
+  6. Conveying Non-Source Forms.
+  You may convey a covered work in object code form under the terms
+of sections 4 and 5, provided that you also convey the
+machine-readable Corresponding Source under the terms of this License,
+in one of these ways:
+    a) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by the
+    Corresponding Source fixed on a durable physical medium
+    customarily used for software interchange.
+    b) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by a
+    written offer, valid for at least three years and valid for as
+    long as you offer spare parts or customer support for that product
+    model, to give anyone who possesses the object code either (1) a
+    copy of the Corresponding Source for all the software in the
+    product that is covered by this License, on a durable physical
+    medium customarily used for software interchange, for a price no
+    more than your reasonable cost of physically performing this
+    conveying of source, or (2) access to copy the
+    Corresponding Source from a network server at no charge.
+    c) Convey individual copies of the object code with a copy of the
+    written offer to provide the Corresponding Source.  This
+    alternative is allowed only occasionally and noncommercially, and
+    only if you received the object code with such an offer, in accord
+    with subsection 6b.
+    d) Convey the object code by offering access from a designated
+    place (gratis or for a charge), and offer equivalent access to the
+    Corresponding Source in the same way through the same place at no
+    further charge.  You need not require recipients to copy the
+    Corresponding Source along with the object code.  If the place to
+    copy the object code is a network server, the Corresponding Source
+    may be on a different server (operated by you or a third party)
+    that supports equivalent copying facilities, provided you maintain
+    clear directions next to the object code saying where to find the
+    Corresponding Source.  Regardless of what server hosts the
+    Corresponding Source, you remain obligated to ensure that it is
+    available for as long as needed to satisfy these requirements.
+    e) Convey the object code using peer-to-peer transmission, provided
+    you inform other peers where the object code and Corresponding
+    Source of the work are being offered to the general public at no
+    charge under subsection 6d.
+  A separable portion of the object code, whose source code is excluded
+from the Corresponding Source as a System Library, need not be
+included in conveying the object code work.
+  A "User Product" is either (1) a "consumer product", which means any
+tangible personal property which is normally used for personal, family,
+or household purposes, or (2) anything designed or sold for incorporation
+into a dwelling.  In determining whether a product is a consumer product,
+doubtful cases shall be resolved in favor of coverage.  For a particular
+product received by a particular user, "normally used" refers to a
+typical or common use of that class of product, regardless of the status
+of the particular user or of the way in which the particular user
+actually uses, or expects or is expected to use, the product.  A product
+is a consumer product regardless of whether the product has substantial
+commercial, industrial or non-consumer uses, unless such uses represent
+the only significant mode of use of the product.
+  "Installation Information" for a User Product means any methods,
+procedures, authorization keys, or other information required to install
+and execute modified versions of a covered work in that User Product from
+a modified version of its Corresponding Source.  The information must
+suffice to ensure that the continued functioning of the modified object
+code is in no case prevented or interfered with solely because
+modification has been made.
+  If you convey an object code work under this section in, or with, or
+specifically for use in, a User Product, and the conveying occurs as
+part of a transaction in which the right of possession and use of the
+User Product is transferred to the recipient in perpetuity or for a
+fixed term (regardless of how the transaction is characterized), the
+Corresponding Source conveyed under this section must be accompanied
+by the Installation Information.  But this requirement does not apply
+if neither you nor any third party retains the ability to install
+modified object code on the User Product (for example, the work has
+been installed in ROM).
+  The requirement to provide Installation Information does not include a
+requirement to continue to provide support service, warranty, or updates
+for a work that has been modified or installed by the recipient, or for
+the User Product in which it has been modified or installed.  Access to a
+network may be denied when the modification itself materially and
+adversely affects the operation of the network or violates the rules and
+protocols for communication across the network.
+  Corresponding Source conveyed, and Installation Information provided,
+in accord with this section must be in a format that is publicly
+documented (and with an implementation available to the public in
+source code form), and must require no special password or key for
+unpacking, reading or copying.
+  7. Additional Terms.
+  "Additional permissions" are terms that supplement the terms of this
+License by making exceptions from one or more of its conditions.
+Additional permissions that are applicable to the entire Program shall
+be treated as though they were included in this License, to the extent
+that they are valid under applicable law.  If additional permissions
+apply only to part of the Program, that part may be used separately
+under those permissions, but the entire Program remains governed by
+this License without regard to the additional permissions.
+  When you convey a copy of a covered work, you may at your option
+remove any additional permissions from that copy, or from any part of
+it.  (Additional permissions may be written to require their own
+removal in certain cases when you modify the work.)  You may place
+additional permissions on material, added by you to a covered work,
+for which you have or can give appropriate copyright permission.
+  Notwithstanding any other provision of this License, for material you
+add to a covered work, you may (if authorized by the copyright holders of
+that material) supplement the terms of this License with terms:
+    a) Disclaiming warranty or limiting liability differently from the
+    terms of sections 15 and 16 of this License; or
+    b) Requiring preservation of specified reasonable legal notices or
+    author attributions in that material or in the Appropriate Legal
+    Notices displayed by works containing it; or
+    c) Prohibiting misrepresentation of the origin of that material, or
+    requiring that modified versions of such material be marked in
+    reasonable ways as different from the original version; or
+    d) Limiting the use for publicity purposes of names of licensors or
+    authors of the material; or
+    e) Declining to grant rights under trademark law for use of some
+    trade names, trademarks, or service marks; or
+    f) Requiring indemnification of licensors and authors of that
+    material by anyone who conveys the material (or modified versions of
+    it) with contractual assumptions of liability to the recipient, for
+    any liability that these contractual assumptions directly impose on
+    those licensors and authors.
+  All other non-permissive additional terms are considered "further
+restrictions" within the meaning of section 10.  If the Program as you
+received it, or any part of it, contains a notice stating that it is
+governed by this License along with a term that is a further
+restriction, you may remove that term.  If a license document contains
+a further restriction but permits relicensing or conveying under this
+License, you may add to a covered work material governed by the terms
+of that license document, provided that the further restriction does
+not survive such relicensing or conveying.
+  If you add terms to a covered work in accord with this section, you
+must place, in the relevant source files, a statement of the
+additional terms that apply to those files, or a notice indicating
+where to find the applicable terms.
+  Additional terms, permissive or non-permissive, may be stated in the
+form of a separately written license, or stated as exceptions;
+the above requirements apply either way.
+  8. Termination.
+  You may not propagate or modify a covered work except as expressly
+provided under this License.  Any attempt otherwise to propagate or
+modify it is void, and will automatically terminate your rights under
+this License (including any patent licenses granted under the third
+paragraph of section 11).
+  However, if you cease all violation of this License, then your
+license from a particular copyright holder is reinstated (a)
+provisionally, unless and until the copyright holder explicitly and
+finally terminates your license, and (b) permanently, if the copyright
+holder fails to notify you of the violation by some reasonable means
+prior to 60 days after the cessation.
+  Moreover, your license from a particular copyright holder is
+reinstated permanently if the copyright holder notifies you of the
+violation by some reasonable means, this is the first time you have
+received notice of violation of this License (for any work) from that
+copyright holder, and you cure the violation prior to 30 days after
+your receipt of the notice.
+  Termination of your rights under this section does not terminate the
+licenses of parties who have received copies or rights from you under
+this License.  If your rights have been terminated and not permanently
+reinstated, you do not qualify to receive new licenses for the same
+material under section 10.
+  9. Acceptance Not Required for Having Copies.
+  You are not required to accept this License in order to receive or
+run a copy of the Program.  Ancillary propagation of a covered work
+occurring solely as a consequence of using peer-to-peer transmission
+to receive a copy likewise does not require acceptance.  However,
+nothing other than this License grants you permission to propagate or
+modify any covered work.  These actions infringe copyright if you do
+not accept this License.  Therefore, by modifying or propagating a
+covered work, you indicate your acceptance of this License to do so.
+  10. Automatic Licensing of Downstream Recipients.
+  Each time you convey a covered work, the recipient automatically
+receives a license from the original licensors, to run, modify and
+propagate that work, subject to this License.  You are not responsible
+for enforcing compliance by third parties with this License.
+  An "entity transaction" is a transaction transferring control of an
+organization, or substantially all assets of one, or subdividing an
+organization, or merging organizations.  If propagation of a covered
+work results from an entity transaction, each party to that
+transaction who receives a copy of the work also receives whatever
+licenses to the work the party's predecessor in interest had or could
+give under the previous paragraph, plus a right to possession of the
+Corresponding Source of the work from the predecessor in interest, if
+the predecessor has it or can get it with reasonable efforts.
+  You may not impose any further restrictions on the exercise of the
+rights granted or affirmed under this License.  For example, you may
+not impose a license fee, royalty, or other charge for exercise of
+rights granted under this License, and you may not initiate litigation
+(including a cross-claim or counterclaim in a lawsuit) alleging that
+any patent claim is infringed by making, using, selling, offering for
+sale, or importing the Program or any portion of it.
+  11. Patents.
+  A "contributor" is a copyright holder who authorizes use under this
+License of the Program or a work on which the Program is based.  The
+work thus licensed is called the contributor's "contributor version".
+  A contributor's "essential patent claims" are all patent claims
+owned or controlled by the contributor, whether already acquired or
+hereafter acquired, that would be infringed by some manner, permitted
+by this License, of making, using, or selling its contributor version,
+but do not include claims that would be infringed only as a
+consequence of further modification of the contributor version.  For
+purposes of this definition, "control" includes the right to grant
+patent sublicenses in a manner consistent with the requirements of
+this License.
+  Each contributor grants you a non-exclusive, worldwide, royalty-free
+patent license under the contributor's essential patent claims, to
+make, use, sell, offer for sale, import and otherwise run, modify and
+propagate the contents of its contributor version.
+  In the following three paragraphs, a "patent license" is any express
+agreement or commitment, however denominated, not to enforce a patent
+(such as an express permission to practice a patent or covenant not to
+sue for patent infringement).  To "grant" such a patent license to a
+party means to make such an agreement or commitment not to enforce a
+patent against the party.
+  If you convey a covered work, knowingly relying on a patent license,
+and the Corresponding Source of the work is not available for anyone
+to copy, free of charge and under the terms of this License, through a
+publicly available network server or other readily accessible means,
+then you must either (1) cause the Corresponding Source to be so
+available, or (2) arrange to deprive yourself of the benefit of the
+patent license for this particular work, or (3) arrange, in a manner
+consistent with the requirements of this License, to extend the patent
+license to downstream recipients.  "Knowingly relying" means you have
+actual knowledge that, but for the patent license, your conveying the
+covered work in a country, or your recipient's use of the covered work
+in a country, would infringe one or more identifiable patents in that
+country that you have reason to believe are valid.
+  If, pursuant to or in connection with a single transaction or
+arrangement, you convey, or propagate by procuring conveyance of, a
+covered work, and grant a patent license to some of the parties
+receiving the covered work authorizing them to use, propagate, modify
+or convey a specific copy of the covered work, then the patent license
+you grant is automatically extended to all recipients of the covered
+work and works based on it.
+  A patent license is "discriminatory" if it does not include within
+the scope of its coverage, prohibits the exercise of, or is
+conditioned on the non-exercise of one or more of the rights that are
+specifically granted under this License.  You may not convey a covered
+work if you are a party to an arrangement with a third party that is
+in the business of distributing software, under which you make payment
+to the third party based on the extent of your activity of conveying
+the work, and under which the third party grants, to any of the
+parties who would receive the covered work from you, a discriminatory
+patent license (a) in connection with copies of the covered work
+conveyed by you (or copies made from those copies), or (b) primarily
+for and in connection with specific products or compilations that
+contain the covered work, unless you entered into that arrangement,
+or that patent license was granted, prior to 28 March 2007.
+  Nothing in this License shall be construed as excluding or limiting
+any implied license or other defenses to infringement that may
+otherwise be available to you under applicable patent law.
+  12. No Surrender of Others' Freedom.
+  If conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot convey a
+covered work so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you may
+not convey it at all.  For example, if you agree to terms that obligate you
+to collect a royalty for further conveying from those to whom you convey
+the Program, the only way you could satisfy both those terms and this
+License would be to refrain entirely from conveying the Program.
+  13. Remote Network Interaction; Use with the GNU General Public License.
+  Notwithstanding any other provision of this License, if you modify the
+Program, your modified version must prominently offer all users
+interacting with it remotely through a computer network (if your version
+supports such interaction) an opportunity to receive the Corresponding
+Source of your version by providing access to the Corresponding Source
+from a network server at no charge, through some standard or customary
+means of facilitating copying of software.  This Corresponding Source
+shall include the Corresponding Source for any work covered by version 3
+of the GNU General Public License that is incorporated pursuant to the
+following paragraph.
+  Notwithstanding any other provision of this License, you have
+permission to link or combine any covered work with a work licensed
+under version 3 of the GNU General Public License into a single
+combined work, and to convey the resulting work.  The terms of this
+License will continue to apply to the part which is the covered work,
+but the work with which it is combined will remain governed by version
+3 of the GNU General Public License.
+  14. Revised Versions of this License.
+  The Free Software Foundation may publish revised and/or new versions of
+the GNU Affero General Public License from time to time.  Such new versions
+will be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+  Each version is given a distinguishing version number.  If the
+Program specifies that a certain numbered version of the GNU Affero General
+Public License "or any later version" applies to it, you have the
+option of following the terms and conditions either of that numbered
+version or of any later version published by the Free Software
+Foundation.  If the Program does not specify a version number of the
+GNU Affero General Public License, you may choose any version ever published
+by the Free Software Foundation.
+  If the Program specifies that a proxy can decide which future
+versions of the GNU Affero General Public License can be used, that proxy's
+public statement of acceptance of a version permanently authorizes you
+to choose that version for the Program.
+  Later license versions may give you additional or different
+permissions.  However, no additional obligations are imposed on any
+author or copyright holder as a result of your choosing to follow a
+later version.
+  15. Disclaimer of Warranty.
+  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
+APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
+HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
+OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
+IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
+ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+  16. Limitation of Liability.
+  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
+THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
+GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
+USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
+DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
+PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
+EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGES.
+  17. Interpretation of Sections 15 and 16.
+  If the disclaimer of warranty and limitation of liability provided
+above cannot be given local legal effect according to their terms,
+reviewing courts shall apply local law that most closely approximates
+an absolute waiver of all civil liability in connection with the
+Program, unless a warranty or assumption of liability accompanies a
+copy of the Program in return for a fee.
+                     END OF TERMS AND CONDITIONS
+            How to Apply These Terms to Your New Programs
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+state the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+    TraceMind MCP Server - AI-powered MCP server for agent evaluation analysis
+    Copyright (C) 2025  Kshitij Thakkar
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as published by
+    the Free Software Foundation, either version 3 of the License, or
+    (at your option) any later version.
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <https://www.gnu.org/licenses/>.
+For contact information: kshitijthakkar@rocketmail.com
+GitHub: https://github.com/Mandark-droid/TraceMind-mcp-server
+  If your software can interact with users remotely through a computer
+network, you should also make sure that it provides a way for users to
+get its source.  For example, if your program is a web application, its
+interface could display a "Source" link that leads users to an archive
+of the code.  There are many ways you could offer source, and different
+solutions will be better for different programs; see section 13 for the
+specific requirements.
+  You should also get your employer (if you work as a programmer) or school,
+if any, to sign a "copyright disclaimer" for the program, if necessary.
+For more information on this, and how to apply and follow the GNU AGPL, see
+<https://www.gnu.org/licenses/>.

README.md CHANGED Viewed

@@ -1,14 +1,592 @@
 ---
-title: TraceMind Mcp Server
-emoji: 🏃
-colorFrom: green
-colorTo: blue
 sdk: gradio
-sdk_version: 5.49.1
-app_file: app.py
 pinned: false
 license: agpl-3.0
-short_description: AI-powered MCP server for model/agent evaluation analysis
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: TraceMind MCP Server
+emoji: 🤖
+colorFrom: blue
+colorTo: purple
 sdk: gradio
+app_port: 7860
 pinned: false
 license: agpl-3.0
+short_description: AI-powered MCP server for agent evaluation analysis with Gemini 2.5 Pro
+tags:
+  - building-mcp-track-enterprise
+  - mcp
+  - gradio
+  - gemini
+  - agent-evaluation
+  - leaderboard
 ---
+# TraceMind MCP Server
+**AI-Powered Analysis Tools for Agent Evaluation Data**
+[![MCP's 1st Birthday Hackathon](https://img.shields.io/badge/MCP%27s%201st%20Birthday-Hackathon-blue)](https://github.com/modelcontextprotocol)
+[![Track](https://img.shields.io/badge/Track-Building%20MCP%20(Enterprise)-green)](https://github.com/modelcontextprotocol/hackathon)
+[![Google Gemini](https://img.shields.io/badge/Powered%20by-Google%20Gemini%202.5%20Pro-orange)](https://ai.google.dev/)
+> **🎯 Track 1 Submission**: Building MCP (Enterprise)
+> **📅 MCP's 1st Birthday Hackathon**: November 14-30, 2025
+## Overview
+TraceMind MCP Server is a Gradio-based MCP (Model Context Protocol) server that provides a complete MCP implementation with:
+### 🛠️ **5 AI-Powered Tools**
+1. **📊 analyze_leaderboard**: Generate insights from evaluation leaderboard data
+2. **🐛 debug_trace**: Debug specific agent execution traces using OpenTelemetry data
+3. **💰 estimate_cost**: Predict evaluation costs before running
+4. **⚖️ compare_runs**: Compare two evaluation runs with AI-powered analysis
+5. **📦 get_dataset**: Load SMOLTRACE datasets (smoltrace-* prefix only) as JSON for flexible analysis
+### 📦 **3 Data Resources**
+1. **leaderboard data**: Direct JSON access to evaluation results
+2. **trace data**: Raw OpenTelemetry trace data with spans
+3. **cost data**: Model pricing and hardware cost information
+### 📝 **3 Prompt Templates**
+1. **analysis prompts**: Standardized templates for different analysis types
+2. **debug prompts**: Templates for debugging scenarios
+3. **optimization prompts**: Templates for optimization goals
+All analysis is powered by **Google Gemini 2.5 Pro** for intelligent, context-aware insights.
+## 📱 Social Media & Demo
+**📢 Announcement Post**: [Coming Soon - X/LinkedIn post]
+**🎥 Demo Video**: [Coming Soon - YouTube/Loom link showing MCP server integration with Claude Desktop]
+---
+## Why This MCP Server?
+**Problem**: Agent evaluation generates massive amounts of data (leaderboards, traces, metrics), but developers struggle to:
+- Understand which models perform best for their use case
+- Debug why specific agent executions failed
+- Estimate costs before running expensive evaluations
+**Solution**: This MCP server provides AI-powered analysis tools that connect to HuggingFace datasets and deliver actionable insights in natural language.
+**Impact**: Developers can make informed decisions about agent configurations, debug issues faster, and optimize costs—all through a simple MCP interface.
+## Features
+### 🎯 Track 1 Compliance: Building MCP (Enterprise)
+- ✅ **Complete MCP Implementation**: Tools, Resources, AND Prompts
+- ✅ **MCP Standard Compliant**: Built with Gradio's native MCP support (`@gr.mcp.*` decorators)
+- ✅ **Production-Ready**: Deployable to HuggingFace Spaces with SSE transport
+- ✅ **Testing Interface**: Beautiful Gradio UI for testing all components
+- ✅ **Enterprise Focus**: Cost optimization, debugging, and decision support
+- ✅ **Google Gemini Powered**: Leverages Gemini 2.5 Pro for intelligent analysis
+- ✅ **11 Total Components**: 5 Tools + 3 Resources + 3 Prompts
+### 🛠️ Five Production-Ready Tools
+#### 1. analyze_leaderboard
+Analyzes evaluation leaderboard data from HuggingFace datasets and provides:
+- Top performers by selected metric (accuracy, cost, latency, CO2)
+- Trade-off analysis (e.g., "GPT-4 is most accurate but Llama-3.1 is 25x cheaper")
+- Trend identification
+- Actionable recommendations
+**Example Use Case**: Before choosing a model for production, get AI-powered insights on which configuration offers the best cost/performance for your requirements.
+#### 2. debug_trace
+Analyzes OpenTelemetry trace data and answers specific questions like:
+- "Why was tool X called twice?"
+- "Which step took the most time?"
+- "Why did this test fail?"
+**Example Use Case**: When an agent test fails, understand exactly what happened without manually parsing trace spans.
+#### 3. estimate_cost
+Predicts costs before running evaluations:
+- LLM API costs (token-based)
+- HuggingFace Jobs compute costs
+- CO2 emissions estimate
+- Hardware recommendations
+**Example Use Case**: Compare the cost of evaluating GPT-4 vs Llama-3.1 across 1000 tests before committing resources.
+#### 4. compare_runs
+Compares two evaluation runs with AI-powered analysis across multiple dimensions:
+- Success rate comparison with statistical significance
+- Cost efficiency analysis (total cost, cost per test, cost per successful test)
+- Speed comparison (average duration, throughput)
+- Environmental impact (CO2 emissions per test)
+- GPU efficiency (for GPU jobs)
+**Focus Options**:
+- `comprehensive`: Complete comparison across all dimensions
+- `cost`: Detailed cost efficiency and ROI analysis
+- `performance`: Speed and accuracy trade-off analysis
+- `eco_friendly`: Environmental impact and carbon footprint comparison
+**Example Use Case**: After running evaluations with two different models, compare them head-to-head to determine which is better for production deployment based on your priorities (accuracy, cost, speed, or environmental impact).
+#### 5. get_dataset
+Loads SMOLTRACE datasets from HuggingFace and returns raw data as JSON:
+- Simple, flexible tool that returns complete dataset with metadata
+- Works with any dataset containing "smoltrace-" prefix
+- Returns total rows, columns list, and data array
+- Automatically sorts by timestamp if available
+- Configurable row limit (1-200) to manage token usage
+**Security Restriction**: Only datasets with "smoltrace-" in the repository name are allowed.
+**Primary Use Cases**:
+- Load `smoltrace-leaderboard` to find run IDs and model names
+- Discover supporting datasets via `results_dataset`, `traces_dataset`, `metrics_dataset` fields
+- Load `smoltrace-results-*` datasets to see individual test case details
+- Load `smoltrace-traces-*` datasets to access OpenTelemetry trace data
+- Load `smoltrace-metrics-*` datasets to get GPU performance data
+- Answer specific questions requiring raw data access
+**Example Workflow**:
+1. LLM calls `get_dataset("kshitijthakkar/smoltrace-leaderboard")` to see all runs
+2. Examines the JSON response to find run IDs, models, and supporting dataset names
+3. Calls `get_dataset("username/smoltrace-results-gpt4")` to load detailed results
+4. Can now answer questions like "What are the last 10 run IDs?" or "Which models were tested?"
+**Example Use Case**: When the user asks "Can you provide me with the list of last 10 runIds and model names?", the LLM loads the leaderboard dataset and extracts the requested information from the JSON response.
+## MCP Resources Usage
+Resources provide direct data access without AI analysis:
+```python
+# Access leaderboard data
+GET leaderboard://kshitijthakkar/smoltrace-leaderboard
+# Returns: JSON with all evaluation runs
+# Access specific trace
+GET trace://trace_abc123/username/agent-traces-gpt4
+# Returns: JSON with trace spans and attributes
+# Get model cost information
+GET cost://model/openai/gpt-4
+# Returns: JSON with pricing and hardware costs
+```
+## MCP Prompts Usage
+Prompts provide reusable templates for standardized interactions:
+```python
+# Get analysis prompt template
+analysis_prompt(analysis_type="leaderboard", focus_area="cost", detail_level="detailed")
+# Returns: "Provide a detailed analysis. Analyze cost efficiency in the leaderboard..."
+# Get debug prompt template
+debug_prompt(debug_type="performance", context="tool_calling")
+# Returns: "Analyze tool calling performance. Identify which tools are slow..."
+# Get optimization prompt template
+optimization_prompt(optimization_goal="cost", constraints="maintain_quality")
+# Returns: "Analyze this evaluation setup and recommend cost optimizations..."
+```
+Use these prompts when interacting with the tools to get consistent, high-quality analysis.
+## Quick Start
+### 1. Installation
+```bash
+git clone https://github.com/Mandark-droid/TraceMind-mcp-server.git
+cd TraceMind-mcp-server
+# Create virtual environment
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+# Install dependencies (note: gradio[mcp] includes MCP support)
+pip install -r requirements.txt
+```
+### 2. Environment Setup
+Create `.env` file:
+```bash
+cp .env.example .env
+# Edit .env and add your API keys
+```
+Get your keys:
+- **Gemini API Key**: https://ai.google.dev/
+- **HuggingFace Token**: https://huggingface.co/settings/tokens
+### 3. Run Locally
+```bash
+python app.py
+```
+Open http://localhost:7860 to test the tools via Gradio interface.
+### 4. Test with Live Data
+Try the live example with real HuggingFace dataset:
+**In the Gradio UI, Tab "📊 Analyze Leaderboard":**
+```
+Leaderboard Repository: kshitijthakkar/smoltrace-leaderboard
+Metric Focus: overall
+Time Range: last_week
+Top N Models: 5
+```
+Click "🔍 Analyze" and get AI-powered insights from live data!
+## MCP Integration
+### How It Works
+This Gradio app uses `mcp_server=True` in the launch configuration, which automatically:
+- Exposes all async functions with proper docstrings as MCP tools
+- Handles MCP protocol communication
+- Provides a standard MCP interface via SSE (Server-Sent Events)
+### Connecting from MCP Clients
+Once deployed to HuggingFace Spaces, your MCP server will be available at:
+**MCP Endpoint (SSE)**:
+```
+https://huggingface.co/spaces/kshitijthakkar/TraceMind-mcp-server/gradio_api/mcp/sse
+```
+**Schema Endpoint**:
+```
+https://huggingface.co/spaces/kshitijthakkar/TraceMind-mcp-server/gradio_api/mcp/schema
+```
+Configure your MCP client (Claude Desktop, Cursor, Cline, etc.) with the SSE endpoint.
+### Available MCP Components
+**Tools** (5):
+1. **analyze_leaderboard**: AI-powered leaderboard analysis with Gemini 2.5 Pro
+2. **debug_trace**: Trace debugging with AI insights
+3. **estimate_cost**: Cost estimation with optimization recommendations
+4. **compare_runs**: Compare two evaluation runs with AI-powered analysis
+5. **get_dataset**: Load SMOLTRACE datasets (smoltrace-* only) as JSON
+**Resources** (3):
+1. **leaderboard://{repo}**: Direct access to raw leaderboard data in JSON
+2. **trace://{trace_id}/{repo}**: Direct access to trace data with spans
+3. **cost://model/{model_name}**: Model pricing and hardware cost information
+**Prompts** (3):
+1. **analysis_prompt**: Reusable templates for different analysis types
+2. **debug_prompt**: Reusable templates for debugging scenarios
+3. **optimization_prompt**: Reusable templates for optimization goals
+See full API documentation in the Gradio interface under "📖 API Documentation" tab.
+## Architecture
+```
+TraceMind-mcp-server/
+├── app.py                      # Gradio UI + MCP server (mcp_server=True)
+├── gemini_client.py            # Google Gemini 2.5 Pro integration
+├── mcp_tools.py                # 3 tool implementations
+├── requirements.txt            # Python dependencies
+├── .env.example                # Environment variable template
+├── .gitignore
+└── README.md
+```
+**Key Technologies**:
+- **Gradio 6 with MCP support**: `gradio[mcp]` provides native MCP server capabilities
+- **Google Gemini 2.5 Pro**: Latest AI model for intelligent analysis
+- **HuggingFace Datasets**: Data source for evaluations
+- **SSE Transport**: Server-Sent Events for real-time MCP communication
+## Deploy to HuggingFace Spaces
+### 1. Create Space
+Go to https://huggingface.co/new-space
+- **Space name**: `TraceMind-mcp-server`
+- **License**: MIT
+- **SDK**: Gradio
+- **Hardware**: CPU Basic (free tier works fine)
+### 2. Add Files
+Upload all files from this repository to your Space:
+- `app.py`
+- `gemini_client.py`
+- `mcp_tools.py`
+- `requirements.txt`
+- `README.md`
+### 3. Add Secrets
+In Space settings → Variables and secrets, add:
+- `GEMINI_API_KEY`: Your Gemini API key
+- `HF_TOKEN`: Your HuggingFace token
+### 4. Add Hackathon Tag
+In Space settings → Tags, add:
+- `building-mcp-track-enterprise`
+### 5. Access Your MCP Server
+Your MCP server will be publicly available at:
+```
+https://huggingface.co/spaces/kshitijthakkar/TraceMind-mcp-server
+```
+## Testing
+### Test 1: Analyze Leaderboard (Live Data)
+```bash
+# In Gradio UI - Tab "📊 Analyze Leaderboard":
+Repository: kshitijthakkar/smoltrace-leaderboard
+Metric: overall
+Time Range: last_week
+Top N: 5
+Click "🔍 Analyze"
+```
+**Expected**: AI-generated analysis of top performing models from live HuggingFace dataset
+### Test 2: Estimate Cost
+```bash
+# In Gradio UI - Tab "💰 Estimate Cost":
+Model: openai/gpt-4
+Agent Type: both
+Number of Tests: 100
+Hardware: auto
+Click "💰 Estimate"
+```
+**Expected**: Cost breakdown with LLM costs, HF Jobs costs, duration, and CO2 estimate
+### Test 3: Debug Trace
+Note: This requires actual trace data from an evaluation run. For testing purposes, this will show an error about missing data, which is expected behavior.
+## Hackathon Submission
+### Track 1: Building MCP (Enterprise)
+**Tag**: `building-mcp-track-enterprise`
+**Why Enterprise Track?**
+- Solves real business problems (cost optimization, debugging, decision support)
+- Production-ready tools with clear ROI
+- Integrates with enterprise data infrastructure (HuggingFace datasets)
+**Technology Stack**
+- **AI Analysis**: Google Gemini 2.5 Pro for all intelligent insights
+- **MCP Framework**: Gradio 6 with native MCP support
+- **Data Source**: HuggingFace Datasets
+- **Transport**: SSE (Server-Sent Events)
+## Related Project: TraceMind UI (Track 2)
+This MCP server is designed to be consumed by **TraceMind UI** (separate submission for Track 2: MCP in Action).
+TraceMind UI is a Gradio-based agent evaluation platform that uses these MCP tools to provide:
+- AI-powered leaderboard insights
+- Interactive trace debugging
+- Pre-evaluation cost estimates
+## File Descriptions
+### app.py
+Main Gradio application with:
+- Testing UI for all 5 tools
+- MCP server enabled via `mcp_server=True`
+- API documentation
+### gemini_client.py
+Google Gemini 2.5 Pro client that:
+- Handles API authentication
+- Provides specialized analysis methods for different data types
+- Formats prompts for optimal results
+- Uses `gemini-2.5-pro-latest` model (can switch to `gemini-2.5-flash-latest`)
+### mcp_tools.py
+Complete MCP implementation with 11 components:
+**Tools** (5 async functions):
+- `analyze_leaderboard()`: AI-powered leaderboard analysis
+- `debug_trace()`: AI-powered trace debugging
+- `estimate_cost()`: AI-powered cost estimation
+- `compare_runs()`: AI-powered run comparison
+- `get_dataset()`: Load SMOLTRACE datasets as JSON
+**Resources** (3 decorated functions with `@gr.mcp.resource()`):
+- `get_leaderboard_data()`: Raw leaderboard JSON data
+- `get_trace_data()`: Raw trace JSON data with spans
+- `get_cost_data()`: Model pricing and hardware cost JSON
+**Prompts** (3 decorated functions with `@gr.mcp.prompt()`):
+- `analysis_prompt()`: Templates for different analysis types
+- `debug_prompt()`: Templates for debugging scenarios
+- `optimization_prompt()`: Templates for optimization goals
+Each function includes:
+- Appropriate decorator (`@gr.mcp.tool()`, `@gr.mcp.resource()`, or `@gr.mcp.prompt()`)
+- Detailed docstring with "Args:" section
+- Type hints for all parameters and return values
+- Descriptive function name (becomes the MCP component name)
+## Environment Variables
+Required environment variables:
+```bash
+GEMINI_API_KEY=your_gemini_api_key_here
+HF_TOKEN=your_huggingface_token_here
+```
+## Development
+### Running Tests
+```bash
+# Test Gemini client
+python -c "from gemini_client import GeminiClient; client = GeminiClient(); print('✅ Gemini client initialized')"
+# Test with live leaderboard data
+python app.py
+# Open browser, test "Analyze Leaderboard" tab
+```
+### Adding New Tools
+To add a new MCP tool (with Gradio's native MCP support):
+1. **Add function to `mcp_tools.py`** with proper docstring:
+```python
+async def your_new_tool(
+    gemini_client: GeminiClient,
+    param1: str,
+    param2: int = 10
+) -> str:
+    """
+    Brief description of what the tool does.
+    Longer description explaining the tool's purpose and behavior.
+    Args:
+        gemini_client (GeminiClient): Initialized Gemini client for AI analysis
+        param1 (str): Description of param1 with examples if helpful
+        param2 (int): Description of param2. Default: 10
+    Returns:
+        str: Description of what the function returns
+    """
+    # Your implementation
+    return result
+```
+2. **Add UI tab in `app.py`** (optional, for testing):
+```python
+with gr.Tab("Your Tool"):
+    # Add UI components
+    # Wire up to your_new_tool()
+```
+3. That's it! Gradio automatically exposes it as an MCP tool based on:
+   - Function name (becomes tool name)
+   - Docstring (becomes tool description)
+   - Args section (becomes parameter descriptions)
+   - Type hints (become parameter types)
+### Switching to Gemini 2.5 Flash
+For faster (but slightly less capable) responses, switch to Gemini 2.5 Flash:
+```python
+# In app.py, change:
+gemini_client = GeminiClient(model_name="gemini-2.5-flash-latest")
+```
+## 🙏 Credits & Acknowledgments
+### Hackathon Sponsors
+Special thanks to the sponsors of **MCP's 1st Birthday Hackathon** (November 14-30, 2025):
+- **🤗 HuggingFace** - Hosting platform and dataset infrastructure
+- **🧠 Google Gemini** - AI analysis powered by Gemini 2.5 Pro API
+- **⚡ Modal** - Serverless infrastructure partner
+- **🏢 Anthropic** - MCP protocol creators
+- **🎨 Gradio** - Native MCP framework support
+- **🎙️ ElevenLabs** - Audio AI capabilities
+- **🦙 SambaNova** - High-performance AI infrastructure
+- **🎯 Blaxel** - Additional compute credits
+### Related Open Source Projects
+This MCP server builds upon our open source agent evaluation ecosystem:
+#### 📊 SMOLTRACE - Agent Evaluation Engine
+- **Description**: Lightweight, production-ready evaluation framework for AI agents with OpenTelemetry instrumentation
+- **GitHub**: [https://github.com/Mandark-droid/SMOLTRACE](https://github.com/Mandark-droid/SMOLTRACE)
+- **PyPI**: [https://pypi.org/project/smoltrace/](https://pypi.org/project/smoltrace/)
+- **Social**: [@smoltrace on X](https://twitter.com/smoltrace)
+#### 🔭 TraceVerde - GenAI OpenTelemetry Instrumentation
+- **Description**: Automatic OpenTelemetry instrumentation for LLM frameworks (LiteLLM, Transformers, LangChain, etc.)
+- **GitHub**: [https://github.com/Mandark-droid/genai_otel_instrument](https://github.com/Mandark-droid/genai_otel_instrument)
+- **PyPI**: [https://pypi.org/project/genai-otel-instrument](https://pypi.org/project/genai-otel-instrument)
+- **Social**: [@genai_otel on X](https://twitter.com/genai_otel)
+### Built By
+**Track**: Building MCP (Enterprise)
+**Author**: Kshitij Thakkar
+**Powered by**: Google Gemini 2.5 Pro
+**Built with**: Gradio 6 (native MCP support)
+---
+## 📄 License
+AGPL-3.0 License
+This project is licensed under the GNU Affero General Public License v3.0. See the LICENSE file for details.
+---
+## 💬 Support
+For issues or questions:
+- 📧 Open an issue on GitHub
+- 💬 Join the [HuggingFace Discord](https://discord.gg/huggingface) - Channel: `#agents-mcp-hackathon-winter25`
+- 🏷️ Tag `building-mcp-track-enterprise` for hackathon-related questions
+- 🐦 Follow us on X: [@TraceMindAI](https://twitter.com/TraceMindAI) (placeholder)
+## Changelog
+### v1.0.0 (2025-11-14)
+- Initial release for MCP Hackathon
+- **Complete MCP Implementation**: 11 components total
+  - 5 AI-powered tools (analyze_leaderboard, debug_trace, estimate_cost, compare_runs, get_dataset)
+  - 3 data resources (leaderboard, trace, cost data)
+  - 3 prompt templates (analysis, debug, optimization)
+- Gradio native MCP support with decorators (`@gr.mcp.*`)
+- Google Gemini 2.5 Pro integration for all AI analysis
+- Live HuggingFace dataset integration
+- SSE transport for MCP communication
+- Production-ready for HuggingFace Spaces deployment

app.py ADDED Viewed

	@@ -0,0 +1,1006 @@

+"""
+TraceMind MCP Server - Gradio Interface with MCP Support
+This server provides AI-powered analysis tools for agent evaluation data:
+1. analyze_leaderboard: Summarize trends and insights from leaderboard
+2. debug_trace: Debug specific agent execution traces
+3. estimate_cost: Predict evaluation costs before running
+4. compare_runs: Compare two evaluation runs with AI-powered analysis
+5. get_dataset: Load any HuggingFace dataset as JSON for flexible analysis
+"""
+import os
+import gradio as gr
+from typing import Optional, Dict, Any
+from datetime import datetime
+# Local imports
+from gemini_client import GeminiClient
+from mcp_tools import (
+    analyze_leaderboard,
+    debug_trace,
+    estimate_cost,
+    compare_runs,
+    get_dataset
+)
+# Initialize default Gemini client (fallback if user doesn't provide key)
+try:
+    default_gemini_client = GeminiClient()
+except ValueError:
+    default_gemini_client = None  # Will prompt user to enter API key
+# Gradio Interface for Testing
+def create_gradio_ui():
+    """Create Gradio UI for testing MCP tools"""
+    with gr.Blocks(title="TraceMind MCP Server", theme=gr.themes.Soft()) as demo:
+        gr.Markdown("""
+        # 🤖 TraceMind MCP Server
+        **AI-Powered Analysis for Agent Evaluation Data**
+        This server provides **5 MCP Tools + 3 MCP Resources + 3 MCP Prompts**:
+        ### MCP Tools (AI-Powered)
+        - 📊 **Analyze Leaderboard**: Get insights from evaluation results
+        - 🐛 **Debug Trace**: Understand what happened in a specific test
+        - 💰 **Estimate Cost**: Predict evaluation costs before running
+        - ⚖️ **Compare Runs**: Compare two evaluation runs with AI-powered analysis
+        - 📦 **Get Dataset**: Load any HuggingFace dataset as JSON for flexible analysis
+        ### MCP Resources (Data Access)
+        - 📊 **leaderboard://{repo}**: Raw leaderboard data
+        - 🔍 **trace://{trace_id}/{repo}**: Raw trace data
+        - 💰 **cost://model/{model_name}**: Model pricing data
+        ### MCP Prompts (Templates)
+        - 📝 **analysis_prompt**: Templates for analysis requests
+        - 🐛 **debug_prompt**: Templates for debugging traces
+        - ⚡ **optimization_prompt**: Templates for optimization recommendations
+        All powered by **Google Gemini 2.5 Pro**.
+        ## For Track 2 Integration
+        **HuggingFace Space URL**: `https://huggingface.co/spaces/kshitijthakkar/TraceMind-mcp-server`
+        **MCP Endpoint**: `https://huggingface.co/spaces/kshitijthakkar/TraceMind-mcp-server/gradio_api/mcp/sse`
+        **Schema**: `https://huggingface.co/spaces/kshitijthakkar/TraceMind-mcp-server/gradio_api/mcp/schema`
+        """)
+        # Session state for API keys
+        gemini_key_state = gr.State(value=os.getenv("GEMINI_API_KEY", ""))
+        hf_token_state = gr.State(value=os.getenv("HF_TOKEN", ""))
+        with gr.Tabs():
+            # Tab 0: Settings (API Keys)
+            with gr.Tab("⚙️ Settings"):
+                gr.Markdown("""
+                ## 🔑 API Key Configuration
+                Configure your API keys here. These will override environment variables for this session only.
+                **Why configure here?**
+                - No need to set environment variables
+                - Test with different API keys easily
+                - Secure session-only storage (not persisted)
+                **Security Note**: API keys are stored in session state only and are not saved permanently.
+                """)
+                with gr.Row():
+                    with gr.Column():
+                        gr.Markdown("### Google Gemini API Key")
+                        gemini_key_input = gr.Textbox(
+                            label="Gemini API Key",
+                            placeholder="Enter your Google Gemini API key",
+                            type="password",
+                            value=os.getenv("GEMINI_API_KEY", ""),
+                            info="Get your key from: https://aistudio.google.com/app/apikey"
+                        )
+                        gemini_status = gr.Markdown("Status: Using environment variable" if os.getenv("GEMINI_API_KEY") else "⚠️ Status: No API key configured")
+                    with gr.Column():
+                        gr.Markdown("### HuggingFace Token")
+                        hf_token_input = gr.Textbox(
+                            label="HuggingFace Token",
+                            placeholder="Enter your HuggingFace token",
+                            type="password",
+                            value=os.getenv("HF_TOKEN", ""),
+                            info="Get your token from: https://huggingface.co/settings/tokens"
+                        )
+                        hf_status = gr.Markdown("Status: Using environment variable" if os.getenv("HF_TOKEN") else "⚠️ Status: No token configured")
+                with gr.Row():
+                    save_keys_button = gr.Button("💾 Save API Keys for This Session", variant="primary", size="lg")
+                    clear_keys_button = gr.Button("🗑️ Clear Session Keys", variant="secondary")
+                keys_save_status = gr.Markdown("")
+                def save_api_keys(gemini_key, hf_token):
+                    """
+                    Save API keys to session state.
+                    Args:
+                        gemini_key (str): Google Gemini API key
+                        hf_token (str): HuggingFace token
+                    Returns:
+                        tuple: Updated state values and status message
+                    """
+                    status_messages = []
+                    # Validate and save Gemini key
+                    if gemini_key and gemini_key.strip():
+                        try:
+                            # Test the key by creating a client
+                            test_client = GeminiClient(api_key=gemini_key.strip())
+                            gemini_saved = gemini_key.strip()
+                            status_messages.append("✅ Gemini API key validated and saved")
+                        except Exception as e:
+                            gemini_saved = os.getenv("GEMINI_API_KEY", "")
+                            status_messages.append(f"❌ Gemini API key invalid: {str(e)}")
+                    else:
+                        gemini_saved = os.getenv("GEMINI_API_KEY", "")
+                        status_messages.append("ℹ️ Gemini API key cleared (using environment variable if set)")
+                    # Validate and save HF token
+                    if hf_token and hf_token.strip():
+                        hf_saved = hf_token.strip()
+                        status_messages.append("✅ HuggingFace token saved")
+                    else:
+                        hf_saved = os.getenv("HF_TOKEN", "")
+                        status_messages.append("ℹ️ HuggingFace token cleared (using environment variable if set)")
+                    status_markdown = "\n\n".join(status_messages)
+                    return gemini_saved, hf_saved, f"### Save Status\n\n{status_markdown}"
+                def clear_api_keys():
+                    """
+                    Clear session API keys and revert to environment variables.
+                    Returns:
+                        tuple: Cleared state values and status message
+                    """
+                    env_gemini = os.getenv("GEMINI_API_KEY", "")
+                    env_hf = os.getenv("HF_TOKEN", "")
+                    status = "### Keys Cleared\n\nReverted to environment variables.\n\n"
+                    if env_gemini:
+                        status += "✅ Using GEMINI_API_KEY from environment\n\n"
+                    else:
+                        status += "⚠️ No GEMINI_API_KEY in environment\n\n"
+                    if env_hf:
+                        status += "✅ Using HF_TOKEN from environment"
+                    else:
+                        status += "⚠️ No HF_TOKEN in environment"
+                    return env_gemini, env_hf, status
+                save_keys_button.click(
+                    fn=save_api_keys,
+                    inputs=[gemini_key_input, hf_token_input],
+                    outputs=[gemini_key_state, hf_token_state, keys_save_status]
+                )
+                clear_keys_button.click(
+                    fn=clear_api_keys,
+                    inputs=[],
+                    outputs=[gemini_key_state, hf_token_state, keys_save_status]
+                )
+                gr.Markdown("""
+                ---
+                ### How It Works
+                1. **Enter your API keys** in the fields above
+                2. **Click "Save API Keys"** to validate and store them for this session
+                3. **Use any tool** - they will automatically use your configured keys
+                4. **Keys are session-only** - they won't be saved when you close the browser
+                ### Environment Variables (Alternative)
+                You can also set these as environment variables:
+                ```bash
+                export GEMINI_API_KEY="your-key-here"
+                export HF_TOKEN="your-token-here"
+                ```
+                UI-configured keys will always override environment variables.
+                """)
+            # Tab 1: Analyze Leaderboard
+            with gr.Tab("📊 Analyze Leaderboard"):
+                gr.Markdown("### Get AI-powered insights from evaluation leaderboard")
+                with gr.Row():
+                    with gr.Column():
+                        lb_repo = gr.Textbox(
+                            label="Leaderboard Repository",
+                            value="kshitijthakkar/smoltrace-leaderboard",
+                            placeholder="username/dataset-name"
+                        )
+                        lb_metric = gr.Dropdown(
+                            label="Metric Focus",
+                            choices=["overall", "accuracy", "cost", "latency", "co2"],
+                            value="overall"
+                        )
+                        lb_time = gr.Dropdown(
+                            label="Time Range",
+                            choices=["last_week", "last_month", "all_time"],
+                            value="last_week"
+                        )
+                        lb_top_n = gr.Slider(
+                            label="Top N Models",
+                            minimum=3,
+                            maximum=10,
+                            value=5,
+                            step=1
+                        )
+                        lb_button = gr.Button("🔍 Analyze", variant="primary")
+                    with gr.Column():
+                        lb_output = gr.Markdown(label="Analysis Results")
+                async def run_analyze_leaderboard(repo, metric, time_range, top_n, gemini_key, hf_token):
+                    """
+                    Analyze agent evaluation leaderboard and generate AI-powered insights.
+                    This tool loads agent evaluation data from HuggingFace datasets and uses
+                    Google Gemini 2.5 Pro to provide intelligent analysis of top performers,
+                    trends, cost/performance trade-offs, and actionable recommendations.
+                    Args:
+                        repo (str): HuggingFace dataset repository containing leaderboard data
+                        metric (str): Primary metric to focus analysis on - "overall", "accuracy", "cost", "latency", or "co2"
+                        time_range (str): Time range for analysis - "last_week", "last_month", or "all_time"
+                        top_n (int): Number of top models to highlight in analysis (3-10)
+                        gemini_key (str): Gemini API key from session state
+                        hf_token (str): HuggingFace token from session state
+                    Returns:
+                        str: Markdown-formatted analysis with top performers, trends, and recommendations
+                    """
+                    try:
+                        # Create GeminiClient with user-provided key or fallback to default
+                        if gemini_key and gemini_key.strip():
+                            client = GeminiClient(api_key=gemini_key)
+                        elif default_gemini_client:
+                            client = default_gemini_client
+                        else:
+                            return "❌ **Error**: No Gemini API key configured. Please set it in the Settings tab."
+                        result = await analyze_leaderboard(
+                            gemini_client=client,
+                            leaderboard_repo=repo,
+                            metric_focus=metric,
+                            time_range=time_range,
+                            top_n=int(top_n),
+                            hf_token=hf_token if hf_token and hf_token.strip() else None
+                        )
+                        return result
+                    except Exception as e:
+                        return f"❌ **Error**: {str(e)}"
+                lb_button.click(
+                    fn=run_analyze_leaderboard,
+                    inputs=[lb_repo, lb_metric, lb_time, lb_top_n, gemini_key_state, hf_token_state],
+                    outputs=[lb_output]
+                )
+            # Tab 2: Debug Trace
+            with gr.Tab("🐛 Debug Trace"):
+                gr.Markdown("### Ask questions about specific agent execution traces")
+                with gr.Row():
+                    with gr.Column():
+                        trace_id = gr.Textbox(
+                            label="Trace ID",
+                            placeholder="trace_abc123",
+                            info="Get this from the Run Detail screen"
+                        )
+                        traces_repo = gr.Textbox(
+                            label="Traces Repository",
+                            placeholder="username/agent-traces-model-timestamp",
+                            info="Dataset containing trace data"
+                        )
+                        question = gr.Textbox(
+                            label="Your Question",
+                            placeholder="Why was tool X called twice?",
+                            lines=3
+                        )
+                        trace_button = gr.Button("🔍 Analyze", variant="primary")
+                    with gr.Column():
+                        trace_output = gr.Markdown(label="Debug Analysis")
+                async def run_debug_trace(trace_id_val, traces_repo_val, question_val, gemini_key, hf_token):
+                    """
+                    Debug a specific agent execution trace using OpenTelemetry data.
+                    This tool analyzes OpenTelemetry trace data from agent executions and uses
+                    Google Gemini 2.5 Pro to answer specific questions about the execution flow,
+                    identify bottlenecks, explain agent behavior, and provide debugging insights.
+                    Args:
+                        trace_id_val (str): Unique identifier for the trace to analyze (e.g., "trace_abc123")
+                        traces_repo_val (str): HuggingFace dataset repository containing trace data
+                        question_val (str): Specific question about the trace (optional, defaults to general analysis)
+                        gemini_key (str): Gemini API key from session state
+                        hf_token (str): HuggingFace token from session state
+                    Returns:
+                        str: Markdown-formatted debug analysis with step-by-step breakdown and answers
+                    """
+                    try:
+                        if not trace_id_val or not traces_repo_val:
+                            return "❌ **Error**: Please provide both Trace ID and Traces Repository"
+                        # Create GeminiClient with user-provided key or fallback to default
+                        if gemini_key and gemini_key.strip():
+                            client = GeminiClient(api_key=gemini_key)
+                        elif default_gemini_client:
+                            client = default_gemini_client
+                        else:
+                            return "❌ **Error**: No Gemini API key configured. Please set it in the Settings tab."
+                        result = await debug_trace(
+                            gemini_client=client,
+                            trace_id=trace_id_val,
+                            traces_repo=traces_repo_val,
+                            question=question_val or "Analyze this trace",
+                            hf_token=hf_token if hf_token and hf_token.strip() else None
+                        )
+                        return result
+                    except Exception as e:
+                        return f"❌ **Error**: {str(e)}"
+                trace_button.click(
+                    fn=run_debug_trace,
+                    inputs=[trace_id, traces_repo, question, gemini_key_state, hf_token_state],
+                    outputs=[trace_output]
+                )
+            # Tab 3: Estimate Cost
+            with gr.Tab("💰 Estimate Cost"):
+                gr.Markdown("### Predict evaluation costs before running")
+                with gr.Row():
+                    with gr.Column():
+                        cost_model = gr.Textbox(
+                            label="Model",
+                            placeholder="openai/gpt-4 or meta-llama/Llama-3.1-8B",
+                            info="Use litellm format (provider/model)"
+                        )
+                        cost_agent_type = gr.Dropdown(
+                            label="Agent Type",
+                            choices=["tool", "code", "both"],
+                            value="both"
+                        )
+                        cost_num_tests = gr.Slider(
+                            label="Number of Tests",
+                            minimum=10,
+                            maximum=1000,
+                            value=100,
+                            step=10
+                        )
+                        cost_hardware = gr.Dropdown(
+                            label="Hardware Type",
+                            choices=["auto", "cpu", "gpu_a10", "gpu_h200"],
+                            value="auto",
+                            info="'auto' will choose based on model type"
+                        )
+                        cost_button = gr.Button("💰 Estimate", variant="primary")
+                    with gr.Column():
+                        cost_output = gr.Markdown(label="Cost Estimate")
+                async def run_estimate_cost(model, agent_type, num_tests, hardware, gemini_key):
+                    """
+                    Estimate the cost, duration, and CO2 emissions of running agent evaluations.
+                    This tool predicts costs before running evaluations by calculating LLM API costs,
+                    HuggingFace Jobs compute costs, and CO2 emissions. Uses Google Gemini 2.5 Pro
+                    to provide detailed cost breakdown and optimization recommendations.
+                    Args:
+                        model (str): Model identifier in litellm format (e.g., "openai/gpt-4", "meta-llama/Llama-3.1-8B")
+                        agent_type (str): Type of agent capabilities to test - "tool", "code", or "both"
+                        num_tests (int): Number of test cases to run (10-1000)
+                        hardware (str): Hardware type for HF Jobs - "auto", "cpu", "gpu_a10", or "gpu_h200"
+                        gemini_key (str): Gemini API key from session state
+                    Returns:
+                        str: Markdown-formatted cost estimate with LLM costs, HF Jobs costs, duration, CO2, and tips
+                    """
+                    try:
+                        if not model:
+                            return "❌ **Error**: Please provide a model name"
+                        # Create GeminiClient with user-provided key or fallback to default
+                        if gemini_key and gemini_key.strip():
+                            client = GeminiClient(api_key=gemini_key)
+                        elif default_gemini_client:
+                            client = default_gemini_client
+                        else:
+                            return "❌ **Error**: No Gemini API key configured. Please set it in the Settings tab."
+                        result = await estimate_cost(
+                            gemini_client=client,
+                            model=model,
+                            agent_type=agent_type,
+                            num_tests=int(num_tests),
+                            hardware=hardware
+                        )
+                        return result
+                    except Exception as e:
+                        return f"❌ **Error**: {str(e)}"
+                cost_button.click(
+                    fn=run_estimate_cost,
+                    inputs=[cost_model, cost_agent_type, cost_num_tests, cost_hardware, gemini_key_state],
+                    outputs=[cost_output]
+                )
+            # Tab 4: Compare Runs
+            with gr.Tab("⚖️ Compare Runs"):
+                gr.Markdown("""
+                ## Compare Two Evaluation Runs
+                Compare two evaluation runs with AI-powered analysis across multiple dimensions:
+                success rate, cost efficiency, speed, environmental impact, and more.
+                """)
+                with gr.Row():
+                    with gr.Column():
+                        compare_run_id_1 = gr.Textbox(
+                            label="First Run ID",
+                            placeholder="e.g., run_abc123",
+                            info="Enter the run_id from the leaderboard"
+                        )
+                    with gr.Column():
+                        compare_run_id_2 = gr.Textbox(
+                            label="Second Run ID",
+                            placeholder="e.g., run_xyz789",
+                            info="Enter the run_id to compare against"
+                        )
+                with gr.Row():
+                    compare_focus = gr.Dropdown(
+                        choices=["comprehensive", "cost", "performance", "eco_friendly"],
+                        value="comprehensive",
+                        label="Comparison Focus",
+                        info="Choose what aspect to focus the comparison on"
+                    )
+                    compare_repo = gr.Textbox(
+                        label="Leaderboard Repository",
+                        value="kshitijthakkar/smoltrace-leaderboard",
+                        info="HuggingFace dataset containing leaderboard data"
+                    )
+                compare_button = gr.Button("🔍 Compare Runs", variant="primary")
+                compare_output = gr.Markdown()
+                async def run_compare_runs(run_id_1, run_id_2, focus, repo, gemini_key, hf_token):
+                    """
+                    Compare two evaluation runs and generate AI-powered comparative analysis.
+                    This tool fetches data for two evaluation runs from the leaderboard and uses
+                    Google Gemini 2.5 Pro to provide intelligent comparison across multiple dimensions:
+                    success rate, cost efficiency, speed, environmental impact, and use case recommendations.
+                    Args:
+                        run_id_1 (str): First run ID from the leaderboard to compare
+                        run_id_2 (str): Second run ID from the leaderboard to compare against
+                        focus (str): Focus area - "comprehensive", "cost", "performance", or "eco_friendly"
+                        repo (str): HuggingFace dataset repository containing leaderboard data
+                        gemini_key (str): Gemini API key from session state
+                        hf_token (str): HuggingFace token from session state
+                    Returns:
+                        str: Markdown-formatted comparative analysis with winners, trade-offs, and recommendations
+                    """
+                    try:
+                        # Create GeminiClient with user-provided key or fallback to default
+                        if gemini_key and gemini_key.strip():
+                            client = GeminiClient(api_key=gemini_key)
+                        elif default_gemini_client:
+                            client = default_gemini_client
+                        else:
+                            return "❌ **Error**: No Gemini API key configured. Please set it in the Settings tab."
+                        result = await compare_runs(
+                            gemini_client=client,
+                            run_id_1=run_id_1,
+                            run_id_2=run_id_2,
+                            leaderboard_repo=repo,
+                            comparison_focus=focus,
+                            hf_token=hf_token if hf_token and hf_token.strip() else None
+                        )
+                        return result
+                    except Exception as e:
+                        return f"❌ **Error**: {str(e)}"
+                compare_button.click(
+                    fn=run_compare_runs,
+                    inputs=[compare_run_id_1, compare_run_id_2, compare_focus, compare_repo, gemini_key_state, hf_token_state],
+                    outputs=[compare_output]
+                )
+            # Tab 5: Get Dataset
+            with gr.Tab("📦 Get Dataset"):
+                gr.Markdown("""
+                ## Load SMOLTRACE Datasets as JSON
+                This tool loads datasets with the **smoltrace-** prefix and returns the raw data as JSON.
+                Use this to access leaderboard data, results datasets, traces datasets, or metrics datasets.
+                **Restriction**: Only datasets with "smoltrace-" in the name are allowed for security.
+                **Tip**: If you don't know which dataset to load, first load the leaderboard to see
+                dataset references in the `results_dataset`, `traces_dataset`, `metrics_dataset` fields.
+                """)
+                with gr.Row():
+                    dataset_repo_input = gr.Textbox(
+                        label="Dataset Repository (must contain 'smoltrace-')",
+                        placeholder="e.g., kshitijthakkar/smoltrace-leaderboard",
+                        value="kshitijthakkar/smoltrace-leaderboard",
+                        info="HuggingFace dataset repository path with smoltrace- prefix"
+                    )
+                    dataset_max_rows = gr.Slider(
+                        minimum=1,
+                        maximum=200,
+                        value=50,
+                        step=1,
+                        label="Max Rows",
+                        info="Limit rows to avoid token limits"
+                    )
+                dataset_button = gr.Button("📥 Load Dataset", variant="primary")
+                dataset_output = gr.JSON(label="Dataset JSON Output")
+                async def run_get_dataset(repo, max_rows, hf_token):
+                    """
+                    Load SMOLTRACE datasets from HuggingFace and return as JSON.
+                    This tool loads datasets with the "smoltrace-" prefix and returns the raw data
+                    as JSON. Use this to access leaderboard data, results datasets, traces datasets,
+                    or metrics datasets. Only datasets with "smoltrace-" in the name are allowed.
+                    Args:
+                        repo (str): HuggingFace dataset repository path with "smoltrace-" prefix (e.g., "kshitijthakkar/smoltrace-leaderboard")
+                        max_rows (int): Maximum number of rows to return (1-200, default 50)
+                        hf_token (str): HuggingFace token from session state
+                    Returns:
+                        dict: JSON object with dataset data, metadata, total rows, and column names
+                    """
+                    try:
+                        import json
+                        result = await get_dataset(
+                            dataset_repo=repo,
+                            max_rows=int(max_rows),
+                            hf_token=hf_token if hf_token and hf_token.strip() else None
+                        )
+                        # Parse JSON string back to dict for JSON component
+                        return json.loads(result)
+                    except Exception as e:
+                        return {"error": str(e)}
+                dataset_button.click(
+                    fn=run_get_dataset,
+                    inputs=[dataset_repo_input, dataset_max_rows, hf_token_state],
+                    outputs=[dataset_output]
+                )
+            # Tab 6: MCP Resources & Prompts
+            with gr.Tab("🔌 MCP Resources & Prompts"):
+                gr.Markdown("""
+                ## MCP Resources & Prompts
+                Beyond the 5 MCP Tools, this server also exposes **MCP Resources** and **MCP Prompts**
+                that MCP clients can use directly.
+                ### MCP Resources (Read-Only Data Access)
+                Resources provide direct access to data without AI processing:
+                #### 1. `leaderboard://{repo}`
+                Get raw leaderboard data in JSON format.
+                **Example**: `leaderboard://kshitijthakkar/smoltrace-leaderboard`
+                **Returns**: JSON with all evaluation runs
+                #### 2. `trace://{trace_id}/{repo}`
+                Get raw trace data for a specific trace.
+                **Example**: `trace://trace_abc123/kshitijthakkar/smoltrace-traces-gpt4`
+                **Returns**: JSON with OpenTelemetry spans
+                #### 3. `cost://model/{model_name}`
+                Get cost information for a specific model.
+                **Example**: `cost://model/openai/gpt-4`
+                **Returns**: JSON with pricing data
+                ---
+                ### MCP Prompts (Reusable Templates)
+                Prompts provide standardized templates for common workflows:
+                #### 1. `analysis_prompt(analysis_type, focus_area, detail_level)`
+                Generate analysis prompt templates.
+                **Parameters**:
+                - `analysis_type`: "leaderboard", "trace", "cost"
+                - `focus_area`: "overall", "performance", "cost", "efficiency"
+                - `detail_level`: "summary", "detailed", "comprehensive"
+                #### 2. `debug_prompt(debug_type, context)`
+                Generate debugging prompt templates.
+                **Parameters**:
+                - `debug_type`: "error", "performance", "behavior", "optimization"
+                - `context`: "agent_execution", "tool_calling", "llm_reasoning"
+                #### 3. `optimization_prompt(optimization_goal, constraints)`
+                Generate optimization prompt templates.
+                **Parameters**:
+                - `optimization_goal`: "cost", "speed", "quality", "efficiency"
+                - `constraints`: "maintain_quality", "maintain_speed", "no_constraints"
+                ---
+                ### Testing MCP Resources
+                Test resources directly from this UI:
+                """)
+                with gr.Row():
+                    with gr.Column():
+                        gr.Markdown("#### Test Leaderboard Resource")
+                        resource_lb_repo = gr.Textbox(
+                            label="Repository",
+                            value="kshitijthakkar/smoltrace-leaderboard"
+                        )
+                        resource_lb_button = gr.Button("Fetch Leaderboard Data", variant="primary")
+                        resource_lb_output = gr.JSON(label="Resource Output")
+                        def test_leaderboard_resource(repo):
+                            """
+                            Test the leaderboard MCP resource by fetching raw leaderboard data.
+                            Args:
+                                repo (str): HuggingFace dataset repository name
+                            Returns:
+                                dict: JSON object with leaderboard data
+                            """
+                            from mcp_tools import get_leaderboard_data
+                            import json
+                            result = get_leaderboard_data(repo)
+                            return json.loads(result)
+                        resource_lb_button.click(
+                            fn=test_leaderboard_resource,
+                            inputs=[resource_lb_repo],
+                            outputs=[resource_lb_output]
+                        )
+                    with gr.Column():
+                        gr.Markdown("#### Test Cost Resource")
+                        resource_cost_model = gr.Textbox(
+                            label="Model Name",
+                            value="openai/gpt-4"
+                        )
+                        resource_cost_button = gr.Button("Fetch Cost Data", variant="primary")
+                        resource_cost_output = gr.JSON(label="Resource Output")
+                        def test_cost_resource(model):
+                            """
+                            Test the cost MCP resource by fetching model pricing data.
+                            Args:
+                                model (str): Model identifier (e.g., "openai/gpt-4")
+                            Returns:
+                                dict: JSON object with cost and pricing information
+                            """
+                            from mcp_tools import get_cost_data
+                            import json
+                            result = get_cost_data(model)
+                            return json.loads(result)
+                        resource_cost_button.click(
+                            fn=test_cost_resource,
+                            inputs=[resource_cost_model],
+                            outputs=[resource_cost_output]
+                        )
+                gr.Markdown("---")
+                gr.Markdown("### Testing MCP Prompts")
+                gr.Markdown("Generate prompt templates for different scenarios:")
+                with gr.Row():
+                    with gr.Column():
+                        prompt_type = gr.Radio(
+                            label="Prompt Type",
+                            choices=["analysis_prompt", "debug_prompt", "optimization_prompt"],
+                            value="analysis_prompt"
+                        )
+                        # Analysis prompt params
+                        with gr.Group(visible=True) as analysis_group:
+                            analysis_type = gr.Dropdown(
+                                label="Analysis Type",
+                                choices=["leaderboard", "trace", "cost"],
+                                value="leaderboard"
+                            )
+                            focus_area = gr.Dropdown(
+                                label="Focus Area",
+                                choices=["overall", "performance", "cost", "efficiency"],
+                                value="overall"
+                            )
+                            detail_level = gr.Dropdown(
+                                label="Detail Level",
+                                choices=["summary", "detailed", "comprehensive"],
+                                value="detailed"
+                            )
+                        # Debug prompt params
+                        with gr.Group(visible=False) as debug_group:
+                            debug_type = gr.Dropdown(
+                                label="Debug Type",
+                                choices=["error", "performance", "behavior", "optimization"],
+                                value="error"
+                            )
+                            debug_context = gr.Dropdown(
+                                label="Context",
+                                choices=["agent_execution", "tool_calling", "llm_reasoning"],
+                                value="agent_execution"
+                            )
+                        # Optimization prompt params
+                        with gr.Group(visible=False) as optimization_group:
+                            optimization_goal = gr.Dropdown(
+                                label="Optimization Goal",
+                                choices=["cost", "speed", "quality", "efficiency"],
+                                value="cost"
+                            )
+                            constraints = gr.Dropdown(
+                                label="Constraints",
+                                choices=["maintain_quality", "maintain_speed", "no_constraints"],
+                                value="maintain_quality"
+                            )
+                        prompt_button = gr.Button("Generate Prompt", variant="primary")
+                    with gr.Column():
+                        prompt_output = gr.Textbox(
+                            label="Generated Prompt Template",
+                            lines=10,
+                            max_lines=20
+                        )
+                def toggle_prompt_groups(prompt_type):
+                    """
+                    Toggle visibility of prompt parameter groups based on selected prompt type.
+                    Args:
+                        prompt_type (str): The type of prompt selected
+                    Returns:
+                        dict: Gradio update objects for group visibility
+                    """
+                    return {
+                        analysis_group: gr.update(visible=(prompt_type == "analysis_prompt")),
+                        debug_group: gr.update(visible=(prompt_type == "debug_prompt")),
+                        optimization_group: gr.update(visible=(prompt_type == "optimization_prompt"))
+                    }
+                prompt_type.change(
+                    fn=toggle_prompt_groups,
+                    inputs=[prompt_type],
+                    outputs=[analysis_group, debug_group, optimization_group]
+                )
+                def generate_prompt(
+                    prompt_type,
+                    analysis_type_val, focus_area_val, detail_level_val,
+                    debug_type_val, debug_context_val,
+                    optimization_goal_val, constraints_val
+                ):
+                    """
+                    Generate a prompt template based on the selected type and parameters.
+                    Args:
+                        prompt_type (str): Type of prompt to generate
+                        analysis_type_val (str): Analysis type parameter
+                        focus_area_val (str): Focus area parameter
+                        detail_level_val (str): Detail level parameter
+                        debug_type_val (str): Debug type parameter
+                        debug_context_val (str): Debug context parameter
+                        optimization_goal_val (str): Optimization goal parameter
+                        constraints_val (str): Constraints parameter
+                    Returns:
+                        str: Generated prompt template text
+                    """
+                    from mcp_tools import analysis_prompt, debug_prompt, optimization_prompt
+                    if prompt_type == "analysis_prompt":
+                        return analysis_prompt(analysis_type_val, focus_area_val, detail_level_val)
+                    elif prompt_type == "debug_prompt":
+                        return debug_prompt(debug_type_val, debug_context_val)
+                    elif prompt_type == "optimization_prompt":
+                        return optimization_prompt(optimization_goal_val, constraints_val)
+                prompt_button.click(
+                    fn=generate_prompt,
+                    inputs=[
+                        prompt_type,
+                        analysis_type, focus_area, detail_level,
+                        debug_type, debug_context,
+                        optimization_goal, constraints
+                    ],
+                    outputs=[prompt_output]
+                )
+            # Tab 7: API Documentation
+            with gr.Tab("📖 API Documentation"):
+                gr.Markdown("""
+                ## MCP Tool Specifications
+                ### 1. analyze_leaderboard
+                **Description**: Generate AI-powered insights from evaluation leaderboard data
+                **Parameters**:
+                - `leaderboard_repo` (str): HuggingFace dataset repository (default: "kshitijthakkar/smoltrace-leaderboard")
+                - `metric_focus` (str): "overall", "accuracy", "cost", "latency", or "co2" (default: "overall")
+                - `time_range` (str): "last_week", "last_month", or "all_time" (default: "last_week")
+                - `top_n` (int): Number of top models to highlight (default: 5, min: 3, max: 10)
+                **Returns**: Markdown-formatted analysis with top performers, trends, and recommendations
+                ---
+                ### 2. debug_trace
+                **Description**: Answer questions about specific agent execution traces
+                **Parameters**:
+                - `trace_id` (str, required): Unique identifier for the trace
+                - `traces_repo` (str, required): HuggingFace dataset repository with trace data
+                - `question` (str): Specific question about the trace (default: "Analyze this trace and explain what happened")
+                **Returns**: Markdown-formatted debug analysis with step-by-step breakdown
+                ---
+                ### 3. estimate_cost
+                **Description**: Predict evaluation costs before running
+                **Parameters**:
+                - `model` (str, required): Model identifier in litellm format (e.g., "openai/gpt-4")
+                - `agent_type` (str, required): "tool", "code", or "both"
+                - `num_tests` (int): Number of test cases (default: 100, min: 10, max: 1000)
+                - `hardware` (str): "auto", "cpu", "gpu_a10", or "gpu_h200" (default: "auto")
+                **Returns**: Markdown-formatted cost estimate with breakdown and optimization tips
+                ---
+                ### 4. compare_runs
+                **Description**: Compare two evaluation runs with AI-powered analysis
+                **Parameters**:
+                - `run_id_1` (str, required): First run ID from the leaderboard
+                - `run_id_2` (str, required): Second run ID to compare against
+                - `leaderboard_repo` (str): HuggingFace dataset repository (default: "kshitijthakkar/smoltrace-leaderboard")
+                - `comparison_focus` (str): "comprehensive", "cost", "performance", or "eco_friendly" (default: "comprehensive")
+                **Returns**: Markdown-formatted comparative analysis with winner for each category, trade-offs, and recommendations
+                **Focus Options**:
+                - `comprehensive`: Complete comparison across all dimensions (success rate, cost, speed, CO2, GPU)
+                - `cost`: Detailed cost efficiency analysis and ROI
+                - `performance`: Speed and accuracy trade-off analysis
+                - `eco_friendly`: Environmental impact and carbon footprint comparison
+                ---
+                ### 5. get_dataset
+                **Description**: Load SMOLTRACE datasets from HuggingFace and return as JSON
+                **Parameters**:
+                - `dataset_repo` (str, required): HuggingFace dataset repository path with "smoltrace-" prefix (e.g., "kshitijthakkar/smoltrace-leaderboard")
+                - `max_rows` (int): Maximum number of rows to return (default: 50, range: 1-200)
+                **Returns**: JSON object with dataset data and metadata
+                **Restriction**: Only datasets with "smoltrace-" in the repository name are allowed for security.
+                **Use Cases**:
+                - Load smoltrace-leaderboard to find run IDs, model names, and supporting dataset references
+                - Load smoltrace-results-* datasets to see individual test case details
+                - Load smoltrace-traces-* datasets to access OpenTelemetry trace data
+                - Load smoltrace-metrics-* datasets to get GPU metrics and performance data
+                **Workflow**:
+                1. Call `get_dataset("kshitijthakkar/smoltrace-leaderboard")` to see all runs
+                2. Find the `results_dataset`, `traces_dataset`, or `metrics_dataset` field for a specific run
+                3. Call `get_dataset(dataset_repo)` with that smoltrace-* dataset name to get detailed data
+                ---
+                ## MCP Integration
+                This Gradio app is MCP-enabled. When deployed to HuggingFace Spaces, it can be accessed via MCP clients.
+                **Space URL**: `https://huggingface.co/spaces/kshitijthakkar/TraceMind-mcp-server`
+                ### What's Exposed via MCP:
+                #### 5 MCP Tools (AI-Powered)
+                The five tools above (`analyze_leaderboard`, `debug_trace`, `estimate_cost`, `compare_runs`, `get_dataset`)
+                are automatically exposed as MCP tools and can be called from any MCP client.
+                #### 3 MCP Resources (Data Access)
+                - `leaderboard://{repo}` - Raw leaderboard data
+                - `trace://{trace_id}/{repo}` - Raw trace data
+                - `cost://model/{model_name}` - Model pricing data
+                #### 3 MCP Prompts (Templates)
+                - `analysis_prompt(analysis_type, focus_area, detail_level)` - Analysis templates
+                - `debug_prompt(debug_type, context)` - Debug templates
+                - `optimization_prompt(optimization_goal, constraints)` - Optimization templates
+                **See the "🔌 MCP Resources & Prompts" tab to test these features.**
+                """)
+        gr.Markdown("""
+        ---
+        ## Environment Variables
+        Required:
+        - `GEMINI_API_KEY`: Your Google Gemini API key
+        - `HF_TOKEN`: Your HuggingFace token (for dataset access)
+        ## Source Code
+        This server is part of the TraceMind project submission for MCP's 1st Birthday Hackathon.
+        **Track 1**: Building MCP (Enterprise)
+        **Tag**: `building-mcp-track-enterprise`
+        """)
+    return demo
+if __name__ == "__main__":
+    # Create Gradio interface
+    demo = create_gradio_ui()
+    # Launch with MCP server enabled
+    # share=True creates a temporary public HTTPS URL for testing with Claude Code
+    demo.launch(
+        server_name="0.0.0.0",
+        server_port=7860,
+        share=True,  # Creates temporary HTTPS URL (e.g., https://abc123.gradio.live)
+        mcp_server=True  # Enable MCP server functionality
+    )

gemini_client.py ADDED Viewed

	@@ -0,0 +1,185 @@

+"""
+Gemini Client for TraceMind MCP Server
+Handles all interactions with Google Gemini 2.5 Pro API
+"""
+import os
+import google.generativeai as genai
+from typing import Optional, Dict, Any, List
+import json
+class GeminiClient:
+    """Client for Google Gemini API"""
+    def __init__(self, api_key: Optional[str] = None, model_name: str = "gemini-2.5-flash"):
+        """
+        Initialize Gemini client
+        Args:
+            api_key: Gemini API key (defaults to GEMINI_API_KEY env var)
+            model_name: Model to use (default: gemini-2.5-flash, can also use gemini-2.5-flash-lite)
+        """
+        self.api_key = api_key or os.getenv("GEMINI_API_KEY")
+        if not self.api_key:
+            raise ValueError("GEMINI_API_KEY environment variable not set")
+        # Configure API
+        genai.configure(api_key=self.api_key)
+        # Initialize model
+        self.model = genai.GenerativeModel(model_name)
+        # Generation config for consistent outputs
+        self.generation_config = {
+            "temperature": 0.7,
+            "top_p": 0.95,
+            "top_k": 40,
+            "max_output_tokens": 8192,
+        }
+    async def analyze_with_context(
+        self,
+        data: Dict[str, Any],
+        analysis_type: str,
+        specific_question: Optional[str] = None
+    ) -> str:
+        """
+        Analyze data with Gemini, providing context about the analysis type
+        Args:
+            data: Data to analyze (will be converted to JSON)
+            analysis_type: Type of analysis ("leaderboard", "trace", "cost_estimate")
+            specific_question: Optional specific question to answer
+        Returns:
+            Markdown-formatted analysis
+        """
+        # Build prompt based on analysis type
+        if analysis_type == "leaderboard":
+            system_prompt = """You are an expert AI agent performance analyst.
+You are analyzing evaluation leaderboard data from agent benchmarks. Your task is to:
+1. Identify top performers across key metrics (accuracy, cost, latency, CO2)
+2. Explain trade-offs between different approaches (API vs local models, GPU types)
+3. Identify trends and patterns
+4. Provide actionable recommendations
+Focus on insights that would help developers choose the right agent configuration for their use case.
+Format your response in clear markdown with sections for:
+- **Top Performers**
+- **Key Insights**
+- **Trade-offs**
+- **Recommendations**
+"""
+        elif analysis_type == "trace":
+            system_prompt = """You are an expert agent debugging specialist.
+You are analyzing OpenTelemetry trace data from agent execution. Your task is to:
+1. Understand the sequence of operations (LLM calls, tool calls, etc.)
+2. Identify performance bottlenecks or inefficiencies
+3. Explain why certain decisions were made
+4. Answer the specific question asked
+Focus on providing clear explanations that help developers understand agent behavior.
+Format your response in clear markdown with relevant code snippets and timing information.
+"""
+        elif analysis_type == "cost_estimate":
+            system_prompt = """You are an expert in LLM cost optimization and cloud resource estimation.
+You are estimating the cost of running agent evaluations. Your task is to:
+1. Calculate LLM API costs based on token usage patterns
+2. Estimate HuggingFace Jobs compute costs
+3. Predict CO2 emissions
+4. Provide cost optimization recommendations
+Focus on giving accurate estimates with clear breakdowns.
+Format your response in clear markdown with cost breakdowns and optimization tips.
+"""
+        else:
+            system_prompt = "You are a helpful AI assistant analyzing agent evaluation data."
+        # Build user prompt
+        data_json = json.dumps(data, indent=2)
+        user_prompt = f"{system_prompt}\n\n**Data to analyze:**\n```json\n{data_json}\n```\n\n"
+        if specific_question:
+            user_prompt += f"**Specific question:** {specific_question}\n\n"
+        user_prompt += "Provide your analysis:"
+        # Generate response
+        try:
+            response = await self.model.generate_content_async(
+                user_prompt,
+                generation_config=self.generation_config
+            )
+            return response.text
+        except Exception as e:
+            return f"Error generating analysis: {str(e)}"
+    async def generate_summary(
+        self,
+        text: str,
+        max_words: int = 100
+    ) -> str:
+        """
+        Generate a concise summary of text
+        Args:
+            text: Text to summarize
+            max_words: Maximum words in summary
+        Returns:
+            Summary text
+        """
+        prompt = f"Summarize the following in {max_words} words or less:\n\n{text}"
+        try:
+            response = await self.model.generate_content_async(prompt)
+            return response.text
+        except Exception as e:
+            return f"Error generating summary: {str(e)}"
+    async def answer_question(
+        self,
+        context: str,
+        question: str
+    ) -> str:
+        """
+        Answer a question given context
+        Args:
+            context: Context information
+            question: Question to answer
+        Returns:
+            Answer
+        """
+        prompt = f"""Based on the following context, answer the question.
+**Context:**
+{context}
+**Question:** {question}
+**Answer:**"""
+        try:
+            response = await self.model.generate_content_async(
+                prompt,
+                generation_config=self.generation_config
+            )
+            return response.text
+        except Exception as e:
+            return f"Error answering question: {str(e)}"

mcp_tools.py ADDED Viewed

	@@ -0,0 +1,943 @@

+"""
+MCP Tool Implementations for TraceMind
+Implements:
+- 5 MCP Tools: analyze_leaderboard, debug_trace, estimate_cost, compare_runs, get_dataset
+- 3 MCP Resources: leaderboard data, trace data, cost data
+- 3 MCP Prompts: analysis prompts, debug prompts, optimization prompts
+With Gradio's native MCP support (mcp_server=True), these are automatically
+exposed based on decorators (@gr.mcp.tool, @gr.mcp.resource, @gr.mcp.prompt),
+docstrings, and type hints.
+"""
+import os
+import json
+from typing import Optional
+from datasets import load_dataset
+import pandas as pd
+from datetime import datetime, timedelta
+import gradio as gr
+from gemini_client import GeminiClient
+async def analyze_leaderboard(
+    gemini_client: GeminiClient,
+    leaderboard_repo: str = "kshitijthakkar/smoltrace-leaderboard",
+    metric_focus: str = "overall",
+    time_range: str = "last_week",
+    top_n: int = 5,
+    hf_token: Optional[str] = None
+) -> str:
+    """
+    Analyze evaluation leaderboard and generate AI-powered insights.
+    This tool loads agent evaluation data from HuggingFace datasets and uses
+    Google Gemini 2.5 Pro to provide intelligent analysis of top performers,
+    trends, cost/performance trade-offs, and actionable recommendations.
+    Args:
+        gemini_client (GeminiClient): Initialized Gemini client for AI analysis
+        leaderboard_repo (str): HuggingFace dataset repository containing leaderboard data. Default: "kshitijthakkar/smoltrace-leaderboard"
+        metric_focus (str): Primary metric to focus analysis on. Options: "overall", "accuracy", "cost", "latency", "co2". Default: "overall"
+        time_range (str): Time range for analysis. Options: "last_week", "last_month", "all_time". Default: "last_week"
+        top_n (int): Number of top models to highlight in analysis. Must be between 3 and 10. Default: 5
+        hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
+    Returns:
+        str: Markdown-formatted analysis with top performers, insights, trade-offs, and recommendations
+    """
+    try:
+        # Load leaderboard data from HuggingFace
+        print(f"Loading leaderboard from {leaderboard_repo}...")
+        # Use user-provided token or fall back to environment variable
+        token = hf_token if hf_token else os.getenv("HF_TOKEN")
+        ds = load_dataset(leaderboard_repo, split="train", token=token)
+        df = pd.DataFrame(ds)
+        # Filter by time range
+        if time_range != "all_time":
+            df['timestamp'] = pd.to_datetime(df['timestamp'])
+            now = datetime.now()
+            if time_range == "last_week":
+                cutoff = now - timedelta(days=7)
+            elif time_range == "last_month":
+                cutoff = now - timedelta(days=30)
+            df = df[df['timestamp'] >= cutoff]
+        # Sort by metric
+        metric_column_map = {
+            "overall": "success_rate",
+            "accuracy": "success_rate",
+            "cost": "total_cost_usd",
+            "latency": "avg_duration_ms",
+            "co2": "co2_emissions_g"
+        }
+        sort_column = metric_column_map.get(metric_focus, "success_rate")
+        ascending = metric_focus in ["cost", "latency", "co2"]  # Lower is better for these
+        df_sorted = df.sort_values(sort_column, ascending=ascending)
+        # Get top N
+        top_models = df_sorted.head(top_n)
+        # Prepare data summary for Gemini
+        analysis_data = {
+            "total_evaluations": len(df),
+            "time_range": time_range,
+            "metric_focus": metric_focus,
+            "top_models": top_models[[
+                "model", "agent_type", "provider",
+                "success_rate", "total_cost_usd", "avg_duration_ms",
+                "co2_emissions_g", "submitted_by"
+            ]].to_dict('records'),
+            "summary_stats": {
+                "avg_success_rate": float(df['success_rate'].mean()),
+                "avg_cost": float(df['total_cost_usd'].mean()),
+                "avg_duration_ms": float(df['avg_duration_ms'].mean()),
+                "total_co2_g": float(df['co2_emissions_g'].sum()),
+                "models_tested": df['model'].nunique(),
+                "unique_submitters": df['submitted_by'].nunique()
+            }
+        }
+        # Get AI analysis from Gemini
+        result = await gemini_client.analyze_with_context(
+            data=analysis_data,
+            analysis_type="leaderboard",
+            specific_question=f"Focus on {metric_focus} performance. What are the key insights?"
+        )
+        return result
+    except Exception as e:
+        return f"❌ **Error analyzing leaderboard**: {str(e)}\n\nPlease check:\n- Repository name is correct\n- You have access to the dataset\n- HF_TOKEN is set correctly"
+async def debug_trace(
+    gemini_client: GeminiClient,
+    trace_id: str,
+    traces_repo: str,
+    question: str = "Analyze this trace and explain what happened",
+    hf_token: Optional[str] = None
+) -> str:
+    """
+    Debug a specific agent execution trace using OpenTelemetry data.
+    This tool analyzes OpenTelemetry trace data from agent executions and uses
+    Google Gemini 2.5 Pro to answer specific questions about the execution flow,
+    identify bottlenecks, and explain agent behavior.
+    Args:
+        gemini_client (GeminiClient): Initialized Gemini client for AI analysis
+        trace_id (str): Unique identifier for the trace to analyze (e.g., "trace_abc123")
+        traces_repo (str): HuggingFace dataset repository containing trace data (e.g., "username/agent-traces-model-timestamp")
+        question (str): Specific question about the trace. Default: "Analyze this trace and explain what happened"
+        hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
+    Returns:
+        str: Markdown-formatted debug analysis with step-by-step breakdown, timing information, and answer to the question
+    """
+    try:
+        # Load traces dataset
+        print(f"Loading traces from {traces_repo}...")
+        # Use user-provided token or fall back to environment variable
+        token = hf_token if hf_token else os.getenv("HF_TOKEN")
+        ds = load_dataset(traces_repo, split="train", token=token)
+        df = pd.DataFrame(ds)
+        # Find the specific trace
+        trace_data = df[df['trace_id'] == trace_id]
+        if len(trace_data) == 0:
+            return f"❌ **Trace not found**: No trace with ID `{trace_id}` in repository `{traces_repo}`"
+        trace_row = trace_data.iloc[0]
+        # Parse spans (OpenTelemetry format)
+        spans = trace_row['spans']
+        if isinstance(spans, str):
+            import json
+            spans = json.loads(spans)
+        # Helper function to handle different OTEL timestamp field formats
+        def get_timestamp(span, field):
+            """Get timestamp handling multiple OTEL formats"""
+            # Try different field name variations
+            for key in [field, f"{field}UnixNano", f"{field}_unix_nano", "timeUnixNano"]:
+                if key in span:
+                    return span[key]
+            return 0
+        # Build trace analysis data
+        start_time = get_timestamp(spans[0], 'startTime')
+        end_time = get_timestamp(spans[-1], 'endTime')
+        trace_analysis = {
+            "trace_id": trace_id,
+            "run_id": trace_row.get('run_id', 'unknown'),
+            "total_duration_ms": (end_time - start_time) / 1_000_000 if end_time > start_time else 0,
+            "num_spans": len(spans),
+            "spans": []
+        }
+        # Process each span
+        for span in spans:
+            span_start = get_timestamp(span, 'startTime')
+            span_end = get_timestamp(span, 'endTime')
+            span_info = {
+                "name": span.get('name', 'Unknown'),
+                "kind": span.get('kind', 'INTERNAL'),
+                "duration_ms": (span_end - span_start) / 1_000_000 if span_end > span_start else 0,
+                "attributes": span.get('attributes', {}),
+                "status": span.get('status', {}).get('code', 'UNKNOWN')
+            }
+            trace_analysis["spans"].append(span_info)
+        # Get AI analysis from Gemini
+        result = await gemini_client.analyze_with_context(
+            data=trace_analysis,
+            analysis_type="trace",
+            specific_question=question
+        )
+        return result
+    except Exception as e:
+        return f"❌ **Error debugging trace**: {str(e)}\n\nPlease check:\n- Trace ID is correct\n- Repository name is correct\n- You have access to the dataset"
+async def estimate_cost(
+    gemini_client: GeminiClient,
+    model: str,
+    agent_type: str,
+    num_tests: int = 100,
+    hardware: str = "auto"
+) -> str:
+    """
+    Estimate the cost, duration, and CO2 emissions of running agent evaluations.
+    This tool predicts costs before running evaluations by calculating LLM API costs,
+    HuggingFace Jobs compute costs, and CO2 emissions. Uses Google Gemini 2.5 Pro
+    to provide cost breakdown and optimization recommendations.
+    Args:
+        gemini_client (GeminiClient): Initialized Gemini client for AI analysis
+        model (str): Model identifier in litellm format (e.g., "openai/gpt-4", "meta-llama/Llama-3.1-8B")
+        agent_type (str): Type of agent capabilities to test. Options: "tool", "code", "both"
+        num_tests (int): Number of test cases to run. Must be between 10 and 1000. Default: 100
+        hardware (str): Hardware type for HuggingFace Jobs. Options: "auto", "cpu", "gpu_a10", "gpu_h200". Default: "auto"
+    Returns:
+        str: Markdown-formatted cost estimate with breakdown of LLM costs, HF Jobs costs, duration, CO2 emissions, and optimization tips
+    """
+    try:
+        # Determine if API or local model
+        is_api_model = any(provider in model.lower() for provider in ["openai", "anthropic", "google", "cohere"])
+        # Auto-select hardware
+        if hardware == "auto":
+            hardware = "cpu" if is_api_model else "gpu_a10"
+        # Cost data (simplified estimates)
+        llm_costs = {
+            "openai/gpt-4": {"input": 0.03, "output": 0.06},  # per 1K tokens
+            "openai/gpt-3.5-turbo": {"input": 0.0015, "output": 0.002},
+            "anthropic/claude-3-opus": {"input": 0.015, "output": 0.075},
+            "anthropic/claude-3-sonnet": {"input": 0.003, "output": 0.015},
+            "meta-llama/Llama-3.1-8B": {"input": 0, "output": 0},  # Local model
+            "default": {"input": 0.001, "output": 0.002}
+        }
+        hf_jobs_costs = {
+            "cpu": 0.60,  # per hour
+            "gpu_a10": 1.10,  # per hour
+            "gpu_h200": 4.50  # per hour
+        }
+        # Get model costs
+        model_cost = llm_costs.get(model, llm_costs["default"])
+        # Estimate token usage per test
+        # Tool agent: ~200 tokens input, ~150 output
+        # Code agent: ~300 tokens input, ~400 output
+        # Both: ~400 tokens input, ~500 output
+        token_estimates = {
+            "tool": {"input": 200, "output": 150},
+            "code": {"input": 300, "output": 400},
+            "both": {"input": 400, "output": 500}
+        }
+        tokens_per_test = token_estimates[agent_type]
+        # Calculate LLM costs
+        llm_cost_per_test = (
+            (tokens_per_test["input"] / 1000) * model_cost["input"] +
+            (tokens_per_test["output"] / 1000) * model_cost["output"]
+        )
+        total_llm_cost = llm_cost_per_test * num_tests
+        # Estimate duration (seconds per test)
+        if is_api_model:
+            duration_per_test = 3.0  # API models are fast
+        else:
+            duration_per_test = 8.0  # Local models slower but depends on GPU
+        total_duration_hours = (duration_per_test * num_tests) / 3600
+        # Calculate HF Jobs costs
+        jobs_hourly_rate = hf_jobs_costs.get(hardware, hf_jobs_costs["cpu"])
+        total_jobs_cost = total_duration_hours * jobs_hourly_rate
+        # Estimate CO2 (rough estimates)
+        co2_per_hour = {
+            "cpu": 0.05,  # kg CO2
+            "gpu_a10": 0.15,
+            "gpu_h200": 0.30
+        }
+        total_co2_kg = total_duration_hours * co2_per_hour.get(hardware, 0.05)
+        # Prepare estimate data
+        estimate_data = {
+            "model": model,
+            "agent_type": agent_type,
+            "num_tests": num_tests,
+            "hardware": hardware,
+            "is_api_model": is_api_model,
+            "estimates": {
+                "llm_cost_usd": round(total_llm_cost, 4),
+                "llm_cost_per_test": round(llm_cost_per_test, 4),
+                "jobs_cost_usd": round(total_jobs_cost, 4),
+                "total_cost_usd": round(total_llm_cost + total_jobs_cost, 4),
+                "duration_hours": round(total_duration_hours, 2),
+                "duration_per_test_seconds": round(duration_per_test, 2),
+                "co2_emissions_kg": round(total_co2_kg, 3),
+                "tokens_per_test": tokens_per_test
+            }
+        }
+        # Get AI analysis from Gemini
+        result = await gemini_client.analyze_with_context(
+            data=estimate_data,
+            analysis_type="cost_estimate",
+            specific_question="Provide cost breakdown and optimization recommendations"
+        )
+        return result
+    except Exception as e:
+        return f"❌ **Error estimating cost**: {str(e)}"
+async def compare_runs(
+    gemini_client: GeminiClient,
+    run_id_1: str,
+    run_id_2: str,
+    leaderboard_repo: str = "kshitijthakkar/smoltrace-leaderboard",
+    comparison_focus: str = "comprehensive",
+    hf_token: Optional[str] = None
+) -> str:
+    """
+    Compare two evaluation runs and generate AI-powered comparative analysis.
+    This tool fetches data for two evaluation runs from the leaderboard and uses
+    Google Gemini 2.5 Pro to provide intelligent comparison across multiple dimensions:
+    success rate, cost efficiency, speed, environmental impact, and use case recommendations.
+    Args:
+        gemini_client (GeminiClient): Initialized Gemini client for AI analysis
+        run_id_1 (str): First run ID to compare
+        run_id_2 (str): Second run ID to compare
+        leaderboard_repo (str): HuggingFace dataset repository containing leaderboard data. Default: "kshitijthakkar/smoltrace-leaderboard"
+        comparison_focus (str): Focus area for comparison. Options: "comprehensive", "cost", "performance", "eco_friendly". Default: "comprehensive"
+        hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
+    Returns:
+        str: Markdown-formatted comparative analysis with winner for each category, trade-offs, and use case recommendations
+    """
+    try:
+        # Load leaderboard data
+        # Use user-provided token or fall back to environment variable
+        token = hf_token if hf_token else os.getenv("HF_TOKEN")
+        dataset = load_dataset(leaderboard_repo, split="train", token=token)
+        df = pd.DataFrame(dataset)
+        # Find the two runs
+        run1 = df[df['run_id'] == run_id_1]
+        run2 = df[df['run_id'] == run_id_2]
+        if run1.empty:
+            return f"❌ **Error**: Run ID '{run_id_1}' not found in leaderboard"
+        if run2.empty:
+            return f"❌ **Error**: Run ID '{run_id_2}' not found in leaderboard"
+        run1_data = run1.iloc[0].to_dict()
+        run2_data = run2.iloc[0].to_dict()
+        # Build comparison context for Gemini
+        comparison_data = {
+            "run_1": {
+                "run_id": run1_data.get('run_id'),
+                "model": run1_data.get('model'),
+                "agent_type": run1_data.get('agent_type'),
+                "success_rate": run1_data.get('success_rate'),
+                "total_tests": run1_data.get('total_tests'),
+                "successful_tests": run1_data.get('successful_tests'),
+                "avg_duration_ms": run1_data.get('avg_duration_ms'),
+                "total_cost_usd": run1_data.get('total_cost_usd'),
+                "avg_cost_per_test_usd": run1_data.get('avg_cost_per_test_usd'),
+                "co2_emissions_g": run1_data.get('co2_emissions_g'),
+                "gpu_utilization_avg": run1_data.get('gpu_utilization_avg'),
+                "total_tokens": run1_data.get('total_tokens'),
+                "provider": run1_data.get('provider'),
+                "job_type": run1_data.get('job_type'),
+                "timestamp": run1_data.get('timestamp')
+            },
+            "run_2": {
+                "run_id": run2_data.get('run_id'),
+                "model": run2_data.get('model'),
+                "agent_type": run2_data.get('agent_type'),
+                "success_rate": run2_data.get('success_rate'),
+                "total_tests": run2_data.get('total_tests'),
+                "successful_tests": run2_data.get('successful_tests'),
+                "avg_duration_ms": run2_data.get('avg_duration_ms'),
+                "total_cost_usd": run2_data.get('total_cost_usd'),
+                "avg_cost_per_test_usd": run2_data.get('avg_cost_per_test_usd'),
+                "co2_emissions_g": run2_data.get('co2_emissions_g'),
+                "gpu_utilization_avg": run2_data.get('gpu_utilization_avg'),
+                "total_tokens": run2_data.get('total_tokens'),
+                "provider": run2_data.get('provider'),
+                "job_type": run2_data.get('job_type'),
+                "timestamp": run2_data.get('timestamp')
+            },
+            "comparison_focus": comparison_focus
+        }
+        # Create comparison prompt based on focus
+        if comparison_focus == "comprehensive":
+            prompt = f"""
+You are analyzing a comparison between two agent evaluation runs. Provide a comprehensive analysis covering all aspects.
+**Run 1 ({comparison_data['run_1']['model']}):**
+{json.dumps(comparison_data['run_1'], indent=2)}
+**Run 2 ({comparison_data['run_2']['model']}):**
+{json.dumps(comparison_data['run_2'], indent=2)}
+Please provide a detailed comparison in the following format:
+## 📊 Head-to-Head Comparison
+### 🎯 Accuracy Winner
+[Which run has better success rate and by how much? Explain significance]
+### ⚡ Speed Winner
+[Which run is faster and by how much? Include average duration comparison]
+### 💰 Cost Winner
+[Which run is more cost-effective? Compare total cost AND cost per test]
+### 🌱 Eco-Friendly Winner
+[Which run has lower CO2 emissions? Calculate the difference]
+### 🔧 GPU Efficiency Winner (if applicable)
+[For GPU jobs, which has better utilization? Explain implications]
+## 📈 Performance Summary
+### Run 1 Strengths
+- [List 3-4 key strengths]
+### Run 2 Strengths
+- [List 3-4 key strengths]
+## 💡 Use Case Recommendations
+### When to Choose Run 1 ({comparison_data['run_1']['model']})
+[Specific scenarios where Run 1 is the better choice]
+### When to Choose Run 2 ({comparison_data['run_2']['model']})
+[Specific scenarios where Run 2 is the better choice]
+## ⚖️ Overall Recommendation
+[Based on the analysis, provide a balanced recommendation considering different priorities]
+Be specific with numbers and percentages. Make the comparison actionable and insightful.
+"""
+        elif comparison_focus == "cost":
+            prompt = f"""
+Compare these two evaluation runs focusing specifically on cost efficiency:
+**Run 1:** {json.dumps(comparison_data['run_1'], indent=2)}
+**Run 2:** {json.dumps(comparison_data['run_2'], indent=2)}
+Provide detailed cost analysis:
+1. Which run has lower total cost and by what percentage?
+2. Cost per test comparison - which is more efficient?
+3. Calculate cost per successful test (accounting for failures)
+4. Token usage efficiency - cost per 1000 tokens
+5. ROI analysis - is higher cost justified by better accuracy?
+6. Scaling implications - at 1000 tests, what would each cost?
+Provide actionable cost optimization recommendations.
+"""
+        elif comparison_focus == "performance":
+            prompt = f"""
+Compare these two evaluation runs focusing on performance (speed + accuracy):
+**Run 1:** {json.dumps(comparison_data['run_1'], indent=2)}
+**Run 2:** {json.dumps(comparison_data['run_2'], indent=2)}
+Analyze:
+1. Success rate difference - statistical significance?
+2. Speed comparison - average duration per test
+3. Which delivers faster results without sacrificing accuracy?
+4. Throughput analysis - tests per minute
+5. Quality vs Speed trade-off assessment
+6. GPU utilization efficiency (if applicable)
+Recommend which run offers best performance for production workloads.
+"""
+        elif comparison_focus == "eco_friendly":
+            prompt = f"""
+Compare these two evaluation runs focusing on environmental impact:
+**Run 1:** {json.dumps(comparison_data['run_1'], indent=2)}
+**Run 2:** {json.dumps(comparison_data['run_2'], indent=2)}
+Analyze:
+1. CO2 emissions comparison - which is greener?
+2. Emissions per test and per successful test
+3. GPU vs API model environmental trade-offs
+4. Energy efficiency based on duration and GPU utilization
+5. Emissions reduction if scaled to 10,000 tests
+6. Carbon offset cost comparison
+Provide eco-conscious recommendations for sustainable AI deployment.
+"""
+        # Get AI analysis from Gemini
+        analysis = await gemini_client.analyze_with_context(
+            comparison_data,
+            analysis_type="comparison",
+            specific_question=prompt
+        )
+        return analysis
+    except Exception as e:
+        return f"❌ **Error comparing runs**: {str(e)}"
+async def get_dataset(
+    dataset_repo: str,
+    max_rows: int = 50,
+    hf_token: Optional[str] = None
+) -> str:
+    """
+    Load SMOLTRACE datasets from HuggingFace and return as JSON.
+    This tool loads datasets with the "smoltrace-" prefix and returns the raw data
+    as JSON. Use this to access:
+    - Leaderboard data (kshitijthakkar/smoltrace-leaderboard)
+    - Results datasets (e.g., username/smoltrace-results-*)
+    - Traces datasets (e.g., username/smoltrace-traces-*)
+    - Metrics datasets (e.g., username/smoltrace-metrics-*)
+    - Any other smoltrace-prefixed evaluation dataset
+    If you don't know which dataset to load, first load the leaderboard to see
+    the dataset references in the results_dataset, traces_dataset, metrics_dataset,
+    and dataset_used fields.
+    Args:
+        dataset_repo (str): HuggingFace dataset repository path with "smoltrace-" prefix (e.g., "kshitijthakkar/smoltrace-leaderboard")
+        max_rows (int): Maximum number of rows to return. Default: 50. Range: 1-200
+        hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
+    Returns:
+        str: JSON object with dataset data and metadata
+    """
+    try:
+        # Validate dataset has smoltrace- prefix
+        if "smoltrace-" not in dataset_repo:
+            return json.dumps({
+                "dataset_repo": dataset_repo,
+                "error": "Only datasets with 'smoltrace-' prefix are allowed. Please use smoltrace-leaderboard or other smoltrace-* datasets.",
+                "data": []
+            }, indent=2)
+        # Load dataset from HuggingFace
+        # Use user-provided token or fall back to environment variable
+        token = hf_token if hf_token else os.getenv("HF_TOKEN")
+        dataset = load_dataset(dataset_repo, split="train", token=token)
+        df = pd.DataFrame(dataset)
+        if df.empty:
+            return json.dumps({
+                "dataset_repo": dataset_repo,
+                "error": "Dataset is empty",
+                "total_rows": 0,
+                "data": []
+            }, indent=2)
+        # Get total row count before limiting
+        total_rows = len(df)
+        # Limit rows to avoid overwhelming the context
+        max_rows = max(1, min(200, max_rows))
+        # Sort by timestamp if available (newest first)
+        if "timestamp" in df.columns:
+            df = df.sort_values("timestamp", ascending=False)
+        df_limited = df.head(max_rows)
+        # Convert to list of dictionaries
+        data = df_limited.to_dict(orient="records")
+        # Build response with metadata
+        result = {
+            "dataset_repo": dataset_repo,
+            "total_rows": total_rows,
+            "rows_returned": len(data),
+            "columns": list(df.columns),
+            "data": data
+        }
+        return json.dumps(result, indent=2, default=str)
+    except Exception as e:
+        return json.dumps({
+            "dataset_repo": dataset_repo,
+            "error": f"Failed to load dataset: {str(e)}",
+            "data": []
+        }, indent=2)
+# ============================================================================
+# MCP RESOURCES - Expose data for retrieval by MCP clients
+# ============================================================================
+@gr.mcp.resource("leaderboard://{repo}")
+def get_leaderboard_data(repo: str = "kshitijthakkar/smoltrace-leaderboard", hf_token: Optional[str] = None) -> str:
+    """
+    Get raw leaderboard data from HuggingFace dataset.
+    This resource provides direct access to leaderboard data in JSON format,
+    allowing MCP clients to retrieve and process evaluation results.
+    Args:
+        repo (str): HuggingFace dataset repository name. Default: "kshitijthakkar/smoltrace-leaderboard"
+        hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
+    Returns:
+        str: JSON string containing leaderboard data with all evaluation runs
+    """
+    try:
+        # Use user-provided token or fall back to environment variable
+        token = hf_token if hf_token else os.getenv("HF_TOKEN")
+        ds = load_dataset(repo, split="train", token=token)
+        df = pd.DataFrame(ds)
+        # Convert to JSON with proper formatting
+        data = df.to_dict('records')
+        return json.dumps({
+            "total_runs": len(data),
+            "repository": repo,
+            "data": data
+        }, indent=2)
+    except Exception as e:
+        return json.dumps({
+            "error": str(e),
+            "repository": repo
+        })
+@gr.mcp.resource("trace://{trace_id}/{repo}")
+def get_trace_data(trace_id: str, repo: str, hf_token: Optional[str] = None) -> str:
+    """
+    Get raw trace data for a specific trace ID from HuggingFace dataset.
+    This resource provides direct access to OpenTelemetry trace data,
+    allowing MCP clients to retrieve detailed execution information.
+    Args:
+        trace_id (str): Unique identifier for the trace (e.g., "trace_abc123")
+        repo (str): HuggingFace dataset repository containing traces (e.g., "username/agent-traces-model")
+        hf_token (Optional[str]): HuggingFace token for dataset access. If None, uses HF_TOKEN environment variable.
+    Returns:
+        str: JSON string containing trace data with all spans and attributes
+    """
+    try:
+        # Use user-provided token or fall back to environment variable
+        token = hf_token if hf_token else os.getenv("HF_TOKEN")
+        ds = load_dataset(repo, split="train", token=token)
+        df = pd.DataFrame(ds)
+        # Find specific trace
+        trace_data = df[df['trace_id'] == trace_id]
+        if len(trace_data) == 0:
+            return json.dumps({
+                "error": f"Trace {trace_id} not found",
+                "trace_id": trace_id,
+                "repository": repo
+            })
+        trace_row = trace_data.iloc[0]
+        # Parse spans if they're stored as string
+        spans = trace_row['spans']
+        if isinstance(spans, str):
+            spans = json.loads(spans)
+        return json.dumps({
+            "trace_id": trace_id,
+            "repository": repo,
+            "run_id": trace_row.get('run_id', 'unknown'),
+            "spans": spans
+        }, indent=2)
+    except Exception as e:
+        return json.dumps({
+            "error": str(e),
+            "trace_id": trace_id,
+            "repository": repo
+        })
+@gr.mcp.resource("cost://model/{model_name}")
+def get_cost_data(model_name: str) -> str:
+    """
+    Get cost information for a specific model.
+    This resource provides pricing data for LLM models and hardware configurations,
+    helping users understand evaluation costs.
+    Args:
+        model_name (str): Model identifier (e.g., "openai/gpt-4", "meta-llama/Llama-3.1-8B")
+    Returns:
+        str: JSON string containing cost data for the model
+    """
+    # Cost database
+    llm_costs = {
+        "openai/gpt-4": {
+            "input_per_1k_tokens": 0.03,
+            "output_per_1k_tokens": 0.06,
+            "type": "api",
+            "provider": "openai"
+        },
+        "openai/gpt-3.5-turbo": {
+            "input_per_1k_tokens": 0.0015,
+            "output_per_1k_tokens": 0.002,
+            "type": "api",
+            "provider": "openai"
+        },
+        "anthropic/claude-3-opus": {
+            "input_per_1k_tokens": 0.015,
+            "output_per_1k_tokens": 0.075,
+            "type": "api",
+            "provider": "anthropic"
+        },
+        "anthropic/claude-3-sonnet": {
+            "input_per_1k_tokens": 0.003,
+            "output_per_1k_tokens": 0.015,
+            "type": "api",
+            "provider": "anthropic"
+        },
+        "meta-llama/Llama-3.1-8B": {
+            "input_per_1k_tokens": 0,
+            "output_per_1k_tokens": 0,
+            "type": "local",
+            "provider": "meta",
+            "requires_gpu": True,
+            "recommended_hardware": "gpu_a10"
+        }
+    }
+    hardware_costs = {
+        "cpu": {"hourly_rate_usd": 0.60, "type": "cpu"},
+        "gpu_a10": {"hourly_rate_usd": 1.10, "type": "gpu", "model": "A10"},
+        "gpu_h200": {"hourly_rate_usd": 4.50, "type": "gpu", "model": "H200"}
+    }
+    model_cost = llm_costs.get(model_name)
+    if model_cost:
+        return json.dumps({
+            "model": model_name,
+            "cost_data": model_cost,
+            "hardware_options": hardware_costs,
+            "currency": "USD"
+        }, indent=2)
+    else:
+        return json.dumps({
+            "model": model_name,
+            "error": "Model not found in cost database",
+            "available_models": list(llm_costs.keys()),
+            "hardware_options": hardware_costs
+        }, indent=2)
+# ============================================================================
+# MCP PROMPTS - Reusable prompt templates for common workflows
+# ============================================================================
+@gr.mcp.prompt()
+def analysis_prompt(
+    analysis_type: str = "leaderboard",
+    focus_area: str = "overall",
+    detail_level: str = "detailed"
+) -> str:
+    """
+    Generate a prompt template for analyzing agent evaluation data.
+    This prompt helps standardize analysis requests across different
+    evaluation data types and focus areas.
+    Args:
+        analysis_type (str): Type of analysis. Options: "leaderboard", "trace", "cost". Default: "leaderboard"
+        focus_area (str): What to focus on. Options: "overall", "performance", "cost", "efficiency". Default: "overall"
+        detail_level (str): Level of detail. Options: "summary", "detailed", "comprehensive". Default: "detailed"
+    Returns:
+        str: Formatted prompt template for analysis
+    """
+    templates = {
+        "leaderboard": {
+            "overall": "Analyze the agent evaluation leaderboard data comprehensively. Identify top performers across all metrics (accuracy, cost, latency, CO2), explain trade-offs between different approaches, and provide actionable recommendations for model selection.",
+            "performance": "Focus on performance metrics in the leaderboard. Compare success rates and accuracy across different models and agent types. Identify which configurations achieve the highest success rates and explain why.",
+            "cost": "Analyze cost efficiency in the leaderboard. Compare costs across different models and identify the best cost-performance ratios. Recommend the most cost-effective configurations for different use cases.",
+            "efficiency": "Evaluate efficiency metrics including latency, GPU utilization, and CO2 emissions. Identify the most efficient models and explain how to optimize for speed while maintaining quality."
+        },
+        "trace": {
+            "overall": "Analyze this agent execution trace comprehensively. Explain the sequence of operations, identify any bottlenecks or inefficiencies, and suggest optimizations.",
+            "performance": "Focus on performance aspects of this trace. Identify which steps took the most time, explain why, and suggest ways to improve execution speed.",
+            "cost": "Analyze the cost implications of this trace execution. Break down token usage and API calls, calculate costs, and suggest ways to reduce expenses.",
+            "efficiency": "Evaluate the efficiency of this trace. Identify redundant operations, suggest ways to optimize the execution flow, and recommend best practices."
+        },
+        "cost": {
+            "overall": "Analyze the cost estimation comprehensively. Break down LLM API costs, infrastructure costs, and provide optimization recommendations.",
+            "performance": "Focus on the cost-performance trade-off. Compare different hardware options and explain which provides the best value.",
+            "cost": "Deep dive into cost breakdown. Explain each cost component in detail and provide specific recommendations for cost reduction.",
+            "efficiency": "Analyze cost efficiency. Compare different model configurations and recommend the most cost-effective approach for the given use case."
+        }
+    }
+    detail_prefixes = {
+        "summary": "Provide a brief, high-level summary. ",
+        "detailed": "Provide a detailed analysis with specific insights. ",
+        "comprehensive": "Provide a comprehensive, in-depth analysis with detailed recommendations. "
+    }
+    prefix = detail_prefixes.get(detail_level, detail_prefixes["detailed"])
+    template = templates.get(analysis_type, {}).get(focus_area, templates["leaderboard"]["overall"])
+    return f"{prefix}{template}"
+@gr.mcp.prompt()
+def debug_prompt(
+    debug_type: str = "error",
+    context: str = "agent_execution"
+) -> str:
+    """
+    Generate a prompt template for debugging agent traces.
+    This prompt helps standardize debugging requests for different
+    types of issues and contexts.
+    Args:
+        debug_type (str): Type of debugging. Options: "error", "performance", "behavior", "optimization". Default: "error"
+        context (str): Execution context. Options: "agent_execution", "tool_calling", "llm_reasoning". Default: "agent_execution"
+    Returns:
+        str: Formatted prompt template for debugging
+    """
+    templates = {
+        "error": {
+            "agent_execution": "Debug this agent execution trace to identify why it failed. Analyze each step in the execution flow, identify where the error occurred, explain the root cause, and suggest how to fix it.",
+            "tool_calling": "Debug this tool calling sequence. Identify which tool call failed or produced unexpected results, explain why it happened, and suggest corrections.",
+            "llm_reasoning": "Debug the LLM reasoning in this trace. Analyze the prompts and responses, identify where the reasoning went wrong, and suggest improvements to the prompts or approach."
+        },
+        "performance": {
+            "agent_execution": "Analyze this trace for performance issues. Identify bottlenecks, measure time spent in each component, and recommend optimizations to improve execution speed.",
+            "tool_calling": "Analyze tool calling performance. Identify which tools are slow, explain why, and suggest ways to optimize tool execution or caching.",
+            "llm_reasoning": "Analyze LLM reasoning efficiency. Identify unnecessary calls, redundant reasoning steps, and suggest ways to streamline the reasoning process."
+        },
+        "behavior": {
+            "agent_execution": "Analyze the agent's behavior in this trace. Explain why the agent made certain decisions, whether the behavior is expected, and suggest improvements if needed.",
+            "tool_calling": "Analyze tool selection behavior. Explain why certain tools were called, whether the choices were optimal, and suggest alternative approaches if applicable.",
+            "llm_reasoning": "Analyze the LLM's reasoning patterns. Explain the logic flow, identify any unexpected reasoning, and suggest how to guide the model toward better decisions."
+        },
+        "optimization": {
+            "agent_execution": "Analyze this trace for optimization opportunities. Identify redundant operations, suggest caching strategies, and recommend ways to reduce costs and improve efficiency.",
+            "tool_calling": "Optimize tool usage in this trace. Suggest ways to reduce tool calls, batch operations, or use more efficient alternatives.",
+            "llm_reasoning": "Optimize LLM usage. Suggest ways to reduce token usage, improve prompt efficiency, and achieve the same results with lower costs."
+        }
+    }
+    template = templates.get(debug_type, {}).get(context, templates["error"]["agent_execution"])
+    return template
+@gr.mcp.prompt()
+def optimization_prompt(
+    optimization_goal: str = "cost",
+    constraints: str = "maintain_quality"
+) -> str:
+    """
+    Generate a prompt template for optimization recommendations.
+    This prompt helps standardize optimization requests for different
+    goals and constraints.
+    Args:
+        optimization_goal (str): What to optimize. Options: "cost", "speed", "quality", "efficiency". Default: "cost"
+        constraints (str): Constraints to consider. Options: "maintain_quality", "maintain_speed", "no_constraints". Default: "maintain_quality"
+    Returns:
+        str: Formatted prompt template for optimization
+    """
+    templates = {
+        "cost": {
+            "maintain_quality": "Analyze this evaluation setup and recommend cost optimizations while maintaining quality. Consider cheaper models, optimized prompts, caching strategies, and hardware selection. Quantify potential savings.",
+            "maintain_speed": "Recommend cost optimizations while maintaining execution speed. Consider model alternatives, batch processing, and infrastructure choices that reduce costs without adding latency.",
+            "no_constraints": "Recommend aggressive cost optimizations. Identify all opportunities to reduce expenses, even if it means trade-offs in quality or speed. Prioritize maximum cost reduction."
+        },
+        "speed": {
+            "maintain_quality": "Recommend speed optimizations while maintaining quality. Consider parallel execution, caching, faster models with similar accuracy, and infrastructure upgrades. Quantify potential speedups.",
+            "maintain_cost": "Recommend speed optimizations within the current cost budget. Suggest configuration changes, caching strategies, and optimizations that don't increase expenses.",
+            "no_constraints": "Recommend aggressive speed optimizations. Identify all opportunities to reduce latency, even if it increases costs. Prioritize maximum performance."
+        },
+        "quality": {
+            "maintain_cost": "Recommend quality improvements within the current cost budget. Suggest better prompts, model configurations, and strategies that improve accuracy without increasing expenses.",
+            "maintain_speed": "Recommend quality improvements while maintaining execution speed. Suggest prompt improvements, reasoning enhancements, and configurations that improve accuracy without adding latency.",
+            "no_constraints": "Recommend quality improvements without budget constraints. Suggest the best models, optimal configurations, and strategies to maximize accuracy and success rates."
+        },
+        "efficiency": {
+            "maintain_quality": "Recommend overall efficiency improvements. Optimize for the best cost-speed-quality balance. Identify waste, suggest streamlined processes, and provide holistic optimization strategies.",
+            "maintain_cost": "Recommend efficiency improvements within budget. Focus on reducing waste, optimizing resource usage, and getting better results with the same cost.",
+            "maintain_speed": "Recommend efficiency improvements maintaining speed. Reduce unnecessary operations, optimize resource usage, and improve output quality without adding latency."
+        }
+    }
+    # Handle constraint variations
+    if constraints == "maintain_quality" and optimization_goal == "speed":
+        constraints = "maintain_quality"  # Use existing template
+    elif constraints == "maintain_speed" and optimization_goal == "cost":
+        constraints = "maintain_speed"  # Use existing template
+    template = templates.get(optimization_goal, {}).get(constraints, templates["cost"]["maintain_quality"])
+    return template

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+# Core dependencies
+gradio[mcp]>=6.0.0.dev1
+google-generativeai>=0.8.0
+datasets>=2.14.0
+pandas>=2.0.0
+# HuggingFace
+huggingface-hub>=0.20.0
+# Utilities
+python-dotenv>=1.0.0