Mandark-droid commited on
Commit
4a16168
Β·
1 Parent(s): fbd2ae8

Add synthetic dataset generation tools for custom SMOLTRACE evaluations

Browse files

Enable users to create domain-specific test datasets when standard benchmarks don't fit their use case. Enterprise users can now generate custom evaluation datasets for proprietary tools, industry-specific workflows, and specialized agent capabilities.

Key features:
- generate_synthetic_dataset: AI-powered generation of SMOLTRACE-format tasks (5-100 tasks)
- Parallel batched generation: Automatically splits large requests into concurrent batches
- Extended timeout: 120s per batch to support 100-task generations
- push_dataset_to_hub: Direct upload to HuggingFace with naming validation
- Complete API documentation for both new tools

Technical improvements:
- Parallel execution with asyncio.gather for 5x speedup on large datasets
- Fair distribution of difficulty/agent_type across batches
- Partial success handling: continues if some batches fail
- Switch to gemini-2.5-flash-lite for cost efficiency

Files changed (4) hide show
  1. README.md +8 -7
  2. app.py +364 -28
  3. gemini_client.py +2 -2
  4. mcp_tools.py +497 -5
README.md CHANGED
@@ -5,7 +5,7 @@ colorFrom: blue
5
  colorTo: purple
6
  sdk: docker
7
  app_port: 7860
8
- pinned: false
9
  license: agpl-3.0
10
  short_description: MCP server for agent evaluation with Gemini 2.5 Pro
11
  tags:
@@ -32,13 +32,14 @@ tags:
32
 
33
  TraceMind MCP Server is a Gradio-based MCP (Model Context Protocol) server that provides a complete MCP implementation with:
34
 
35
- ### πŸ› οΈ **6 AI-Powered Tools**
36
  1. **πŸ“Š analyze_leaderboard**: Generate insights from evaluation leaderboard data
37
  2. **πŸ› debug_trace**: Debug specific agent execution traces using OpenTelemetry data
38
  3. **πŸ’° estimate_cost**: Predict evaluation costs before running
39
  4. **βš–οΈ compare_runs**: Compare two evaluation runs with AI-powered analysis
40
- 5. **πŸ” analyze_results**: Deep dive into test results with optimization recommendations
41
- 6. **πŸ“¦ get_dataset**: Load SMOLTRACE datasets (smoltrace-* prefix only) as JSON for flexible analysis
 
42
 
43
  ### πŸ“¦ **3 Data Resources**
44
  1. **leaderboard data**: Direct JSON access to evaluation results
@@ -93,11 +94,11 @@ All analysis is powered by **Google Gemini 2.5 Pro** for intelligent, context-aw
93
  - βœ… **MCP Standard Compliant**: Built with Gradio's native MCP support (`@gr.mcp.*` decorators)
94
  - βœ… **Production-Ready**: Deployable to HuggingFace Spaces with SSE transport
95
  - βœ… **Testing Interface**: Beautiful Gradio UI for testing all components
96
- - βœ… **Enterprise Focus**: Cost optimization, debugging, and decision support
97
  - βœ… **Google Gemini Powered**: Leverages Gemini 2.5 Pro for intelligent analysis
98
- - βœ… **11 Total Components**: 5 Tools + 3 Resources + 3 Prompts
99
 
100
- ### πŸ› οΈ Five Production-Ready Tools
101
 
102
  #### 1. analyze_leaderboard
103
 
 
5
  colorTo: purple
6
  sdk: docker
7
  app_port: 7860
8
+ pinned: true
9
  license: agpl-3.0
10
  short_description: MCP server for agent evaluation with Gemini 2.5 Pro
11
  tags:
 
32
 
33
  TraceMind MCP Server is a Gradio-based MCP (Model Context Protocol) server that provides a complete MCP implementation with:
34
 
35
+ ### πŸ› οΈ **7 AI-Powered Tools**
36
  1. **πŸ“Š analyze_leaderboard**: Generate insights from evaluation leaderboard data
37
  2. **πŸ› debug_trace**: Debug specific agent execution traces using OpenTelemetry data
38
  3. **πŸ’° estimate_cost**: Predict evaluation costs before running
39
  4. **βš–οΈ compare_runs**: Compare two evaluation runs with AI-powered analysis
40
+ 5. **πŸ“¦ get_dataset**: Load SMOLTRACE datasets (smoltrace-* prefix only) as JSON for flexible analysis
41
+ 6. **πŸ§ͺ generate_synthetic_dataset**: Create domain-specific test datasets for SMOLTRACE evaluations (supports up to 100 tasks with parallel batched generation)
42
+ 7. **πŸ“€ push_dataset_to_hub**: Upload generated datasets to HuggingFace Hub
43
 
44
  ### πŸ“¦ **3 Data Resources**
45
  1. **leaderboard data**: Direct JSON access to evaluation results
 
94
  - βœ… **MCP Standard Compliant**: Built with Gradio's native MCP support (`@gr.mcp.*` decorators)
95
  - βœ… **Production-Ready**: Deployable to HuggingFace Spaces with SSE transport
96
  - βœ… **Testing Interface**: Beautiful Gradio UI for testing all components
97
+ - βœ… **Enterprise Focus**: Cost optimization, debugging, decision support, and custom dataset generation
98
  - βœ… **Google Gemini Powered**: Leverages Gemini 2.5 Pro for intelligent analysis
99
+ - βœ… **13 Total Components**: 7 Tools + 3 Resources + 3 Prompts
100
 
101
+ ### πŸ› οΈ Seven Production-Ready Tools
102
 
103
  #### 1. analyze_leaderboard
104
 
app.py CHANGED
@@ -1,19 +1,49 @@
1
  """
2
- TraceMind MCP Server - Gradio Interface with MCP Support
3
-
4
- This server provides AI-powered analysis tools for agent evaluation data:
5
- 1. analyze_leaderboard: Summarize trends and insights from leaderboard
6
- 2. debug_trace: Debug specific agent execution traces
7
- 3. estimate_cost: Predict evaluation costs before running
8
- 4. compare_runs: Compare two evaluation runs with AI-powered analysis
9
- 5. get_dataset: Load any HuggingFace dataset as JSON for flexible analysis
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  """
11
 
12
  import os
 
13
  import gradio as gr
14
  from typing import Optional, Dict, Any
15
  from datetime import datetime
16
 
 
 
 
 
 
 
 
 
17
  # Local imports
18
  from gemini_client import GeminiClient
19
  from mcp_tools import (
@@ -21,8 +51,9 @@ from mcp_tools import (
21
  debug_trace,
22
  estimate_cost,
23
  compare_runs,
24
- analyze_results,
25
- get_dataset
 
26
  )
27
 
28
  # Initialize default Gemini client (fallback if user doesn't provide key)
@@ -42,15 +73,16 @@ def create_gradio_ui():
42
 
43
  **AI-Powered Analysis for Agent Evaluation Data**
44
 
45
- This server provides **6 MCP Tools + 3 MCP Resources + 3 MCP Prompts**:
46
 
47
  ### MCP Tools (AI-Powered)
48
  - πŸ“Š **Analyze Leaderboard**: Get insights from evaluation results
49
  - πŸ› **Debug Trace**: Understand what happened in a specific test
50
  - πŸ’° **Estimate Cost**: Predict evaluation costs before running
51
  - βš–οΈ **Compare Runs**: Compare two evaluation runs with AI-powered analysis
52
- - πŸ” **Analyze Results**: Deep dive into test results with optimization recommendations
53
  - πŸ“¦ **Get Dataset**: Load any HuggingFace dataset as JSON for flexible analysis
 
 
54
 
55
  ### MCP Resources (Data Access)
56
  - πŸ“Š **leaderboard://{repo}**: Raw leaderboard data
@@ -493,12 +525,181 @@ def create_gradio_ui():
493
  outputs=[dataset_output]
494
  )
495
 
496
- # Tab 6: MCP Resources & Prompts
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
497
  with gr.Tab("πŸ”Œ MCP Resources & Prompts"):
498
  gr.Markdown("""
499
  ## MCP Resources & Prompts
500
 
501
- Beyond the 5 MCP Tools, this server also exposes **MCP Resources** and **MCP Prompts**
502
  that MCP clients can use directly.
503
 
504
  ### MCP Resources (Read-Only Data Access)
@@ -751,7 +952,7 @@ def create_gradio_ui():
751
  outputs=[prompt_output]
752
  )
753
 
754
- # Tab 7: API Documentation
755
  with gr.Tab("πŸ“– API Documentation"):
756
  gr.Markdown("""
757
  ## MCP Tool Specifications
@@ -842,6 +1043,95 @@ def create_gradio_ui():
842
 
843
  ---
844
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
845
  ## MCP Integration
846
 
847
  This Gradio app is MCP-enabled. When deployed to HuggingFace Spaces, it can be accessed via MCP clients.
@@ -854,8 +1144,8 @@ def create_gradio_ui():
854
 
855
  ### What's Exposed via MCP:
856
 
857
- #### 5 MCP Tools (AI-Powered)
858
- The five tools above (`analyze_leaderboard`, `debug_trace`, `estimate_cost`, `compare_runs`, `get_dataset`)
859
  are automatically exposed as MCP tools and can be called from any MCP client.
860
 
861
  #### 3 MCP Resources (Data Access)
@@ -891,14 +1181,60 @@ def create_gradio_ui():
891
  return demo
892
 
893
  if __name__ == "__main__":
894
- # Create Gradio interface
895
- demo = create_gradio_ui()
896
-
897
- # Launch with MCP server enabled
898
- # share=True creates a temporary public HTTPS URL for testing with Claude Code
899
- demo.launch(
900
- server_name="0.0.0.0",
901
- server_port=7860,
902
- #share=True, # Creates temporary HTTPS URL (e.g., https://abc123.gradio.live)
903
- mcp_server=True # Enable MCP server functionality
904
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  """
2
+ TraceMind MCP Server - Hugging Face Space Entry Point (Track 1)
3
+
4
+ This file serves as the entry point for HuggingFace Space deployment.
5
+ Exposes 7 AI-powered MCP tools + 3 Resources + 3 Prompts via Gradio's native MCP support.
6
+
7
+ Architecture:
8
+ User β†’ MCP Client (Claude Desktop, Continue, Cline, etc.)
9
+ β†’ MCP Endpoint (Gradio SSE)
10
+ β†’ TraceMind MCP Server (this file)
11
+ β†’ Tools (mcp_tools.py)
12
+ β†’ Google Gemini 2.5 Pro API
13
+
14
+ For Track 1: Building MCP Servers - Enterprise Category
15
+ https://huggingface.co/MCP-1st-Birthday
16
+
17
+ Tools Provided:
18
+ πŸ“Š analyze_leaderboard - AI-powered leaderboard analysis
19
+ πŸ› debug_trace - Debug agent execution traces with AI
20
+ πŸ’° estimate_cost - Predict evaluation costs before running
21
+ βš–οΈ compare_runs - Compare evaluation runs with AI analysis
22
+ πŸ“¦ get_dataset - Load SMOLTRACE datasets as JSON
23
+ πŸ§ͺ generate_synthetic_dataset - Create domain-specific test datasets
24
+ πŸ“€ push_dataset_to_hub - Upload datasets to HuggingFace Hub
25
+
26
+ Compatible with:
27
+ - Claude Desktop (via Gradio MCP support)
28
+ - Continue.dev (VS Code extension)
29
+ - Cline (VS Code extension)
30
+ - Any MCP client supporting Gradio's MCP protocol
31
  """
32
 
33
  import os
34
+ import logging
35
  import gradio as gr
36
  from typing import Optional, Dict, Any
37
  from datetime import datetime
38
 
39
+ # Configure logging
40
+ logging.basicConfig(
41
+ level=logging.INFO,
42
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
43
+ handlers=[logging.StreamHandler()]
44
+ )
45
+ logger = logging.getLogger(__name__)
46
+
47
  # Local imports
48
  from gemini_client import GeminiClient
49
  from mcp_tools import (
 
51
  debug_trace,
52
  estimate_cost,
53
  compare_runs,
54
+ get_dataset,
55
+ generate_synthetic_dataset,
56
+ push_dataset_to_hub
57
  )
58
 
59
  # Initialize default Gemini client (fallback if user doesn't provide key)
 
73
 
74
  **AI-Powered Analysis for Agent Evaluation Data**
75
 
76
+ This server provides **7 MCP Tools + 3 MCP Resources + 3 MCP Prompts**:
77
 
78
  ### MCP Tools (AI-Powered)
79
  - πŸ“Š **Analyze Leaderboard**: Get insights from evaluation results
80
  - πŸ› **Debug Trace**: Understand what happened in a specific test
81
  - πŸ’° **Estimate Cost**: Predict evaluation costs before running
82
  - βš–οΈ **Compare Runs**: Compare two evaluation runs with AI-powered analysis
 
83
  - πŸ“¦ **Get Dataset**: Load any HuggingFace dataset as JSON for flexible analysis
84
+ - πŸ§ͺ **Generate Synthetic Dataset**: Create domain-specific test datasets for SMOLTRACE
85
+ - πŸ“€ **Push to Hub**: Upload generated datasets to HuggingFace Hub
86
 
87
  ### MCP Resources (Data Access)
88
  - πŸ“Š **leaderboard://{repo}**: Raw leaderboard data
 
525
  outputs=[dataset_output]
526
  )
527
 
528
+ # Tab 6: Generate Synthetic Dataset
529
+ with gr.Tab("πŸ§ͺ Generate Synthetic Dataset"):
530
+ gr.Markdown("""
531
+ ## Create Domain-Specific Test Datasets for SMOLTRACE
532
+
533
+ Use AI to generate synthetic evaluation tasks tailored to your domain and tools.
534
+ Perfect for creating custom benchmarks when standard datasets don't fit your use case.
535
+
536
+ **🎯 Enterprise Use Case**: Quickly create evaluation datasets for:
537
+ - Custom tools and APIs your agents use
538
+ - Industry-specific domains (finance, healthcare, legal, etc.)
539
+ - Internal workflows and processes
540
+ - Specialized agent capabilities
541
+
542
+ **Output Format**: SMOLTRACE-compatible task dataset ready for HuggingFace upload
543
+ """)
544
+
545
+ with gr.Row():
546
+ with gr.Column():
547
+ synth_domain = gr.Textbox(
548
+ label="Domain",
549
+ placeholder="e.g., finance, healthcare, travel, ecommerce, customer_support",
550
+ value="travel",
551
+ info="The domain/industry for your synthetic tasks"
552
+ )
553
+ synth_tools = gr.Textbox(
554
+ label="Tool Names (comma-separated)",
555
+ placeholder="e.g., get_weather,search_flights,book_hotel,currency_converter",
556
+ value="get_weather,search_flights,book_hotel",
557
+ info="Names of tools your agent can use",
558
+ lines=2
559
+ )
560
+ synth_num_tasks = gr.Slider(
561
+ label="Number of Tasks",
562
+ minimum=5,
563
+ maximum=100,
564
+ value=10,
565
+ step=1,
566
+ info="Total number of synthetic tasks to generate"
567
+ )
568
+ synth_difficulty = gr.Dropdown(
569
+ label="Difficulty Distribution",
570
+ choices=["balanced", "easy_only", "medium_only", "hard_only", "progressive"],
571
+ value="balanced",
572
+ info="How to distribute task difficulty"
573
+ )
574
+ synth_agent_type = gr.Dropdown(
575
+ label="Agent Type",
576
+ choices=["both", "tool", "code"],
577
+ value="both",
578
+ info="Target agent type for the tasks"
579
+ )
580
+ synth_button = gr.Button("πŸ§ͺ Generate Synthetic Dataset", variant="primary", size="lg")
581
+
582
+ with gr.Column():
583
+ synth_output = gr.JSON(label="Generated Dataset (JSON)")
584
+
585
+ gr.Markdown("""
586
+ ### πŸ“ Next Steps
587
+
588
+ After generation:
589
+ 1. **Copy the `tasks` array** from the JSON output above
590
+ 2. **Use the "Push to Hub" tab** to upload directly to HuggingFace
591
+ 3. **Or upload manually** following the instructions in the output
592
+
593
+ **πŸ’‘ Tip**: The generated dataset includes usage instructions and follows SMOLTRACE naming convention!
594
+ """)
595
+
596
+ async def run_generate_synthetic(domain, tools, num_tasks, difficulty, agent_type):
597
+ """Generate synthetic dataset with async support."""
598
+ try:
599
+ import json
600
+ result = await generate_synthetic_dataset(
601
+ domain=domain,
602
+ tool_names=tools,
603
+ num_tasks=int(num_tasks),
604
+ difficulty_distribution=difficulty,
605
+ agent_type=agent_type
606
+ )
607
+ return json.loads(result)
608
+ except Exception as e:
609
+ return {"error": str(e)}
610
+
611
+ synth_button.click(
612
+ fn=run_generate_synthetic,
613
+ inputs=[synth_domain, synth_tools, synth_num_tasks, synth_difficulty, synth_agent_type],
614
+ outputs=[synth_output]
615
+ )
616
+
617
+ # Tab 7: Push Dataset to Hub
618
+ with gr.Tab("πŸ“€ Push to Hub"):
619
+ gr.Markdown("""
620
+ ## Upload Generated Dataset to HuggingFace Hub
621
+
622
+ Upload your synthetic dataset (from the previous tab or any SMOLTRACE-format dataset)
623
+ directly to HuggingFace Hub.
624
+
625
+ **Requirements**:
626
+ - HuggingFace account
627
+ - API token with write permissions ([Get one here](https://huggingface.co/settings/tokens))
628
+ - Dataset in SMOLTRACE format
629
+
630
+ **Naming Convention**: `{username}/smoltrace-{domain}-tasks` or `{username}/smoltrace-{domain}-tasks-v1`
631
+ """)
632
+
633
+ with gr.Row():
634
+ with gr.Column():
635
+ push_dataset_json = gr.Textbox(
636
+ label="Dataset JSON (tasks array)",
637
+ placeholder='[{"id": "task_001", "prompt": "...", "expected_tool": "...", ...}]',
638
+ info="Paste the 'tasks' array from generate_synthetic_dataset output",
639
+ lines=10
640
+ )
641
+ push_repo_name = gr.Textbox(
642
+ label="Repository Name",
643
+ placeholder="your-username/smoltrace-finance-tasks",
644
+ info="HuggingFace repo name (follow SMOLTRACE convention)",
645
+ value=""
646
+ )
647
+ push_hf_token = gr.Textbox(
648
+ label="HuggingFace Token",
649
+ placeholder="hf_...",
650
+ info="API token with write permissions",
651
+ type="password"
652
+ )
653
+ push_private = gr.Checkbox(
654
+ label="Make dataset private",
655
+ value=False,
656
+ info="Private datasets are only visible to you"
657
+ )
658
+ push_button = gr.Button("πŸ“€ Push to HuggingFace Hub", variant="primary", size="lg")
659
+
660
+ with gr.Column():
661
+ push_output = gr.JSON(label="Upload Result")
662
+
663
+ gr.Markdown("""
664
+ ### πŸŽ‰ After Upload
665
+
666
+ Once uploaded, you can:
667
+ 1. **View your dataset** at the URL provided in the output
668
+ 2. **Use in SMOLTRACE** evaluations with the command shown
669
+ 3. **Share with your team** (if public) or manage access (if private)
670
+
671
+ **Example**: After uploading to `company/smoltrace-finance-tasks`:
672
+ ```bash
673
+ smoltrace-eval --model openai/gpt-4 --dataset-name company/smoltrace-finance-tasks
674
+ ```
675
+ """)
676
+
677
+ async def run_push_dataset(dataset_json, repo_name, hf_token, private):
678
+ """Push dataset to hub with async support."""
679
+ try:
680
+ import json
681
+ result = await push_dataset_to_hub(
682
+ dataset_json=dataset_json,
683
+ repo_name=repo_name,
684
+ hf_token=hf_token,
685
+ private=private
686
+ )
687
+ return json.loads(result)
688
+ except Exception as e:
689
+ return {"error": str(e)}
690
+
691
+ push_button.click(
692
+ fn=run_push_dataset,
693
+ inputs=[push_dataset_json, push_repo_name, push_hf_token, push_private],
694
+ outputs=[push_output]
695
+ )
696
+
697
+ # Tab 9: MCP Resources & Prompts
698
  with gr.Tab("πŸ”Œ MCP Resources & Prompts"):
699
  gr.Markdown("""
700
  ## MCP Resources & Prompts
701
 
702
+ Beyond the 7 MCP Tools, this server also exposes **MCP Resources** and **MCP Prompts**
703
  that MCP clients can use directly.
704
 
705
  ### MCP Resources (Read-Only Data Access)
 
952
  outputs=[prompt_output]
953
  )
954
 
955
+ # Tab 10: API Documentation
956
  with gr.Tab("πŸ“– API Documentation"):
957
  gr.Markdown("""
958
  ## MCP Tool Specifications
 
1043
 
1044
  ---
1045
 
1046
+ ### 6. generate_synthetic_dataset
1047
+
1048
+ **Description**: Generate domain-specific synthetic test datasets for SMOLTRACE evaluations using AI
1049
+
1050
+ **Parameters**:
1051
+ - `domain` (str, required): The domain for synthetic tasks (e.g., "finance", "healthcare", "travel", "ecommerce", "customer_support")
1052
+ - `tool_names` (str, required): Comma-separated list of tool names to include (e.g., "get_weather,search_web,calculator")
1053
+ - `num_tasks` (int): Number of synthetic tasks to generate (default: 10, range: 5-100)
1054
+ - `difficulty_distribution` (str): How to distribute task difficulty (default: "balanced")
1055
+ - Options: "balanced" (40% easy, 40% medium, 20% hard), "easy_only", "medium_only", "hard_only", "progressive" (50% easy, 30% medium, 20% hard)
1056
+ - `agent_type` (str): Target agent type for tasks (default: "both")
1057
+ - Options: "tool" (ToolCallingAgent), "code" (CodeAgent), "both" (50/50 mix)
1058
+
1059
+ **Returns**: JSON object with dataset_info (including batch statistics), tasks array (SMOLTRACE format), and usage_instructions
1060
+
1061
+ **πŸš€ Batched Generation**:
1062
+ - Requests >20 tasks are automatically split into parallel batches
1063
+ - Each batch generates up to 20 tasks concurrently
1064
+ - Example: 100 tasks = 5 parallel batches (20 tasks each)
1065
+ - Timeout: 120 seconds per batch
1066
+ - Token limit: 8,192 per batch (40,960 total for 100 tasks)
1067
+
1068
+ **Performance**:
1069
+ - 5-20 tasks: Single batch, ~30-60 seconds
1070
+ - 21-100 tasks: Multiple parallel batches, ~60-120 seconds per batch
1071
+
1072
+ **SMOLTRACE Task Format**:
1073
+ Each task includes: `id`, `prompt`, `expected_tool`, `expected_tool_calls` (optional), `difficulty`, `agent_type`, `expected_keywords` (optional)
1074
+
1075
+ **Use Cases**:
1076
+ - Create custom evaluation datasets for industry-specific domains
1077
+ - Test agents with proprietary tools and APIs
1078
+ - Generate benchmarks for internal workflows
1079
+ - Rapid prototyping of evaluation scenarios
1080
+
1081
+ ---
1082
+
1083
+ ### 7. push_dataset_to_hub
1084
+
1085
+ **Description**: Push a generated synthetic dataset to HuggingFace Hub
1086
+
1087
+ **Parameters**:
1088
+ - `dataset_json` (str, required): JSON string containing the tasks array from generate_synthetic_dataset
1089
+ - `repo_name` (str, required): HuggingFace repository name following SMOLTRACE naming convention
1090
+ - Format: `{username}/smoltrace-{domain}-tasks` or `{username}/smoltrace-{domain}-tasks-v{version}`
1091
+ - Examples: `kshitij/smoltrace-finance-tasks`, `kshitij/smoltrace-healthcare-tasks-v2`
1092
+ - `hf_token` (str, required): HuggingFace API token with write permissions
1093
+ - `private` (bool): Whether to create a private repository (default: False)
1094
+
1095
+ **Returns**: JSON object with upload status, repository URL, and dataset information
1096
+
1097
+ **Validation**:
1098
+ - βœ… Checks SMOLTRACE naming convention (`smoltrace-` prefix required)
1099
+ - βœ… Validates all tasks have required fields (id, prompt, expected_tool, difficulty, agent_type)
1100
+ - βœ… Verifies HuggingFace token has write permissions
1101
+ - βœ… Handles repository creation if it doesn't exist
1102
+
1103
+ **Workflow**:
1104
+ 1. Generate synthetic dataset using `generate_synthetic_dataset`
1105
+ 2. Extract the `tasks` array from the response JSON
1106
+ 3. Convert tasks array to JSON string
1107
+ 4. Call `push_dataset_to_hub` with the JSON string and desired repo name
1108
+ 5. Share the dataset URL with your team or use in SMOLTRACE evaluations
1109
+
1110
+ **Example Integration**:
1111
+ ```python
1112
+ # Step 1: Generate dataset
1113
+ result = generate_synthetic_dataset(
1114
+ domain="finance",
1115
+ tool_names="get_stock_price,calculate_roi,fetch_company_info",
1116
+ num_tasks=50
1117
+ )
1118
+
1119
+ # Step 2: Extract tasks
1120
+ import json
1121
+ data = json.loads(result)
1122
+ tasks_json = json.dumps(data["tasks"])
1123
+
1124
+ # Step 3: Push to HuggingFace
1125
+ push_result = push_dataset_to_hub(
1126
+ dataset_json=tasks_json,
1127
+ repo_name="your-username/smoltrace-finance-tasks",
1128
+ hf_token="hf_xxx",
1129
+ private=False
1130
+ )
1131
+ ```
1132
+
1133
+ ---
1134
+
1135
  ## MCP Integration
1136
 
1137
  This Gradio app is MCP-enabled. When deployed to HuggingFace Spaces, it can be accessed via MCP clients.
 
1144
 
1145
  ### What's Exposed via MCP:
1146
 
1147
+ #### 7 MCP Tools (AI-Powered)
1148
+ The seven tools above (`analyze_leaderboard`, `debug_trace`, `estimate_cost`, `compare_runs`, `get_dataset`, `generate_synthetic_dataset`, `push_dataset_to_hub`)
1149
  are automatically exposed as MCP tools and can be called from any MCP client.
1150
 
1151
  #### 3 MCP Resources (Data Access)
 
1181
  return demo
1182
 
1183
  if __name__ == "__main__":
1184
+ logger.info("=" * 70)
1185
+ logger.info("TraceMind MCP Server - HuggingFace Space (Track 1)")
1186
+ logger.info("=" * 70)
1187
+ logger.info("MCP Server: TraceMind Agent Evaluation Platform v1.0.0")
1188
+ logger.info("Protocol: Model Context Protocol (MCP)")
1189
+ logger.info("Transport: Gradio Native MCP Support (SSE)")
1190
+ logger.info("MCP Endpoint: https://kshitijthakkar-tracemind-mcp-server.hf.space/gradio_api/mcp/")
1191
+ logger.info("=" * 70)
1192
+ logger.info("Features:")
1193
+ logger.info(" βœ“ 7 AI-Powered Tools (Leaderboard + Trace + Cost + Dataset)")
1194
+ logger.info(" βœ“ 3 Real-Time Resources (leaderboard, trace, cost data)")
1195
+ logger.info(" βœ“ 3 Prompt Templates (analysis, debug, optimization)")
1196
+ logger.info(" βœ“ Google Gemini 2.5 Pro - Intelligent Analysis")
1197
+ logger.info(" βœ“ HuggingFace Dataset Integration")
1198
+ logger.info(" βœ“ SMOLTRACE Format Support")
1199
+ logger.info(" βœ“ Synthetic Dataset Generation")
1200
+ logger.info("=" * 70)
1201
+ logger.info("Tool Categories:")
1202
+ logger.info(" πŸ“Š Analysis: analyze_leaderboard, compare_runs")
1203
+ logger.info(" πŸ› Debugging: debug_trace")
1204
+ logger.info(" οΏ½οΏ½ Cost: estimate_cost")
1205
+ logger.info(" πŸ“¦ Data: get_dataset")
1206
+ logger.info(" πŸ§ͺ Generation: generate_synthetic_dataset, push_dataset_to_hub")
1207
+ logger.info("=" * 70)
1208
+ logger.info("Compatible Clients:")
1209
+ logger.info(" β€’ Claude Desktop")
1210
+ logger.info(" β€’ Continue.dev (VS Code)")
1211
+ logger.info(" β€’ Cline (VS Code)")
1212
+ logger.info(" β€’ Any MCP-compatible client")
1213
+ logger.info("=" * 70)
1214
+ logger.info("How to Connect (Claude Desktop/HF MCP Client):")
1215
+ logger.info(" 1. Go to https://huggingface.co/settings/mcp")
1216
+ logger.info(" 2. Add Space: kshitijthakkar-tracemind-mcp-server")
1217
+ logger.info(" 3. Start using TraceMind tools in your MCP client!")
1218
+ logger.info("=" * 70)
1219
+ logger.info("Starting Gradio UI + MCP Server on 0.0.0.0:7860...")
1220
+ logger.info("Waiting for connections...")
1221
+ logger.info("=" * 70)
1222
+
1223
+ try:
1224
+ # Create Gradio interface
1225
+ demo = create_gradio_ui()
1226
+
1227
+ # Launch with MCP server enabled
1228
+ demo.launch(
1229
+ server_name="0.0.0.0",
1230
+ server_port=7860,
1231
+ mcp_server=True # Enable MCP server functionality
1232
+ )
1233
+
1234
+ except Exception as e:
1235
+ logger.error(f"Failed to start server: {e}")
1236
+ logger.error("Check that:")
1237
+ logger.error(" 1. GEMINI_API_KEY environment variable is set")
1238
+ logger.error(" 2. Port 7860 is available")
1239
+ logger.error(" 3. All dependencies are installed")
1240
+ raise
gemini_client.py CHANGED
@@ -12,13 +12,13 @@ import json
12
  class GeminiClient:
13
  """Client for Google Gemini API"""
14
 
15
- def __init__(self, api_key: Optional[str] = None, model_name: str = "gemini-2.5-flash"):
16
  """
17
  Initialize Gemini client
18
 
19
  Args:
20
  api_key: Gemini API key (defaults to GEMINI_API_KEY env var)
21
- model_name: Model to use (default: gemini-2.5-flash, can also use gemini-2.5-flash-lite)
22
  """
23
  self.api_key = api_key or os.getenv("GEMINI_API_KEY")
24
  if not self.api_key:
 
12
  class GeminiClient:
13
  """Client for Google Gemini API"""
14
 
15
+ def __init__(self, api_key: Optional[str] = None, model_name: str = "gemini-2.5-flash-lite"):
16
  """
17
  Initialize Gemini client
18
 
19
  Args:
20
  api_key: Gemini API key (defaults to GEMINI_API_KEY env var)
21
+ model_name: Model to use (default: gemini-2.5-flash-lite, can also use gemini-2.5-flash)
22
  """
23
  self.api_key = api_key or os.getenv("GEMINI_API_KEY")
24
  if not self.api_key:
mcp_tools.py CHANGED
@@ -1,14 +1,34 @@
1
  """
2
- MCP Tool Implementations for TraceMind
3
 
4
- Implements:
5
- - 5 MCP Tools: analyze_leaderboard, debug_trace, estimate_cost, compare_runs, get_dataset
6
- - 3 MCP Resources: leaderboard data, trace data, cost data
7
- - 3 MCP Prompts: analysis prompts, debug prompts, optimization prompts
8
 
9
  With Gradio's native MCP support (mcp_server=True), these are automatically
10
  exposed based on decorators (@gr.mcp.tool, @gr.mcp.resource, @gr.mcp.prompt),
11
  docstrings, and type hints.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  """
13
 
14
  import os
@@ -1114,3 +1134,475 @@ def optimization_prompt(
1114
 
1115
  template = templates.get(optimization_goal, {}).get(constraints, templates["cost"]["maintain_quality"])
1116
  return template
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  """
2
+ MCP Tool Implementations for TraceMind MCP Server
3
 
4
+ This module implements 13 MCP components (7 Tools + 3 Resources + 3 Prompts) for
5
+ AI-powered agent evaluation analysis.
 
 
6
 
7
  With Gradio's native MCP support (mcp_server=True), these are automatically
8
  exposed based on decorators (@gr.mcp.tool, @gr.mcp.resource, @gr.mcp.prompt),
9
  docstrings, and type hints.
10
+
11
+ πŸ› οΈ Tools (7 AI-Powered):
12
+ πŸ“Š analyze_leaderboard - Get AI insights from evaluation leaderboard data
13
+ πŸ› debug_trace - Debug agent execution traces with AI assistance
14
+ πŸ’° estimate_cost - Predict evaluation costs with AI recommendations
15
+ βš–οΈ compare_runs - Compare two evaluation runs with AI analysis
16
+ πŸ“¦ get_dataset - Load SMOLTRACE datasets as JSON for flexible analysis
17
+ πŸ§ͺ generate_synthetic_dataset - Create domain-specific test datasets
18
+ πŸ“€ push_dataset_to_hub - Upload datasets to HuggingFace Hub
19
+
20
+ πŸ“¦ Resources (3 Data Access):
21
+ leaderboard://{repo} - Raw leaderboard data in JSON format
22
+ trace://{trace_id}/{repo} - Raw OpenTelemetry trace data
23
+ cost://model/{model_name} - Model pricing and hardware cost data
24
+
25
+ πŸ“ Prompts (3 Templates):
26
+ analysis_prompt - Standardized templates for analysis requests
27
+ debug_prompt - Standardized templates for debugging scenarios
28
+ optimization_prompt - Standardized templates for optimization goals
29
+
30
+ All AI analysis powered by Google Gemini 2.5 Pro.
31
+ Track 1: Building MCP Servers - Enterprise Category
32
  """
33
 
34
  import os
 
1134
 
1135
  template = templates.get(optimization_goal, {}).get(constraints, templates["cost"]["maintain_quality"])
1136
  return template
1137
+
1138
+
1139
+ # ========================================
1140
+ # NEW TOOLS: Synthetic Dataset Generation
1141
+ # ========================================
1142
+
1143
+ @gr.mcp.tool()
1144
+ async def generate_synthetic_dataset(
1145
+ domain: str,
1146
+ tool_names: str,
1147
+ num_tasks: int = 10,
1148
+ difficulty_distribution: str = "balanced",
1149
+ agent_type: str = "both"
1150
+ ) -> str:
1151
+ """
1152
+ Generate domain-specific synthetic test datasets for SMOLTRACE evaluations using AI.
1153
+
1154
+ This tool uses Google Gemini 2.5 Pro to create realistic, domain-specific evaluation
1155
+ tasks that follow the SMOLTRACE task dataset format. Perfect for creating custom
1156
+ benchmarks when standard datasets don't fit your use case.
1157
+
1158
+ **πŸš€ Batched Generation for Scale**:
1159
+ - Requests >20 tasks are automatically split into parallel batches
1160
+ - Utilizes Gemini's large context window efficiently
1161
+ - Supports up to 100 tasks with 120s timeout per batch
1162
+ - Example: 100 tasks = 5 parallel batches (20 tasks each)
1163
+
1164
+ **Enterprise Use Case**: Quickly create evaluation datasets for:
1165
+ - Custom tools and APIs your agents use
1166
+ - Industry-specific domains (finance, healthcare, legal, manufacturing, etc.)
1167
+ - Internal workflows and business processes
1168
+ - Specialized agent capabilities
1169
+
1170
+ **Security**: Requires GEMINI_API_KEY environment variable.
1171
+
1172
+ Args:
1173
+ domain (str): The domain for synthetic tasks (e.g., "finance", "healthcare", "travel", "ecommerce", "customer_support")
1174
+ tool_names (str): Comma-separated list of tool names to include (e.g., "get_weather,search_web,calculator")
1175
+ num_tasks (int): Number of synthetic tasks to generate. Must be between 5 and 100. Default: 10
1176
+ - 5-20 tasks: Single batch (fast, ~30-60s)
1177
+ - 21-100 tasks: Multiple parallel batches (slower, ~60-120s per batch)
1178
+ difficulty_distribution (str): How to distribute task difficulty. Options: "balanced" (40% easy, 40% medium, 20% hard), "easy_only", "medium_only", "hard_only", "progressive" (50% easy, 30% medium, 20% hard). Default: "balanced"
1179
+ agent_type (str): Target agent type for tasks. Options: "tool" (ToolCallingAgent), "code" (CodeAgent), "both" (50/50 mix). Default: "both"
1180
+
1181
+ Returns:
1182
+ str: JSON-formatted response with dataset_info (including batch statistics), tasks array (SMOLTRACE format), and usage_instructions
1183
+ """
1184
+ try:
1185
+ # Initialize Gemini client
1186
+ gemini_client = GeminiClient()
1187
+
1188
+ # Validate inputs
1189
+ if num_tasks < 5 or num_tasks > 100:
1190
+ return json.dumps({
1191
+ "error": "num_tasks must be between 5 and 100",
1192
+ "num_tasks_provided": num_tasks
1193
+ }, indent=2)
1194
+
1195
+ # Parse tool names
1196
+ tools = [tool.strip() for tool in tool_names.split(",") if tool.strip()]
1197
+ if len(tools) == 0:
1198
+ return json.dumps({
1199
+ "error": "At least one tool name must be provided",
1200
+ "tool_names_provided": tool_names
1201
+ }, indent=2)
1202
+
1203
+ # Calculate distributions
1204
+ difficulty_counts = _calculate_difficulty_distribution(num_tasks, difficulty_distribution)
1205
+ agent_type_counts = _calculate_agent_type_distribution(num_tasks, agent_type)
1206
+
1207
+ # Create generation prompt
1208
+ generation_prompt = f"""You are an expert at creating synthetic evaluation datasets for AI agents.
1209
+
1210
+ Generate {num_tasks} synthetic test tasks for the **{domain}** domain following the SMOLTRACE task format.
1211
+
1212
+ **Available Tools**: {", ".join(tools)}
1213
+
1214
+ **Difficulty Distribution**:
1215
+ - Easy ({difficulty_counts['easy']} tasks): Single tool call, straightforward input, clear expected output
1216
+ - Medium ({difficulty_counts['medium']} tasks): Multiple tool calls OR complex input parsing OR conditional logic
1217
+ - Hard ({difficulty_counts['hard']} tasks): Multiple tools, complex reasoning, edge cases, error handling
1218
+
1219
+ **Agent Type Distribution**:
1220
+ - Tool Agent ({agent_type_counts['tool']} tasks): Uses ToolCallingAgent - declarative tool calling
1221
+ - Code Agent ({agent_type_counts['code']} tasks): Uses CodeAgent - writes Python code with tools
1222
+
1223
+ **SMOLTRACE Task Format** (required structure):
1224
+ ```json
1225
+ {{
1226
+ "id": "string - unique identifier like '{domain.lower()}_{{tool}}_{{number}}'",
1227
+ "prompt": "string - clear, specific task description",
1228
+ "expected_tool": "string - the tool name that should be used",
1229
+ "expected_tool_calls": "integer - how many times the tool should be called (optional, default 1)",
1230
+ "difficulty": "string - 'easy', 'medium', or 'hard'",
1231
+ "agent_type": "string - 'tool' or 'code'",
1232
+ "expected_keywords": "array of strings - keywords expected in response (optional)"
1233
+ }}
1234
+ ```
1235
+
1236
+ **Generation Guidelines**:
1237
+ 1. **Domain Specificity**: Make tasks realistic and specific to the {domain} domain
1238
+ 2. **Tool Usage**: Ensure each task requires using one of: {", ".join(tools)}
1239
+ 3. **Prompt Quality**: Write clear, unambiguous prompts that an agent can execute
1240
+ 4. **Expected Keywords**: Include 2-4 expected keywords for validation (optional but recommended)
1241
+ 5. **Variety**: Vary the tasks to cover different aspects of the domain
1242
+
1243
+ **IMPORTANT**: Return ONLY a valid JSON array of tasks. No explanatory text, no markdown formatting, no code blocks. Just the raw JSON array starting with [ and ending with ].
1244
+
1245
+ Generate exactly {num_tasks} tasks:"""
1246
+
1247
+ print(f"[GENERATE_SYNTHETIC_DATASET] Generating {num_tasks} tasks for domain '{domain}'...")
1248
+ print(f"[GENERATE_SYNTHETIC_DATASET] Tools: {', '.join(tools)}")
1249
+
1250
+ # Import required modules
1251
+ import asyncio
1252
+ import google.generativeai as genai
1253
+
1254
+ # Determine batching strategy
1255
+ # Gemini can handle ~20 tasks per call with 8192 token output limit
1256
+ TASKS_PER_BATCH = 20
1257
+ num_batches = (num_tasks + TASKS_PER_BATCH - 1) // TASKS_PER_BATCH # Ceiling division
1258
+
1259
+ if num_batches > 1:
1260
+ print(f"[GENERATE_SYNTHETIC_DATASET] Large request detected. Splitting into {num_batches} parallel batches...")
1261
+
1262
+ # Create batch generation tasks
1263
+ async def generate_batch(batch_num: int, batch_size: int, batch_difficulty: dict, batch_agent_type: dict):
1264
+ """Generate a single batch of tasks"""
1265
+ batch_prompt = f"""You are an expert at creating synthetic evaluation datasets for AI agents.
1266
+
1267
+ Generate {batch_size} synthetic test tasks for the **{domain}** domain following the SMOLTRACE task format.
1268
+
1269
+ **Available Tools**: {", ".join(tools)}
1270
+
1271
+ **Difficulty Distribution for this batch**:
1272
+ - Easy ({batch_difficulty['easy']} tasks): Single tool call, straightforward input, clear expected output
1273
+ - Medium ({batch_difficulty['medium']} tasks): Multiple tool calls OR complex input parsing OR conditional logic
1274
+ - Hard ({batch_difficulty['hard']} tasks): Multiple tools, complex reasoning, edge cases, error handling
1275
+
1276
+ **Agent Type Distribution for this batch**:
1277
+ - Tool Agent ({batch_agent_type['tool']} tasks): Uses ToolCallingAgent - declarative tool calling
1278
+ - Code Agent ({batch_agent_type['code']} tasks): Uses CodeAgent - writes Python code with tools
1279
+
1280
+ **SMOLTRACE Task Format** (required structure):
1281
+ ```json
1282
+ {{
1283
+ "id": "string - unique identifier like '{domain.lower()}_{{tool}}_batch{batch_num}_{{number}}'",
1284
+ "prompt": "string - clear, specific task description",
1285
+ "expected_tool": "string - the tool name that should be used",
1286
+ "expected_tool_calls": "integer - how many times the tool should be called (optional, default 1)",
1287
+ "difficulty": "string - 'easy', 'medium', or 'hard'",
1288
+ "agent_type": "string - 'tool' or 'code'",
1289
+ "expected_keywords": "array of strings - keywords expected in response (optional)"
1290
+ }}
1291
+ ```
1292
+
1293
+ **Generation Guidelines**:
1294
+ 1. **Domain Specificity**: Make tasks realistic and specific to the {domain} domain
1295
+ 2. **Tool Usage**: Ensure each task requires using one of: {", ".join(tools)}
1296
+ 3. **Prompt Quality**: Write clear, unambiguous prompts that an agent can execute
1297
+ 4. **Expected Keywords**: Include 2-4 expected keywords for validation (optional but recommended)
1298
+ 5. **Variety**: Vary the tasks to cover different aspects of the domain
1299
+ 6. **Unique IDs**: Include 'batch{batch_num}' in task IDs to ensure uniqueness across batches
1300
+
1301
+ **IMPORTANT**: Return ONLY a valid JSON array of tasks. No explanatory text, no markdown formatting, no code blocks. Just the raw JSON array starting with [ and ending with ].
1302
+
1303
+ Generate exactly {batch_size} tasks:"""
1304
+
1305
+ generation_config = {
1306
+ "temperature": 0.8, # Higher for creativity and diversity
1307
+ "top_p": 0.95,
1308
+ "top_k": 40,
1309
+ "max_output_tokens": 8192,
1310
+ }
1311
+
1312
+ try:
1313
+ response = await asyncio.wait_for(
1314
+ gemini_client.model.generate_content_async(
1315
+ batch_prompt,
1316
+ generation_config=generation_config
1317
+ ),
1318
+ timeout=120.0 # 120 seconds per batch for larger datasets
1319
+ )
1320
+ return response.text, None
1321
+ except Exception as e:
1322
+ return None, str(e)
1323
+
1324
+ # Split difficulty and agent type distributions across batches
1325
+ def split_distribution(total_counts: dict, num_batches: int, batch_num: int, remaining_tasks: int):
1326
+ """Split distribution counts across batches fairly"""
1327
+ batch_counts = {}
1328
+ for key, total in total_counts.items():
1329
+ # Calculate fair share for this batch
1330
+ base_share = total // num_batches
1331
+ extra = 1 if batch_num < (total % num_batches) else 0
1332
+ batch_counts[key] = min(base_share + extra, remaining_tasks)
1333
+ return batch_counts
1334
+
1335
+ # Generate all batches in parallel
1336
+ batch_tasks = []
1337
+ remaining_tasks = num_tasks
1338
+
1339
+ for batch_num in range(num_batches):
1340
+ batch_size = min(TASKS_PER_BATCH, remaining_tasks)
1341
+
1342
+ # Calculate distributions for this batch
1343
+ batch_difficulty = split_distribution(difficulty_counts, num_batches, batch_num, batch_size)
1344
+ batch_agent_type = split_distribution(agent_type_counts, num_batches, batch_num, batch_size)
1345
+
1346
+ batch_tasks.append(generate_batch(batch_num, batch_size, batch_difficulty, batch_agent_type))
1347
+ remaining_tasks -= batch_size
1348
+
1349
+ print(f"[GENERATE_SYNTHETIC_DATASET] Executing {num_batches} parallel Gemini API calls...")
1350
+
1351
+ # Execute all batches in parallel
1352
+ batch_results = await asyncio.gather(*batch_tasks)
1353
+
1354
+ # Combine and validate results
1355
+ all_tasks = []
1356
+ errors = []
1357
+
1358
+ for batch_num, (response_text, error) in enumerate(batch_results):
1359
+ if error:
1360
+ errors.append(f"Batch {batch_num} failed: {error}")
1361
+ continue
1362
+
1363
+ try:
1364
+ # Clean response (remove markdown if present)
1365
+ cleaned_response = response_text.strip()
1366
+ if cleaned_response.startswith("```"):
1367
+ import re
1368
+ match = re.search(r'```(?:json)?\s*\n(.*?)\n```', cleaned_response, re.DOTALL)
1369
+ if match:
1370
+ cleaned_response = match.group(1)
1371
+
1372
+ # Parse JSON
1373
+ batch_tasks_parsed = json.loads(cleaned_response)
1374
+
1375
+ if not isinstance(batch_tasks_parsed, list):
1376
+ errors.append(f"Batch {batch_num} did not return a JSON array")
1377
+ continue
1378
+
1379
+ all_tasks.extend(batch_tasks_parsed)
1380
+
1381
+ except json.JSONDecodeError as e:
1382
+ errors.append(f"Batch {batch_num} JSON parsing failed: {str(e)}")
1383
+
1384
+ # Check if we got enough tasks
1385
+ if len(all_tasks) == 0:
1386
+ return json.dumps({
1387
+ "error": "All batches failed to generate tasks",
1388
+ "batch_errors": errors,
1389
+ "suggestion": "Check GEMINI_API_KEY and try again"
1390
+ }, indent=2)
1391
+
1392
+ if errors:
1393
+ print(f"[GENERATE_SYNTHETIC_DATASET] Warning: Some batches failed: {errors}")
1394
+
1395
+ print(f"[GENERATE_SYNTHETIC_DATASET] Successfully generated {len(all_tasks)} tasks across {num_batches} batch(es)")
1396
+
1397
+ # Validate required fields for all tasks
1398
+ synthetic_tasks = all_tasks
1399
+ required_fields = ["id", "prompt", "expected_tool", "difficulty", "agent_type"]
1400
+ for i, task in enumerate(synthetic_tasks):
1401
+ missing_fields = [field for field in required_fields if field not in task]
1402
+ if missing_fields:
1403
+ return json.dumps({
1404
+ "error": f"Task {i} is missing required fields: {missing_fields}",
1405
+ "task": task
1406
+ }, indent=2)
1407
+
1408
+ # Return formatted dataset with metadata
1409
+ result = {
1410
+ "dataset_info": {
1411
+ "domain": domain,
1412
+ "tools": tools,
1413
+ "num_tasks_requested": num_tasks,
1414
+ "num_tasks_generated": len(synthetic_tasks),
1415
+ "num_batches": num_batches,
1416
+ "batches_succeeded": num_batches - len(errors),
1417
+ "batches_failed": len(errors) if errors else 0,
1418
+ "batch_errors": errors if errors else None,
1419
+ "difficulty_distribution": difficulty_counts,
1420
+ "agent_type_distribution": agent_type_counts,
1421
+ "generated_at": datetime.now().isoformat(),
1422
+ "smoltrace_naming_convention": f"{{username}}/smoltrace-{domain.lower()}-tasks",
1423
+ "warning": f"⚠️ {len(errors)} batch(es) failed. Generated {len(synthetic_tasks)}/{num_tasks} tasks." if errors else None
1424
+ },
1425
+ "tasks": synthetic_tasks,
1426
+ "usage_instructions": {
1427
+ "format": "SMOLTRACE task dataset format",
1428
+ "naming_convention": f"Follow SMOLTRACE naming: {{username}}/smoltrace-{domain.lower()}-tasks or {{username}}/smoltrace-{domain.lower()}-tasks-v1 for versioning",
1429
+ "how_to_upload": [
1430
+ "Option 1: Use the push_dataset_to_hub tool in this MCP server",
1431
+ "Option 2: Manual upload with Python code (see example_code below)"
1432
+ ],
1433
+ "example_code": f"""from datasets import Dataset
1434
+
1435
+ # Extract tasks from this response
1436
+ tasks = result["tasks"]
1437
+
1438
+ # Create and push to HuggingFace (following SMOLTRACE naming convention)
1439
+ dataset = Dataset.from_list(tasks)
1440
+ dataset.push_to_hub("your-username/smoltrace-{domain.lower()}-tasks")
1441
+
1442
+ # Use in SMOLTRACE evaluation
1443
+ # smoltrace-eval --model openai/gpt-4 --dataset-name your-username/smoltrace-{domain.lower()}-tasks"""
1444
+ }
1445
+ }
1446
+
1447
+ return json.dumps(result, indent=2, default=str)
1448
+
1449
+ except Exception as e:
1450
+ return json.dumps({
1451
+ "error": f"Failed to generate synthetic dataset: {str(e)}",
1452
+ "domain": domain,
1453
+ "tools": tool_names
1454
+ }, indent=2)
1455
+
1456
+
1457
+ @gr.mcp.tool()
1458
+ async def push_dataset_to_hub(
1459
+ dataset_json: str,
1460
+ repo_name: str,
1461
+ hf_token: str,
1462
+ private: bool = False
1463
+ ) -> str:
1464
+ """
1465
+ Push a generated synthetic dataset to HuggingFace Hub.
1466
+
1467
+ This tool uploads datasets created by generate_synthetic_dataset (or any SMOLTRACE-format
1468
+ dataset) to HuggingFace Hub, making them ready for use in SMOLTRACE evaluations.
1469
+
1470
+ **Naming Convention**: Repo name should follow SMOLTRACE convention:
1471
+ - Format: {username}/smoltrace-{domain}-tasks or {username}/smoltrace-{domain}-tasks-v{version}
1472
+ - Examples: "mycompany/smoltrace-finance-tasks", "alice/smoltrace-healthcare-tasks-v2"
1473
+
1474
+ **Security**: Requires valid HuggingFace token with write permissions.
1475
+
1476
+ Args:
1477
+ dataset_json (str): JSON string containing the tasks array (from generate_synthetic_dataset output, use the "tasks" field)
1478
+ repo_name (str): HuggingFace repository name following SMOLTRACE naming: {username}/smoltrace-{domain}-tasks
1479
+ hf_token (str): HuggingFace API token with write permissions (get from https://huggingface.co/settings/tokens)
1480
+ private (bool): Whether to create a private dataset. Default: False (public)
1481
+
1482
+ Returns:
1483
+ str: JSON response with upload status, dataset URL, and next steps
1484
+ """
1485
+ try:
1486
+ from huggingface_hub import HfApi
1487
+
1488
+ # Validate repo name follows SMOLTRACE convention
1489
+ if "smoltrace-" not in repo_name and "-tasks" not in repo_name:
1490
+ return json.dumps({
1491
+ "warning": "Repository name doesn't follow SMOLTRACE naming convention",
1492
+ "expected_format": "{username}/smoltrace-{domain}-tasks or {username}/smoltrace-{domain}-tasks-v{version}",
1493
+ "your_repo_name": repo_name,
1494
+ "recommendation": "Consider renaming to follow the convention for consistency with SMOLTRACE ecosystem",
1495
+ "proceeding": "Continuing with upload..."
1496
+ }, indent=2)
1497
+
1498
+ # Parse dataset JSON
1499
+ try:
1500
+ tasks = json.loads(dataset_json)
1501
+ if not isinstance(tasks, list):
1502
+ return json.dumps({
1503
+ "error": "dataset_json must be a JSON array of tasks",
1504
+ "type_received": str(type(tasks))
1505
+ }, indent=2)
1506
+ except json.JSONDecodeError as e:
1507
+ return json.dumps({
1508
+ "error": "Invalid JSON in dataset_json",
1509
+ "parse_error": str(e)
1510
+ }, indent=2)
1511
+
1512
+ # Validate task structure
1513
+ required_fields = ["id", "prompt", "expected_tool", "difficulty", "agent_type"]
1514
+ for i, task in enumerate(tasks):
1515
+ missing_fields = [field for field in required_fields if field not in task]
1516
+ if missing_fields:
1517
+ return json.dumps({
1518
+ "error": f"Task {i} is missing required SMOLTRACE fields: {missing_fields}",
1519
+ "task": task
1520
+ }, indent=2)
1521
+
1522
+ # Create dataset and push to hub
1523
+ from datasets import Dataset
1524
+
1525
+ dataset = Dataset.from_list(tasks)
1526
+
1527
+ print(f"[PUSH_DATASET_TO_HUB] Uploading {len(tasks)} tasks to {repo_name}...")
1528
+
1529
+ # Push to hub
1530
+ dataset.push_to_hub(
1531
+ repo_name,
1532
+ token=hf_token,
1533
+ private=private
1534
+ )
1535
+
1536
+ # Return success response
1537
+ result = {
1538
+ "status": "success",
1539
+ "message": f"Successfully uploaded {len(tasks)} tasks to HuggingFace Hub",
1540
+ "dataset_info": {
1541
+ "repository": repo_name,
1542
+ "num_tasks": len(tasks),
1543
+ "visibility": "private" if private else "public",
1544
+ "dataset_url": f"https://huggingface.co/datasets/{repo_name}"
1545
+ },
1546
+ "next_steps": {
1547
+ "view_dataset": f"https://huggingface.co/datasets/{repo_name}",
1548
+ "use_in_smoltrace": f"smoltrace-eval --model openai/gpt-4 --dataset-name {repo_name}",
1549
+ "share_with_team": f"Team members can access at https://huggingface.co/datasets/{repo_name}" if not private else "Dataset is private - share access via HuggingFace settings"
1550
+ }
1551
+ }
1552
+
1553
+ return json.dumps(result, indent=2)
1554
+
1555
+ except ImportError:
1556
+ return json.dumps({
1557
+ "error": "Required packages not installed",
1558
+ "missing_packages": "datasets, huggingface_hub",
1559
+ "install_command": "pip install datasets huggingface_hub"
1560
+ }, indent=2)
1561
+ except Exception as e:
1562
+ return json.dumps({
1563
+ "error": f"Failed to push dataset to hub: {str(e)}",
1564
+ "repo_name": repo_name
1565
+ }, indent=2)
1566
+
1567
+
1568
+ # Helper functions for synthetic dataset generation
1569
+ def _calculate_difficulty_distribution(num_tasks: int, difficulty_distribution: str) -> dict:
1570
+ """Calculate how many tasks of each difficulty to generate."""
1571
+ if difficulty_distribution == "balanced":
1572
+ easy = int(num_tasks * 0.4)
1573
+ medium = int(num_tasks * 0.4)
1574
+ hard = num_tasks - easy - medium
1575
+ elif difficulty_distribution == "easy_only":
1576
+ easy, medium, hard = num_tasks, 0, 0
1577
+ elif difficulty_distribution == "medium_only":
1578
+ easy, medium, hard = 0, num_tasks, 0
1579
+ elif difficulty_distribution == "hard_only":
1580
+ easy, medium, hard = 0, 0, num_tasks
1581
+ elif difficulty_distribution == "progressive":
1582
+ easy = int(num_tasks * 0.5)
1583
+ medium = int(num_tasks * 0.3)
1584
+ hard = num_tasks - easy - medium
1585
+ else:
1586
+ # Default to balanced
1587
+ easy = int(num_tasks * 0.4)
1588
+ medium = int(num_tasks * 0.4)
1589
+ hard = num_tasks - easy - medium
1590
+
1591
+ return {"easy": easy, "medium": medium, "hard": hard}
1592
+
1593
+
1594
+ def _calculate_agent_type_distribution(num_tasks: int, agent_type: str) -> dict:
1595
+ """Calculate how many tasks for each agent type to generate."""
1596
+ if agent_type == "tool":
1597
+ return {"tool": num_tasks, "code": 0}
1598
+ elif agent_type == "code":
1599
+ return {"tool": 0, "code": num_tasks}
1600
+ elif agent_type == "both":
1601
+ tool_count = num_tasks // 2
1602
+ code_count = num_tasks - tool_count
1603
+ return {"tool": tool_count, "code": code_count}
1604
+ else:
1605
+ # Default to both
1606
+ tool_count = num_tasks // 2
1607
+ code_count = num_tasks - tool_count
1608
+ return {"tool": tool_count, "code": code_count}