Spaces:
Running
fix: Remove app.queue() and revert defensive type handling that broke MCP tools
Browse filesThe app.queue(default_concurrency_limit=4) added in commit 02ca5d1 caused
ClosedResourceError on every MCP tool call because the global MCP client
connection was being used across multiple worker processes/threads.
The defensive type handling added in commit 6022c4b was an attempt to fix
the symptoms but made the problem worse by adding json.loads() calls that
failed when MCP tools returned dicts.
This commit:
- Removes app.queue() to restore single-process execution
- Reverts prompts/code_agent.yaml to version from commit d78d01f
- Restores working state for demo recording
Related commits:
- 02ca5d1: Configure queue for 4 concurrent users (BROKE MCP)
- 6022c4b: Add defensive type handling (MADE IT WORSE)
- d78d01f: Fix JSON parsing (LAST WORKING VERSION)
- app.py +0 -3
- prompts/code_agent.yaml +8 -27
|
@@ -3952,9 +3952,6 @@ if __name__ == "__main__":
|
|
| 3952 |
print(f"Data Source: {os.getenv('DATA_SOURCE', 'both')}")
|
| 3953 |
print(f"JSON Path: {os.getenv('JSON_DATA_PATH', './sample_data')}")
|
| 3954 |
|
| 3955 |
-
# Configure queue to handle up to 4 concurrent users
|
| 3956 |
-
app.queue(default_concurrency_limit=4, max_size=4)
|
| 3957 |
-
|
| 3958 |
app.launch(
|
| 3959 |
server_name="0.0.0.0",
|
| 3960 |
server_port=7860,
|
|
|
|
| 3952 |
print(f"Data Source: {os.getenv('DATA_SOURCE', 'both')}")
|
| 3953 |
print(f"JSON Path: {os.getenv('JSON_DATA_PATH', './sample_data')}")
|
| 3954 |
|
|
|
|
|
|
|
|
|
|
| 3955 |
app.launch(
|
| 3956 |
server_name="0.0.0.0",
|
| 3957 |
server_port=7860,
|
|
@@ -25,17 +25,13 @@ system_prompt: |-
|
|
| 25 |
---
|
| 26 |
Task: "What are the top 3 performing models on the leaderboard and how much do they cost?"
|
| 27 |
|
| 28 |
-
Thought: This is a "top N" query, so I should use the optimized `run_get_top_performers` tool instead of run_get_dataset to avoid loading all 51 runs (saves 90% tokens!).
|
| 29 |
```python
|
| 30 |
-
|
| 31 |
-
top_raw = run_get_top_performers(
|
| 32 |
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
|
| 33 |
metric="success_rate",
|
| 34 |
top_n=3
|
| 35 |
)
|
| 36 |
-
# Defensive: handle both string and dict returns
|
| 37 |
-
top_models_data = json.loads(top_raw) if isinstance(top_raw, str) else top_raw
|
| 38 |
-
|
| 39 |
print(f"Top 3 models by {top_models_data['metric_ranked_by']}:")
|
| 40 |
for model in top_models_data['top_performers']:
|
| 41 |
print(f" - {model['model']}: {model['success_rate']}% success, ${model['total_cost_usd']}/run")
|
|
@@ -73,23 +69,20 @@ system_prompt: |-
|
|
| 73 |
---
|
| 74 |
Task: "Analyze the current leaderboard and show me the top performing models with their costs"
|
| 75 |
|
| 76 |
-
Thought: This is an overview question about the leaderboard. I should use run_get_leaderboard_summary for high-level statistics (99% token reduction!), then run_get_top_performers for the top models with costs. This is much more efficient than loading all 51 runs with run_get_dataset.
|
| 77 |
```python
|
| 78 |
-
import json
|
| 79 |
# Get overview statistics
|
| 80 |
-
|
| 81 |
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard"
|
| 82 |
)
|
| 83 |
-
summary_data = json.loads(summary_raw) if isinstance(summary_raw, str) else summary_raw
|
| 84 |
summary = summary_data['summary']
|
| 85 |
|
| 86 |
# Get top 5 performers
|
| 87 |
-
|
| 88 |
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
|
| 89 |
metric="success_rate",
|
| 90 |
top_n=5
|
| 91 |
)
|
| 92 |
-
top_models_data = json.loads(top_raw) if isinstance(top_raw, str) else top_raw
|
| 93 |
top_models = top_models_data['top_performers']
|
| 94 |
|
| 95 |
print(f"Leaderboard Overview:")
|
|
@@ -131,22 +124,15 @@ system_prompt: |-
|
|
| 131 |
---
|
| 132 |
Task: "Create a synthetic dataset of 20 finance-related tasks for testing agents with stock price and ROI calculation tools"
|
| 133 |
|
| 134 |
-
Thought: I will use the run_generate_synthetic_dataset tool to create domain-specific test tasks. I'll specify the finance domain, provide the tool names, and request 20 tasks with balanced difficulty.
|
| 135 |
```python
|
| 136 |
-
|
| 137 |
-
synthetic_raw = run_generate_synthetic_dataset(
|
| 138 |
domain="finance",
|
| 139 |
tool_names="get_stock_price,calculate_roi,fetch_company_info",
|
| 140 |
num_tasks=20,
|
| 141 |
difficulty_distribution="balanced",
|
| 142 |
agent_type="both"
|
| 143 |
)
|
| 144 |
-
# Defensive: handle both string and dict returns
|
| 145 |
-
if isinstance(synthetic_raw, str):
|
| 146 |
-
synthetic_result = json.loads(synthetic_raw)
|
| 147 |
-
else:
|
| 148 |
-
synthetic_result = synthetic_raw
|
| 149 |
-
|
| 150 |
print(f"Generated {synthetic_result['dataset_info']['num_tasks_generated']} tasks")
|
| 151 |
print(f"Batches used: {synthetic_result['dataset_info']['num_batches']}")
|
| 152 |
print(f"Difficulty distribution: {synthetic_result['dataset_info']['difficulty_distribution']}")
|
|
@@ -246,12 +232,7 @@ system_prompt: |-
|
|
| 246 |
- For overview questions (e.g., "how many runs", "average success rate"): Use `run_get_leaderboard_summary()` (99% token savings!)
|
| 247 |
- For leaderboard analysis with AI insights: Use `run_analyze_leaderboard()`
|
| 248 |
- ONLY use `run_get_dataset()` for non-leaderboard datasets (traces, results, metrics)
|
| 249 |
-
- **IMPORTANT
|
| 250 |
-
```python
|
| 251 |
-
result_raw = run_tool(...)
|
| 252 |
-
result = json.loads(result_raw) if isinstance(result_raw, str) else result_raw
|
| 253 |
-
```
|
| 254 |
-
Then access dict keys safely: `result['key']`. Use json.dumps() when converting dict to string (e.g., for push_dataset_to_hub).
|
| 255 |
5. Call a tool only when needed, and never re-do a tool call that you previously did with the exact same parameters.
|
| 256 |
6. Don't name any new variable with the same name as a tool: for instance don't name a variable 'final_answer'.
|
| 257 |
7. Never create any notional variables in our code, as having these in your logs will derail you from the true variables.
|
|
|
|
| 25 |
---
|
| 26 |
Task: "What are the top 3 performing models on the leaderboard and how much do they cost?"
|
| 27 |
|
| 28 |
+
Thought: This is a "top N" query, so I should use the optimized `run_get_top_performers` tool instead of run_get_dataset to avoid loading all 51 runs (saves 90% tokens!). This tool returns a dict ready to use (no json.loads needed).
|
| 29 |
```python
|
| 30 |
+
top_models_data = run_get_top_performers(
|
|
|
|
| 31 |
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
|
| 32 |
metric="success_rate",
|
| 33 |
top_n=3
|
| 34 |
)
|
|
|
|
|
|
|
|
|
|
| 35 |
print(f"Top 3 models by {top_models_data['metric_ranked_by']}:")
|
| 36 |
for model in top_models_data['top_performers']:
|
| 37 |
print(f" - {model['model']}: {model['success_rate']}% success, ${model['total_cost_usd']}/run")
|
|
|
|
| 69 |
---
|
| 70 |
Task: "Analyze the current leaderboard and show me the top performing models with their costs"
|
| 71 |
|
| 72 |
+
Thought: This is an overview question about the leaderboard. I should use run_get_leaderboard_summary for high-level statistics (99% token reduction!), then run_get_top_performers for the top models with costs. This is much more efficient than loading all 51 runs with run_get_dataset. MCP tools return dicts ready to use.
|
| 73 |
```python
|
|
|
|
| 74 |
# Get overview statistics
|
| 75 |
+
summary_data = run_get_leaderboard_summary(
|
| 76 |
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard"
|
| 77 |
)
|
|
|
|
| 78 |
summary = summary_data['summary']
|
| 79 |
|
| 80 |
# Get top 5 performers
|
| 81 |
+
top_models_data = run_get_top_performers(
|
| 82 |
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
|
| 83 |
metric="success_rate",
|
| 84 |
top_n=5
|
| 85 |
)
|
|
|
|
| 86 |
top_models = top_models_data['top_performers']
|
| 87 |
|
| 88 |
print(f"Leaderboard Overview:")
|
|
|
|
| 124 |
---
|
| 125 |
Task: "Create a synthetic dataset of 20 finance-related tasks for testing agents with stock price and ROI calculation tools"
|
| 126 |
|
| 127 |
+
Thought: I will use the run_generate_synthetic_dataset tool to create domain-specific test tasks. I'll specify the finance domain, provide the tool names, and request 20 tasks with balanced difficulty. The tool returns a dict ready to use.
|
| 128 |
```python
|
| 129 |
+
synthetic_result = run_generate_synthetic_dataset(
|
|
|
|
| 130 |
domain="finance",
|
| 131 |
tool_names="get_stock_price,calculate_roi,fetch_company_info",
|
| 132 |
num_tasks=20,
|
| 133 |
difficulty_distribution="balanced",
|
| 134 |
agent_type="both"
|
| 135 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
print(f"Generated {synthetic_result['dataset_info']['num_tasks_generated']} tasks")
|
| 137 |
print(f"Batches used: {synthetic_result['dataset_info']['num_batches']}")
|
| 138 |
print(f"Difficulty distribution: {synthetic_result['dataset_info']['difficulty_distribution']}")
|
|
|
|
| 232 |
- For overview questions (e.g., "how many runs", "average success rate"): Use `run_get_leaderboard_summary()` (99% token savings!)
|
| 233 |
- For leaderboard analysis with AI insights: Use `run_analyze_leaderboard()`
|
| 234 |
- ONLY use `run_get_dataset()` for non-leaderboard datasets (traces, results, metrics)
|
| 235 |
+
- **IMPORTANT**: All MCP tools return dict/list objects ready to use - DO NOT use json.loads()! Only use json.dumps() when you need to convert a dict to a JSON string (e.g., for push_dataset_to_hub).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
5. Call a tool only when needed, and never re-do a tool call that you previously did with the exact same parameters.
|
| 237 |
6. Don't name any new variable with the same name as a tool: for instance don't name a variable 'final_answer'.
|
| 238 |
7. Never create any notional variables in our code, as having these in your logs will derail you from the true variables.
|