KhalilGuetari commited on
Commit
64e67e1
Β·
1 Parent(s): 11bba08

Document technical details

Browse files
README.md CHANGED
@@ -15,51 +15,133 @@ tags:
15
  - building-mcp-track-consumer
16
  ---
17
 
18
- # HuggingFace EDA MCP Server
19
 
20
- An MCP (Model Context Protocol) server that provides tools for Exploratory Data Analysis (EDA) of HuggingFace datasets.
21
 
22
- TODO Add use cases details
23
 
24
- ## Features
25
 
26
- Add details about the features, caching, parquet, etc.
 
 
 
 
 
 
 
27
 
28
- - **Dataset Metadata**: Retrieve comprehensive information about HuggingFace datasets including size, features, splits, and configurations
29
- - **Dataset Sampling**: Get samples from any dataset split for quick exploration
30
- - **Feature Analysis**: Perform basic EDA with automatic optimization
31
- - Uses HuggingFace Dataset Viewer API for full dataset statistics (when available)
32
- - Automatic fallback to sample-based analysis
33
- - Supports multiple data types: numerical, categorical, text, image, audio
34
- - Includes histograms, distributions, and missing value analysis
35
- - **Text Search**: Search for text in dataset columns using the Dataset Viewer API
36
- - Only text columns are searched
37
- - Only parquet datasets are supported
38
- - Supports pagination with offset and length parameters
39
 
40
- ## Usage
41
 
42
- This Space runs as an MCP server that can be accessed by MCP-compatible AI assistants.
43
 
44
- ### Available Tools
 
 
 
45
 
46
- 1. **get_dataset_metadata**: Get detailed information about a dataset including size, features, splits, and download statistics
47
- 2. **get_dataset_sample**: Retrieve sample rows from a dataset for quick exploration
48
- 3. **analyze_dataset_features**: Perform comprehensive exploratory analysis with automatic optimization
49
- - Automatically uses Dataset Viewer API statistics for parquet datasets (full dataset analysis)
50
- - Falls back to sample-based analysis for other formats
51
- - Returns feature types, statistics, histograms, and missing value analysis
52
- 4. **search_text_in_dataset**: Search for text in dataset columns
53
- - Search text in text columns using the Dataset Viewer API
54
- - Only parquet datasets are supported
55
- - Supports pagination for large result sets
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
  ## MCP Client Configuration
58
 
59
- Under the hood, tools use DatasetViewer and HfApi to get information on datasets. A HuggingFace Token `hf-api-token` is necessary to use those.
60
 
61
- - **Gradio UI** on the HF space, the token used is a token set in the space's secrets
62
- - **MCP server**: set up your HF Token in the MCP configuration headers like in the following example:
 
63
 
64
  ```json
65
  {
@@ -74,7 +156,7 @@ Under the hood, tools use DatasetViewer and HfApi to get information on datasets
74
  }
75
  ```
76
 
77
- Or with a `npx` command:
78
 
79
  ```json
80
  {
@@ -94,6 +176,51 @@ Or with a `npx` command:
94
  }
95
  ```
96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
  ## License
98
 
99
  Apache License 2.0
 
15
  - building-mcp-track-consumer
16
  ---
17
 
18
+ # πŸ“Š HuggingFace EDA MCP Server
19
 
20
+ > πŸŽ‰ Submission for the [HuggingFace 1st Birthday Hackathon](https://huggingface.co/spaces/huggingface/hf-1st-birthday-hackathon)
21
 
22
+ An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.
23
 
24
+ Whether you're a ML engineer, data scientist, or researcher, dataset exploration is a critical part of the workflow. This server automates the tedious parts such as fetching metadata, sampling data, computing statistics, so you can focus on what matters: finding and understanding the right data for your task.
25
 
26
+ **Use cases:**
27
+ - **Dataset discovery**:
28
+ - Inspect metadata, schemas, and samples to evaluate datasets before use
29
+ - Use it in conjunction with HuggingFace MCP `search_dataset` for even more powerful dataset discovery
30
+ - **Exploratory Data analysis**:
31
+ - Analyze feature distributions, detect missing values, and review statistics
32
+ - Ask your AI assistant to build reports and visualizations
33
+ - **Content search**: Find specific examples in datasets using text search
34
 
35
+ ## Available Tools
 
 
 
 
 
 
 
 
 
 
36
 
37
+ ### `get_dataset_metadata`
38
 
39
+ Retrieve comprehensive metadata about a HuggingFace dataset.
40
 
41
+ | Parameter | Type | Required | Default | Description |
42
+ |-----------|------|----------|---------|-------------|
43
+ | `dataset_id` | string | βœ… | - | HuggingFace dataset identifier (e.g., `imdb`, `squad`, `glue`) |
44
+ | `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |
45
 
46
+ **Returns:** Dataset size, features schema, splits info, configurations, download stats, tags, download size, description and more.
47
+
48
+ ---
49
+
50
+ ### `get_dataset_sample`
51
+
52
+ Retrieve sample rows from a dataset for quick exploration.
53
+
54
+ | Parameter | Type | Required | Default | Description |
55
+ |-----------|------|----------|---------|-------------|
56
+ | `dataset_id` | string | βœ… | - | HuggingFace dataset identifier |
57
+ | `split` | string | ❌ | `train` | Dataset split to sample from |
58
+ | `num_samples` | int | ❌ | `10` | Number of samples to retrieve (max: 10,000) |
59
+ | `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |
60
+ | `streaming` | bool | ❌ | `True` | Use streaming mode for efficient loading |
61
+
62
+ **Returns:** Sample data rows with schema information and sampling metadata.
63
+
64
+ ---
65
+
66
+ ### `analyze_dataset_features`
67
+
68
+ Perform exploratory data analysis on dataset features with automatic optimization.
69
+
70
+ | Parameter | Type | Required | Default | Description |
71
+ |-----------|------|----------|---------|-------------|
72
+ | `dataset_id` | string | βœ… | - | HuggingFace dataset identifier |
73
+ | `split` | string | ❌ | `train` | Dataset split to analyze |
74
+ | `sample_size` | int | ❌ | `1000` | Number of samples for analysis (max: 50,000) |
75
+ | `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |
76
+
77
+ **Returns:** Feature types, statistics (mean, std, min, max for numerical), distributions, histograms, and missing value analysis. Supports numerical, categorical, text, image, and audio data types.
78
+
79
+ ---
80
+
81
+ ### `search_text_in_dataset`
82
+
83
+ Search for text in dataset columns using the Dataset Viewer API.
84
+
85
+ | Parameter | Type | Required | Default | Description |
86
+ |-----------|------|----------|---------|-------------|
87
+ | `dataset_id` | string | βœ… | - | Full dataset identifier (e.g., `stanfordnlp/imdb`) |
88
+ | `config_name` | string | βœ… | - | Configuration name |
89
+ | `split` | string | βœ… | - | Split name |
90
+ | `query` | string | βœ… | - | Search query |
91
+ | `offset` | int | ❌ | `0` | Pagination offset |
92
+ | `length` | int | ❌ | `10` | Number of results to return |
93
+
94
+ **Returns:** Matching rows with highlighted search results. Only works on parquet datasets with text columns.
95
+
96
+ ---
97
+
98
+ ## How It Works
99
+
100
+ ### API Integrations
101
+
102
+ The server leverages multiple HuggingFace APIs:
103
+
104
+ | API | Used For |
105
+ |-----|----------|
106
+ | **[Hub API](https://huggingface.co/docs/huggingface_hub/guides/hf_api)** | Dataset metadata, repository info, download stats |
107
+ | **[Dataset Viewer API](https://huggingface.co/docs/dataset-viewer)** | Full dataset statistics, text search, parquet row access |
108
+ | **[datasets library](https://huggingface.co/docs/datasets)** | Streaming data loading, sample extraction |
109
+
110
+ ### Data Loading Strategy
111
+
112
+ - **Streaming mode** (default): Uses `datasets.load_dataset(..., streaming=True)` to avoid downloading entire datasets. Samples are taken from an iterator, minimizing memory footprint.
113
+ - **Statistics API**: For parquet datasets, `analyze_dataset_features` first attempts to fetch pre-computed statistics from the Dataset Viewer API (`/statistics` endpoint), providing full dataset coverage without sampling.
114
+ - **Fallback**: If statistics aren't available, analysis falls back to sample-based computation.
115
+
116
+ ### Caching
117
+
118
+ Results are cached locally to reduce API calls:
119
+
120
+ | Cache Type | TTL | Location |
121
+ |------------|-----|----------|
122
+ | Metadata | 1 hour | `~/.cache/hf_eda_mcp/metadata/` |
123
+ | Samples | 1 hour | `~/.cache/hf_eda_mcp/samples/` |
124
+ | Statistics | 1 hour | `~/.cache/hf_eda_mcp/statistics/` |
125
+
126
+ ### Parquet Requirements
127
+
128
+ Some features require datasets with `builder_name="parquet"`:
129
+ - **Text search** (`search_text_in_dataset`): Only parquet datasets are searchable
130
+ - **Full statistics**: Pre-computed stats are only available for parquet datasets
131
+
132
+ ### Error Handling
133
+
134
+ - Automatic retry with exponential backoff for transient network errors
135
+ - Graceful fallback from statistics API to sample-based analysis
136
+ - Descriptive error messages with suggestions for common issues
137
 
138
  ## MCP Client Configuration
139
 
140
+ Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.
141
 
142
+ **Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/`
143
+
144
+ ### With URL
145
 
146
  ```json
147
  {
 
156
  }
157
  ```
158
 
159
+ ### With mcp-remote
160
 
161
  ```json
162
  {
 
176
  }
177
  ```
178
 
179
+ ## Project Structure
180
+
181
+ ```
182
+ src/hf_eda_mcp/
183
+ β”œβ”€β”€ server.py # Gradio app with MCP server setup
184
+ β”œβ”€β”€ config.py # Server configuration (env vars, defaults)
185
+ β”œβ”€β”€ validation.py # Input validation for all tools
186
+ β”œβ”€β”€ error_handling.py # Retry logic, error formatting
187
+ β”œβ”€β”€ tools/ # MCP tools (exposed via Gradio)
188
+ β”‚ β”œβ”€β”€ metadata.py # get_dataset_metadata
189
+ β”‚ β”œβ”€β”€ sampling.py # get_dataset_sample
190
+ β”‚ β”œβ”€β”€ analysis.py # analyze_dataset_features
191
+ β”‚ └── search.py # search_text_in_dataset
192
+ β”œβ”€β”€ services/ # Business logic layer
193
+ β”‚ β”œβ”€β”€ dataset_service.py # Caching, data loading, statistics
194
+ └── integrations/
195
+ └── dataset_viewer_adapter.py # Dataset Viewer API client
196
+ └── hf_client.py # HuggingFace Hub API wrapper (HfApi)
197
+ ```
198
+
199
+ ## Local Development
200
+
201
+ ### Setup
202
+
203
+ ```bash
204
+ # Install pdm
205
+ brew install pdm
206
+
207
+ # Clone the repository
208
+ git clone https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp
209
+ cd hf-eda-mcp
210
+
211
+ # Install dependencies
212
+ pdm install
213
+
214
+ # Set your HuggingFace token
215
+ export HF_TOKEN=hf_xxx
216
+ # or create a .env file with HF_TOKEN=hf_xxx (see config.example.env)
217
+
218
+ # Run the server
219
+ pdm run hf-eda-mcp
220
+ ```
221
+
222
+ The server starts at `http://localhost:7860` with MCP endpoint at `/gradio_api/mcp/`.
223
+
224
  ## License
225
 
226
  Apache License 2.0
scripts/playground/analysis_tool_playground.py CHANGED
@@ -2,7 +2,7 @@ import os
2
  import logging
3
  from pprint import pprint
4
  from dotenv import load_dotenv
5
- from hf_eda_mcp.services.dataset_viewer_adapter import DatasetViewerAdapter
6
  from hf_eda_mcp.tools.analysis import analyze_dataset_features
7
 
8
  load_dotenv()
 
2
  import logging
3
  from pprint import pprint
4
  from dotenv import load_dotenv
5
+ from hf_eda_mcp.integrations.dataset_viewer_adapter import DatasetViewerAdapter
6
  from hf_eda_mcp.tools.analysis import analyze_dataset_features
7
 
8
  load_dotenv()
src/hf_eda_mcp/{services β†’ integrations}/dataset_viewer_adapter.py RENAMED
File without changes
src/hf_eda_mcp/services/dataset_service.py CHANGED
@@ -21,7 +21,7 @@ from hf_eda_mcp.integrations.hf_client import (
21
  AuthenticationError,
22
  NetworkError
23
  )
24
- from hf_eda_mcp.services.dataset_viewer_adapter import DatasetViewerAdapter
25
  from hf_eda_mcp.error_handling import (
26
  retry_with_backoff,
27
  RetryConfig,
 
21
  AuthenticationError,
22
  NetworkError
23
  )
24
+ from hf_eda_mcp.integrations.dataset_viewer_adapter import DatasetViewerAdapter
25
  from hf_eda_mcp.error_handling import (
26
  retry_with_backoff,
27
  RetryConfig,