hf-eda-mcp

Running

App Files Files Community

KhalilGuetari commited on 10 days ago

Commit

64e67e1

1 Parent(s): 11bba08

Document technical details

Browse files

Files changed (4) hide show

README.md +160 -33
scripts/playground/analysis_tool_playground.py +1 -1
src/hf_eda_mcp/{services → integrations}/dataset_viewer_adapter.py +0 -0
src/hf_eda_mcp/services/dataset_service.py +1 -1

README.md CHANGED Viewed

@@ -15,51 +15,133 @@ tags:
   - building-mcp-track-consumer
 ---
-# HuggingFace EDA MCP Server
-An MCP (Model Context Protocol) server that provides tools for Exploratory Data Analysis (EDA) of HuggingFace datasets.
-TODO Add use cases details
-## Features
-Add details about the features, caching, parquet, etc.
-- **Dataset Metadata**: Retrieve comprehensive information about HuggingFace datasets including size, features, splits, and configurations
-- **Dataset Sampling**: Get samples from any dataset split for quick exploration
-- **Feature Analysis**: Perform basic EDA with automatic optimization
-  - Uses HuggingFace Dataset Viewer API for full dataset statistics (when available)
-  - Automatic fallback to sample-based analysis
-  - Supports multiple data types: numerical, categorical, text, image, audio
-  - Includes histograms, distributions, and missing value analysis
-- **Text Search**: Search for text in dataset columns using the Dataset Viewer API
-  - Only text columns are searched
-  - Only parquet datasets are supported
-  - Supports pagination with offset and length parameters
-## Usage
-This Space runs as an MCP server that can be accessed by MCP-compatible AI assistants.
-### Available Tools
-1. **get_dataset_metadata**: Get detailed information about a dataset including size, features, splits, and download statistics
-2. **get_dataset_sample**: Retrieve sample rows from a dataset for quick exploration
-3. **analyze_dataset_features**: Perform comprehensive exploratory analysis with automatic optimization
-   - Automatically uses Dataset Viewer API statistics for parquet datasets (full dataset analysis)
-   - Falls back to sample-based analysis for other formats
-   - Returns feature types, statistics, histograms, and missing value analysis
-4. **search_text_in_dataset**: Search for text in dataset columns
-   - Search text in text columns using the Dataset Viewer API
-   - Only parquet datasets are supported
-   - Supports pagination for large result sets
 ## MCP Client Configuration
-Under the hood, tools use DatasetViewer and HfApi to get information on datasets. A HuggingFace Token `hf-api-token` is necessary to use those.
-- **Gradio UI** on the HF space, the token used is a token set in the space's secrets
-- **MCP server**: set up your HF Token in the MCP configuration headers like in the following example:
 ```json
 {
@@ -74,7 +156,7 @@ Under the hood, tools use DatasetViewer and HfApi to get information on datasets
 }
 ```
-Or with a `npx` command:
 ```json
 {
@@ -94,6 +176,51 @@ Or with a `npx` command:
 }
 ```
 ## License
 Apache License 2.0

   - building-mcp-track-consumer
 ---
+# 📊 HuggingFace EDA MCP Server
+> 🎉 Submission for the [HuggingFace 1st Birthday Hackathon](https://huggingface.co/spaces/huggingface/hf-1st-birthday-hackathon)
+An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.
+Whether you're a ML engineer, data scientist, or researcher, dataset exploration is a critical part of the workflow. This server automates the tedious parts such as fetching metadata, sampling data, computing statistics, so you can focus on what matters: finding and understanding the right data for your task.
+**Use cases:**
+- **Dataset discovery**:
+  - Inspect metadata, schemas, and samples to evaluate datasets before use
+  - Use it in conjunction with HuggingFace MCP `search_dataset` for even more powerful dataset discovery
+- **Exploratory Data analysis**:
+  - Analyze feature distributions, detect missing values, and review statistics
+  - Ask your AI assistant to build reports and visualizations
+- **Content search**: Find specific examples in datasets using text search
+## Available Tools
+### `get_dataset_metadata`
+Retrieve comprehensive metadata about a HuggingFace dataset.
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `dataset_id` | string | ✅ | - | HuggingFace dataset identifier (e.g., `imdb`, `squad`, `glue`) |
+| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |
+**Returns:** Dataset size, features schema, splits info, configurations, download stats, tags, download size, description and more.
+---
+### `get_dataset_sample`
+Retrieve sample rows from a dataset for quick exploration.
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `dataset_id` | string | ✅ | - | HuggingFace dataset identifier |
+| `split` | string | ❌ | `train` | Dataset split to sample from |
+| `num_samples` | int | ❌ | `10` | Number of samples to retrieve (max: 10,000) |
+| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |
+| `streaming` | bool | ❌ | `True` | Use streaming mode for efficient loading |
+**Returns:** Sample data rows with schema information and sampling metadata.
+---
+### `analyze_dataset_features`
+Perform exploratory data analysis on dataset features with automatic optimization.
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `dataset_id` | string | ✅ | - | HuggingFace dataset identifier |
+| `split` | string | ❌ | `train` | Dataset split to analyze |
+| `sample_size` | int | ❌ | `1000` | Number of samples for analysis (max: 50,000) |
+| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |
+**Returns:** Feature types, statistics (mean, std, min, max for numerical), distributions, histograms, and missing value analysis. Supports numerical, categorical, text, image, and audio data types.
+---
+### `search_text_in_dataset`
+Search for text in dataset columns using the Dataset Viewer API.
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `dataset_id` | string | ✅ | - | Full dataset identifier (e.g., `stanfordnlp/imdb`) |
+| `config_name` | string | ✅ | - | Configuration name |
+| `split` | string | ✅ | - | Split name |
+| `query` | string | ✅ | - | Search query |
+| `offset` | int | ❌ | `0` | Pagination offset |
+| `length` | int | ❌ | `10` | Number of results to return |
+**Returns:** Matching rows with highlighted search results. Only works on parquet datasets with text columns.
+---
+## How It Works
+### API Integrations
+The server leverages multiple HuggingFace APIs:
+| API | Used For |
+|-----|----------|
+| **[Hub API](https://huggingface.co/docs/huggingface_hub/guides/hf_api)** | Dataset metadata, repository info, download stats |
+| **[Dataset Viewer API](https://huggingface.co/docs/dataset-viewer)** | Full dataset statistics, text search, parquet row access |
+| **[datasets library](https://huggingface.co/docs/datasets)** | Streaming data loading, sample extraction |
+### Data Loading Strategy
+- **Streaming mode** (default): Uses `datasets.load_dataset(..., streaming=True)` to avoid downloading entire datasets. Samples are taken from an iterator, minimizing memory footprint.
+- **Statistics API**: For parquet datasets, `analyze_dataset_features` first attempts to fetch pre-computed statistics from the Dataset Viewer API (`/statistics` endpoint), providing full dataset coverage without sampling.
+- **Fallback**: If statistics aren't available, analysis falls back to sample-based computation.
+### Caching
+Results are cached locally to reduce API calls:
+| Cache Type | TTL | Location |
+|------------|-----|----------|
+| Metadata | 1 hour | `~/.cache/hf_eda_mcp/metadata/` |
+| Samples | 1 hour | `~/.cache/hf_eda_mcp/samples/` |
+| Statistics | 1 hour | `~/.cache/hf_eda_mcp/statistics/` |
+### Parquet Requirements
+Some features require datasets with `builder_name="parquet"`:
+- **Text search** (`search_text_in_dataset`): Only parquet datasets are searchable
+- **Full statistics**: Pre-computed stats are only available for parquet datasets
+### Error Handling
+- Automatic retry with exponential backoff for transient network errors
+- Graceful fallback from statistics API to sample-based analysis
+- Descriptive error messages with suggestions for common issues
 ## MCP Client Configuration
+Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.
+**Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/`
+### With URL
 ```json
 {
 }
 ```
+### With mcp-remote
 ```json
 {
 }
 ```
+## Project Structure
+```
+src/hf_eda_mcp/
+├── server.py                 # Gradio app with MCP server setup
+├── config.py                 # Server configuration (env vars, defaults)
+├── validation.py             # Input validation for all tools
+├── error_handling.py         # Retry logic, error formatting
+├── tools/                    # MCP tools (exposed via Gradio)
+│   ├── metadata.py           # get_dataset_metadata
+│   ├── sampling.py           # get_dataset_sample
+│   ├── analysis.py           # analyze_dataset_features
+│   └── search.py             # search_text_in_dataset
+├── services/                 # Business logic layer
+│   ├── dataset_service.py    # Caching, data loading, statistics
+└── integrations/
+    └── dataset_viewer_adapter.py  # Dataset Viewer API client
+    └── hf_client.py          # HuggingFace Hub API wrapper (HfApi)
+```
+## Local Development
+### Setup
+```bash
+# Install pdm
+brew install pdm
+# Clone the repository
+git clone https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp
+cd hf-eda-mcp
+# Install dependencies
+pdm install
+# Set your HuggingFace token
+export HF_TOKEN=hf_xxx
+# or create a .env file with HF_TOKEN=hf_xxx (see config.example.env)
+# Run the server
+pdm run hf-eda-mcp
+```
+The server starts at `http://localhost:7860` with MCP endpoint at `/gradio_api/mcp/`.
 ## License
 Apache License 2.0

scripts/playground/analysis_tool_playground.py CHANGED Viewed

@@ -2,7 +2,7 @@ import os
 import logging
 from pprint import pprint
 from dotenv import load_dotenv
-from hf_eda_mcp.services.dataset_viewer_adapter import DatasetViewerAdapter
 from hf_eda_mcp.tools.analysis import analyze_dataset_features
 load_dotenv()

 import logging
 from pprint import pprint
 from dotenv import load_dotenv
+from hf_eda_mcp.integrations.dataset_viewer_adapter import DatasetViewerAdapter
 from hf_eda_mcp.tools.analysis import analyze_dataset_features
 load_dotenv()

src/hf_eda_mcp/{services → integrations}/dataset_viewer_adapter.py RENAMED Viewed

File without changes

src/hf_eda_mcp/services/dataset_service.py CHANGED Viewed

@@ -21,7 +21,7 @@ from hf_eda_mcp.integrations.hf_client import (
     AuthenticationError,
     NetworkError
 )
-from hf_eda_mcp.services.dataset_viewer_adapter import DatasetViewerAdapter
 from hf_eda_mcp.error_handling import (
     retry_with_backoff,
     RetryConfig,

     AuthenticationError,
     NetworkError
 )
+from hf_eda_mcp.integrations.dataset_viewer_adapter import DatasetViewerAdapter
 from hf_eda_mcp.error_handling import (
     retry_with_backoff,
     RetryConfig,