KhalilGuetari commited on
Commit
11df203
·
1 Parent(s): 69d9c55

Set up base gradio server

Browse files
.kiro/specs/hf-eda-mcp-server/design.md ADDED
@@ -0,0 +1,305 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Design Document
2
+
3
+ ## Overview
4
+
5
+ The hf-eda-mcp system is designed as a Gradio-based MCP server that provides exploratory data analysis tools for HuggingFace datasets. The system leverages Gradio's built-in MCP server capabilities to automatically expose EDA functions as MCP tools, enabling seamless integration with AI assistants and other MCP-compatible systems.
6
+
7
+ The architecture follows a modular approach where core EDA functionality is implemented as separate Python functions, which are then wrapped in Gradio interfaces and automatically converted to MCP tools through Gradio's native MCP integration.
8
+
9
+ ## Architecture
10
+
11
+ ### High-Level Architecture
12
+
13
+ ```mermaid
14
+ graph TB
15
+ subgraph "MCP Client (AI Assistant)"
16
+ A[AI Assistant]
17
+ end
18
+
19
+ subgraph "hf-eda-mcp Server"
20
+ B[Gradio App with MCP Server]
21
+ C[EDA Tools Module]
22
+ D[Dataset Service]
23
+ E[HuggingFace Integration]
24
+ end
25
+
26
+ subgraph "External Services"
27
+ F[HuggingFace Hub]
28
+ end
29
+
30
+ A -->|MCP Protocol| B
31
+ B --> C
32
+ C --> D
33
+ D --> E
34
+ E -->|API Calls| F
35
+ ```
36
+
37
+ ### Component Architecture
38
+
39
+ The system is organized into the following key components:
40
+
41
+ 1. **Gradio MCP Server**: The main application that hosts the MCP server and web interface
42
+ 2. **EDA Tools Module**: Core analysis functions for dataset exploration
43
+ 3. **Dataset Service**: Handles dataset loading, caching, and metadata retrieval
44
+ 4. **HuggingFace Integration**: Manages authentication and API interactions with HF Hub
45
+
46
+ ## Components and Interfaces
47
+
48
+ ### 1. Gradio MCP Server (`src/hf_eda_mcp/server.py`)
49
+
50
+ **Purpose**: Main application entry point that creates Gradio interfaces for EDA tools and enables MCP server functionality.
51
+
52
+ **Key Responsibilities**:
53
+ - Initialize Gradio app with MCP server enabled
54
+ - Define Gradio interfaces for each EDA tool
55
+ - Handle MCP protocol communication
56
+ - Manage server configuration and startup
57
+
58
+ **Interface**:
59
+ ```python
60
+ def create_gradio_app() -> gr.Blocks:
61
+ """Create and configure the main Gradio application with MCP server."""
62
+
63
+ def launch_server(port: int = 7860, mcp_server: bool = True) -> None:
64
+ """Launch the Gradio app with MCP server enabled."""
65
+ ```
66
+
67
+ ### 2. EDA Tools Module (`src/hf_eda_mcp/tools/`)
68
+
69
+ **Purpose**: Contains individual EDA functions that will be exposed as MCP tools.
70
+
71
+ #### Dataset Metadata Tool (`tools/metadata.py`)
72
+ ```python
73
+ def get_dataset_metadata(dataset_id: str, config_name: str = None) -> dict:
74
+ """
75
+ Retrieve comprehensive metadata for a HuggingFace dataset.
76
+
77
+ Args:
78
+ dataset_id: HuggingFace dataset identifier (e.g., 'squad', 'glue')
79
+ config_name: Optional configuration name for multi-config datasets
80
+
81
+ Returns:
82
+ Dictionary containing dataset metadata including:
83
+ - Basic info (size, splits, features)
84
+ - Configuration details
85
+ - Download statistics
86
+ - Dataset card information
87
+ """
88
+ ```
89
+
90
+ #### Dataset Sampling Tool (`tools/sampling.py`)
91
+ ```python
92
+ def get_dataset_sample(
93
+ dataset_id: str,
94
+ split: str = "train",
95
+ num_samples: int = 10,
96
+ config_name: str = None
97
+ ) -> dict:
98
+ """
99
+ Retrieve a sample of rows from a HuggingFace dataset.
100
+
101
+ Args:
102
+ dataset_id: HuggingFace dataset identifier
103
+ split: Dataset split to sample from
104
+ num_samples: Number of samples to retrieve
105
+ config_name: Optional configuration name
106
+
107
+ Returns:
108
+ Dictionary containing sampled data and metadata
109
+ """
110
+ ```
111
+
112
+ #### Basic Analysis Tool (`tools/analysis.py`)
113
+ ```python
114
+ def analyze_dataset_features(
115
+ dataset_id: str,
116
+ split: str = "train",
117
+ sample_size: int = 1000,
118
+ config_name: str = None
119
+ ) -> dict:
120
+ """
121
+ Perform basic exploratory analysis on dataset features.
122
+
123
+ Args:
124
+ dataset_id: HuggingFace dataset identifier
125
+ split: Dataset split to analyze
126
+ sample_size: Number of samples to use for analysis
127
+ config_name: Optional configuration name
128
+
129
+ Returns:
130
+ Dictionary containing feature analysis results:
131
+ - Feature types and distributions
132
+ - Missing value statistics
133
+ - Summary statistics for numerical features
134
+ - Unique value counts for categorical features
135
+ """
136
+ ```
137
+
138
+ ### 3. Dataset Service (`src/hf_eda_mcp/services/dataset_service.py`)
139
+
140
+ **Purpose**: Centralized service for dataset operations, caching, and metadata management.
141
+
142
+ **Key Responsibilities**:
143
+ - Load datasets from HuggingFace Hub
144
+ - Cache dataset metadata and samples
145
+ - Handle authentication for private datasets
146
+ - Manage dataset configuration and splits
147
+
148
+ **Interface**:
149
+ ```python
150
+ class DatasetService:
151
+ def __init__(self, cache_dir: str = None, token: str = None):
152
+ """Initialize dataset service with optional caching and authentication."""
153
+
154
+ def load_dataset_info(self, dataset_id: str, config_name: str = None) -> DatasetInfo:
155
+ """Load dataset information from HuggingFace Hub."""
156
+
157
+ def load_dataset_sample(self, dataset_id: str, split: str, num_samples: int, config_name: str = None) -> Dataset:
158
+ """Load a sample from the specified dataset."""
159
+
160
+ def get_cached_metadata(self, dataset_id: str, config_name: str = None) -> dict:
161
+ """Retrieve cached metadata or fetch if not available."""
162
+ ```
163
+
164
+ ### 4. HuggingFace Integration (`src/hf_eda_mcp/integrations/hf_client.py`)
165
+
166
+ **Purpose**: Handles all interactions with HuggingFace Hub API and datasets library.
167
+
168
+ **Key Responsibilities**:
169
+ - Authenticate with HuggingFace Hub
170
+ - Fetch dataset information using huggingface_hub
171
+ - Load datasets using datasets library
172
+ - Handle errors and rate limiting
173
+
174
+ ## Data Models
175
+
176
+ ### Dataset Metadata Model
177
+ ```python
178
+ from pydantic import BaseModel
179
+
180
+ class DatasetMetadata(BaseModel):
181
+ id: str
182
+ author: str
183
+ description: str
184
+ features: Dict[str, str]
185
+ splits: Dict[str, int]
186
+ configs: List[str]
187
+ size_bytes: int
188
+ downloads: int
189
+ likes: int
190
+ tags: List[str]
191
+ created_at: datetime
192
+ last_modified: datetime
193
+ ```
194
+
195
+ ### Analysis Result Model
196
+ ```python
197
+ from pydantic import BaseModel
198
+
199
+ class FeatureAnalysis(BaseModel):
200
+ feature_name: str
201
+ feature_type: str
202
+ missing_count: int
203
+ missing_percentage: float
204
+ unique_count: int
205
+ statistics: Dict[str, Any] # Mean, std, min, max for numerical; top values for categorical
206
+ ```
207
+
208
+ ### Sample Data Model
209
+ ```python
210
+ from pydantic import BaseModel
211
+
212
+ class DatasetSample(BaseModel):
213
+ dataset_id: str
214
+ split: str
215
+ config_name: str
216
+ sample_size: int
217
+ data: List[Dict[str, Any]]
218
+ schema: Dict[str, str]
219
+ ```
220
+
221
+ ## Error Handling
222
+
223
+ ### Error Categories and Handling Strategy
224
+
225
+ 1. **Dataset Not Found Errors**
226
+ - Return structured error response with suggestions
227
+ - Log error for monitoring
228
+ - Provide helpful error messages to users
229
+
230
+ 2. **Authentication Errors**
231
+ - Handle private dataset access gracefully
232
+ - Provide clear instructions for authentication
233
+ - Support both token-based and login-based auth
234
+
235
+ 3. **Network and API Errors**
236
+ - Implement retry logic with exponential backoff
237
+ - Cache successful responses to reduce API calls
238
+ - Provide fallback responses when possible
239
+
240
+ 4. **Data Processing Errors**
241
+ - Validate input parameters before processing
242
+ - Handle malformed or unexpected data gracefully
243
+ - Provide partial results when possible
244
+
245
+ ### Error Response Format
246
+ ```python
247
+ from pydantic import BaseModel
248
+
249
+ class ErrorResponse(BaseModel):
250
+ error_type: str
251
+ message: str
252
+ details: Dict[str, Any]
253
+ suggestions: List[str]
254
+ ```
255
+
256
+ ## Testing Strategy
257
+
258
+ ### Unit Testing
259
+ - Test individual EDA functions with mock datasets
260
+ - Validate data processing and analysis logic
261
+ - Test error handling for various edge cases
262
+ - Mock HuggingFace API calls for consistent testing
263
+
264
+ ### Integration Testing
265
+ - Test Gradio interface creation and MCP tool exposure
266
+ - Validate end-to-end dataset loading and analysis workflows
267
+ - Test authentication and private dataset access
268
+ - Verify MCP protocol compliance
269
+
270
+ ### Performance Testing
271
+ - Test with large datasets to ensure efficient sampling
272
+ - Validate caching mechanisms for repeated requests
273
+ - Monitor memory usage during dataset processing
274
+ - Test concurrent request handling
275
+
276
+ ### Test Data Strategy
277
+ - Use small, well-known public datasets for testing
278
+ - Create mock datasets for edge case testing
279
+ - Test with various dataset formats and configurations
280
+ - Include datasets with missing values and data quality issues
281
+
282
+ ## Configuration and Deployment
283
+
284
+ ### Environment Configuration
285
+ - Support for HuggingFace authentication tokens
286
+ - Configurable cache directory and size limits
287
+ - Adjustable sampling limits and timeouts
288
+ - Optional logging and monitoring configuration
289
+
290
+ ### Deployment Options
291
+ 1. **Local Development**: Run as standalone Gradio app
292
+ 2. **HuggingFace Spaces**: Deploy as hosted MCP server34. **MCP Client Integration**: Direct integration with MCP-compatible systems
293
+
294
+ ### MCP Server Configuration
295
+ The server will be configured to work with standard MCP clients through Gradio's built-in MCP support:
296
+
297
+ ```json
298
+ {
299
+ "mcpServers": {
300
+ "hf-eda-mcp-server": {
301
+ "url": "https://your-space.hf.space/gradio_api/mcp/sse"
302
+ }
303
+ }
304
+ }
305
+ ```
.kiro/specs/hf-eda-mcp-server/requirements.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Requirements Document
2
+
3
+ ## Introduction
4
+
5
+ This document specifies the requirements for hf-eda-mcp, a Model Context Protocol (MCP) server that provides tools for Exploratory Data Analysis (EDA) of datasets hosted on HuggingFace. The system will enable AI assistants to perform structured dataset exploration and analysis through MCP-compatible interfaces.
6
+
7
+ ## Glossary
8
+
9
+ - **MCP Server**: A server implementation following the Model Context Protocol that provides tools accessible to AI systems
10
+ - **HuggingFace Dataset**: A dataset hosted on the HuggingFace Hub platform
11
+ - **Several EDA Tool**: Functions that perform exploratory data analysis operations on datasets
12
+ - **Dataset Metadata Tool**: Tool to fetch information about a dataset including size, features, splits, and configuration details
13
+ - **Gradio**: A Python library for building web interfaces and applications
14
+ - **Dataset Sample**: A subset of rows from a dataset used for analysis and preview
15
+
16
+ ## Requirements
17
+
18
+ ### Requirement 1
19
+
20
+ **User Story:** As a data scientist, I want to retrieve metadata from HuggingFace datasets, so that I can understand the structure and properties of datasets before analysis.
21
+
22
+ #### Acceptance Criteria
23
+
24
+ 1. WHEN a dataset identifier is provided, THE MCP Server SHALL retrieve comprehensive metadata including dataset size, feature types, splits, and configuration details
25
+ 2. THE MCP Server SHALL validate dataset identifiers and return appropriate error messages for invalid or non-existent datasets
26
+ 3. THE MCP Server SHALL format metadata in a structured, readable format for AI assistant consumption
27
+ 4. THE MCP Server SHALL handle datasets with multiple configurations and return configuration-specific metadata when requested
28
+
29
+ ### Requirement 2
30
+
31
+ **User Story:** As an AI assistant, I want to access dataset samples through MCP tools, so that I can perform analysis on actual data content.
32
+
33
+ #### Acceptance Criteria
34
+
35
+ 1. WHEN a dataset sample is requested, THE MCP Server SHALL return a configurable number of rows from the specified dataset
36
+ 2. THE MCP Server SHALL support sampling from different dataset splits (train, validation, test)
37
+ 3. WHERE a specific configuration is specified, THE MCP Server SHALL return samples from that configuration
38
+ 4. THE MCP Server SHALL handle large datasets efficiently by implementing appropriate streaming and sampling strategies
39
+
40
+ ### Requirement 3
41
+
42
+ **User Story:** As a developer, I want the MCP server to follow standard MCP protocols, so that it can integrate seamlessly with MCP-compatible AI systems.
43
+
44
+ #### Acceptance Criteria
45
+
46
+ 1. THE MCP Server SHALL implement the standard MCP protocol for tool discovery and execution
47
+ 2. THE MCP Server SHALL provide proper tool schemas and descriptions for all available EDA functions
48
+ 3. THE MCP Server SHALL handle MCP requests and responses according to protocol specifications
49
+ 4. THE MCP Server SHALL support graceful error handling and return appropriate MCP error responses
50
+
51
+ ### Requirement 4
52
+
53
+ **User Story:** As a data analyst, I want basic exploratory analysis tools, so that I can quickly understand dataset characteristics and quality.
54
+
55
+ #### Acceptance Criteria
56
+
57
+ 1. THE MCP Server SHALL provide tools to analyze dataset feature distributions and statistics
58
+ 2. THE MCP Server SHALL identify missing values and data quality issues in dataset samples
59
+ 3. THE MCP Server SHALL generate summary statistics for numerical and categorical features
60
+ 4. WHERE applicable, THE MCP Server SHALL detect and report potential data anomalies or inconsistencies
.kiro/specs/hf-eda-mcp-server/tasks.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Implementation Plan
2
+
3
+ - [x] 1. Set up project structure and dependencies
4
+ - Create package directory structure following Python best practices
5
+ - Configure pyproject.toml with required dependencies (gradio, datasets, huggingface_hub)
6
+ - Set up basic package initialization files
7
+ - _Requirements: 3.1, 4.1, 4.2_
8
+
9
+ - [ ] 2. Implement HuggingFace integration layer
10
+ - [ ] 2.1 Create HuggingFace client wrapper
11
+ - Write HfClient class to handle authentication and API interactions
12
+ - Implement dataset info retrieval using huggingface_hub
13
+ - Add error handling for authentication and network issues
14
+ - _Requirements: 1.2, 4.3_
15
+
16
+ - [ ] 2.2 Implement dataset service with caching
17
+ - Create DatasetService class for centralized dataset operations
18
+ - Add metadata caching to reduce API calls
19
+ - Implement dataset loading and sampling functionality
20
+ - _Requirements: 1.1, 2.1, 2.2_
21
+
22
+ - [ ] 3. Create core EDA tools
23
+ - [ ] 3.1 Implement dataset metadata tool
24
+ - Write get_dataset_metadata function to retrieve comprehensive dataset information
25
+ - Format metadata response with dataset size, features, splits, and configuration details
26
+ - Handle multi-configuration datasets appropriately
27
+ - _Requirements: 1.1, 1.3, 1.4_
28
+
29
+ - [ ] 3.2 Implement dataset sampling tool
30
+ - Create get_dataset_sample function for retrieving dataset samples
31
+ - Support different splits (train, validation, test) and configurable sample sizes
32
+ - Implement efficient sampling strategies for large datasets
33
+ - _Requirements: 2.1, 2.2, 2.3_
34
+
35
+ - [ ] 3.3 Implement basic analysis tool
36
+ - Write analyze_dataset_features function for exploratory data analysis
37
+ - Generate feature statistics, missing value analysis, and data quality insights
38
+ - Handle different data types (numerical, categorical, text) appropriately
39
+ - _Requirements: 5.1, 5.2, 5.3, 5.4_
40
+
41
+ - [ ] 4. Create Gradio interfaces and MCP server
42
+ - [ ] 4.1 Design Gradio interfaces for each EDA tool
43
+ - Create Gradio interface for metadata retrieval with appropriate input/output components
44
+ - Build interface for dataset sampling with split and sample size controls
45
+ - Design interface for feature analysis with configuration options
46
+ - _Requirements: 3.1, 3.2_
47
+
48
+ - [ ] 4.2 Implement main Gradio application
49
+ - Create main Gradio app that combines all EDA tool interfaces
50
+ - Enable MCP server functionality using Gradio's built-in MCP support
51
+ - Configure proper tool descriptions and schemas for MCP exposure
52
+ - _Requirements: 3.1, 3.2, 3.3_
53
+
54
+ - [ ] 4.3 Add server configuration and startup
55
+ - Implement server launch function with configurable parameters
56
+ - Add environment variable support for authentication and configuration
57
+ - Include proper logging and error handling for server operations
58
+ - _Requirements: 4.1, 4.2, 4.4_
59
+
60
+ - [ ] 5. Implement error handling and validation
61
+ - [ ] 5.1 Add input validation for all tools
62
+ - Validate dataset identifiers and configuration names
63
+ - Check split names and sample size parameters
64
+ - Provide helpful error messages for invalid inputs
65
+ - _Requirements: 1.2, 2.1_
66
+
67
+ - [ ] 5.2 Implement comprehensive error handling
68
+ - Handle dataset not found errors with suggestions
69
+ - Manage authentication errors for private datasets
70
+ - Add retry logic for network and API failures
71
+ - _Requirements: 1.2, 4.3_
72
+
73
+ - [ ]* 5.3 Write unit tests for core functionality
74
+ - Create tests for HuggingFace client and dataset service
75
+ - Test EDA tools with mock datasets and various edge cases
76
+ - Validate error handling and input validation logic
77
+ - _Requirements: 1.1, 2.1, 5.1_
78
+
79
+ - [ ] 6. Integration and deployment setup
80
+ - [ ] 6.1 Create main entry point and CLI
81
+ - Implement main module for running the server
82
+ - Add command-line interface for server configuration
83
+ - Include help documentation and usage examples
84
+ - _Requirements: 4.1, 4.2_
85
+
86
+ - [ ] 6.2 Add deployment configuration
87
+ - Create configuration for HuggingFace Spaces deployment
88
+ - Add Docker configuration for containerized deployment
89
+ - Include MCP client configuration examples
90
+ - _Requirements: 4.1, 4.2_
91
+
92
+ - [ ]* 6.3 Write integration tests
93
+ - Test end-to-end workflows from MCP client perspective
94
+ - Validate Gradio interface functionality and MCP tool exposure
95
+ - Test with real HuggingFace datasets for integration validation
96
+ - _Requirements: 3.1, 3.2, 3.3_
97
+
98
+ - [ ]* 7. Documentation and examples
99
+ - Create comprehensive README with installation and usage instructions
100
+ - Add example MCP client configurations for popular clients
101
+ - Include API documentation for all available tools
102
+ - _Requirements: 4.2_
.kiro/steering/product.md ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Product Overview
2
+
3
+ **hf-eda-mcp** is an MCP (Model Context Protocol) server that provides tools for Exploratory Data Analysis (EDA) of HuggingFace datasets.
4
+
5
+ ## Purpose
6
+ - Enables AI assistants to perform data analysis on HuggingFace datasets
7
+ - Provides structured tools for dataset exploration and visualization
8
+ - Integrates with MCP-compatible AI systems for seamless data analysis workflows
9
+
10
+ ## Target Users
11
+ - Data scientists and researchers working with HuggingFace datasets
12
+ - AI developers building applications that need dataset analysis capabilities
13
+ - Anyone needing programmatic access to dataset exploration tools
14
+
15
+ ## License
16
+ Apache License 2.0 - Open source project encouraging community contributions and commercial use.
.kiro/steering/structure.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Project Structure
2
+
3
+ ## Current Organization
4
+ ```
5
+ hf-eda-mcp/
6
+ ├── .git/ # Git version control
7
+ ├── .kiro/ # Kiro AI assistant configuration
8
+ │ ├── settings/ # Kiro settings (MCP config, etc.)
9
+ │ └── steering/ # AI guidance documents
10
+ ├── .vscode/ # VSCode configuration
11
+ ├── .gitignore # Python-focused gitignore
12
+ ├── LICENSE # Apache 2.0 license
13
+ └── README.md # Project documentation
14
+ ```
15
+
16
+ ## Expected Structure (for MCP server)
17
+ Based on MCP server conventions, the project will likely expand to:
18
+
19
+ ```
20
+ hf-eda-mcp/
21
+ ├── scripts/
22
+ ├── src/
23
+ │ └── hf_eda_mcp/ # Main package directory
24
+ │ ├── __init__.py # Package initialization
25
+ │ ├── server.py # MCP server implementation
26
+ │ └── tools/ # EDA tool implementations
27
+ ├── tests/ # Test suite
28
+ ├── docs/ # Documentation
29
+ ├── requirements.txt # Dependencies (or pyproject.toml)
30
+ └── setup.py # Package setup (or pyproject.toml)
31
+ ```
32
+
33
+ ## Naming Conventions
34
+ - **Package/Module**: Snake_case (hf_eda_mcp)
35
+ - **Classes**: PascalCase (DatasetAnalyzer)
36
+ - **Functions/Variables**: Snake_case (analyze_dataset)
37
+ - **Constants**: UPPER_SNAKE_CASE (DEFAULT_BATCH_SIZE)
38
+
39
+ ## File Organization Principles
40
+ - Keep MCP tools modular and focused
41
+ - Separate data processing logic from MCP server logic
42
+ - Use clear, descriptive names for EDA functions
43
+ - Group related analysis tools together
44
+ - Follow Python package structure best practices
45
+
46
+ ## Configuration Files
47
+ - Use `.kiro/settings/mcp.json` for MCP server configuration
48
+ - Environment variables for sensitive data (API keys, etc.)
49
+ - Support multiple dependency management systems
.kiro/steering/tech.md ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Technology Stack
2
+
3
+ ## Primary Technologies
4
+ - **Python**: Core programming language
5
+ - **MCP (Model Context Protocol)**: Server framework for AI tool integration
6
+ - **HuggingFace**: Dataset ecosystem and APIs
7
+
8
+ ## Development Environment
9
+ - **Python Package Management**: Supports multiple managers (pdm)
10
+ - **Virtual Environments**: Standard Python venv/virtualenv workflow
11
+ - **IDE**: VSCode with Kiro agent integration
12
+
13
+ ## Key Dependencies
14
+ - HuggingFace libraries (datasets, transformers ecosystem, gradio)
15
+ - MCP server framework (gradio)
16
+ - Data analysis libraries (likely pandas, numpy, matplotlib/seaborn for EDA)
17
+
18
+ ## Common Commands
19
+ ```bash
20
+ # Environment setup
21
+ pdm sync
22
+
23
+ # Testing
24
+ pytest
25
+ # OR
26
+ python -m pytest
27
+
28
+ # Linting and formatting
29
+ ruff check .
30
+ ruff format .
31
+ ```
32
+
33
+ ## MCP Integration
34
+ - Designed to run as an MCP server
35
+ - Provides tools accessible to MCP-compatible AI systems
36
+ - Configuration through standard MCP server protocols
.vscode/settings.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "kiroAgent.configureMCP": "Enabled"
3
+ }
pdm.lock ADDED
The diff for this file is too large to render. See raw diff
 
pyproject.toml ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "hf-eda-mcp"
3
+ version = "0.1.0"
4
+ description = "MCP server for EDA on HuggingFace datasets"
5
+ authors = [
6
+ {name = "Khalil Guetari", email = "khalil.guetari@momentslab.com"},
7
+ ]
8
+ dependencies = [
9
+ "gradio>=5.49.1",
10
+ "datasets>=4.3.0",
11
+ "huggingface_hub>=0.20.0",
12
+ "pydantic>=2.0.0",
13
+ "pandas>=2.0.0",
14
+ "numpy>=1.24.0"
15
+ ]
16
+ requires-python = ">=3.13"
17
+ readme = "README.md"
18
+ license = {text = "Apache-2.0"}
19
+
20
+ [build-system]
21
+ requires = ["pdm-backend"]
22
+ build-backend = "pdm.backend"
23
+
24
+
25
+ [project.scripts]
26
+ hf-eda-mcp = "hf_eda_mcp.server:launch_server"
27
+
28
+ [tool.pdm]
29
+ distribution = true
30
+
31
+ [tool.pdm.dev-dependencies]
32
+ test = [
33
+ "pytest>=7.0.0",
34
+ "pytest-asyncio>=0.21.0",
35
+ "pytest-cov>=4.0.0"
36
+ ]
37
+ lint = [
38
+ "ruff>=0.1.0",
39
+ "black>=23.0.0",
40
+ "mypy>=1.0.0"
41
+ ]
src/hf_eda_mcp/__init__.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ HuggingFace EDA MCP Server package.
3
+
4
+ A Model Context Protocol (MCP) server that provides tools for
5
+ Exploratory Data Analysis (EDA) of datasets hosted on HuggingFace.
6
+ """
7
+
8
+ from .server import create_gradio_app, launch_server
9
+
10
+ __version__ = "0.1.0"
11
+ __all__ = ["create_gradio_app", "launch_server"]
src/hf_eda_mcp/__main__.py ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Main entry point for the hf-eda-mcp server.
3
+
4
+ This module allows the package to be run as a module using:
5
+ python -m hf_eda_mcp
6
+ """
7
+
8
+ import argparse
9
+ import sys
10
+ from .server import launch_server
11
+
12
+
13
+ def main():
14
+ """Main entry point with command line argument parsing."""
15
+ parser = argparse.ArgumentParser(
16
+ description="HuggingFace EDA MCP Server",
17
+ prog="hf-eda-mcp"
18
+ )
19
+
20
+ parser.add_argument(
21
+ "--port",
22
+ type=int,
23
+ default=7860,
24
+ help="Port to run the server on (default: 7860)"
25
+ )
26
+
27
+ parser.add_argument(
28
+ "--no-mcp",
29
+ action="store_true",
30
+ help="Disable MCP server functionality"
31
+ )
32
+
33
+ args = parser.parse_args()
34
+
35
+ try:
36
+ launch_server(port=args.port, mcp_server=not args.no_mcp)
37
+ except KeyboardInterrupt:
38
+ print("\nServer stopped by user")
39
+ sys.exit(0)
40
+ except Exception as e:
41
+ print(f"Error starting server: {e}")
42
+ sys.exit(1)
43
+
44
+
45
+ if __name__ == "__main__":
46
+ main()
src/hf_eda_mcp/integrations/__init__.py ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Integration module for external services.
3
+
4
+ This package contains integration classes for HuggingFace Hub
5
+ and other external services.
6
+ """
7
+
8
+ __all__ = []
src/hf_eda_mcp/integrations/hf_client.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ """
2
+ HuggingFace client wrapper for API interactions.
3
+
4
+ This module will be implemented in task 2.1.
5
+ """
6
+
7
+ # Placeholder - will be implemented in task 2.1
src/hf_eda_mcp/server.py ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Main Gradio application with MCP server functionality.
3
+
4
+ This module provides the main entry point for the hf-eda-mcp server,
5
+ creating Gradio interfaces for EDA tools and enabling MCP server functionality.
6
+ """
7
+
8
+ import gradio as gr
9
+
10
+
11
+ def create_gradio_app() -> gr.Blocks:
12
+ """Create and configure the main Gradio application with MCP server."""
13
+ # Placeholder implementation - will be expanded in later tasks
14
+ with gr.Blocks(title="HF EDA MCP Server") as app:
15
+ gr.Markdown("# HuggingFace EDA MCP Server")
16
+ gr.Markdown("MCP server for exploratory data analysis of HuggingFace datasets.")
17
+
18
+ return app
19
+
20
+
21
+ def launch_server(port: int = 7860, mcp_server: bool = True) -> None:
22
+ """Launch the Gradio app with MCP server enabled."""
23
+ app = create_gradio_app()
24
+
25
+ # Launch with MCP server enabled
26
+ app.launch(server_port=port, share=False, show_error=True)
27
+
28
+
29
+ if __name__ == "__main__":
30
+ launch_server()
src/hf_eda_mcp/services/__init__.py ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Services module for dataset operations and integrations.
3
+
4
+ This package contains service classes for dataset management, caching,
5
+ and external API integrations.
6
+ """
7
+
8
+ __all__ = []
src/hf_eda_mcp/services/dataset_service.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ """
2
+ Dataset service for centralized dataset operations and caching.
3
+
4
+ This module will be implemented in task 2.2.
5
+ """
6
+
7
+ # Placeholder - will be implemented in task 2.2
src/hf_eda_mcp/tools/__init__.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ """
2
+ EDA tools module for HuggingFace datasets.
3
+
4
+ This package contains individual EDA functions that will be exposed as MCP tools.
5
+ """
6
+
7
+ __all__ = []
src/hf_eda_mcp/tools/analysis.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ """
2
+ Basic analysis tool for exploratory data analysis.
3
+
4
+ This module will be implemented in task 3.3.
5
+ """
6
+
7
+ # Placeholder - will be implemented in task 3.3
src/hf_eda_mcp/tools/metadata.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ """
2
+ Dataset metadata tool for retrieving HuggingFace dataset information.
3
+
4
+ This module will be implemented in task 3.1.
5
+ """
6
+
7
+ # Placeholder - will be implemented in task 3.1
src/hf_eda_mcp/tools/sampling.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ """
2
+ Dataset sampling tool for retrieving dataset samples.
3
+
4
+ This module will be implemented in task 3.2.
5
+ """
6
+
7
+ # Placeholder - will be implemented in task 3.2
tests/__init__.py ADDED
File without changes