| # Data Ingestion Scripts | |
| Documentation for scripts that download and process geographic datasets. | |
| --- | |
| ## Overview | |
| Data ingestion scripts in `backend/scripts/` automate downloading and processing of various data sources: | |
| - OpenStreetMap via Geofabrik | |
| - Humanitarian Data Exchange (HDX) | |
| - World Bank Open Data | |
| - STRI GIS Portal | |
| - Kontur Population | |
| - Global datasets | |
| --- | |
| ##Scripts Reference | |
| ### 1. download_geofabrik.py | |
| Downloads OpenStreetMap data for Panama from Geofabrik. | |
| **Usage**: | |
| ```bash | |
| cd backend | |
| python scripts/download_geofabrik.py | |
| ``` | |
| **What it downloads**: | |
| - Roads network | |
| - Buildings | |
| - POI (points of interest) | |
| - Natural features | |
| **Output**: GeoJSON files in `backend/data/osm/` | |
| **Schedule**: Run monthly for updates | |
| --- | |
| ### 2. download_hdx_panama.py | |
| Downloads administrative boundaries from Humanitarian Data Exchange. | |
| **Usage**: | |
| ```bash | |
| python scripts/download_hdx_panama.py | |
| ``` | |
| **Downloads**: | |
| - Level 1: Provinces (10 features) | |
| - Level 2: Districts (81 features) | |
| - Level 3: Corregimientos (679 features) | |
| **Output**: `backend/data/hdx/pan_admin{1,2,3}_2021.geojson` | |
| **Schedule**: Annual updates | |
| --- | |
| ### 3. download_worldbank.py | |
| Downloads World Bank development indicators. | |
| **Usage**: | |
| ```bash | |
| python scripts/download_worldbank.py | |
| ``` | |
| **Indicators**: | |
| - GDP per capita | |
| - Life expectancy | |
| - Access to electricity | |
| - Internet usage | |
| - And more... | |
| **Output**: `backend/data/worldbank/indicators.geojson` | |
| **Processing**: Joins indicator data with country geometries | |
| --- | |
| ### 4. download_stri_data.py | |
| Downloads datasets from STRI GIS Portal. | |
| **Usage**: | |
| ```bash | |
| python scripts/download_stri_data.py | |
| ``` | |
| **Downloads**: | |
| - Protected areas | |
| - Forest cover | |
| - Environmental datasets | |
| **Output**: `backend/data/stri/*.geojson` | |
| **Note**: Uses ArcGIS REST API | |
| --- | |
| ### 5. stri_catalog_scraper.py | |
| Discovers and catalogs all available STRI datasets. | |
| **Usage**: | |
| ```bash | |
| python scripts/stri_catalog_scraper.py | |
| ``` | |
| **Output**: JSON catalog of 100+ STRI datasets with metadata | |
| **Features**: | |
| - Priority scoring | |
| - Temporal dataset detection | |
| - REST endpoint generation | |
| --- | |
| ### 6. create_province_layer.py | |
| Creates province-level socioeconomic data layer. | |
| **Usage**: | |
| ```bash | |
| python scripts/create_province_layer.py | |
| ``` | |
| **Combines**: | |
| - INEC Census data | |
| - MPI (poverty index) | |
| - Administrative geometries | |
| **Output**: `backend/data/socioeconomic/province_socioeconomic.geojson` | |
| --- | |
| ### 7. download_global_datasets.py | |
| Downloads global reference datasets. | |
| **Usage**: | |
| ```bash | |
| python scripts/download_global_datasets.py | |
| ``` | |
| **Downloads**: | |
| - Natural Earth country boundaries | |
| - Global admin boundaries | |
| - Reference layers | |
| **Output**: `backend/data/global/*.geojson` | |
| --- | |
| ### 8. register_global_datasets.py | |
| Registers global datasets in catalog.json. | |
| **Usage**: | |
| ```bash | |
| python scripts/register_global_datasets.py | |
| ``` | |
| **Action**: Adds dataset entries to `backend/data/catalog.json` | |
| --- | |
| ## Adding New Data Sources | |
| ### Step-by-Step Guide | |
| #### 1. Create Download Script | |
| Create `backend/scripts/download_mycustom_data.py`: | |
| ```python | |
| import geopandas as gpd | |
| import requests | |
| from pathlib import Path | |
| def download_custom_data(): | |
| """Download custom dataset.""" | |
| # Define output path | |
| output_dir = Path(__file__).parent.parent / "data" / "custom" | |
| output_dir.mkdir(parents=True, exist_ok=True) | |
| # Download data | |
| url = "https://example.com/data.geojson" | |
| response = requests.get(url) | |
| # Save as GeoJSON | |
| output_file = output_dir / "custom_data.geojson" | |
| with open(output_file, 'w') as f: | |
| f.write(response.text) | |
| print(f"Downloaded to {output_file}") | |
| if __name__ == "__main__": | |
| download_custom_data() | |
| ``` | |
| #### 2. Update Catalog | |
| Add entry to `backend/data/catalog.json`: | |
| ```json | |
| { | |
| "custom_data": { | |
| "path": "custom/custom_data.geojson", | |
| "description": "Short description for display", | |
| "semantic_description": "Detailed description mentioning key concepts that help AI discovery. Include what data represents, coverage area, and typical use cases.", | |
| "categories": ["infrastructure"], | |
| "tags": ["roads", "transport", "panama"], | |
| "schema": { | |
| "columns": ["name", "type", "length_km", "geom"], | |
| "geometry_type": "LineString" | |
| } | |
| } | |
| } | |
| ``` | |
| **Key Fields**: | |
| - `path`: Relative path from `backend/data/` | |
| - `description`: Human-readable short description | |
| - `semantic_description`: Detailed description for AI semantic search | |
| - `categories`: Classify dataset | |
| - `tags`: Keywords for filtering | |
| - `schema`: Optional column and geometry info | |
| #### 3. Regenerate Embeddings | |
| ```bash | |
| cd backend | |
| rm data/embeddings.npy | |
| python -c "from backend.core.semantic_search import get_semantic_search; get_semantic_search()" | |
| ``` | |
| This generates vector embeddings for the new dataset description. | |
| #### 4. Test Discovery | |
| ```bash | |
| # Start backend | |
| uvicorn backend.main:app --reload | |
| # Test query | |
| curl -X POST http://localhost:8000/api/chat \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"message":"show me [your new data]","history":[]}' | |
| ``` | |
| Verify the AI can discover and query your dataset. | |
| --- | |
| ## Script Templates | |
| ### Basic Download Template | |
| ```python | |
| #!/usr/bin/env python3 | |
| """ | |
| Download script for [DATA SOURCE NAME] | |
| """ | |
| import geopandas as gpd | |
| import requests | |
| from pathlib import Path | |
| import logging | |
| logging.basicConfig(level=logging.INFO) | |
| logger = logging.getLogger(__name__) | |
| # Constants | |
| DATA_URL = "https://example.com/data.geojson" | |
| OUTPUT_DIR = Path(__file__).parent.parent / "data" / "category" | |
| def download_data(): | |
| """Download and process data.""" | |
| OUTPUT_DIR.mkdir(parents=True, exist_ok=True) | |
| logger.info(f"Downloading from {DATA_URL}") | |
| # Download | |
| gdf = gpd.read_file(DATA_URL) | |
| # Process (example: project to WGS84) | |
| if gdf.crs and gdf.crs != "EPSG:4326": | |
| gdf = gdf.to_crs("EPSG:4326") | |
| # Save | |
| output_file = OUTPUT_DIR / "data.geojson" | |
| gdf.to_file(output_file, driver="GeoJSON") | |
| logger.info(f"Saved {len(gdf)} features to {output_file}") | |
| if __name__ == "__main__": | |
| download_data() | |
| ``` | |
| ### API Download Template | |
| ```python | |
| import requests | |
| import json | |
| def download_from_api(): | |
| """Download from REST API.""" | |
| # Query API | |
| params = { | |
| "where": "country='Panama'", | |
| "outFields": "*", | |
| "f": "geojson" | |
| } | |
| response = requests.get(API_URL, params=params) | |
| response.raise_for_status() | |
| # Parse and save | |
| geojson = response.json() | |
| with open(output_file, 'w') as f: | |
| json.dump(geojson, f) | |
| ``` | |
| --- | |
| ## Data Processing Best Practices | |
| ### 1. Coordinate System | |
| Always save in WGS84 (EPSG:4326): | |
| ```python | |
| if gdf.crs != "EPSG:4326": | |
| gdf = gdf.to_crs("EPSG:4326") | |
| ``` | |
| ### 2. Column Naming | |
| Use lowercase with underscores: | |
| ```python | |
| gdf.columns = gdf.columns.str.lower().str.replace(' ', '_') | |
| ``` | |
| ### 3. Null Handling | |
| Remove or fill nulls: | |
| ```python | |
| gdf['name'] = gdf['name'].fillna('Unknown') | |
| gdf = gdf.dropna(subset=['geom']) | |
| ``` | |
| ### 4. Simplify Geometry (if needed) | |
| For large datasets: | |
| ```python | |
| gdf['geom'] = gdf['geom'].simplify(tolerance=0.001) | |
| ``` | |
| ### 5. Validate GeoJSON | |
| ```python | |
| import json | |
| # Check valid JSON | |
| with open(output_file) as f: | |
| data = json.load(f) | |
| assert data['type'] == 'FeatureCollection' | |
| assert 'features' in data | |
| ``` | |
| --- | |
| ## Data Sources Reference | |
| | Source | Script | Frequency | Size | | |
| |--------|--------|-----------|------| | |
| | Geofabrik (OSM) | `download_geofabrik.py` | Monthly | ~100MB | | |
| | HDX | `download_hdx_panama.py` | Annual | ~5MB | | |
| | World Bank | `download_worldbank.py` | Annual | ~1MB | | |
| | STRI | `download_stri_data.py` | As updated | ~50MB | | |
| | Kontur | Manual | Quarterly | ~200MB | | |
| --- | |
| ## Next Steps | |
| - **Dataset Sources**: [../data/DATASET_SOURCES.md](../data/DATASET_SOURCES.md) | |
| - **Core Services**: [CORE_SERVICES.md](CORE_SERVICES.md) | |