# Data Ingestion Scripts Documentation for scripts that download and process geographic datasets. --- ## Overview Data ingestion scripts in `backend/scripts/` automate downloading and processing of various data sources: - OpenStreetMap via Geofabrik - Humanitarian Data Exchange (HDX) - World Bank Open Data - STRI GIS Portal - Kontur Population - Global datasets --- ##Scripts Reference ### 1. download_geofabrik.py Downloads OpenStreetMap data for Panama from Geofabrik. **Usage**: ```bash cd backend python scripts/download_geofabrik.py ``` **What it downloads**: - Roads network - Buildings - POI (points of interest) - Natural features **Output**: GeoJSON files in `backend/data/osm/` **Schedule**: Run monthly for updates --- ### 2. download_hdx_panama.py Downloads administrative boundaries from Humanitarian Data Exchange. **Usage**: ```bash python scripts/download_hdx_panama.py ``` **Downloads**: - Level 1: Provinces (10 features) - Level 2: Districts (81 features) - Level 3: Corregimientos (679 features) **Output**: `backend/data/hdx/pan_admin{1,2,3}_2021.geojson` **Schedule**: Annual updates --- ### 3. download_worldbank.py Downloads World Bank development indicators. **Usage**: ```bash python scripts/download_worldbank.py ``` **Indicators**: - GDP per capita - Life expectancy - Access to electricity - Internet usage - And more... **Output**: `backend/data/worldbank/indicators.geojson` **Processing**: Joins indicator data with country geometries --- ### 4. download_stri_data.py Downloads datasets from STRI GIS Portal. **Usage**: ```bash python scripts/download_stri_data.py ``` **Downloads**: - Protected areas - Forest cover - Environmental datasets **Output**: `backend/data/stri/*.geojson` **Note**: Uses ArcGIS REST API --- ### 5. stri_catalog_scraper.py Discovers and catalogs all available STRI datasets. **Usage**: ```bash python scripts/stri_catalog_scraper.py ``` **Output**: JSON catalog of 100+ STRI datasets with metadata **Features**: - Priority scoring - Temporal dataset detection - REST endpoint generation --- ### 6. create_province_layer.py Creates province-level socioeconomic data layer. **Usage**: ```bash python scripts/create_province_layer.py ``` **Combines**: - INEC Census data - MPI (poverty index) - Administrative geometries **Output**: `backend/data/socioeconomic/province_socioeconomic.geojson` --- ### 7. download_global_datasets.py Downloads global reference datasets. **Usage**: ```bash python scripts/download_global_datasets.py ``` **Downloads**: - Natural Earth country boundaries - Global admin boundaries - Reference layers **Output**: `backend/data/global/*.geojson` --- ### 8. register_global_datasets.py Registers global datasets in catalog.json. **Usage**: ```bash python scripts/register_global_datasets.py ``` **Action**: Adds dataset entries to `backend/data/catalog.json` --- ## Adding New Data Sources ### Step-by-Step Guide #### 1. Create Download Script Create `backend/scripts/download_mycustom_data.py`: ```python import geopandas as gpd import requests from pathlib import Path def download_custom_data(): """Download custom dataset.""" # Define output path output_dir = Path(__file__).parent.parent / "data" / "custom" output_dir.mkdir(parents=True, exist_ok=True) # Download data url = "https://example.com/data.geojson" response = requests.get(url) # Save as GeoJSON output_file = output_dir / "custom_data.geojson" with open(output_file, 'w') as f: f.write(response.text) print(f"Downloaded to {output_file}") if __name__ == "__main__": download_custom_data() ``` #### 2. Update Catalog Add entry to `backend/data/catalog.json`: ```json { "custom_data": { "path": "custom/custom_data.geojson", "description": "Short description for display", "semantic_description": "Detailed description mentioning key concepts that help AI discovery. Include what data represents, coverage area, and typical use cases.", "categories": ["infrastructure"], "tags": ["roads", "transport", "panama"], "schema": { "columns": ["name", "type", "length_km", "geom"], "geometry_type": "LineString" } } } ``` **Key Fields**: - `path`: Relative path from `backend/data/` - `description`: Human-readable short description - `semantic_description`: Detailed description for AI semantic search - `categories`: Classify dataset - `tags`: Keywords for filtering - `schema`: Optional column and geometry info #### 3. Regenerate Embeddings ```bash cd backend rm data/embeddings.npy python -c "from backend.core.semantic_search import get_semantic_search; get_semantic_search()" ``` This generates vector embeddings for the new dataset description. #### 4. Test Discovery ```bash # Start backend uvicorn backend.main:app --reload # Test query curl -X POST http://localhost:8000/api/chat \ -H "Content-Type: application/json" \ -d '{"message":"show me [your new data]","history":[]}' ``` Verify the AI can discover and query your dataset. --- ## Script Templates ### Basic Download Template ```python #!/usr/bin/env python3 """ Download script for [DATA SOURCE NAME] """ import geopandas as gpd import requests from pathlib import Path import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) # Constants DATA_URL = "https://example.com/data.geojson" OUTPUT_DIR = Path(__file__).parent.parent / "data" / "category" def download_data(): """Download and process data.""" OUTPUT_DIR.mkdir(parents=True, exist_ok=True) logger.info(f"Downloading from {DATA_URL}") # Download gdf = gpd.read_file(DATA_URL) # Process (example: project to WGS84) if gdf.crs and gdf.crs != "EPSG:4326": gdf = gdf.to_crs("EPSG:4326") # Save output_file = OUTPUT_DIR / "data.geojson" gdf.to_file(output_file, driver="GeoJSON") logger.info(f"Saved {len(gdf)} features to {output_file}") if __name__ == "__main__": download_data() ``` ### API Download Template ```python import requests import json def download_from_api(): """Download from REST API.""" # Query API params = { "where": "country='Panama'", "outFields": "*", "f": "geojson" } response = requests.get(API_URL, params=params) response.raise_for_status() # Parse and save geojson = response.json() with open(output_file, 'w') as f: json.dump(geojson, f) ``` --- ## Data Processing Best Practices ### 1. Coordinate System Always save in WGS84 (EPSG:4326): ```python if gdf.crs != "EPSG:4326": gdf = gdf.to_crs("EPSG:4326") ``` ### 2. Column Naming Use lowercase with underscores: ```python gdf.columns = gdf.columns.str.lower().str.replace(' ', '_') ``` ### 3. Null Handling Remove or fill nulls: ```python gdf['name'] = gdf['name'].fillna('Unknown') gdf = gdf.dropna(subset=['geom']) ``` ### 4. Simplify Geometry (if needed) For large datasets: ```python gdf['geom'] = gdf['geom'].simplify(tolerance=0.001) ``` ### 5. Validate GeoJSON ```python import json # Check valid JSON with open(output_file) as f: data = json.load(f) assert data['type'] == 'FeatureCollection' assert 'features' in data ``` --- ## Data Sources Reference | Source | Script | Frequency | Size | |--------|--------|-----------|------| | Geofabrik (OSM) | `download_geofabrik.py` | Monthly | ~100MB | | HDX | `download_hdx_panama.py` | Annual | ~5MB | | World Bank | `download_worldbank.py` | Annual | ~1MB | | STRI | `download_stri_data.py` | As updated | ~50MB | | Kontur | Manual | Quarterly | ~200MB | --- ## Next Steps - **Dataset Sources**: [../data/DATASET_SOURCES.md](../data/DATASET_SOURCES.md) - **Core Services**: [CORE_SERVICES.md](CORE_SERVICES.md)