GeoQuery / docs /backend /SCRIPTS.md
GerardCB's picture
Deploy to Spaces (Final Clean)
4851501
# Data Ingestion Scripts
Documentation for scripts that download and process geographic datasets.
---
## Overview
Data ingestion scripts in `backend/scripts/` automate downloading and processing of various data sources:
- OpenStreetMap via Geofabrik
- Humanitarian Data Exchange (HDX)
- World Bank Open Data
- STRI GIS Portal
- Kontur Population
- Global datasets
---
##Scripts Reference
### 1. download_geofabrik.py
Downloads OpenStreetMap data for Panama from Geofabrik.
**Usage**:
```bash
cd backend
python scripts/download_geofabrik.py
```
**What it downloads**:
- Roads network
- Buildings
- POI (points of interest)
- Natural features
**Output**: GeoJSON files in `backend/data/osm/`
**Schedule**: Run monthly for updates
---
### 2. download_hdx_panama.py
Downloads administrative boundaries from Humanitarian Data Exchange.
**Usage**:
```bash
python scripts/download_hdx_panama.py
```
**Downloads**:
- Level 1: Provinces (10 features)
- Level 2: Districts (81 features)
- Level 3: Corregimientos (679 features)
**Output**: `backend/data/hdx/pan_admin{1,2,3}_2021.geojson`
**Schedule**: Annual updates
---
### 3. download_worldbank.py
Downloads World Bank development indicators.
**Usage**:
```bash
python scripts/download_worldbank.py
```
**Indicators**:
- GDP per capita
- Life expectancy
- Access to electricity
- Internet usage
- And more...
**Output**: `backend/data/worldbank/indicators.geojson`
**Processing**: Joins indicator data with country geometries
---
### 4. download_stri_data.py
Downloads datasets from STRI GIS Portal.
**Usage**:
```bash
python scripts/download_stri_data.py
```
**Downloads**:
- Protected areas
- Forest cover
- Environmental datasets
**Output**: `backend/data/stri/*.geojson`
**Note**: Uses ArcGIS REST API
---
### 5. stri_catalog_scraper.py
Discovers and catalogs all available STRI datasets.
**Usage**:
```bash
python scripts/stri_catalog_scraper.py
```
**Output**: JSON catalog of 100+ STRI datasets with metadata
**Features**:
- Priority scoring
- Temporal dataset detection
- REST endpoint generation
---
### 6. create_province_layer.py
Creates province-level socioeconomic data layer.
**Usage**:
```bash
python scripts/create_province_layer.py
```
**Combines**:
- INEC Census data
- MPI (poverty index)
- Administrative geometries
**Output**: `backend/data/socioeconomic/province_socioeconomic.geojson`
---
### 7. download_global_datasets.py
Downloads global reference datasets.
**Usage**:
```bash
python scripts/download_global_datasets.py
```
**Downloads**:
- Natural Earth country boundaries
- Global admin boundaries
- Reference layers
**Output**: `backend/data/global/*.geojson`
---
### 8. register_global_datasets.py
Registers global datasets in catalog.json.
**Usage**:
```bash
python scripts/register_global_datasets.py
```
**Action**: Adds dataset entries to `backend/data/catalog.json`
---
## Adding New Data Sources
### Step-by-Step Guide
#### 1. Create Download Script
Create `backend/scripts/download_mycustom_data.py`:
```python
import geopandas as gpd
import requests
from pathlib import Path
def download_custom_data():
"""Download custom dataset."""
# Define output path
output_dir = Path(__file__).parent.parent / "data" / "custom"
output_dir.mkdir(parents=True, exist_ok=True)
# Download data
url = "https://example.com/data.geojson"
response = requests.get(url)
# Save as GeoJSON
output_file = output_dir / "custom_data.geojson"
with open(output_file, 'w') as f:
f.write(response.text)
print(f"Downloaded to {output_file}")
if __name__ == "__main__":
download_custom_data()
```
#### 2. Update Catalog
Add entry to `backend/data/catalog.json`:
```json
{
"custom_data": {
"path": "custom/custom_data.geojson",
"description": "Short description for display",
"semantic_description": "Detailed description mentioning key concepts that help AI discovery. Include what data represents, coverage area, and typical use cases.",
"categories": ["infrastructure"],
"tags": ["roads", "transport", "panama"],
"schema": {
"columns": ["name", "type", "length_km", "geom"],
"geometry_type": "LineString"
}
}
}
```
**Key Fields**:
- `path`: Relative path from `backend/data/`
- `description`: Human-readable short description
- `semantic_description`: Detailed description for AI semantic search
- `categories`: Classify dataset
- `tags`: Keywords for filtering
- `schema`: Optional column and geometry info
#### 3. Regenerate Embeddings
```bash
cd backend
rm data/embeddings.npy
python -c "from backend.core.semantic_search import get_semantic_search; get_semantic_search()"
```
This generates vector embeddings for the new dataset description.
#### 4. Test Discovery
```bash
# Start backend
uvicorn backend.main:app --reload
# Test query
curl -X POST http://localhost:8000/api/chat \
-H "Content-Type: application/json" \
-d '{"message":"show me [your new data]","history":[]}'
```
Verify the AI can discover and query your dataset.
---
## Script Templates
### Basic Download Template
```python
#!/usr/bin/env python3
"""
Download script for [DATA SOURCE NAME]
"""
import geopandas as gpd
import requests
from pathlib import Path
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Constants
DATA_URL = "https://example.com/data.geojson"
OUTPUT_DIR = Path(__file__).parent.parent / "data" / "category"
def download_data():
"""Download and process data."""
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
logger.info(f"Downloading from {DATA_URL}")
# Download
gdf = gpd.read_file(DATA_URL)
# Process (example: project to WGS84)
if gdf.crs and gdf.crs != "EPSG:4326":
gdf = gdf.to_crs("EPSG:4326")
# Save
output_file = OUTPUT_DIR / "data.geojson"
gdf.to_file(output_file, driver="GeoJSON")
logger.info(f"Saved {len(gdf)} features to {output_file}")
if __name__ == "__main__":
download_data()
```
### API Download Template
```python
import requests
import json
def download_from_api():
"""Download from REST API."""
# Query API
params = {
"where": "country='Panama'",
"outFields": "*",
"f": "geojson"
}
response = requests.get(API_URL, params=params)
response.raise_for_status()
# Parse and save
geojson = response.json()
with open(output_file, 'w') as f:
json.dump(geojson, f)
```
---
## Data Processing Best Practices
### 1. Coordinate System
Always save in WGS84 (EPSG:4326):
```python
if gdf.crs != "EPSG:4326":
gdf = gdf.to_crs("EPSG:4326")
```
### 2. Column Naming
Use lowercase with underscores:
```python
gdf.columns = gdf.columns.str.lower().str.replace(' ', '_')
```
### 3. Null Handling
Remove or fill nulls:
```python
gdf['name'] = gdf['name'].fillna('Unknown')
gdf = gdf.dropna(subset=['geom'])
```
### 4. Simplify Geometry (if needed)
For large datasets:
```python
gdf['geom'] = gdf['geom'].simplify(tolerance=0.001)
```
### 5. Validate GeoJSON
```python
import json
# Check valid JSON
with open(output_file) as f:
data = json.load(f)
assert data['type'] == 'FeatureCollection'
assert 'features' in data
```
---
## Data Sources Reference
| Source | Script | Frequency | Size |
|--------|--------|-----------|------|
| Geofabrik (OSM) | `download_geofabrik.py` | Monthly | ~100MB |
| HDX | `download_hdx_panama.py` | Annual | ~5MB |
| World Bank | `download_worldbank.py` | Annual | ~1MB |
| STRI | `download_stri_data.py` | As updated | ~50MB |
| Kontur | Manual | Quarterly | ~200MB |
---
## Next Steps
- **Dataset Sources**: [../data/DATASET_SOURCES.md](../data/DATASET_SOURCES.md)
- **Core Services**: [CORE_SERVICES.md](CORE_SERVICES.md)