Spaces:

GerardCB
/

GeoQuery

Running

App Files Files Community

GeoQuery / docs /backend /SCRIPTS.md

GerardCB

Deploy to Spaces (Final Clean)

4851501 2 days ago

preview code

raw

history blame contribute delete

7.86 kB

	# Data Ingestion Scripts

	Documentation for scripts that download and process geographic datasets.

	---

	## Overview

	Data ingestion scripts in `backend/scripts/` automate downloading and processing of various data sources:

	- OpenStreetMap via Geofabrik
	- Humanitarian Data Exchange (HDX)
	- World Bank Open Data
	- STRI GIS Portal
	- Kontur Population
	- Global datasets

	---

	##Scripts Reference

	### 1. download_geofabrik.py

	Downloads OpenStreetMap data for Panama from Geofabrik.

	Usage:
	```bash
	cd backend
	python scripts/download_geofabrik.py
	```

	What it downloads:
	- Roads network
	- Buildings
	- POI (points of interest)
	- Natural features

	Output: GeoJSON files in `backend/data/osm/`

	Schedule: Run monthly for updates

	---

	### 2. download_hdx_panama.py

	Downloads administrative boundaries from Humanitarian Data Exchange.

	Usage:
	```bash
	python scripts/download_hdx_panama.py
	```

	Downloads:
	- Level 1: Provinces (10 features)
	- Level 2: Districts (81 features)
	- Level 3: Corregimientos (679 features)

	Output: `backend/data/hdx/pan_admin{1,2,3}_2021.geojson`

	Schedule: Annual updates

	---

	### 3. download_worldbank.py

	Downloads World Bank development indicators.

	Usage:
	```bash
	python scripts/download_worldbank.py
	```

	Indicators:
	- GDP per capita
	- Life expectancy
	- Access to electricity
	- Internet usage
	- And more...

	Output: `backend/data/worldbank/indicators.geojson`

	Processing: Joins indicator data with country geometries

	---

	### 4. download_stri_data.py

	Downloads datasets from STRI GIS Portal.

	Usage:
	```bash
	python scripts/download_stri_data.py
	```

	Downloads:
	- Protected areas
	- Forest cover
	- Environmental datasets

	Output: `backend/data/stri/*.geojson`

	Note: Uses ArcGIS REST API

	---

	### 5. stri_catalog_scraper.py

	Discovers and catalogs all available STRI datasets.

	Usage:
	```bash
	python scripts/stri_catalog_scraper.py
	```

	Output: JSON catalog of 100+ STRI datasets with metadata

	Features:
	- Priority scoring
	- Temporal dataset detection
	- REST endpoint generation

	---

	### 6. create_province_layer.py

	Creates province-level socioeconomic data layer.

	Usage:
	```bash
	python scripts/create_province_layer.py
	```

	Combines:
	- INEC Census data
	- MPI (poverty index)
	- Administrative geometries

	Output: `backend/data/socioeconomic/province_socioeconomic.geojson`

	---

	### 7. download_global_datasets.py

	Downloads global reference datasets.

	Usage:
	```bash
	python scripts/download_global_datasets.py
	```

	Downloads:
	- Natural Earth country boundaries
	- Global admin boundaries
	- Reference layers

	Output: `backend/data/global/*.geojson`

	---

	### 8. register_global_datasets.py

	Registers global datasets in catalog.json.

	Usage:
	```bash
	python scripts/register_global_datasets.py
	```

	Action: Adds dataset entries to `backend/data/catalog.json`

	---

	## Adding New Data Sources

	### Step-by-Step Guide

	#### 1. Create Download Script

	Create `backend/scripts/download_mycustom_data.py`:

	```python
	import geopandas as gpd
	import requests
	from pathlib import Path

	def download_custom_data():
	"""Download custom dataset."""

	# Define output path
	output_dir = Path(__file__).parent.parent / "data" / "custom"
	output_dir.mkdir(parents=True, exist_ok=True)

	# Download data
	url = "https://example.com/data.geojson"
	response = requests.get(url)

	# Save as GeoJSON
	output_file = output_dir / "custom_data.geojson"
	with open(output_file, 'w') as f:
	f.write(response.text)

	print(f"Downloaded to {output_file}")

	if __name__ == "__main__":
	download_custom_data()
	```

	#### 2. Update Catalog

	Add entry to `backend/data/catalog.json`:

	```json
	{
	"custom_data": {
	"path": "custom/custom_data.geojson",
	"description": "Short description for display",
	"semantic_description": "Detailed description mentioning key concepts that help AI discovery. Include what data represents, coverage area, and typical use cases.",
	"categories": ["infrastructure"],
	"tags": ["roads", "transport", "panama"],
	"schema": {
	"columns": ["name", "type", "length_km", "geom"],
	"geometry_type": "LineString"
	}
	}
	}
	```

	Key Fields:
	- `path`: Relative path from `backend/data/`
	- `description`: Human-readable short description
	- `semantic_description`: Detailed description for AI semantic search
	- `categories`: Classify dataset
	- `tags`: Keywords for filtering
	- `schema`: Optional column and geometry info

	#### 3. Regenerate Embeddings

	```bash
	cd backend
	rm data/embeddings.npy
	python -c "from backend.core.semantic_search import get_semantic_search; get_semantic_search()"
	```

	This generates vector embeddings for the new dataset description.

	#### 4. Test Discovery

	```bash
	# Start backend
	uvicorn backend.main:app --reload

	# Test query
	curl -X POST http://localhost:8000/api/chat \
	-H "Content-Type: application/json" \
	-d '{"message":"show me [your new data]","history":[]}'
	```

	Verify the AI can discover and query your dataset.

	---

	## Script Templates

	### Basic Download Template

	```python
	#!/usr/bin/env python3
	"""
	Download script for [DATA SOURCE NAME]
	"""

	import geopandas as gpd
	import requests
	from pathlib import Path
	import logging

	logging.basicConfig(level=logging.INFO)
	logger = logging.getLogger(__name__)

	# Constants
	DATA_URL = "https://example.com/data.geojson"
	OUTPUT_DIR = Path(__file__).parent.parent / "data" / "category"

	def download_data():
	"""Download and process data."""
	OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

	logger.info(f"Downloading from {DATA_URL}")

	# Download
	gdf = gpd.read_file(DATA_URL)

	# Process (example: project to WGS84)
	if gdf.crs and gdf.crs != "EPSG:4326":
	gdf = gdf.to_crs("EPSG:4326")

	# Save
	output_file = OUTPUT_DIR / "data.geojson"
	gdf.to_file(output_file, driver="GeoJSON")

	logger.info(f"Saved {len(gdf)} features to {output_file}")

	if __name__ == "__main__":
	download_data()
	```

	### API Download Template

	```python
	import requests
	import json

	def download_from_api():
	"""Download from REST API."""

	# Query API
	params = {
	"where": "country='Panama'",
	"outFields": "*",
	"f": "geojson"
	}

	response = requests.get(API_URL, params=params)
	response.raise_for_status()

	# Parse and save
	geojson = response.json()

	with open(output_file, 'w') as f:
	json.dump(geojson, f)
	```

	---

	## Data Processing Best Practices

	### 1. Coordinate System

	Always save in WGS84 (EPSG:4326):
	```python
	if gdf.crs != "EPSG:4326":
	gdf = gdf.to_crs("EPSG:4326")
	```

	### 2. Column Naming

	Use lowercase with underscores:
	```python
	gdf.columns = gdf.columns.str.lower().str.replace(' ', '_')
	```

	### 3. Null Handling

	Remove or fill nulls:
	```python
	gdf['name'] = gdf['name'].fillna('Unknown')
	gdf = gdf.dropna(subset=['geom'])
	```

	### 4. Simplify Geometry (if needed)

	For large datasets:
	```python
	gdf['geom'] = gdf['geom'].simplify(tolerance=0.001)
	```

	### 5. Validate GeoJSON

	```python
	import json

	# Check valid JSON
	with open(output_file) as f:
	data = json.load(f)

	assert data['type'] == 'FeatureCollection'
	assert 'features' in data
	```

	---

	## Data Sources Reference

	\| Source \| Script \| Frequency \| Size \|
	\|--------\|--------\|-----------\|------\|
	\| Geofabrik (OSM) \| `download_geofabrik.py` \| Monthly \| ~100MB \|
	\| HDX \| `download_hdx_panama.py` \| Annual \| ~5MB \|
	\| World Bank \| `download_worldbank.py` \| Annual \| ~1MB \|
	\| STRI \| `download_stri_data.py` \| As updated \| ~50MB \|
	\| Kontur \| Manual \| Quarterly \| ~200MB \|

	---

	## Next Steps

	- Dataset Sources: [../data/DATASET_SOURCES.md](../data/DATASET_SOURCES.md)
	- Core Services: [CORE_SERVICES.md](CORE_SERVICES.md)