# Data Ingestion Scripts

Documentation for scripts that download and process geographic datasets.

---

## Overview

Data ingestion scripts in `backend/scripts/` automate downloading and processing of various data sources:

- OpenStreetMap via Geofabrik
- Humanitarian Data Exchange (HDX)
- World Bank Open Data
- STRI GIS Portal
- Kontur Population
- Global datasets

---

##Scripts Reference

### 1. download_geofabrik.py

Downloads OpenStreetMap data for Panama from Geofabrik.

**Usage**:
```bash
cd backend
python scripts/download_geofabrik.py
```

**What it downloads**:
- Roads network
- Buildings
- POI (points of interest)
- Natural features

**Output**: GeoJSON files in `backend/data/osm/`

**Schedule**: Run monthly for updates

---

### 2. download_hdx_panama.py

Downloads administrative boundaries from Humanitarian Data Exchange.

**Usage**:
```bash
python scripts/download_hdx_panama.py
```

**Downloads**:
- Level 1: Provinces (10 features)
- Level 2: Districts (81 features)
- Level 3: Corregimientos (679 features)

**Output**: `backend/data/hdx/pan_admin{1,2,3}_2021.geojson`

**Schedule**: Annual updates

---

### 3. download_worldbank.py

Downloads World Bank development indicators.

**Usage**:
```bash
python scripts/download_worldbank.py
```

**Indicators**:
- GDP per capita
- Life expectancy
- Access to electricity
- Internet usage
- And more...

**Output**: `backend/data/worldbank/indicators.geojson`

**Processing**: Joins indicator data with country geometries

---

### 4. download_stri_data.py

Downloads datasets from STRI GIS Portal.

**Usage**:
```bash
python scripts/download_stri_data.py
```

**Downloads**:
- Protected areas
- Forest cover
- Environmental datasets

**Output**: `backend/data/stri/*.geojson`

**Note**: Uses ArcGIS REST API

---

### 5. stri_catalog_scraper.py

Discovers and catalogs all available STRI datasets.

**Usage**:
```bash
python scripts/stri_catalog_scraper.py
```

**Output**: JSON catalog of 100+ STRI datasets with metadata

**Features**:
- Priority scoring
- Temporal dataset detection
- REST endpoint generation

---

### 6. create_province_layer.py

Creates province-level socioeconomic data layer.

**Usage**:
```bash
python scripts/create_province_layer.py
```

**Combines**:
- INEC Census data
- MPI (poverty index)
- Administrative geometries

**Output**: `backend/data/socioeconomic/province_socioeconomic.geojson`

---

### 7. download_global_datasets.py

Downloads global reference datasets.

**Usage**:
```bash
python scripts/download_global_datasets.py
```

**Downloads**:
- Natural Earth country boundaries
- Global admin boundaries
- Reference layers

**Output**: `backend/data/global/*.geojson`

---

### 8. register_global_datasets.py

Registers global datasets in catalog.json.

**Usage**:
```bash
python scripts/register_global_datasets.py
```

**Action**: Adds dataset entries to `backend/data/catalog.json`

---

## Adding New Data Sources

### Step-by-Step Guide

#### 1. Create Download Script

Create `backend/scripts/download_mycustom_data.py`:

```python
import geopandas as gpd
import requests
from pathlib import Path

def download_custom_data():
    """Download custom dataset."""
    
    # Define output path
    output_dir = Path(__file__).parent.parent / "data" / "custom"
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Download data
    url = "https://example.com/data.geojson"
    response = requests.get(url)
    
    # Save as GeoJSON
    output_file = output_dir / "custom_data.geojson"
    with open(output_file, 'w') as f:
        f.write(response.text)
    
    print(f"Downloaded to {output_file}")

if __name__ == "__main__":
    download_custom_data()
```

#### 2. Update Catalog

Add entry to `backend/data/catalog.json`:

```json
{
  "custom_data": {
    "path": "custom/custom_data.geojson",
    "description": "Short description for display",
    "semantic_description": "Detailed description mentioning key concepts that help AI discovery. Include what data represents, coverage area, and typical use cases.",
    "categories": ["infrastructure"],
    "tags": ["roads", "transport", "panama"],
    "schema": {
      "columns": ["name", "type", "length_km", "geom"],
      "geometry_type": "LineString"
    }
  }
}
```

**Key Fields**:
- `path`: Relative path from `backend/data/`
- `description`: Human-readable short description
- `semantic_description`: Detailed description for AI semantic search
- `categories`: Classify dataset
- `tags`: Keywords for filtering
- `schema`: Optional column and geometry info

#### 3. Regenerate Embeddings

```bash
cd backend
rm data/embeddings.npy
python -c "from backend.core.semantic_search import get_semantic_search; get_semantic_search()"
```

This generates vector embeddings for the new dataset description.

#### 4. Test Discovery

```bash
# Start backend
uvicorn backend.main:app --reload

# Test query
curl -X POST http://localhost:8000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message":"show me [your new data]","history":[]}'
```

Verify the AI can discover and query your dataset.

---

## Script Templates

### Basic Download Template

```python
#!/usr/bin/env python3
"""
Download script for [DATA SOURCE NAME]
"""

import geopandas as gpd
import requests
from pathlib import Path
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Constants
DATA_URL = "https://example.com/data.geojson"
OUTPUT_DIR = Path(__file__).parent.parent / "data" / "category"

def download_data():
    """Download and process data."""
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    
    logger.info(f"Downloading from {DATA_URL}")
    
    # Download
    gdf = gpd.read_file(DATA_URL)
    
    # Process (example: project to WGS84)
    if gdf.crs and gdf.crs != "EPSG:4326":
        gdf = gdf.to_crs("EPSG:4326")
    
    # Save
    output_file = OUTPUT_DIR / "data.geojson"
    gdf.to_file(output_file, driver="GeoJSON")
    
    logger.info(f"Saved {len(gdf)} features to {output_file}")

if __name__ == "__main__":
    download_data()
```

### API Download Template

```python
import requests
import json

def download_from_api():
    """Download from REST API."""
    
    # Query API
    params = {
        "where": "country='Panama'",
        "outFields": "*",
        "f": "geojson"
    }
    
    response = requests.get(API_URL, params=params)
    response.raise_for_status()
    
    # Parse and save
    geojson = response.json()
    
    with open(output_file, 'w') as f:
        json.dump(geojson, f)
```

---

## Data Processing Best Practices

### 1. Coordinate System

Always save in WGS84 (EPSG:4326):
```python
if gdf.crs != "EPSG:4326":
    gdf = gdf.to_crs("EPSG:4326")
```

### 2. Column Naming

Use lowercase with underscores:
```python
gdf.columns = gdf.columns.str.lower().str.replace(' ', '_')
```

### 3. Null Handling

Remove or fill nulls:
```python
gdf['name'] = gdf['name'].fillna('Unknown')
gdf = gdf.dropna(subset=['geom'])
```

### 4. Simplify Geometry (if needed)

For large datasets:
```python
gdf['geom'] = gdf['geom'].simplify(tolerance=0.001)
```

### 5. Validate GeoJSON

```python
import json

# Check valid JSON
with open(output_file) as f:
    data = json.load(f)
    
assert data['type'] == 'FeatureCollection'
assert 'features' in data
```

---

## Data Sources Reference

| Source | Script | Frequency | Size |
|--------|--------|-----------|------|
| Geofabrik (OSM) | `download_geofabrik.py` | Monthly | ~100MB |
| HDX | `download_hdx_panama.py` | Annual | ~5MB |
| World Bank | `download_worldbank.py` | Annual | ~1MB |
| STRI | `download_stri_data.py` | As updated | ~50MB |
| Kontur | Manual | Quarterly | ~200MB |

---

## Next Steps

- **Dataset Sources**: [../data/DATASET_SOURCES.md](../data/DATASET_SOURCES.md)
- **Core Services**: [CORE_SERVICES.md](CORE_SERVICES.md)