File size: 10,580 Bytes
4851501
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
---
title: GeoQuery
emoji: 🌍
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: mit
app_port: 7860
---

# GeoQuery
 πŸŒπŸ€–

**Territorial Intelligence Platform** - Natural language interface for geospatial data analysis powered by LLMs and DuckDB Spatial.

![Status](https://img.shields.io/badge/Status-Active-success) ![Python](https://img.shields.io/badge/Python-3.11+-blue) ![Next.js](https://img.shields.io/badge/Next.js-15-black) ![License](https://img.shields.io/badge/License-MIT-green)

---

## ✨ What is GeoQuery?

GeoQuery transforms geographic data analysis by combining **Large Language Models** with **spatial databases**. Simply ask questions in natural language and get instant maps, charts, and insights.

**Example**: *"Show me hospitals in Panama City"* β†’ Interactive map with 45 hospital locations, automatically styled with πŸ₯ icons.

### Key Capabilities

- πŸ—£οΈ **Conversational Queries** - Natural language instead of SQL or GIS interfaces
- πŸ—ΊοΈ **Auto-Visualization** - Smart choropleth maps, point markers, and heatmaps
- πŸ“Š **Dynamic Charts** - Automatic bar, pie, and line chart generation
- πŸ” **Semantic Discovery** - Finds relevant datasets from 100+ options using AI embeddings
- 🧩 **Multi-Step Analysis** - Complex queries automatically decomposed and executed
- πŸ’‘ **Thinking Transparency** - See the LLM's reasoning process in real-time
- 🎨 **Custom Point Styles** - Icon markers for POI, circle points for large datasets

---

## 🎬 Quick Demo

### Try These Queries

| Query | What You Get |
|-------|--------------|
| "Show me all provinces colored by area" | Choropleth map with size-based gradient |
| "Where are the universities?" | Point map with πŸŽ“ icons |
| "Compare hospital count vs school count by province" | Multi-step analysis with side-by-side bar charts |
| "Show intersections in David as circle points" | 1,288 traffic intersections as simple colored circles |
| "Population density in Veraguas" | H3 hexagon heatmap (33K cells) |

---

## πŸ—οΈ Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Frontend (Next.js)                    β”‚
β”‚   Chat Interface  β”‚  Leaflet Maps  β”‚  Data Explorer     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚ (SSE Streaming)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Backend (FastAPI)                       β”‚
β”‚  Intent Detection β†’ Semantic Search β†’ SQL Generation     β”‚
β”‚         ↓                ↓                  ↓             β”‚
β”‚    Gemini LLM    DataCatalog (Embeddings) DuckDB Spatial β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

It supports dynamic dataset discovery via semantic embeddings + LLM-generated spatial SQL.

πŸ“– **[Detailed Architecture](ARCHITECTURE.md)**

---

## πŸš€ Quick Start

### Prerequisites

- **Python 3.11+**
- **Node.js 18+**
- **Google AI API Key** ([Get one free](https://aistudio.google.com/app/apikey))

### Installation

```bash
# 1. Clone repository
git clone https://github.com/GerardCB/GeoQuery.git
cd GeoQuery

# 2. Backend setup
cd backend
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -e .

# 3. Configure API key
export GEMINI_API_KEY="your-api-key-here"

# 4. Start backend
uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000

# 5. Frontend setup (new terminal)
cd frontend
npm install
npm run dev
```

### πŸŽ‰ Done!

Open **http://localhost:3000** and start querying!

πŸ“˜ **[Detailed Setup Guide](SETUP.md)**

---

## πŸ“‚ Project Structure

```
GeoQuery/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ api/                    # FastAPI endpoints
β”‚   β”‚   └── endpoints/          # /chat, /catalog, /schema
β”‚   β”œβ”€β”€ core/                   # Core services
β”‚   β”‚   β”œβ”€β”€ llm_gateway.py      # Gemini API integration
β”‚   β”‚   β”œβ”€β”€ geo_engine.py       # DuckDB Spatial wrapper
β”‚   β”‚   β”œβ”€β”€ semantic_search.py  # Embedding-based discovery
β”‚   β”‚   β”œβ”€β”€ data_catalog.py     # Dataset metadata management
β”‚   β”‚   β”œβ”€β”€ query_planner.py    # Multi-step query orchestration
β”‚   β”‚   └── prompts.py          # LLM system instructions
β”‚   β”œβ”€β”€ services/               # Business logic
β”‚   β”‚   β”œβ”€β”€ executor.py         # Query pipeline orchestrator
β”‚   β”‚   └── response_formatter.py # GeoJSON/chart formatting
β”‚   β”œβ”€β”€ data/                   # Datasets and metadata
β”‚   β”‚   β”œβ”€β”€ catalog.json        # Dataset registry
β”‚   β”‚   β”œβ”€β”€ embeddings.npy      # Vector embeddings
β”‚   β”‚   β”œβ”€β”€ osm/                # OpenStreetMap data
β”‚   β”‚   β”œβ”€β”€ admin/              # Administrative boundaries
β”‚   β”‚   β”œβ”€β”€ global/             # Global datasets (Kontur, etc.)
β”‚   β”‚   └── socioeconomic/      # World Bank, poverty data
β”‚   └── scripts/                # Data ingestion scripts
β”‚       β”œβ”€β”€ download_geofabrik.py
β”‚       β”œβ”€β”€ download_hdx_panama.py
β”‚       └── stri_catalog_scraper.py
β”œβ”€β”€ frontend/
β”‚   └── src/
β”‚       β”œβ”€β”€ app/                # Next.js App Router pages
β”‚       └── components/
β”‚           β”œβ”€β”€ ChatPanel.tsx   # Chat interface with SSE
β”‚           β”œβ”€β”€ MapViewer.tsx   # Leaflet map with layers
β”‚           └── DataExplorer.tsx # Tabular data view
└── docs/                       # Technical documentation
    β”œβ”€β”€ backend/                # Backend deep-dives
    β”œβ”€β”€ frontend/               # Frontend architecture
    └── data/                   # Data system docs
```

---

## πŸ”§ Technology Stack

| Layer | Technology | Purpose |
|-------|-----------|---------|
| **LLM** | Google Gemini 2.0 | Intent detection, SQL generation, explanations |
| **Backend** | Python 3.11 + FastAPI | Async HTTP server with SSE streaming |
| **Database** | DuckDB with Spatial | In-memory spatial analytics |
| **Frontend** | Next.js 15 + React 18 | Server-side rendering + interactive UI |
| **Maps** | Leaflet 1.9 | Interactive web maps |
| **Embeddings** | sentence-transformers | Semantic dataset search |
| **Data** | GeoJSON + Parquet | Standardized geospatial formats |

---

## πŸ“Š Available Datasets

GeoQuery currently includes 100+ datasets across multiple categories:

### Administrative
- Panama provinces, districts, corregimientos (HDX 2021)
- Comarca boundaries
- Electoral districts

### Infrastructure
- Roads and highways (OpenStreetMap)
- Hospitals and health facilities (986 locations)
- Universities and schools (200+ institutions)
- Airports, ports, power plants

### Socioeconomic
- World Bank development indicators
- Multidimensional poverty index (MPI)
- Population density (Kontur H3 hexagons - 33K cells)

### Natural Environment
- Protected areas (STRI GIS Portal)
- Forest cover and land use
- Rivers and water bodies

πŸ“– **[Full Dataset List](docs/data/DATASET_SOURCES.md)** | **[Adding New Data](docs/backend/SCRIPTS.md)**

---

## πŸ’‘ How It Works

1. **User Query**: "Show me hospitals in Panama City"
2. **Intent Detection**: LLM classifies as MAP_REQUEST
3. **Semantic Search**: Finds `panama_healthsites_geojson` via embeddings
4. **SQL Generation**: LLM creates: `SELECT name, geom FROM panama_healthsites_geojson WHERE ST_Intersects(geom, (SELECT geom FROM pan_admin2 WHERE adm2_name = 'PanamΓ‘'))`
5. **Execution**: DuckDB Spatial runs query β†’ 45 features
6. **Visualization**: Auto-styled map with πŸ₯ icons
7. **Explanation**: LLM streams natural language summary

**Streaming**: See the LLM's thinking process in real-time via Server-Sent Events.

πŸ“– **[Detailed Data Flow](docs/DATA_FLOW.md)** | **[LLM Integration](docs/backend/LLM_INTEGRATION.md)**

---

## πŸ—ΊοΈ Advanced Features

### Choropleth Maps
Automatically detects numeric columns and creates color gradients:
- **Linear scale**: For area, count
- **Logarithmic scale**: For population, density

### Point Visualization Modes
- **Icon markers** πŸ₯πŸŽ“β›ͺ: For categorical POI (<500 points)
- **Circle points** β­•: For large datasets like intersections (>500 points)

### Spatial Operations
- Intersection: "Find hospitals within protected areas"
- Difference: "Show me areas outside national parks"
- Buffer: "Show 5km radius around hospitals"

### Multi-Step Queries
Complex questions automatically decomposed:
- "Compare population density with hospital coverage by province"
  1. Calculate population per province
  2. Count hospitals per province
  3. Compute ratios
  4. Generate comparison chart

---

## πŸ“š Documentation

| Document | Description |
|----------|-------------|
| **[ARCHITECTURE.md](ARCHITECTURE.md)** | System design, components, decisions |
| **[SETUP.md](SETUP.md)** | Development environment setup |
| **[docs/backend/CORE_SERVICES.md](docs/backend/CORE_SERVICES.md)** | Backend services reference |
| **[docs/backend/API_ENDPOINTS.md](docs/backend/API_ENDPOINTS.md)** | API endpoint documentation |
| **[docs/frontend/COMPONENTS.md](docs/frontend/COMPONENTS.md)** | React component architecture |
| **[docs/DATA_FLOW.md](docs/DATA_FLOW.md)** | End-to-end request walkthrough |

---

## πŸ“„ License

MIT License - see **[LICENSE](LICENSE)** for details.

---

## πŸ™ Acknowledgments

**Data Sources**:
- [OpenStreetMap](https://www.openstreetmap.org/) - Infrastructure and POI data
- [Humanitarian Data Exchange (HDX)](https://data.humdata.org/) - Administrative boundaries
- [World Bank Open Data](https://data.worldbank.org/) - Socioeconomic indicators
- [Kontur Population Dataset](https://data.humdata.org/organization/kontur) - H3 population grid
- [STRI GIS Portal](https://stridata-si.opendata.arcgis.com/) - Environmental datasets

**Technologies**:
- [Google Gemini](https://ai.google.dev/) - LLM API
- [DuckDB](https://duckdb.org/) - Fast in-process analytics
- [Leaflet](https://leafletjs.com/) - Interactive maps
- [Next.js](https://nextjs.org/) - React framework