Spaces:

GerardCB
/

GeoQuery

Running

App Files Files Community

GeoQuery / ARCHITECTURE.md

GerardCB

Deploy to Spaces (Final Clean)

4851501 about 23 hours ago

preview code

raw

history blame contribute delete

17.2 kB

GeoQuery Architecture

System Overview

GeoQuery is a Territorial Intelligence Platform that combines Large Language Models (LLMs) with geospatial analysis to enable natural language querying of geographic datasets. The system translates conversational queries into SQL, executes spatial operations, and presents results through interactive maps and data visualizations.

Design Philosophy

Natural Language First: Users interact through conversational queries, not SQL or GIS interfaces
Dynamic Data Discovery: No fixed schema—the system adapts to any GeoJSON dataset added to the catalog
Streaming Intelligence: Real-time thought processes and incremental results via Server-Sent Events
Spatial Native: PostGIS-compatible spatial operations in DuckDB for performant geospatial analysis
Visual by Default: Automatic map visualization, choropleth generation, and data presentation

High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                         Frontend                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │  ChatPanel   │  │  MapViewer   │  │ DataExplorer │     │
│  │  (React)     │  │  (Leaflet)   │  │   (Table)    │     │
│  └──────────────┘  └──────────────┘  └──────────────┘     │
│         │                  │                  │             │
│         └──────────────────┴──────────────────┘             │
│                           │ (SSE/HTTP)                       │
└───────────────────────────┼─────────────────────────────────┘
                            │
┌───────────────────────────┼─────────────────────────────────┐
│                      API Layer                               │
│  ┌──────────────────────────────────────────────────┐       │
│  │         FastAPI Endpoints                         │       │
│  │  /api/chat (SSE) │ /api/catalog │ /api/schema    │       │
│  └──────────────────────────────────────────────────┘       │
│                           │                                  │
└───────────────────────────┼─────────────────────────────────┘
                            │
┌───────────────────────────┼─────────────────────────────────┐
│                     Service Layer                            │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ QueryExecutor│  │   LLMGateway │  │  GeoEngine   │      │
│  │              │  │   (Gemini)   │  │   (DuckDB)   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ DataCatalog  │  │SemanticSearch│  │ SessionStore │      │
│  │ (Embeddings) │  │  (Vectors)   │  │   (Layers)   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                            │
┌───────────────────────────┼─────────────────────────────────┐
│                      Data Layer                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ catalog.json │  │  GeoJSON     │  │ embeddings   │      │
│  │  (Metadata)  │  │  (Datasets)  │  │   (.npy)     │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│  ┌──────────────────────────────────────────────────┐      │
│  │         DuckDB In-Memory Database                 │      │
│  │  (Spatial Tables, Temporary Layers, Indexes)     │      │
│  └──────────────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────────────┘

Core Components

1. Frontend (Next.js + React)

Location: frontend/src/

The frontend is a single-page application built with Next.js that provides:

ChatPanel: Conversational interface with streaming responses
MapViewer: Interactive Leaflet map with layer management
DataExplorer: Tabular data view with export capabilities

Key Technologies:

Next.js 14 (App Router)
React 18 with hooks
Leaflet for map rendering
Server-Sent Events (SSE) for streaming
dnd-kit for drag-and-drop layer reordering

2. API Layer (FastAPI)

Location: backend/api/

RESTful API with streaming support:

/api/chat (POST): Main query endpoint with SSE streaming
/api/catalog (GET): Returns available datasets
/api/schema (GET): Returns database schema

Key Technologies:

FastAPI for async HTTP
Starlette for SSE streaming
CORS middleware for cross-origin requests

3. Service Layer

QueryExecutor (`backend/services/executor.py`)

Orchestrates the entire query pipeline:

Intent detection
Data discovery
SQL generation
Query execution
Response formatting
Explanation generation

LLMGateway (`backend/core/llm_gateway.py`)

Interfaces with Gemini API:

Intent detection with thinking
Text-to-SQL generation
Natural language explanations
Layer naming and styling
Error correction
Streaming support

GeoEngine (`backend/core/geo_engine.py`)

Manages spatial database:

DuckDB connection with Spatial extension
Lazy table loading from GeoJSON
SQL query execution
Result formatting to GeoJSON
Temporary layer registration

DataCatalog (`backend/core/data_catalog.py`)

Dataset discovery system:

Loads catalog.json metadata
Generates table summaries for LLM context
Provides schema information
Manages dataset metadata

SemanticSearch (`backend/core/semantic_search.py`)

Vector-based dataset discovery:

Generates embeddings for dataset descriptions
Performs cosine similarity search
Returns top-k relevant datasets
Scales to large catalogs (100+ datasets)

SessionStore (`backend/core/session_store.py`)

User session management:

Tracks created map layers per session
Enables spatial operations on user layers
Maintains layer metadata

4. Data Layer

Catalog System (`backend/data/catalog.json`)

Central metadata registry:

Dataset paths and descriptions
Semantic descriptions for AI discovery
Categories and tags
Schema information
Data provenance

GeoJSON Datasets (`backend/data/`)

Organized by source:

osm/ - OpenStreetMap data (roads, buildings, POI)
admin/ - Administrative boundaries (HDX)
global/ - Global datasets (Kontur, Natural Earth)
socioeconomic/ - World Bank, MPI data
stri/ - STRI GIS Portal datasets

Vector Embeddings (`backend/data/embeddings.npy`)

Sentence transformer embeddings for semantic search

Data Flow: User Query to Response

Step 1: User Input

User: "Show me hospitals in Panama City"

Step 2: Frontend → Backend

POST /api/chat
{
  "message": "Show me hospitals in Panama City",
  "history": []
}

Step 3: Intent Detection (LLM)

# QueryExecutor calls LLMGateway.detect_intent()
intent = await llm.detect_intent(query, history)
# Returns: "MAP_REQUEST"

Step 4: Semantic Discovery

# SemanticSearch finds relevant tables
candidates = semantic_search.search_table_names(query, top_k=15)
# Returns: ["panama_healthsites_geojson", "osm_amenities", ...]

Step 5: Table Schema Retrieval

# GeoEngine loads relevant tables
geo_engine.ensure_table_loaded("panama_healthsites_geojson")
schema = geo_engine.get_table_schemas()
# Returns: "Table: panama_healthsites_geojson\nColumns: name, amenity, geom..."

Step 6: SQL Generation (LLM)

# LLMGateway generates SQL
sql = await llm.generate_analytical_sql(query, schema, history)
# Returns: "SELECT name, amenity, geom FROM panama_healthsites_geojson 
#           WHERE amenity = 'hospital' AND ST_Intersects(geom, ...)"

Step 7: Query Execution

# GeoEngine executes spatial query
geojson = geo_engine.execute_spatial_query(sql)
# Returns: GeoJSON with 45 hospital features

Step 8: Response Formatting

# Add layer metadata, generate name, configure visualization
layer_info = await llm.generate_layer_name(query, sql)
# Returns: {"name": "Hospitals in Panama City", "emoji": "🏥", "pointStyle": "icon"}

geojson = format_geojson_layer(query, geojson, features, 
                                layer_info["name"], 
                                layer_info["emoji"],
                                layer_info["pointStyle"])

Step 9: Explanation Generation (Streaming)

# LLMGateway generates explanation with streaming
async for chunk in llm.stream_explanation(query, sql, data_summary, history):
    if chunk["type"] == "thought":
        # Stream thinking process to frontend
    elif chunk["type"] == "content":
        # Stream actual response text

Step 10: Frontend Rendering

ChatPanel displays streamed explanation
MapViewer renders GeoJSON layer with hospital icons
DataExplorer shows tabular data

Key Design Decisions

1. Why DuckDB Instead of PostgreSQL?

Chosen: DuckDB with Spatial extension

Rationale:

Zero Configuration: Embedded database, no separate server
Fast Analytics: Columnar storage optimized for analytical queries
Spatial Support: Full PostGIS compatibility via spatial extension
GeoJSON Native: Direct GeoJSON import/export
Lightweight: Perfect for development and small deployments

Trade-off: Limited concurrency compared to PostgreSQL (acceptable for our use case)

2. Why Semantic Search for Dataset Discovery?

Chosen: Sentence transformer embeddings + cosine similarity

Rationale:

Scalability: Works with 100+ datasets without overwhelming LLM context
Accuracy: Better matches than keyword search
Token Efficiency: Only sends relevant table schemas to LLM

Example:

Query: "Where can I find doctors?"
Semantic search finds: panama_healthsites_geojson (closest match)
LLM then generates SQL using only relevant schema

3. Why Server-Sent Events for Streaming?

Chosen: SSE instead of WebSockets

Rationale:

Simpler Protocol: One-way communication (server → client)
HTTP Compatible: Works through firewalls and proxies
Auto Reconnect: Built-in browser support
Event Types: Named events for different message types

Trade-off: No client → server streaming (not needed for our use case)

4. Why Lazy Table Loading?

Chosen: Load GeoJSON only when needed

Rationale:

Fast Startup: Don't load all datasets on initialization
Memory Efficient: Only keep active tables in memory
Flexible: Easy to add new datasets without restart

Implementation:

def ensure_table_loaded(self, table_name: str) -> bool:
    if table_name not in self.loaded_tables:
        self.load_geojson_to_table(table_name)
    return table_name in self.loaded_tables

5. Why Choropleth Auto-Detection?

Chosen: Automatic choropleth configuration based on data

Rationale:

User Friendly: No manual configuration needed
Intelligent: Prioritizes meaningful columns (population, area, density)
Adaptive: Works with any numeric column

Logic:

Find numeric columns
Prioritize keywords (population, area, count)
Check value variance (skip if all same)
Enable choropleth with appropriate scale (linear/log)

##Error Handling & Resilience

SQL Error Correction

When a generated SQL query fails:

Extract error message
Send to LLM with original query and schema
LLM generates corrected SQL
Execute repaired query
If still fails, return error to user

Data Unavailable Handling

When requested data doesn't exist:

LLM returns special error marker: -- ERROR: DATA_UNAVAILABLE
System extracts "Requested" and "Available" from response
Returns helpful message to user with alternatives

Missing Tables

Catalog lists all datasets but not all loaded
Lazy loading attempts to load on demand
If file missing, logs warning and continues

Performance Considerations

Query Optimization

Spatial Indexes: DuckDB automatically indexes geometry columns
Top-K Limits: Large result sets limited to prevent memory issues
Lazy Evaluation: Stream results when possible

Embedding Cache

Embeddings pre-computed and stored in .npy file
Only regenerated when catalog changes
Fast cosine similarity via NumPy vectorization

Frontend Rendering

Layer Virtualization: Large point datasets use circle markers for performance
Choropleth Colors: Pre-computed color palettes
Lazy Map Loading: Only render visible layers

Security Considerations

LLM Prompt Injection

Mitigation: Clear separation of user query and system instructions
Validation: SQL parsing and column name verification
Sandboxing: Read-only queries (no INSERT/UPDATE/DELETE)

API Access

CORS: Configured allowed origins
Rate Limiting: Can be added via middleware (not currently implemented)
Authentication: Not implemented (suitable for internal/demo deployments)

Data Privacy

No user data stored (stateless queries)
Session layers stored in-memory only
No query logging by default

Scalability Path

Current Limitations

Single Process: No horizontal scaling
In-Memory Database: Limited by RAM
No Caching: Repeated queries re-execute

Future Enhancements

Add PostgreSQL/PostGIS: For production deployments with persistence
Redis Cache: Cache query results and embeddings
Load Balancer: Multiple FastAPI instances
Background Workers: Async data ingestion with Celery
CDN: Serve GeoJSON datasets from cloud storage

Technology Choices Summary

Component	Technology	Why?
Backend Language	Python 3.11+	Rich geospatial ecosystem, LLM SDKs
Web Framework	FastAPI	Async support, OpenAPI docs, SSE
Database	DuckDB	Embedded, fast analytics, spatial support
LLM	Google Gemini	Thinking mode, streaming, JSON output
Frontend Framework	Next.js 14	React, SSR, App Router, TypeScript
Map Library	Leaflet	Lightweight, flexible, plugin ecosystem
Embeddings	sentence-transformers	Multilingual, semantic similarity
Data Format	GeoJSON	Standard, human-readable, LLM-friendly

Next Steps

For detailed information on specific components:

GeoQuery Architecture

System Overview

Design Philosophy

High-Level Architecture

Core Components

1. Frontend (Next.js + React)

2. API Layer (FastAPI)

3. Service Layer

QueryExecutor (backend/services/executor.py)

LLMGateway (backend/core/llm_gateway.py)

GeoEngine (backend/core/geo_engine.py)

DataCatalog (backend/core/data_catalog.py)

SemanticSearch (backend/core/semantic_search.py)

SessionStore (backend/core/session_store.py)

4. Data Layer

Catalog System (backend/data/catalog.json)

GeoJSON Datasets (backend/data/)

Vector Embeddings (backend/data/embeddings.npy)

Data Flow: User Query to Response

Step 1: User Input

Step 2: Frontend → Backend

Step 3: Intent Detection (LLM)

Step 4: Semantic Discovery

Step 5: Table Schema Retrieval

Step 6: SQL Generation (LLM)

Step 7: Query Execution

Step 8: Response Formatting

Step 9: Explanation Generation (Streaming)

Step 10: Frontend Rendering

Key Design Decisions

1. Why DuckDB Instead of PostgreSQL?

2. Why Semantic Search for Dataset Discovery?

3. Why Server-Sent Events for Streaming?

4. Why Lazy Table Loading?

5. Why Choropleth Auto-Detection?

SQL Error Correction

Data Unavailable Handling

Missing Tables

Performance Considerations

Query Optimization

Embedding Cache

Frontend Rendering

Security Considerations

LLM Prompt Injection

API Access

Data Privacy

Scalability Path

Current Limitations

Future Enhancements

Technology Choices Summary

Next Steps

QueryExecutor (`backend/services/executor.py`)

LLMGateway (`backend/core/llm_gateway.py`)

GeoEngine (`backend/core/geo_engine.py`)

DataCatalog (`backend/core/data_catalog.py`)

SemanticSearch (`backend/core/semantic_search.py`)

SessionStore (`backend/core/session_store.py`)

Catalog System (`backend/data/catalog.json`)

GeoJSON Datasets (`backend/data/`)

Vector Embeddings (`backend/data/embeddings.npy`)