Spaces:

GerardCB
/

GeoQuery

Running

App Files Files Community

GeoQuery / docs /DATA_FLOW.md

GerardCB

Deploy to Spaces (Final Clean)

4851501 1 day ago

preview code

raw

history blame contribute delete

12.7 kB

	# Data Flow: End-to-End Request Processing

	This document provides a detailed walkthrough of how a user query flows through the GeoQuery system from input to visualization.

	---

	## Overview

	```
	User Query → Intent Detection → Semantic Search → SQL Generation →
	Query Execution → Result Formatting → Explanation → Map Rendering
	```

	Timeline: 2-8 seconds for typical queries

	---

	## Step-by-Step Walkthrough

	### Example Query

	User Input: "Show me hospitals in Panama City"

	---

	### Step 1: Frontend - User Submits Query

	Component: `ChatPanel.tsx`

	```typescript
	const handleSubmit = async (message: string) => {
	// Add user message to chat
	setMessages(prev => [...prev, { role: 'user', content: message }]);

	// Send to backend via SSE
	const response = await fetch('http://localhost:8000/api/chat', {
	method: 'POST',
	headers: {'Content-Type': 'application/json'},
	body: JSON.stringify({ message, history })
	});

	// Start streaming response
	const reader = response.body.getReader();
	...
	};
	```

	Request Payload:
	```json
	{
	"message": "Show me hospitals in Panama City",
	"history": []
	}
	```

	---

	### Step 2: Backend - API Endpoint Receives Request

	File: `backend/api/endpoints/chat.py`

	```python
	@router.post("/chat")
	async def chat(request: ChatRequest):
	# Initialize executor
	executor = QueryExecutor()

	# Process query with streaming
	async for event in executor.process_query_stream(
	request.message,
	request.history
	):
	yield sse_format(event)
	```

	Action: Routes to `QueryExecutor.process_query_stream()`

	---

	### Step 3: Intent Detection

	Service: `LLMGateway.detect_intent()`
	File: `backend/core/llm_gateway.py`

	LLM Prompt:
	```
	Analyze this user query and determine the best output type.

	User Query: "Show me hospitals in Panama City"

	THINK STEP BY STEP:
	1. What is the user asking for?
	2. Does this require geographic visualization (map)?
	...

	Respond with ONLY ONE of these exact words:
	- GENERAL_CHAT
	- DATA_QUERY
	- MAP_REQUEST
	- SPATIAL_OP
	- STAT_QUERY
	```

	Gemini Response:
	```
	Thinking: The user wants to SEE hospitals on a map, explicitly asks to "show"
	Response: MAP_REQUEST
	```

	Streaming to Frontend:
	```json
	{
	"event": "intent",
	"data": {"intent": "MAP_REQUEST"}
	}
	```

	Frontend: Displays intent badge in chat

	---

	### Step 4: Semantic Discovery

	Service: `SemanticSearch.search_table_names()`
	File: `backend/core/semantic_search.py`

	Process:
	1. Convert query to embedding vector (384 dimensions)
	```python
	query_embedding = model.encode("Show me hospitals in Panama City")
	```

	2. Calculate cosine similarity with all dataset embeddings
	```python
	similarities = cosine_similarity(query_embedding, catalog_embeddings)
	```

	3. Return top-k matches
	```python
	top_k_indices = np.argsort(similarities)[-15:][::-1]
	```

	Result:
	```python
	[
	"panama_healthsites_geojson", # similarity: 0.89
	"osm_amenities", # similarity: 0.76
	"panama_hospitals", # similarity: 0.74
	"osm_healthcare", # similarity: 0.71
	...
	]
	```

	Streaming to Frontend:
	```json
	{
	"event": "status",
	"data": {"status": "📚 Searching data catalog..."}
	}
	```

	Performance: <10ms for 100+ datasets

	---

	### Step 5: Table Schema Retrieval

	Service: `GeoEngine.ensure_table_loaded()`
	File: `backend/core/geo_engine.py`

	Process:
	1. Check if table already loaded in DuckDB
	```python
	if "panama_healthsites_geojson" not in self.loaded_tables:
	```

	2. Load GeoJSON file
	```python
	gdf = gpd.read_file("backend/data/osm/healthsites.geojson")
	```

	3. Register in DuckDB
	```python
	conn.register("panama_healthsites_geojson", gdf)
	self.loaded_tables["panama_healthsites_geojson"] = gdf
	```

	Schema Retrieved:
	```
	Table: panama_healthsites_geojson
	Columns: osm_id, name, amenity, operator, addr:city, geom
	Row count: 986
	Geometry type: Point
	```

	Streaming to Frontend:
	```json
	{
	"event": "status",
	"data": {"status": "💾 Loading tables: panama_healthsites_geojson..."}
	}
	```

	---

	### Step 6: SQL Generation

	Service: `LLMGateway.generate_analytical_sql()`
	File: `backend/core/llm_gateway.py`

	Prompt to Gemini:
	```
	You are a DuckDB SQL expert. Generate a valid DuckDB SQL query.

	Table: panama_healthsites_geojson
	Columns: osm_id, name, amenity, operator, addr:city, geom
	Row count: 986

	User Request: "Show me hospitals in Panama City"

	Rules:
	1. Return ONLY the SQL query
	2. ALWAYS include 'geom' for map visualization
	3. Use ILIKE for case-insensitive matching
	4. Use ST_Intersects for spatial filtering
	...

	Generate SQL:
	```

	Gemini Thinking (streamed):
	```
	"I need to:
	1. Filter for amenity='hospital'
	2. Filter for Panama City (district name is 'Panamá')
	3. Use spatial intersection with admin boundaries
	4. Include geometry for map rendering"
	```

	Gemini Response:
	```sql
	SELECT
	name,
	amenity,
	operator,
	geom
	FROM panama_healthsites_geojson
	WHERE amenity = 'hospital'
	AND ST_Intersects(
	geom,
	(SELECT geom FROM pan_admin2 WHERE adm2_name = 'Panamá')
	)
	```

	Streaming to Frontend:
	```json
	{
	"event": "chunk",
	"data": {"type": "thought", "content": "I need to filter for hospitals..."}
	}
	```

	Performance: ~1 second for SQL generation

	---

	### Step 7: Query Execution

	Service: `GeoEngine.execute_spatial_query()`
	File: `backend/core/geo_engine.py`

	Execution:
	```python
	try:
	result = conn.execute(sql).fetchdf()
	geojson = json.loads(result.to_json())
	except Exception as e:
	# Try SQL correction
	corrected_sql = await llm.correct_sql(query, sql, str(e), schema)
	result = conn.execute(corrected_sql).fetchdf()
	```

	Result:
	```python
	# GeoDataFrame with 45 rows
	name amenity operator geom
	0 Hospital Santo Tomás hospital MINSA POINT(...)
	1 Hospital del Niño hospital CSS POINT(...)
	...
	```

	Convert to GeoJSON:
	```json
	{
	"type": "FeatureCollection",
	"features": [
	{
	"type": "Feature",
	"geometry": {"type": "Point", "coordinates": [-79.5, 8.98]},
	"properties": {"name": "Hospital Santo Tomás", "amenity": "hospital"}
	},
	...
	]
	}
	```

	Streaming to Frontend:
	```json
	{
	"event": "status",
	"data": {"status": "✅ Found 45 results"}
	}
	```

	Performance: 100ms - 2s depending on data size

	---

	### Step 8: Result Formatting

	Service: `ResponseFormatter.format_geojson_layer()`
	File: `backend/services/response_formatter.py`

	Layer Name Generation:
	```python
	layer_info = await llm.generate_layer_name(query, sql)
	# Returns: {"name": "Hospitals in Panama City", "emoji": "🏥", "pointStyle": "icon"}
	```

	GeoJSON Enhancement:
	```python
	geojson["properties"] = {
	"layer_id": "abc123",
	"layer_name": "Hospitals in Panama City",
	"style": {
	"color": "#E63946",
	"fillColor": "#E63946",
	"opacity": 0.8,
	"fillOpacity": 0.4
	},
	"pointMarker": {
	"icon": "🏥",
	"style": "icon",
	"color": "#E63946",
	"size": 32
	},
	"choropleth": {"enabled": false}
	}
	```

	Auto-Detection:
	- Detects geometry type (Point)
	- Checks for numeric columns (none meaningful)
	- Configures point marker style based on `pointStyle: "icon"`

	---

	### Step 9: Explanation Generation

	Service: `LLMGateway.stream_explanation()`
	File: `backend/core/llm_gateway.py`

	Prompt to Gemini:
	```
	Explain the results of this data query to the user.

	User Question: "Show me hospitals in Panama City"
	SQL Query: SELECT name, amenity, geom FROM ... WHERE amenity='hospital'...
	Data Result Summary: Found 45 features (Points)

	Instructions:
	1. Keep response concise
	2. Only describe ACTUAL data returned
	3. Cite data source
	4. Speak as GeoQuery
	```

	Gemini Response (streamed):
	```
	"I have located 45 hospitals within Panama City district.

	The map displays each hospital as a 🏥 icon. You can click on any
	marker to see details including the hospital name and operator.

	Key facilities include:
	- Hospital Santo Tomás (MINSA)
	- Hospital del Niño (CSS)
	- Hospital Punta Pacifica (Private)

	Source: Healthcare facility data from OpenStreetMap via Healthsites.io"
	```

	Streaming to Frontend:
	```json
	{
	"event": "chunk",
	"data": {"type": "text", "content": "I have located 45 hospitals..."}
	}
	```

	Performance: ~1-2 seconds for explanation

	---

	### Step 10: Final Result Event

	Event Sent:
	```json
	{
	"event": "result",
	"data": {
	"response": "I have located 45 hospitals within Panama City...",
	"sql_query": "SELECT name, amenity, geom FROM ...",
	"geojson": { /* GeoJSON with 45 features */ },
	"chart_data": null,
	"raw_data": [ /* 45 rows of data */ ],
	"data_citations": [
	"Healthcare facility data from OpenStreetMap via Healthsites.io"
	]
	}
	}
	```

	---

	### Step 11: Frontend - Map Rendering

	Component: `MapViewer.tsx`

	Process:
	1. Receive GeoJSON from result event
	2. Create new MapLayer
	```typescript
	const newLayer: MapLayer = {
	id: geojson.properties.layer_id,
	name: geojson.properties.layer_name,
	data: geojson,
	visible: true,
	style: geojson.properties.style,
	pointMarker: geojson.properties.pointMarker
	};
	```

	3. Render with Leaflet
	```typescript
	<GeoJSON
	data={layer.data}
	pointToLayer={(feature, latlng) => {
	if (layer.pointMarker?.style === "icon") {
	return L.marker(latlng, {
	icon: L.divIcon({
	html: `<div style="font-size: 32px">${layer.pointMarker.icon}</div>`
	})
	});
	}
	}}
	/>
	```

	4. Auto-fit bounds to show all hospitals
	5. Display layer in legend panel

	Result: Interactive map with 45 hospital markers (🏥 icons)

	---

	## Performance Breakdown

	\| Step \| Service \| Time \| Async \|
	\|------\|---------\|------\|-------\|
	\| 1. Frontend Submit \| - \| <10ms \| - \|
	\| 2. API Routing \| FastAPI \| <5ms \| - \|
	\| 3. Intent Detection \| Gemini \| ~500ms \| ✓ \|
	\| 4. Semantic Search \| SentenceTransformer \| <10ms \| ✓ \|
	\| 5. Schema Loading \| DuckDB \| 50-200ms \| - \|
	\| 6. SQL Generation \| Gemini \| ~1s \| ✓ Streamed \|
	\| 7. Query Execution \| DuckDB \| 100ms-2s \| - \|
	\| 8. Formatting \| Python \| 10-50ms \| - \|
	\| 9. Explanation \| Gemini \| ~1s \| ✓ Streamed \|
	\| 10. Frontend Render \| Leaflet \| 50-200ms \| - \|

	Total: 2-5 seconds (perception: faster due to streaming)

	---

	## Error Handling Flow

	### SQL Execution Failure

	```
	SQL Error → Extract Error Message → Send to LLM → Generate Corrected SQL →
	Retry Execution → If Still Fails → Return Error to User
	```

	Example:
	```python
	try:
	result = execute_query(sql)
	except Exception as e:
	# Error: column "hospitals" does not exist
	corrected_sql = await llm.correct_sql(query, sql, str(e), schema)
	# LLM fixes: hospitals → panama_healthsites_geojson
	result = execute_query(corrected_sql)
	```

	### Data Unavailable

	```
	LLM Realizes Data Missing → Returns Special Marker →
	System Detects Marker → Returns Helpful Error Message
	```

	Example:
	```sql
	-- ERROR: DATA_UNAVAILABLE
	-- Requested: crime statistics
	-- Available: admin boundaries, hospitals, schools
	```

	---

	## Streaming Architecture

	Benefits of SSE (Server-Sent Events):
	1. Progressive Disclosure: User sees thinking process
	2. Faster Perceived Performance: Content streams in
	3. Transparency: Shows "why" behind answers
	4. Simple Protocol: HTTP-based, works everywhere

	Event Types:
	- `status`: Processing updates ("🔍 Searching...", "⚡ Executing...")
	- `intent`: Detected intent category
	- `chunk`: Streamed content (thought or text)
	- `result`: Final payload with all data

	---

	## Complex Query Flow

	For queries requiring multiple steps (e.g., "Compare hospital density with school density by province"):

	1. Complexity Detection: QueryPlanner identifies multi-dataset query
	2. Step Decomposition: Break into atomic steps
	- Step 1: Count hospitals per province
	- Step 2: Count schools per province
	- Step 3: Calculate ratios
	3. Parallel Execution: Execute independent steps concurrently
	4. Result Combination: Merge results for final answer
	5. Unified Explanation: LLM explains combined analysis

	See `backend/core/query_planner.py` for implementation.

	---

	## Next Steps

	- Backend Services: [backend/CORE_SERVICES.md](backend/CORE_SERVICES.md)
	- API Reference: [backend/API_ENDPOINTS.md](backend/API_ENDPOINTS.md)
	- Frontend Components: [frontend/COMPONENTS.md](frontend/COMPONENTS.md)