# Data Flow: End-to-End Request Processing This document provides a detailed walkthrough of how a user query flows through the GeoQuery system from input to visualization. --- ## Overview ``` User Query → Intent Detection → Semantic Search → SQL Generation → Query Execution → Result Formatting → Explanation → Map Rendering ``` **Timeline**: 2-8 seconds for typical queries --- ## Step-by-Step Walkthrough ### Example Query **User Input**: *"Show me hospitals in Panama City"* --- ### Step 1: Frontend - User Submits Query **Component**: `ChatPanel.tsx` ```typescript const handleSubmit = async (message: string) => { // Add user message to chat setMessages(prev => [...prev, { role: 'user', content: message }]); // Send to backend via SSE const response = await fetch('http://localhost:8000/api/chat', { method: 'POST', headers: {'Content-Type': 'application/json'}, body: JSON.stringify({ message, history }) }); // Start streaming response const reader = response.body.getReader(); ... }; ``` **Request Payload**: ```json { "message": "Show me hospitals in Panama City", "history": [] } ``` --- ### Step 2: Backend - API Endpoint Receives Request **File**: `backend/api/endpoints/chat.py` ```python @router.post("/chat") async def chat(request: ChatRequest): # Initialize executor executor = QueryExecutor() # Process query with streaming async for event in executor.process_query_stream( request.message, request.history ): yield sse_format(event) ``` **Action**: Routes to `QueryExecutor.process_query_stream()` --- ### Step 3: Intent Detection **Service**: `LLMGateway.detect_intent()` **File**: `backend/core/llm_gateway.py` **LLM Prompt**: ``` Analyze this user query and determine the best output type. User Query: "Show me hospitals in Panama City" THINK STEP BY STEP: 1. What is the user asking for? 2. Does this require geographic visualization (map)? ... Respond with ONLY ONE of these exact words: - GENERAL_CHAT - DATA_QUERY - MAP_REQUEST - SPATIAL_OP - STAT_QUERY ``` **Gemini Response**: ``` Thinking: The user wants to SEE hospitals on a map, explicitly asks to "show" Response: MAP_REQUEST ``` **Streaming to Frontend**: ```json { "event": "intent", "data": {"intent": "MAP_REQUEST"} } ``` **Frontend**: Displays intent badge in chat --- ### Step 4: Semantic Discovery **Service**: `SemanticSearch.search_table_names()` **File**: `backend/core/semantic_search.py` **Process**: 1. Convert query to embedding vector (384 dimensions) ```python query_embedding = model.encode("Show me hospitals in Panama City") ``` 2. Calculate cosine similarity with all dataset embeddings ```python similarities = cosine_similarity(query_embedding, catalog_embeddings) ``` 3. Return top-k matches ```python top_k_indices = np.argsort(similarities)[-15:][::-1] ``` **Result**: ```python [ "panama_healthsites_geojson", # similarity: 0.89 "osm_amenities", # similarity: 0.76 "panama_hospitals", # similarity: 0.74 "osm_healthcare", # similarity: 0.71 ... ] ``` **Streaming to Frontend**: ```json { "event": "status", "data": {"status": "📚 Searching data catalog..."} } ``` **Performance**: <10ms for 100+ datasets --- ### Step 5: Table Schema Retrieval **Service**: `GeoEngine.ensure_table_loaded()` **File**: `backend/core/geo_engine.py` **Process**: 1. Check if table already loaded in DuckDB ```python if "panama_healthsites_geojson" not in self.loaded_tables: ``` 2. Load GeoJSON file ```python gdf = gpd.read_file("backend/data/osm/healthsites.geojson") ``` 3. Register in DuckDB ```python conn.register("panama_healthsites_geojson", gdf) self.loaded_tables["panama_healthsites_geojson"] = gdf ``` **Schema Retrieved**: ``` Table: panama_healthsites_geojson Columns: osm_id, name, amenity, operator, addr:city, geom Row count: 986 Geometry type: Point ``` **Streaming to Frontend**: ```json { "event": "status", "data": {"status": "💾 Loading tables: panama_healthsites_geojson..."} } ``` --- ### Step 6: SQL Generation **Service**: `LLMGateway.generate_analytical_sql()` **File**: `backend/core/llm_gateway.py` **Prompt to Gemini**: ``` You are a DuckDB SQL expert. Generate a valid DuckDB SQL query. Table: panama_healthsites_geojson Columns: osm_id, name, amenity, operator, addr:city, geom Row count: 986 User Request: "Show me hospitals in Panama City" Rules: 1. Return ONLY the SQL query 2. ALWAYS include 'geom' for map visualization 3. Use ILIKE for case-insensitive matching 4. Use ST_Intersects for spatial filtering ... Generate SQL: ``` **Gemini Thinking** (streamed): ``` "I need to: 1. Filter for amenity='hospital' 2. Filter for Panama City (district name is 'Panamá') 3. Use spatial intersection with admin boundaries 4. Include geometry for map rendering" ``` **Gemini Response**: ```sql SELECT name, amenity, operator, geom FROM panama_healthsites_geojson WHERE amenity = 'hospital' AND ST_Intersects( geom, (SELECT geom FROM pan_admin2 WHERE adm2_name = 'Panamá') ) ``` **Streaming to Frontend**: ```json { "event": "chunk", "data": {"type": "thought", "content": "I need to filter for hospitals..."} } ``` **Performance**: ~1 second for SQL generation --- ### Step 7: Query Execution **Service**: `GeoEngine.execute_spatial_query()` **File**: `backend/core/geo_engine.py` **Execution**: ```python try: result = conn.execute(sql).fetchdf() geojson = json.loads(result.to_json()) except Exception as e: # Try SQL correction corrected_sql = await llm.correct_sql(query, sql, str(e), schema) result = conn.execute(corrected_sql).fetchdf() ``` **Result**: ```python # GeoDataFrame with 45 rows name amenity operator geom 0 Hospital Santo Tomás hospital MINSA POINT(...) 1 Hospital del Niño hospital CSS POINT(...) ... ``` **Convert to GeoJSON**: ```json { "type": "FeatureCollection", "features": [ { "type": "Feature", "geometry": {"type": "Point", "coordinates": [-79.5, 8.98]}, "properties": {"name": "Hospital Santo Tomás", "amenity": "hospital"} }, ... ] } ``` **Streaming to Frontend**: ```json { "event": "status", "data": {"status": "✅ Found 45 results"} } ``` **Performance**: 100ms - 2s depending on data size --- ### Step 8: Result Formatting **Service**: `ResponseFormatter.format_geojson_layer()` **File**: `backend/services/response_formatter.py` **Layer Name Generation**: ```python layer_info = await llm.generate_layer_name(query, sql) # Returns: {"name": "Hospitals in Panama City", "emoji": "🏥", "pointStyle": "icon"} ``` **GeoJSON Enhancement**: ```python geojson["properties"] = { "layer_id": "abc123", "layer_name": "Hospitals in Panama City", "style": { "color": "#E63946", "fillColor": "#E63946", "opacity": 0.8, "fillOpacity": 0.4 }, "pointMarker": { "icon": "🏥", "style": "icon", "color": "#E63946", "size": 32 }, "choropleth": {"enabled": false} } ``` **Auto-Detection**: - Detects geometry type (Point) - Checks for numeric columns (none meaningful) - Configures point marker style based on `pointStyle: "icon"` --- ### Step 9: Explanation Generation **Service**: `LLMGateway.stream_explanation()` **File**: `backend/core/llm_gateway.py` **Prompt to Gemini**: ``` Explain the results of this data query to the user. User Question: "Show me hospitals in Panama City" SQL Query: SELECT name, amenity, geom FROM ... WHERE amenity='hospital'... Data Result Summary: Found 45 features (Points) Instructions: 1. Keep response concise 2. Only describe ACTUAL data returned 3. Cite data source 4. Speak as GeoQuery ``` **Gemini Response** (streamed): ``` "I have located 45 hospitals within Panama City district. The map displays each hospital as a 🏥 icon. You can click on any marker to see details including the hospital name and operator. **Key facilities include**: - Hospital Santo Tomás (MINSA) - Hospital del Niño (CSS) - Hospital Punta Pacifica (Private) Source: Healthcare facility data from OpenStreetMap via Healthsites.io" ``` **Streaming to Frontend**: ```json { "event": "chunk", "data": {"type": "text", "content": "I have located 45 hospitals..."} } ``` **Performance**: ~1-2 seconds for explanation --- ### Step 10: Final Result Event **Event Sent**: ```json { "event": "result", "data": { "response": "I have located 45 hospitals within Panama City...", "sql_query": "SELECT name, amenity, geom FROM ...", "geojson": { /* GeoJSON with 45 features */ }, "chart_data": null, "raw_data": [ /* 45 rows of data */ ], "data_citations": [ "Healthcare facility data from OpenStreetMap via Healthsites.io" ] } } ``` --- ### Step 11: Frontend - Map Rendering **Component**: `MapViewer.tsx` **Process**: 1. Receive GeoJSON from result event 2. Create new MapLayer ```typescript const newLayer: MapLayer = { id: geojson.properties.layer_id, name: geojson.properties.layer_name, data: geojson, visible: true, style: geojson.properties.style, pointMarker: geojson.properties.pointMarker }; ``` 3. Render with Leaflet ```typescript { if (layer.pointMarker?.style === "icon") { return L.marker(latlng, { icon: L.divIcon({ html: `
${layer.pointMarker.icon}
` }) }); } }} /> ``` 4. Auto-fit bounds to show all hospitals 5. Display layer in legend panel **Result**: Interactive map with 45 hospital markers (🏥 icons) --- ## Performance Breakdown | Step | Service | Time | Async | |------|---------|------|-------| | 1. Frontend Submit | - | <10ms | - | | 2. API Routing | FastAPI | <5ms | - | | 3. Intent Detection | Gemini | ~500ms | ✓ | | 4. Semantic Search | SentenceTransformer | <10ms | ✓ | | 5. Schema Loading | DuckDB | 50-200ms | - | | 6. SQL Generation | Gemini | ~1s | ✓ Streamed | | 7. Query Execution | DuckDB | 100ms-2s | - | | 8. Formatting | Python | 10-50ms | - | | 9. Explanation | Gemini | ~1s | ✓ Streamed | | 10. Frontend Render | Leaflet | 50-200ms | - | **Total**: 2-5 seconds (perception: faster due to streaming) --- ## Error Handling Flow ### SQL Execution Failure ``` SQL Error → Extract Error Message → Send to LLM → Generate Corrected SQL → Retry Execution → If Still Fails → Return Error to User ``` **Example**: ```python try: result = execute_query(sql) except Exception as e: # Error: column "hospitals" does not exist corrected_sql = await llm.correct_sql(query, sql, str(e), schema) # LLM fixes: hospitals → panama_healthsites_geojson result = execute_query(corrected_sql) ``` ### Data Unavailable ``` LLM Realizes Data Missing → Returns Special Marker → System Detects Marker → Returns Helpful Error Message ``` **Example**: ```sql -- ERROR: DATA_UNAVAILABLE -- Requested: crime statistics -- Available: admin boundaries, hospitals, schools ``` --- ## Streaming Architecture **Benefits of SSE (Server-Sent Events)**: 1. **Progressive Disclosure**: User sees thinking process 2. **Faster Perceived Performance**: Content streams in 3. **Transparency**: Shows "why" behind answers 4. **Simple Protocol**: HTTP-based, works everywhere **Event Types**: - `status`: Processing updates ("🔍 Searching...", "⚡ Executing...") - `intent`: Detected intent category - `chunk`: Streamed content (thought or text) - `result`: Final payload with all data --- ## Complex Query Flow For queries requiring multiple steps (e.g., "Compare hospital density with school density by province"): 1. **Complexity Detection**: QueryPlanner identifies multi-dataset query 2. **Step Decomposition**: Break into atomic steps - Step 1: Count hospitals per province - Step 2: Count schools per province - Step 3: Calculate ratios 3. **Parallel Execution**: Execute independent steps concurrently 4. **Result Combination**: Merge results for final answer 5. **Unified Explanation**: LLM explains combined analysis See `backend/core/query_planner.py` for implementation. --- ## Next Steps - **Backend Services**: [backend/CORE_SERVICES.md](backend/CORE_SERVICES.md) - **API Reference**: [backend/API_ENDPOINTS.md](backend/API_ENDPOINTS.md) - **Frontend Components**: [frontend/COMPONENTS.md](frontend/COMPONENTS.md)