# LLM Integration & Prompt Engineering Deep dive into GeoQuery's LLM integration, prompt system, and AI capabilities. --- ## Overview GeoQuery uses **Google Gemini 2.0 Flash** for all AI capabilities: - Intent detection - Text-to-SQL generation - Natural language explanations - Layer naming and styling decisions **Key Feature**: Thinking mode for transparency into reasoning process. --- ## LLMGateway Service **File**: `backend/core/llm_gateway.py` Central interface to Gemini API with streaming support. ### Initialization ```python from google import genai from google.genai import types class LLMGateway: def __init__(self): api_key = os.getenv("GEMINI_API_KEY") self.client = genai.Client(api_key=api_key) self.model = "gemini-2.0-flash-exp" ``` ### Configuration **Model**: `gemini-2.0-flash-exp` - **Speed**: Fast responses (~1s for SQL) - **Quality**: High accuracy for structured output - **Thinking Mode**: Shows reasoning process - **JSON Output**: Structured responses **Parameters**: ```python config = types.GenerateContentConfig( temperature=1, # Creative but consistent response_mime_type="application/json", # For structured outputs thinking_config=types.ThinkingConfig( mode=types.ThinkingMode.THINKING # Enable reasoning ) ) ``` --- ## Prompt System All prompts centralized in `backend/core/prompts.py`. ### 1. System Instruction **Prompt**: `SYSTEM_INSTRUCTION` Sets overall context and capabilities: ``` You are GeoQuery, an advanced Territorial Intelligence Agent capable of analyzing diverse geographic datasets. ## Your Capabilities - Dynamic Metadata Catalog (not fixed schema) - Spatial Analysis with PostGIS/DuckDB functions - Visual outputs (maps, charts) ## Output Guidelines 1. Be Data-Driven: Base answers on SQL query results 2. Be Visual: Use choropleth maps, point maps, charts 3. Be Transparent: Explain reasoning, cite sources ... ``` **Purpose**: Establishes AI persona and core behavior --- ### 2. Intent Detection **Prompt**: `INTENT_DETECTION_PROMPT` Classifies user queries into categories. ``` Analyze this user query and determine the best output type. User Query: "{user_query}" THINK STEP BY STEP: 1. What is the user asking for? 2. Does this require geographic visualization? 3. Does this require a chart/graph? 4. Is this a general question? Respond with ONLY ONE of these exact words: - GENERAL_CHAT - DATA_QUERY - MAP_REQUEST - SPATIAL_OP - STAT_QUERY Key rules: - "color by", "compare regions" → MAP_REQUEST - "create a chart" → STAT_QUERY - Questions about data availability → GENERAL_CHAT ``` **Examples**: | Query | Thinking | Intent | |-------|----------|--------| | "Show me hospitals" | User wants to SEE on map | `MAP_REQUEST` | | "How many provinces?" | Numerical answer, no viz needed | `DATA_QUERY` | | "Create bar chart of districts" | Explicitly requests chart | `STAT_QUERY` | | "Subtract Chiriquí from Panama" | Geometric operation | `SPATIAL_OP` | | "What data do you have?" | General question | `GENERAL_CHAT` | --- ### 3. SQL Generation **Prompt**: `SQL_GENERATION_PROMPT` Converts natural language to DuckDB SQL. ``` You are a DuckDB SQL expert for geographic data analysis. {table_schema} ### CRITICAL - Data Availability: ✅ You may ONLY query the tables listed above. ❌ Do NOT invent table names or columns. If requested data is NOT available, return: -- ERROR: DATA_UNAVAILABLE -- Requested: [what user asked for] -- Available: [list tables you DO have] ### User Request: "{user_query}" ### Rules: 1. Return ONLY the SQL query 2. Use DuckDB syntax (ILIKE for case-insensitive) 3. ALWAYS include 'geom' for map visualization 4. For "top N", use ORDER BY ... DESC LIMIT N 5. Do NOT add LIMIT unless explicitly requested 6. NEVER invent columns that don't exist ### Special Dataset - Population: - Use `kontur_population` (H3 hexagons) - Columns: population, geom - Large dataset (33K hexagons) - use LIMIT 40000 ... Generate SQL: ``` **Key Features**: - **Error Prevention**: Explicit instructions to avoid hallucinating tables - **Spatial Functions**: Guides use of ST_Intersects, ST_Within, etc. - **Data Unavailable Handling**: Returns special marker instead of invalid SQL **Examples**: Input: "Show hospitals in David" ```sql SELECT name, amenity, geom FROM panama_healthsites_geojson WHERE amenity = 'hospital' AND ST_Intersects(geom, (SELECT geom FROM pan_admin2 WHERE adm2_name = 'David')) ``` Input: "Population density in Veraguas" ```sql SELECT population, geom FROM kontur_population WHERE ST_Intersects(geom, (SELECT geom FROM pan_admin1 WHERE adm1_name = 'Veraguas')) LIMIT 5000 ``` --- ### 4. Spatial SQL **Prompt**: `SPATIAL_SQL_PROMPT` For geometric operations (difference, intersection, buffer, etc.). ``` You are a GIS expert using DuckDB Spatial. Available Data: {layer_context} User Request: "{user_query}" Rules: 1. Return ONLY SQL query 2. Use DuckDB Spatial functions: - ST_Difference, ST_Intersection, ST_Union - ST_Buffer, ST_Within, ST_Contains 3. The geometry column is named 'geom' 4. Use EXACT table names shown above 5. IMPORTANT: For aggregate geometries (ST_Union), use CTE pattern: CORRECT: WITH layer_b_union AS (SELECT ST_Union(geom) as geom FROM layer_b) SELECT a.*, ST_Difference(a.geom, b.geom) as geom FROM layer_a a, layer_b_union b WRONG: SELECT ST_Difference(geom, (SELECT ST_Union(geom) FROM layer_b)) FROM layer_a ``` **Example**: Input: "Subtract protected areas from Chiriquí province" ```sql WITH protected_union AS ( SELECT ST_Union(geom) as geom FROM stri_protected_areas_2025 ) SELECT p.adm1_name, ST_Difference(p.geom, pa.geom) as geom FROM pan_admin1 p, protected_union pa WHERE p.adm1_name = 'Chiriquí' ``` --- ### 5. Layer Naming **Prompt**: `LAYER_NAME_PROMPT` Generates descriptive name, emoji, and point style for map layers. ``` User Request: "{user_query}" SQL Query: "{sql_query}" Rules: 1. Return JSON with: name, emoji, pointStyle 2. "name": Short descriptive (1-4 words) 3. "emoji": Single emoji for data content 4. "pointStyle": How to render points - "icon": Small/medium POI (<500 points) - "circle": Large point datasets (>500 points) - null: Polygon data (use choropleth) Examples: {"name": "Hospitals in David", "emoji": "🏥", "pointStyle": "icon"} {"name": "Population Density", "emoji": "👥", "pointStyle": null} {"name": "Traffic Intersections", "emoji": "🚦", "pointStyle": "circle"} ``` **Decision Logic**: - Hospitals, schools, parks → icon - Intersections, sensors (large datasets) → circle - H3 hexagons, admin boundaries → null (polygon rendering) --- ### 6. Explanation **Prompt**: `EXPLANATION_PROMPT` Generates natural language explanation of results. ``` Explain the results of this data query to the user. User Question: "{user_query}" SQL Query: {sql_query} Data Result Summary: {data_summary} Instructions: 1. Keep response concise 2. Only describe ACTUAL data returned 3. Cite data source 4. Speak as GeoQuery Example citation: "Source: Administrative boundary data from HDX/INEC, 2021" ``` **Features**: - **Factual**: Only describes what was actually found - **Contextual**: Relates results to user's question - **Transparent**: Cites data sources --- ### 7. SQL Correction **Prompt**: `SQL_CORRECTION_PROMPT` Repairs failed SQL queries. ``` Your previous query failed. Fix it. ### Error Message: {error_message} ### Failed SQL: {incorrect_sql} ### User Request: "{user_query}" ### Database Schema: {schema_context} Rules: 1. Fix the error described in the message 2. Return ONLY the valid SQL query 3. Keep query logic consistent with User Request ``` **Common Fixes**: - Column ambiguity → Add table aliases - Missing column → Use correct column name - Syntax error → Fix DuckDB syntax --- ## Streaming Implementation ### Thinking + Content Streaming ```python async def stream_sql_generation(self, query: str, schema: str): config = types.GenerateContentConfig( thinking_config=types.ThinkingConfig( mode=types.ThinkingMode.THINKING ) ) response = await asyncio.to_thread( self.client.models.generate_content_stream, model=self.model, contents=query_prompt, config=config ) async for chunk in response: if hasattr(chunk, 'thought'): yield {"type": "thought", "text": chunk.thought.text} if hasattr(chunk, 'text'): yield {"type": "content", "text": chunk.text} ``` **Frontend receives**: ```json {"type": "thought", "text": "I need to find hospitals in the David district..."} {"type": "content", "text": "SELECT name, geom FROM ..."} ``` --- ## Error Handling ### 1. Data Unavailable ```sql -- ERROR: DATA_UNAVAILABLE -- Requested: crime statistics -- Available: hospitals, schools, admin boundaries ``` → System detects marker and returns helpful error ### 2. SQL Execution Error ``` Error: column "hospitals" does not exist ``` → Send to `correct_sql()` → LLM fixes → Retry ### 3. Rate Limiting ```python try: response = await self.client.models.generate_content(...) except Exception as e: if "rate limit" in str(e).lower(): await asyncio.sleep(1) # Retry ``` --- ## Performance Optimizations ### Caching **Not currently implemented**, but recommended: ```python from functools import lru_cache @lru_cache(maxsize=100) async def cached_sql_generation(query_hash: str): ... ``` ### Token Management - **Minimize Context**: Only send relevant table schemas - **Semantic Search**: Pre-filter to top 15 tables - **Batch Requests**: Combine multiple LLM calls where possible --- ## Prompt Engineering Best Practices ### 1. Be Explicit ❌ "Generate SQL for this query" ✅ "Generate DuckDB SQL with spatial functions. Include 'geom' column. Use ILIKE for text matching." ### 2. Provide Examples ``` Example: Input: "hospitals in Panama" Output: SELECT name, geom FROM panama_healthsites_geojson WHERE amenity='hospital' ``` ### 3. Use Constraints ``` Rules: - Return ONLY SQL (no markdown, no explanation) - Use EXACT table names from schema - DO NOT invent columns ``` ### 4. Handle Edge Cases ``` If data not available, return: -- ERROR: DATA_UNAVAILABLE ``` ### 5. Structure Output ``` Return valid JSON: {"name": "...", "emoji": "..."} ``` --- ## Testing ### Manual Testing ```python llm = LLMGateway() # Test intent detection intent = await llm.detect_intent("Show me hospitals", []) print(intent) # Should be "MAP_REQUEST" # Test SQL generation sql = await llm.generate_analytical_sql("hospitals in David", schema, []) print(sql) # Should be valid SELECT query ``` ### Prompt Iteration 1. **Test with real queries** 2. **Analyze failures** 3. **Update prompt** 4. **Re-test** --- ## Next Steps - **Core Services**: [CORE_SERVICES.md](CORE_SERVICES.md) - **Data Flow**: [../DATA_FLOW.md](../DATA_FLOW.md) - **API Reference**: [API_ENDPOINTS.md](API_ENDPOINTS.md)