| # LLM Integration & Prompt Engineering | |
| Deep dive into GeoQuery's LLM integration, prompt system, and AI capabilities. | |
| --- | |
| ## Overview | |
| GeoQuery uses **Google Gemini 2.0 Flash** for all AI capabilities: | |
| - Intent detection | |
| - Text-to-SQL generation | |
| - Natural language explanations | |
| - Layer naming and styling decisions | |
| **Key Feature**: Thinking mode for transparency into reasoning process. | |
| --- | |
| ## LLMGateway Service | |
| **File**: `backend/core/llm_gateway.py` | |
| Central interface to Gemini API with streaming support. | |
| ### Initialization | |
| ```python | |
| from google import genai | |
| from google.genai import types | |
| class LLMGateway: | |
| def __init__(self): | |
| api_key = os.getenv("GEMINI_API_KEY") | |
| self.client = genai.Client(api_key=api_key) | |
| self.model = "gemini-2.0-flash-exp" | |
| ``` | |
| ### Configuration | |
| **Model**: `gemini-2.0-flash-exp` | |
| - **Speed**: Fast responses (~1s for SQL) | |
| - **Quality**: High accuracy for structured output | |
| - **Thinking Mode**: Shows reasoning process | |
| - **JSON Output**: Structured responses | |
| **Parameters**: | |
| ```python | |
| config = types.GenerateContentConfig( | |
| temperature=1, # Creative but consistent | |
| response_mime_type="application/json", # For structured outputs | |
| thinking_config=types.ThinkingConfig( | |
| mode=types.ThinkingMode.THINKING # Enable reasoning | |
| ) | |
| ) | |
| ``` | |
| --- | |
| ## Prompt System | |
| All prompts centralized in `backend/core/prompts.py`. | |
| ### 1. System Instruction | |
| **Prompt**: `SYSTEM_INSTRUCTION` | |
| Sets overall context and capabilities: | |
| ``` | |
| You are GeoQuery, an advanced Territorial Intelligence Agent capable of | |
| analyzing diverse geographic datasets. | |
| ## Your Capabilities | |
| - Dynamic Metadata Catalog (not fixed schema) | |
| - Spatial Analysis with PostGIS/DuckDB functions | |
| - Visual outputs (maps, charts) | |
| ## Output Guidelines | |
| 1. Be Data-Driven: Base answers on SQL query results | |
| 2. Be Visual: Use choropleth maps, point maps, charts | |
| 3. Be Transparent: Explain reasoning, cite sources | |
| ... | |
| ``` | |
| **Purpose**: Establishes AI persona and core behavior | |
| --- | |
| ### 2. Intent Detection | |
| **Prompt**: `INTENT_DETECTION_PROMPT` | |
| Classifies user queries into categories. | |
| ``` | |
| Analyze this user query and determine the best output type. | |
| User Query: "{user_query}" | |
| THINK STEP BY STEP: | |
| 1. What is the user asking for? | |
| 2. Does this require geographic visualization? | |
| 3. Does this require a chart/graph? | |
| 4. Is this a general question? | |
| Respond with ONLY ONE of these exact words: | |
| - GENERAL_CHAT | |
| - DATA_QUERY | |
| - MAP_REQUEST | |
| - SPATIAL_OP | |
| - STAT_QUERY | |
| Key rules: | |
| - "color by", "compare regions" → MAP_REQUEST | |
| - "create a chart" → STAT_QUERY | |
| - Questions about data availability → GENERAL_CHAT | |
| ``` | |
| **Examples**: | |
| | Query | Thinking | Intent | | |
| |-------|----------|--------| | |
| | "Show me hospitals" | User wants to SEE on map | `MAP_REQUEST` | | |
| | "How many provinces?" | Numerical answer, no viz needed | `DATA_QUERY` | | |
| | "Create bar chart of districts" | Explicitly requests chart | `STAT_QUERY` | | |
| | "Subtract Chiriquí from Panama" | Geometric operation | `SPATIAL_OP` | | |
| | "What data do you have?" | General question | `GENERAL_CHAT` | | |
| --- | |
| ### 3. SQL Generation | |
| **Prompt**: `SQL_GENERATION_PROMPT` | |
| Converts natural language to DuckDB SQL. | |
| ``` | |
| You are a DuckDB SQL expert for geographic data analysis. | |
| {table_schema} | |
| ### CRITICAL - Data Availability: | |
| ✅ You may ONLY query the tables listed above. | |
| ❌ Do NOT invent table names or columns. | |
| If requested data is NOT available, return: | |
| -- ERROR: DATA_UNAVAILABLE | |
| -- Requested: [what user asked for] | |
| -- Available: [list tables you DO have] | |
| ### User Request: "{user_query}" | |
| ### Rules: | |
| 1. Return ONLY the SQL query | |
| 2. Use DuckDB syntax (ILIKE for case-insensitive) | |
| 3. ALWAYS include 'geom' for map visualization | |
| 4. For "top N", use ORDER BY ... DESC LIMIT N | |
| 5. Do NOT add LIMIT unless explicitly requested | |
| 6. NEVER invent columns that don't exist | |
| ### Special Dataset - Population: | |
| - Use `kontur_population` (H3 hexagons) | |
| - Columns: population, geom | |
| - Large dataset (33K hexagons) - use LIMIT 40000 | |
| ... | |
| Generate SQL: | |
| ``` | |
| **Key Features**: | |
| - **Error Prevention**: Explicit instructions to avoid hallucinating tables | |
| - **Spatial Functions**: Guides use of ST_Intersects, ST_Within, etc. | |
| - **Data Unavailable Handling**: Returns special marker instead of invalid SQL | |
| **Examples**: | |
| Input: "Show hospitals in David" | |
| ```sql | |
| SELECT name, amenity, geom | |
| FROM panama_healthsites_geojson | |
| WHERE amenity = 'hospital' | |
| AND ST_Intersects(geom, (SELECT geom FROM pan_admin2 WHERE adm2_name = 'David')) | |
| ``` | |
| Input: "Population density in Veraguas" | |
| ```sql | |
| SELECT population, geom | |
| FROM kontur_population | |
| WHERE ST_Intersects(geom, (SELECT geom FROM pan_admin1 WHERE adm1_name = 'Veraguas')) | |
| LIMIT 5000 | |
| ``` | |
| --- | |
| ### 4. Spatial SQL | |
| **Prompt**: `SPATIAL_SQL_PROMPT` | |
| For geometric operations (difference, intersection, buffer, etc.). | |
| ``` | |
| You are a GIS expert using DuckDB Spatial. | |
| Available Data: | |
| {layer_context} | |
| User Request: "{user_query}" | |
| Rules: | |
| 1. Return ONLY SQL query | |
| 2. Use DuckDB Spatial functions: | |
| - ST_Difference, ST_Intersection, ST_Union | |
| - ST_Buffer, ST_Within, ST_Contains | |
| 3. The geometry column is named 'geom' | |
| 4. Use EXACT table names shown above | |
| 5. IMPORTANT: For aggregate geometries (ST_Union), use CTE pattern: | |
| CORRECT: | |
| WITH layer_b_union AS (SELECT ST_Union(geom) as geom FROM layer_b) | |
| SELECT a.*, ST_Difference(a.geom, b.geom) as geom | |
| FROM layer_a a, layer_b_union b | |
| WRONG: | |
| SELECT ST_Difference(geom, (SELECT ST_Union(geom) FROM layer_b)) | |
| FROM layer_a | |
| ``` | |
| **Example**: | |
| Input: "Subtract protected areas from Chiriquí province" | |
| ```sql | |
| WITH protected_union AS ( | |
| SELECT ST_Union(geom) as geom FROM stri_protected_areas_2025 | |
| ) | |
| SELECT | |
| p.adm1_name, | |
| ST_Difference(p.geom, pa.geom) as geom | |
| FROM pan_admin1 p, protected_union pa | |
| WHERE p.adm1_name = 'Chiriquí' | |
| ``` | |
| --- | |
| ### 5. Layer Naming | |
| **Prompt**: `LAYER_NAME_PROMPT` | |
| Generates descriptive name, emoji, and point style for map layers. | |
| ``` | |
| User Request: "{user_query}" | |
| SQL Query: "{sql_query}" | |
| Rules: | |
| 1. Return JSON with: name, emoji, pointStyle | |
| 2. "name": Short descriptive (1-4 words) | |
| 3. "emoji": Single emoji for data content | |
| 4. "pointStyle": How to render points | |
| - "icon": Small/medium POI (<500 points) | |
| - "circle": Large point datasets (>500 points) | |
| - null: Polygon data (use choropleth) | |
| Examples: | |
| {"name": "Hospitals in David", "emoji": "🏥", "pointStyle": "icon"} | |
| {"name": "Population Density", "emoji": "👥", "pointStyle": null} | |
| {"name": "Traffic Intersections", "emoji": "🚦", "pointStyle": "circle"} | |
| ``` | |
| **Decision Logic**: | |
| - Hospitals, schools, parks → icon | |
| - Intersections, sensors (large datasets) → circle | |
| - H3 hexagons, admin boundaries → null (polygon rendering) | |
| --- | |
| ### 6. Explanation | |
| **Prompt**: `EXPLANATION_PROMPT` | |
| Generates natural language explanation of results. | |
| ``` | |
| Explain the results of this data query to the user. | |
| User Question: "{user_query}" | |
| SQL Query: {sql_query} | |
| Data Result Summary: {data_summary} | |
| Instructions: | |
| 1. Keep response concise | |
| 2. Only describe ACTUAL data returned | |
| 3. Cite data source | |
| 4. Speak as GeoQuery | |
| Example citation: | |
| "Source: Administrative boundary data from HDX/INEC, 2021" | |
| ``` | |
| **Features**: | |
| - **Factual**: Only describes what was actually found | |
| - **Contextual**: Relates results to user's question | |
| - **Transparent**: Cites data sources | |
| --- | |
| ### 7. SQL Correction | |
| **Prompt**: `SQL_CORRECTION_PROMPT` | |
| Repairs failed SQL queries. | |
| ``` | |
| Your previous query failed. Fix it. | |
| ### Error Message: | |
| {error_message} | |
| ### Failed SQL: | |
| {incorrect_sql} | |
| ### User Request: | |
| "{user_query}" | |
| ### Database Schema: | |
| {schema_context} | |
| Rules: | |
| 1. Fix the error described in the message | |
| 2. Return ONLY the valid SQL query | |
| 3. Keep query logic consistent with User Request | |
| ``` | |
| **Common Fixes**: | |
| - Column ambiguity → Add table aliases | |
| - Missing column → Use correct column name | |
| - Syntax error → Fix DuckDB syntax | |
| --- | |
| ## Streaming Implementation | |
| ### Thinking + Content Streaming | |
| ```python | |
| async def stream_sql_generation(self, query: str, schema: str): | |
| config = types.GenerateContentConfig( | |
| thinking_config=types.ThinkingConfig( | |
| mode=types.ThinkingMode.THINKING | |
| ) | |
| ) | |
| response = await asyncio.to_thread( | |
| self.client.models.generate_content_stream, | |
| model=self.model, | |
| contents=query_prompt, | |
| config=config | |
| ) | |
| async for chunk in response: | |
| if hasattr(chunk, 'thought'): | |
| yield {"type": "thought", "text": chunk.thought.text} | |
| if hasattr(chunk, 'text'): | |
| yield {"type": "content", "text": chunk.text} | |
| ``` | |
| **Frontend receives**: | |
| ```json | |
| {"type": "thought", "text": "I need to find hospitals in the David district..."} | |
| {"type": "content", "text": "SELECT name, geom FROM ..."} | |
| ``` | |
| --- | |
| ## Error Handling | |
| ### 1. Data Unavailable | |
| ```sql | |
| -- ERROR: DATA_UNAVAILABLE | |
| -- Requested: crime statistics | |
| -- Available: hospitals, schools, admin boundaries | |
| ``` | |
| → System detects marker and returns helpful error | |
| ### 2. SQL Execution Error | |
| ``` | |
| Error: column "hospitals" does not exist | |
| ``` | |
| → Send to `correct_sql()` → LLM fixes → Retry | |
| ### 3. Rate Limiting | |
| ```python | |
| try: | |
| response = await self.client.models.generate_content(...) | |
| except Exception as e: | |
| if "rate limit" in str(e).lower(): | |
| await asyncio.sleep(1) | |
| # Retry | |
| ``` | |
| --- | |
| ## Performance Optimizations | |
| ### Caching | |
| **Not currently implemented**, but recommended: | |
| ```python | |
| from functools import lru_cache | |
| @lru_cache(maxsize=100) | |
| async def cached_sql_generation(query_hash: str): | |
| ... | |
| ``` | |
| ### Token Management | |
| - **Minimize Context**: Only send relevant table schemas | |
| - **Semantic Search**: Pre-filter to top 15 tables | |
| - **Batch Requests**: Combine multiple LLM calls where possible | |
| --- | |
| ## Prompt Engineering Best Practices | |
| ### 1. Be Explicit | |
| ❌ "Generate SQL for this query" | |
| ✅ "Generate DuckDB SQL with spatial functions. Include 'geom' column. Use ILIKE for text matching." | |
| ### 2. Provide Examples | |
| ``` | |
| Example: | |
| Input: "hospitals in Panama" | |
| Output: SELECT name, geom FROM panama_healthsites_geojson WHERE amenity='hospital' | |
| ``` | |
| ### 3. Use Constraints | |
| ``` | |
| Rules: | |
| - Return ONLY SQL (no markdown, no explanation) | |
| - Use EXACT table names from schema | |
| - DO NOT invent columns | |
| ``` | |
| ### 4. Handle Edge Cases | |
| ``` | |
| If data not available, return: | |
| -- ERROR: DATA_UNAVAILABLE | |
| ``` | |
| ### 5. Structure Output | |
| ``` | |
| Return valid JSON: | |
| {"name": "...", "emoji": "..."} | |
| ``` | |
| --- | |
| ## Testing | |
| ### Manual Testing | |
| ```python | |
| llm = LLMGateway() | |
| # Test intent detection | |
| intent = await llm.detect_intent("Show me hospitals", []) | |
| print(intent) # Should be "MAP_REQUEST" | |
| # Test SQL generation | |
| sql = await llm.generate_analytical_sql("hospitals in David", schema, []) | |
| print(sql) # Should be valid SELECT query | |
| ``` | |
| ### Prompt Iteration | |
| 1. **Test with real queries** | |
| 2. **Analyze failures** | |
| 3. **Update prompt** | |
| 4. **Re-test** | |
| --- | |
| ## Next Steps | |
| - **Core Services**: [CORE_SERVICES.md](CORE_SERVICES.md) | |
| - **Data Flow**: [../DATA_FLOW.md](../DATA_FLOW.md) | |
| - **API Reference**: [API_ENDPOINTS.md](API_ENDPOINTS.md) | |