LLM Integration & Prompt Engineering
Deep dive into GeoQuery's LLM integration, prompt system, and AI capabilities.
Overview
GeoQuery uses Google Gemini 2.0 Flash for all AI capabilities:
- Intent detection
- Text-to-SQL generation
- Natural language explanations
- Layer naming and styling decisions
Key Feature: Thinking mode for transparency into reasoning process.
LLMGateway Service
File: backend/core/llm_gateway.py
Central interface to Gemini API with streaming support.
Initialization
from google import genai
from google.genai import types
class LLMGateway:
def __init__(self):
api_key = os.getenv("GEMINI_API_KEY")
self.client = genai.Client(api_key=api_key)
self.model = "gemini-2.0-flash-exp"
Configuration
Model: gemini-2.0-flash-exp
- Speed: Fast responses (~1s for SQL)
- Quality: High accuracy for structured output
- Thinking Mode: Shows reasoning process
- JSON Output: Structured responses
Parameters:
config = types.GenerateContentConfig(
temperature=1, # Creative but consistent
response_mime_type="application/json", # For structured outputs
thinking_config=types.ThinkingConfig(
mode=types.ThinkingMode.THINKING # Enable reasoning
)
)
Prompt System
All prompts centralized in backend/core/prompts.py.
1. System Instruction
Prompt: SYSTEM_INSTRUCTION
Sets overall context and capabilities:
You are GeoQuery, an advanced Territorial Intelligence Agent capable of
analyzing diverse geographic datasets.
## Your Capabilities
- Dynamic Metadata Catalog (not fixed schema)
- Spatial Analysis with PostGIS/DuckDB functions
- Visual outputs (maps, charts)
## Output Guidelines
1. Be Data-Driven: Base answers on SQL query results
2. Be Visual: Use choropleth maps, point maps, charts
3. Be Transparent: Explain reasoning, cite sources
...
Purpose: Establishes AI persona and core behavior
2. Intent Detection
Prompt: INTENT_DETECTION_PROMPT
Classifies user queries into categories.
Analyze this user query and determine the best output type.
User Query: "{user_query}"
THINK STEP BY STEP:
1. What is the user asking for?
2. Does this require geographic visualization?
3. Does this require a chart/graph?
4. Is this a general question?
Respond with ONLY ONE of these exact words:
- GENERAL_CHAT
- DATA_QUERY
- MAP_REQUEST
- SPATIAL_OP
- STAT_QUERY
Key rules:
- "color by", "compare regions" → MAP_REQUEST
- "create a chart" → STAT_QUERY
- Questions about data availability → GENERAL_CHAT
Examples:
| Query | Thinking | Intent |
|---|---|---|
| "Show me hospitals" | User wants to SEE on map | MAP_REQUEST |
| "How many provinces?" | Numerical answer, no viz needed | DATA_QUERY |
| "Create bar chart of districts" | Explicitly requests chart | STAT_QUERY |
| "Subtract Chiriquí from Panama" | Geometric operation | SPATIAL_OP |
| "What data do you have?" | General question | GENERAL_CHAT |
3. SQL Generation
Prompt: SQL_GENERATION_PROMPT
Converts natural language to DuckDB SQL.
You are a DuckDB SQL expert for geographic data analysis.
{table_schema}
### CRITICAL - Data Availability:
✅ You may ONLY query the tables listed above.
❌ Do NOT invent table names or columns.
If requested data is NOT available, return:
-- ERROR: DATA_UNAVAILABLE
-- Requested: [what user asked for]
-- Available: [list tables you DO have]
### User Request: "{user_query}"
### Rules:
1. Return ONLY the SQL query
2. Use DuckDB syntax (ILIKE for case-insensitive)
3. ALWAYS include 'geom' for map visualization
4. For "top N", use ORDER BY ... DESC LIMIT N
5. Do NOT add LIMIT unless explicitly requested
6. NEVER invent columns that don't exist
### Special Dataset - Population:
- Use `kontur_population` (H3 hexagons)
- Columns: population, geom
- Large dataset (33K hexagons) - use LIMIT 40000
...
Generate SQL:
Key Features:
- Error Prevention: Explicit instructions to avoid hallucinating tables
- Spatial Functions: Guides use of ST_Intersects, ST_Within, etc.
- Data Unavailable Handling: Returns special marker instead of invalid SQL
Examples:
Input: "Show hospitals in David"
SELECT name, amenity, geom
FROM panama_healthsites_geojson
WHERE amenity = 'hospital'
AND ST_Intersects(geom, (SELECT geom FROM pan_admin2 WHERE adm2_name = 'David'))
Input: "Population density in Veraguas"
SELECT population, geom
FROM kontur_population
WHERE ST_Intersects(geom, (SELECT geom FROM pan_admin1 WHERE adm1_name = 'Veraguas'))
LIMIT 5000
4. Spatial SQL
Prompt: SPATIAL_SQL_PROMPT
For geometric operations (difference, intersection, buffer, etc.).
You are a GIS expert using DuckDB Spatial.
Available Data:
{layer_context}
User Request: "{user_query}"
Rules:
1. Return ONLY SQL query
2. Use DuckDB Spatial functions:
- ST_Difference, ST_Intersection, ST_Union
- ST_Buffer, ST_Within, ST_Contains
3. The geometry column is named 'geom'
4. Use EXACT table names shown above
5. IMPORTANT: For aggregate geometries (ST_Union), use CTE pattern:
CORRECT:
WITH layer_b_union AS (SELECT ST_Union(geom) as geom FROM layer_b)
SELECT a.*, ST_Difference(a.geom, b.geom) as geom
FROM layer_a a, layer_b_union b
WRONG:
SELECT ST_Difference(geom, (SELECT ST_Union(geom) FROM layer_b))
FROM layer_a
Example:
Input: "Subtract protected areas from Chiriquí province"
WITH protected_union AS (
SELECT ST_Union(geom) as geom FROM stri_protected_areas_2025
)
SELECT
p.adm1_name,
ST_Difference(p.geom, pa.geom) as geom
FROM pan_admin1 p, protected_union pa
WHERE p.adm1_name = 'Chiriquí'
5. Layer Naming
Prompt: LAYER_NAME_PROMPT
Generates descriptive name, emoji, and point style for map layers.
User Request: "{user_query}"
SQL Query: "{sql_query}"
Rules:
1. Return JSON with: name, emoji, pointStyle
2. "name": Short descriptive (1-4 words)
3. "emoji": Single emoji for data content
4. "pointStyle": How to render points
- "icon": Small/medium POI (<500 points)
- "circle": Large point datasets (>500 points)
- null: Polygon data (use choropleth)
Examples:
{"name": "Hospitals in David", "emoji": "🏥", "pointStyle": "icon"}
{"name": "Population Density", "emoji": "👥", "pointStyle": null}
{"name": "Traffic Intersections", "emoji": "🚦", "pointStyle": "circle"}
Decision Logic:
- Hospitals, schools, parks → icon
- Intersections, sensors (large datasets) → circle
- H3 hexagons, admin boundaries → null (polygon rendering)
6. Explanation
Prompt: EXPLANATION_PROMPT
Generates natural language explanation of results.
Explain the results of this data query to the user.
User Question: "{user_query}"
SQL Query: {sql_query}
Data Result Summary: {data_summary}
Instructions:
1. Keep response concise
2. Only describe ACTUAL data returned
3. Cite data source
4. Speak as GeoQuery
Example citation:
"Source: Administrative boundary data from HDX/INEC, 2021"
Features:
- Factual: Only describes what was actually found
- Contextual: Relates results to user's question
- Transparent: Cites data sources
7. SQL Correction
Prompt: SQL_CORRECTION_PROMPT
Repairs failed SQL queries.
Your previous query failed. Fix it.
### Error Message:
{error_message}
### Failed SQL:
{incorrect_sql}
### User Request:
"{user_query}"
### Database Schema:
{schema_context}
Rules:
1. Fix the error described in the message
2. Return ONLY the valid SQL query
3. Keep query logic consistent with User Request
Common Fixes:
- Column ambiguity → Add table aliases
- Missing column → Use correct column name
- Syntax error → Fix DuckDB syntax
Streaming Implementation
Thinking + Content Streaming
async def stream_sql_generation(self, query: str, schema: str):
config = types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(
mode=types.ThinkingMode.THINKING
)
)
response = await asyncio.to_thread(
self.client.models.generate_content_stream,
model=self.model,
contents=query_prompt,
config=config
)
async for chunk in response:
if hasattr(chunk, 'thought'):
yield {"type": "thought", "text": chunk.thought.text}
if hasattr(chunk, 'text'):
yield {"type": "content", "text": chunk.text}
Frontend receives:
{"type": "thought", "text": "I need to find hospitals in the David district..."}
{"type": "content", "text": "SELECT name, geom FROM ..."}
Error Handling
1. Data Unavailable
-- ERROR: DATA_UNAVAILABLE
-- Requested: crime statistics
-- Available: hospitals, schools, admin boundaries
→ System detects marker and returns helpful error
2. SQL Execution Error
Error: column "hospitals" does not exist
→ Send to correct_sql() → LLM fixes → Retry
3. Rate Limiting
try:
response = await self.client.models.generate_content(...)
except Exception as e:
if "rate limit" in str(e).lower():
await asyncio.sleep(1)
# Retry
Performance Optimizations
Caching
Not currently implemented, but recommended:
from functools import lru_cache
@lru_cache(maxsize=100)
async def cached_sql_generation(query_hash: str):
...
Token Management
- Minimize Context: Only send relevant table schemas
- Semantic Search: Pre-filter to top 15 tables
- Batch Requests: Combine multiple LLM calls where possible
Prompt Engineering Best Practices
1. Be Explicit
❌ "Generate SQL for this query" ✅ "Generate DuckDB SQL with spatial functions. Include 'geom' column. Use ILIKE for text matching."
2. Provide Examples
Example:
Input: "hospitals in Panama"
Output: SELECT name, geom FROM panama_healthsites_geojson WHERE amenity='hospital'
3. Use Constraints
Rules:
- Return ONLY SQL (no markdown, no explanation)
- Use EXACT table names from schema
- DO NOT invent columns
4. Handle Edge Cases
If data not available, return:
-- ERROR: DATA_UNAVAILABLE
5. Structure Output
Return valid JSON:
{"name": "...", "emoji": "..."}
Testing
Manual Testing
llm = LLMGateway()
# Test intent detection
intent = await llm.detect_intent("Show me hospitals", [])
print(intent) # Should be "MAP_REQUEST"
# Test SQL generation
sql = await llm.generate_analytical_sql("hospitals in David", schema, [])
print(sql) # Should be valid SELECT query
Prompt Iteration
- Test with real queries
- Analyze failures
- Update prompt
- Re-test
Next Steps
- Core Services: CORE_SERVICES.md
- Data Flow: ../DATA_FLOW.md
- API Reference: API_ENDPOINTS.md