Spaces:

GerardCB
/

GeoQuery

Running

App Files Files Community

GeoQuery / docs /backend /LLM_INTEGRATION.md

GerardCB

Deploy to Spaces (Final Clean)

4851501 2 days ago

preview code

raw

history blame contribute delete

11.1 kB

	# LLM Integration & Prompt Engineering

	Deep dive into GeoQuery's LLM integration, prompt system, and AI capabilities.

	---

	## Overview

	GeoQuery uses Google Gemini 2.0 Flash for all AI capabilities:
	- Intent detection
	- Text-to-SQL generation
	- Natural language explanations
	- Layer naming and styling decisions

	Key Feature: Thinking mode for transparency into reasoning process.

	---

	## LLMGateway Service

	File: `backend/core/llm_gateway.py`

	Central interface to Gemini API with streaming support.

	### Initialization

	```python
	from google import genai
	from google.genai import types

	class LLMGateway:
	def __init__(self):
	api_key = os.getenv("GEMINI_API_KEY")
	self.client = genai.Client(api_key=api_key)
	self.model = "gemini-2.0-flash-exp"
	```

	### Configuration

	Model: `gemini-2.0-flash-exp`
	- Speed: Fast responses (~1s for SQL)
	- Quality: High accuracy for structured output
	- Thinking Mode: Shows reasoning process
	- JSON Output: Structured responses

	Parameters:
	```python
	config = types.GenerateContentConfig(
	temperature=1, # Creative but consistent
	response_mime_type="application/json", # For structured outputs
	thinking_config=types.ThinkingConfig(
	mode=types.ThinkingMode.THINKING # Enable reasoning
	)
	)
	```

	---

	## Prompt System

	All prompts centralized in `backend/core/prompts.py`.

	### 1. System Instruction

	Prompt: `SYSTEM_INSTRUCTION`

	Sets overall context and capabilities:

	```
	You are GeoQuery, an advanced Territorial Intelligence Agent capable of
	analyzing diverse geographic datasets.

	## Your Capabilities
	- Dynamic Metadata Catalog (not fixed schema)
	- Spatial Analysis with PostGIS/DuckDB functions
	- Visual outputs (maps, charts)

	## Output Guidelines
	1. Be Data-Driven: Base answers on SQL query results
	2. Be Visual: Use choropleth maps, point maps, charts
	3. Be Transparent: Explain reasoning, cite sources
	...
	```

	Purpose: Establishes AI persona and core behavior

	---

	### 2. Intent Detection

	Prompt: `INTENT_DETECTION_PROMPT`

	Classifies user queries into categories.

	```
	Analyze this user query and determine the best output type.

	User Query: "{user_query}"

	THINK STEP BY STEP:
	1. What is the user asking for?
	2. Does this require geographic visualization?
	3. Does this require a chart/graph?
	4. Is this a general question?

	Respond with ONLY ONE of these exact words:
	- GENERAL_CHAT
	- DATA_QUERY
	- MAP_REQUEST
	- SPATIAL_OP
	- STAT_QUERY

	Key rules:
	- "color by", "compare regions" → MAP_REQUEST
	- "create a chart" → STAT_QUERY
	- Questions about data availability → GENERAL_CHAT
	```

	Examples:

	\| Query \| Thinking \| Intent \|
	\|-------\|----------\|--------\|
	\| "Show me hospitals" \| User wants to SEE on map \| `MAP_REQUEST` \|
	\| "How many provinces?" \| Numerical answer, no viz needed \| `DATA_QUERY` \|
	\| "Create bar chart of districts" \| Explicitly requests chart \| `STAT_QUERY` \|
	\| "Subtract Chiriquí from Panama" \| Geometric operation \| `SPATIAL_OP` \|
	\| "What data do you have?" \| General question \| `GENERAL_CHAT` \|

	---

	### 3. SQL Generation

	Prompt: `SQL_GENERATION_PROMPT`

	Converts natural language to DuckDB SQL.

	```
	You are a DuckDB SQL expert for geographic data analysis.

	{table_schema}

	### CRITICAL - Data Availability:
	✅ You may ONLY query the tables listed above.
	❌ Do NOT invent table names or columns.

	If requested data is NOT available, return:
	-- ERROR: DATA_UNAVAILABLE
	-- Requested: [what user asked for]
	-- Available: [list tables you DO have]

	### User Request: "{user_query}"

	### Rules:
	1. Return ONLY the SQL query
	2. Use DuckDB syntax (ILIKE for case-insensitive)
	3. ALWAYS include 'geom' for map visualization
	4. For "top N", use ORDER BY ... DESC LIMIT N
	5. Do NOT add LIMIT unless explicitly requested
	6. NEVER invent columns that don't exist

	### Special Dataset - Population:
	- Use `kontur_population` (H3 hexagons)
	- Columns: population, geom
	- Large dataset (33K hexagons) - use LIMIT 40000
	...

	Generate SQL:
	```

	Key Features:
	- Error Prevention: Explicit instructions to avoid hallucinating tables
	- Spatial Functions: Guides use of ST_Intersects, ST_Within, etc.
	- Data Unavailable Handling: Returns special marker instead of invalid SQL

	Examples:

	Input: "Show hospitals in David"
	```sql
	SELECT name, amenity, geom
	FROM panama_healthsites_geojson
	WHERE amenity = 'hospital'
	AND ST_Intersects(geom, (SELECT geom FROM pan_admin2 WHERE adm2_name = 'David'))
	```

	Input: "Population density in Veraguas"
	```sql
	SELECT population, geom
	FROM kontur_population
	WHERE ST_Intersects(geom, (SELECT geom FROM pan_admin1 WHERE adm1_name = 'Veraguas'))
	LIMIT 5000
	```

	---

	### 4. Spatial SQL

	Prompt: `SPATIAL_SQL_PROMPT`

	For geometric operations (difference, intersection, buffer, etc.).

	```
	You are a GIS expert using DuckDB Spatial.

	Available Data:
	{layer_context}

	User Request: "{user_query}"

	Rules:
	1. Return ONLY SQL query
	2. Use DuckDB Spatial functions:
	- ST_Difference, ST_Intersection, ST_Union
	- ST_Buffer, ST_Within, ST_Contains
	3. The geometry column is named 'geom'
	4. Use EXACT table names shown above
	5. IMPORTANT: For aggregate geometries (ST_Union), use CTE pattern:

	CORRECT:
	WITH layer_b_union AS (SELECT ST_Union(geom) as geom FROM layer_b)
	SELECT a.*, ST_Difference(a.geom, b.geom) as geom
	FROM layer_a a, layer_b_union b

	WRONG:
	SELECT ST_Difference(geom, (SELECT ST_Union(geom) FROM layer_b))
	FROM layer_a
	```

	Example:

	Input: "Subtract protected areas from Chiriquí province"
	```sql
	WITH protected_union AS (
	SELECT ST_Union(geom) as geom FROM stri_protected_areas_2025
	)
	SELECT
	p.adm1_name,
	ST_Difference(p.geom, pa.geom) as geom
	FROM pan_admin1 p, protected_union pa
	WHERE p.adm1_name = 'Chiriquí'
	```

	---

	### 5. Layer Naming

	Prompt: `LAYER_NAME_PROMPT`

	Generates descriptive name, emoji, and point style for map layers.

	```
	User Request: "{user_query}"
	SQL Query: "{sql_query}"

	Rules:
	1. Return JSON with: name, emoji, pointStyle
	2. "name": Short descriptive (1-4 words)
	3. "emoji": Single emoji for data content
	4. "pointStyle": How to render points
	- "icon": Small/medium POI (<500 points)
	- "circle": Large point datasets (>500 points)
	- null: Polygon data (use choropleth)

	Examples:
	{"name": "Hospitals in David", "emoji": "🏥", "pointStyle": "icon"}
	{"name": "Population Density", "emoji": "👥", "pointStyle": null}
	{"name": "Traffic Intersections", "emoji": "🚦", "pointStyle": "circle"}
	```

	Decision Logic:
	- Hospitals, schools, parks → icon
	- Intersections, sensors (large datasets) → circle
	- H3 hexagons, admin boundaries → null (polygon rendering)

	---

	### 6. Explanation

	Prompt: `EXPLANATION_PROMPT`

	Generates natural language explanation of results.

	```
	Explain the results of this data query to the user.

	User Question: "{user_query}"
	SQL Query: {sql_query}
	Data Result Summary: {data_summary}

	Instructions:
	1. Keep response concise
	2. Only describe ACTUAL data returned
	3. Cite data source
	4. Speak as GeoQuery

	Example citation:
	"Source: Administrative boundary data from HDX/INEC, 2021"
	```

	Features:
	- Factual: Only describes what was actually found
	- Contextual: Relates results to user's question
	- Transparent: Cites data sources

	---

	### 7. SQL Correction

	Prompt: `SQL_CORRECTION_PROMPT`

	Repairs failed SQL queries.

	```
	Your previous query failed. Fix it.

	### Error Message:
	{error_message}

	### Failed SQL:
	{incorrect_sql}

	### User Request:
	"{user_query}"

	### Database Schema:
	{schema_context}

	Rules:
	1. Fix the error described in the message
	2. Return ONLY the valid SQL query
	3. Keep query logic consistent with User Request
	```

	Common Fixes:
	- Column ambiguity → Add table aliases
	- Missing column → Use correct column name
	- Syntax error → Fix DuckDB syntax

	---

	## Streaming Implementation

	### Thinking + Content Streaming

	```python
	async def stream_sql_generation(self, query: str, schema: str):
	config = types.GenerateContentConfig(
	thinking_config=types.ThinkingConfig(
	mode=types.ThinkingMode.THINKING
	)
	)

	response = await asyncio.to_thread(
	self.client.models.generate_content_stream,
	model=self.model,
	contents=query_prompt,
	config=config
	)

	async for chunk in response:
	if hasattr(chunk, 'thought'):
	yield {"type": "thought", "text": chunk.thought.text}
	if hasattr(chunk, 'text'):
	yield {"type": "content", "text": chunk.text}
	```

	Frontend receives:
	```json
	{"type": "thought", "text": "I need to find hospitals in the David district..."}
	{"type": "content", "text": "SELECT name, geom FROM ..."}
	```

	---

	## Error Handling

	### 1. Data Unavailable
	```sql
	-- ERROR: DATA_UNAVAILABLE
	-- Requested: crime statistics
	-- Available: hospitals, schools, admin boundaries
	```

	→ System detects marker and returns helpful error

	### 2. SQL Execution Error
	```
	Error: column "hospitals" does not exist
	```

	→ Send to `correct_sql()` → LLM fixes → Retry

	### 3. Rate Limiting
	```python
	try:
	response = await self.client.models.generate_content(...)
	except Exception as e:
	if "rate limit" in str(e).lower():
	await asyncio.sleep(1)
	# Retry
	```

	---

	## Performance Optimizations

	### Caching

	Not currently implemented, but recommended:
	```python
	from functools import lru_cache

	@lru_cache(maxsize=100)
	async def cached_sql_generation(query_hash: str):
	...
	```

	### Token Management

	- Minimize Context: Only send relevant table schemas
	- Semantic Search: Pre-filter to top 15 tables
	- Batch Requests: Combine multiple LLM calls where possible

	---

	## Prompt Engineering Best Practices

	### 1. Be Explicit
	❌ "Generate SQL for this query"
	✅ "Generate DuckDB SQL with spatial functions. Include 'geom' column. Use ILIKE for text matching."

	### 2. Provide Examples
	```
	Example:
	Input: "hospitals in Panama"
	Output: SELECT name, geom FROM panama_healthsites_geojson WHERE amenity='hospital'
	```

	### 3. Use Constraints
	```
	Rules:
	- Return ONLY SQL (no markdown, no explanation)
	- Use EXACT table names from schema
	- DO NOT invent columns
	```

	### 4. Handle Edge Cases
	```
	If data not available, return:
	-- ERROR: DATA_UNAVAILABLE
	```

	### 5. Structure Output
	```
	Return valid JSON:
	{"name": "...", "emoji": "..."}
	```

	---

	## Testing

	### Manual Testing

	```python
	llm = LLMGateway()

	# Test intent detection
	intent = await llm.detect_intent("Show me hospitals", [])
	print(intent) # Should be "MAP_REQUEST"

	# Test SQL generation
	sql = await llm.generate_analytical_sql("hospitals in David", schema, [])
	print(sql) # Should be valid SELECT query
	```

	### Prompt Iteration

	1. Test with real queries
	2. Analyze failures
	3. Update prompt
	4. Re-test

	---

	## Next Steps

	- Core Services: [CORE_SERVICES.md](CORE_SERVICES.md)
	- Data Flow: [../DATA_FLOW.md](../DATA_FLOW.md)
	- API Reference: [API_ENDPOINTS.md](API_ENDPOINTS.md)