Spaces:

GerardCB
/

GeoQuery

Running

File size: 11,050 Bytes
# LLM Integration & Prompt Engineering

Deep dive into GeoQuery's LLM integration, prompt system, and AI capabilities.

---

## Overview

GeoQuery uses **Google Gemini 2.0 Flash** for all AI capabilities:
- Intent detection
- Text-to-SQL generation
- Natural language explanations
- Layer naming and styling decisions

**Key Feature**: Thinking mode for transparency into reasoning process.

---

## LLMGateway Service

**File**: `backend/core/llm_gateway.py`

Central interface to Gemini API with streaming support.

### Initialization

```python
from google import genai
from google.genai import types

class LLMGateway:
    def __init__(self):
        api_key = os.getenv("GEMINI_API_KEY")
        self.client = genai.Client(api_key=api_key)
        self.model = "gemini-2.0-flash-exp"
```

### Configuration

**Model**: `gemini-2.0-flash-exp`
- **Speed**: Fast responses (~1s for SQL)
- **Quality**: High accuracy for structured output
- **Thinking Mode**: Shows reasoning process
- **JSON Output**: Structured responses

**Parameters**:
```python
config = types.GenerateContentConfig(
    temperature=1,  # Creative but consistent
    response_mime_type="application/json",  # For structured outputs
    thinking_config=types.ThinkingConfig(
        mode=types.ThinkingMode.THINKING  # Enable reasoning
    )
)
```

---

## Prompt System

All prompts centralized in `backend/core/prompts.py`.

### 1. System Instruction

**Prompt**: `SYSTEM_INSTRUCTION`

Sets overall context and capabilities:

```
You are GeoQuery, an advanced Territorial Intelligence Agent capable of 
analyzing diverse geographic datasets.

## Your Capabilities
- Dynamic Metadata Catalog (not fixed schema)
- Spatial Analysis with PostGIS/DuckDB functions
- Visual outputs (maps, charts)

## Output Guidelines
1. Be Data-Driven: Base answers on SQL query results
2. Be Visual: Use choropleth maps, point maps, charts
3. Be Transparent: Explain reasoning, cite sources
...
```

**Purpose**: Establishes AI persona and core behavior

---

### 2. Intent Detection

**Prompt**: `INTENT_DETECTION_PROMPT`

Classifies user queries into categories.

```
Analyze this user query and determine the best output type.

User Query: "{user_query}"

THINK STEP BY STEP:
1. What is the user asking for?
2. Does this require geographic visualization?
3. Does this require a chart/graph?
4. Is this a general question?

Respond with ONLY ONE of these exact words:
- GENERAL_CHAT
- DATA_QUERY  
- MAP_REQUEST
- SPATIAL_OP
- STAT_QUERY

Key rules:
- "color by", "compare regions" → MAP_REQUEST
- "create a chart" → STAT_QUERY
- Questions about data availability → GENERAL_CHAT
```

**Examples**:

| Query | Thinking | Intent |
|-------|----------|--------|
| "Show me hospitals" | User wants to SEE on map | `MAP_REQUEST` |
| "How many provinces?" | Numerical answer, no viz needed | `DATA_QUERY` |
| "Create bar chart of districts" | Explicitly requests chart | `STAT_QUERY` |
| "Subtract Chiriquí from Panama" | Geometric operation | `SPATIAL_OP` |
| "What data do you have?" | General question | `GENERAL_CHAT` |

---

### 3. SQL Generation

**Prompt**: `SQL_GENERATION_PROMPT`

Converts natural language to DuckDB SQL.

```
You are a DuckDB SQL expert for geographic data analysis.

{table_schema}

### CRITICAL - Data Availability:
✅ You may ONLY query the tables listed above.
❌ Do NOT invent table names or columns.

If requested data is NOT available, return:
-- ERROR: DATA_UNAVAILABLE
-- Requested: [what user asked for]
-- Available: [list tables you DO have]

### User Request: "{user_query}"

### Rules:
1. Return ONLY the SQL query
2. Use DuckDB syntax (ILIKE for case-insensitive)
3. ALWAYS include 'geom' for map visualization
4. For "top N", use ORDER BY ... DESC LIMIT N
5. Do NOT add LIMIT unless explicitly requested
6. NEVER invent columns that don't exist

### Special Dataset - Population:
- Use `kontur_population` (H3 hexagons)
- Columns: population, geom
- Large dataset (33K hexagons) - use LIMIT 40000
...

Generate SQL:
```

**Key Features**:
- **Error Prevention**: Explicit instructions to avoid hallucinating tables
- **Spatial Functions**: Guides use of ST_Intersects, ST_Within, etc.
- **Data Unavailable Handling**: Returns special marker instead of invalid SQL

**Examples**:

Input: "Show hospitals in David"
```sql
SELECT name, amenity, geom 
FROM panama_healthsites_geojson 
WHERE amenity = 'hospital' 
  AND ST_Intersects(geom, (SELECT geom FROM pan_admin2 WHERE adm2_name = 'David'))
```

Input: "Population density in Veraguas"
```sql
SELECT population, geom 
FROM kontur_population 
WHERE ST_Intersects(geom, (SELECT geom FROM pan_admin1 WHERE adm1_name = 'Veraguas'))
LIMIT 5000
```

---

### 4. Spatial SQL

**Prompt**: `SPATIAL_SQL_PROMPT`

For geometric operations (difference, intersection, buffer, etc.).

```
You are a GIS expert using DuckDB Spatial.

Available Data:
{layer_context}

User Request: "{user_query}"

Rules:
1. Return ONLY SQL query
2. Use DuckDB Spatial functions:
   - ST_Difference, ST_Intersection, ST_Union
   - ST_Buffer, ST_Within, ST_Contains
3. The geometry column is named 'geom'
4. Use EXACT table names shown above
5. IMPORTANT: For aggregate geometries (ST_Union), use CTE pattern:

CORRECT:
WITH layer_b_union AS (SELECT ST_Union(geom) as geom FROM layer_b)
SELECT a.*, ST_Difference(a.geom, b.geom) as geom 
FROM layer_a a, layer_b_union b

WRONG:
SELECT ST_Difference(geom, (SELECT ST_Union(geom) FROM layer_b)) 
FROM layer_a
```

**Example**:

Input: "Subtract protected areas from Chiriquí province"
```sql
WITH protected_union AS (
  SELECT ST_Union(geom) as geom FROM stri_protected_areas_2025
)
SELECT 
  p.adm1_name,
  ST_Difference(p.geom, pa.geom) as geom
FROM pan_admin1 p, protected_union pa
WHERE p.adm1_name = 'Chiriquí'
```

---

### 5. Layer Naming

**Prompt**: `LAYER_NAME_PROMPT`

Generates descriptive name, emoji, and point style for map layers.

```
User Request: "{user_query}"
SQL Query: "{sql_query}"

Rules:
1. Return JSON with: name, emoji, pointStyle
2. "name": Short descriptive (1-4 words)
3. "emoji": Single emoji for data content
4. "pointStyle": How to render points
   - "icon": Small/medium POI (<500 points)
   - "circle": Large point datasets (>500 points)  
   - null: Polygon data (use choropleth)

Examples:
{"name": "Hospitals in David", "emoji": "🏥", "pointStyle": "icon"}
{"name": "Population Density", "emoji": "👥", "pointStyle": null}
{"name": "Traffic Intersections", "emoji": "🚦", "pointStyle": "circle"}
```

**Decision Logic**:
- Hospitals, schools, parks → icon
- Intersections, sensors (large datasets) → circle
- H3 hexagons, admin boundaries → null (polygon rendering)

---

### 6. Explanation

**Prompt**: `EXPLANATION_PROMPT`

Generates natural language explanation of results.

```
Explain the results of this data query to the user.

User Question: "{user_query}"
SQL Query: {sql_query}
Data Result Summary: {data_summary}

Instructions:
1. Keep response concise
2. Only describe ACTUAL data returned
3. Cite data source
4. Speak as GeoQuery

Example citation:
"Source: Administrative boundary data from HDX/INEC, 2021"
```

**Features**:
- **Factual**: Only describes what was actually found
- **Contextual**: Relates results to user's question
- **Transparent**: Cites data sources

---

### 7. SQL Correction

**Prompt**: `SQL_CORRECTION_PROMPT`

Repairs failed SQL queries.

```
Your previous query failed. Fix it.

### Error Message:
{error_message}

### Failed SQL:
{incorrect_sql}

### User Request:
"{user_query}"

### Database Schema:
{schema_context}

Rules:
1. Fix the error described in the message
2. Return ONLY the valid SQL query
3. Keep query logic consistent with User Request
```

**Common Fixes**:
- Column ambiguity → Add table aliases
- Missing column → Use correct column name
- Syntax error → Fix DuckDB syntax

---

## Streaming Implementation

### Thinking + Content Streaming

```python
async def stream_sql_generation(self, query: str, schema: str):
    config = types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(
            mode=types.ThinkingMode.THINKING
        )
    )
    
    response = await asyncio.to_thread(
        self.client.models.generate_content_stream,
        model=self.model,
        contents=query_prompt,
        config=config
    )
    
    async for chunk in response:
        if hasattr(chunk, 'thought'):
            yield {"type": "thought", "text": chunk.thought.text}
        if hasattr(chunk, 'text'):
            yield {"type": "content", "text": chunk.text}
```

**Frontend receives**:
```json
{"type": "thought", "text": "I need to find hospitals in the David district..."}
{"type": "content", "text": "SELECT name, geom FROM ..."}
```

---

## Error Handling

### 1. Data Unavailable
```sql
-- ERROR: DATA_UNAVAILABLE
-- Requested: crime statistics
-- Available: hospitals, schools, admin boundaries
```

→ System detects marker and returns helpful error

### 2. SQL Execution Error
```
Error: column "hospitals" does not exist
```

→ Send to `correct_sql()` → LLM fixes → Retry

### 3. Rate Limiting
```python
try:
    response = await self.client.models.generate_content(...)
except Exception as e:
    if "rate limit" in str(e).lower():
        await asyncio.sleep(1)
        # Retry
```

---

## Performance Optimizations

### Caching

**Not currently implemented**, but recommended:
```python
from functools import lru_cache

@lru_cache(maxsize=100)
async def cached_sql_generation(query_hash: str):
    ...
```

### Token Management

- **Minimize Context**: Only send relevant table schemas
- **Semantic Search**: Pre-filter to top 15 tables
- **Batch Requests**: Combine multiple LLM calls where possible

---

## Prompt Engineering Best Practices

### 1. Be Explicit
❌ "Generate SQL for this query"
✅ "Generate DuckDB SQL with spatial functions. Include 'geom' column. Use ILIKE for text matching."

### 2. Provide Examples
```
Example:
Input: "hospitals in Panama"
Output: SELECT name, geom FROM panama_healthsites_geojson WHERE amenity='hospital'
```

### 3. Use Constraints
```
Rules:
- Return ONLY SQL (no markdown, no explanation)
- Use EXACT table names from schema
- DO NOT invent columns
```

### 4. Handle Edge Cases
```
If data not available, return:
-- ERROR: DATA_UNAVAILABLE
```

### 5. Structure Output
```
Return valid JSON:
{"name": "...", "emoji": "..."}
```

---

## Testing

### Manual Testing

```python
llm = LLMGateway()

# Test intent detection
intent = await llm.detect_intent("Show me hospitals", [])
print(intent)  # Should be "MAP_REQUEST"

# Test SQL generation
sql = await llm.generate_analytical_sql("hospitals in David", schema, [])
print(sql)  # Should be valid SELECT query
```

### Prompt Iteration

1. **Test with real queries**
2. **Analyze failures**  
3. **Update prompt**
4. **Re-test**

---

## Next Steps

- **Core Services**: [CORE_SERVICES.md](CORE_SERVICES.md)
- **Data Flow**: [../DATA_FLOW.md](../DATA_FLOW.md)
- **API Reference**: [API_ENDPOINTS.md](API_ENDPOINTS.md)