osint-llm / README.md
Tom
Add complete RAG-powered OSINT investigation assistant
6466c00
---
title: OSINT Investigation Assistant
emoji: ๐Ÿ”
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
short_description: RAG-powered OSINT investigation assistant with 344+ tools
license: mit
---
# ๐Ÿ” OSINT Investigation Assistant
A RAG-powered AI assistant that helps investigators develop structured methodologies for open-source intelligence (OSINT) investigations. Built with LangChain, Supabase PGVector, and Hugging Face Inference Providers.
## โœจ Features
- **๐ŸŽฏ Structured Methodologies**: Generate step-by-step investigation plans tailored to your query
- **๐Ÿ› ๏ธ 344+ OSINT Tools**: Access recommendations from a comprehensive database of curated OSINT tools
- **๐Ÿ” Context-Aware Retrieval**: Semantic search finds the most relevant tools for your investigation
- **๐Ÿš€ API Access**: Built-in REST API for integration with external applications
- **๐Ÿ’ฌ Chat Interface**: User-friendly conversational interface
- **๐Ÿ”Œ MCP Support**: Can be extended to work with AI agents via MCP protocol
## ๐Ÿ—๏ธ Architecture
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Gradio UI + API Endpoints โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ LangChain RAG Pipeline โ”‚
โ”‚ โ€ข Query Understanding โ”‚
โ”‚ โ€ข Tool Retrieval (PGVector) โ”‚
โ”‚ โ€ข Response Generation (LLM) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ โ”‚
โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Supabase โ”‚ โ”‚ HF Inference โ”‚
โ”‚ PGVector DB โ”‚ โ”‚ Providers โ”‚
โ”‚ (344 tools) โ”‚ โ”‚ (Llama 3.1) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
## ๐Ÿš€ Quick Start
### Local Development
1. **Clone the repository**
```bash
git clone <your-repo-url>
cd osint-llm
```
2. **Install dependencies**
```bash
pip install -r requirements.txt
```
3. **Set up environment variables**
```bash
cp .env.example .env
# Edit .env with your credentials
```
Required variables:
- `SUPABASE_CONNECTION_STRING`: Your Supabase PostgreSQL connection string
- `HF_TOKEN`: Your Hugging Face API token
4. **Run the application**
```bash
python app.py
```
The app will be available at `http://localhost:7860`
### Hugging Face Spaces Deployment
1. **Create a new Space** on Hugging Face
2. **Push this repository** to your Space
3. **Set environment variables** in Space settings:
- `SUPABASE_CONNECTION_STRING`
- `HF_TOKEN`
4. **Deploy** - The Space will automatically build and launch
## ๐Ÿ“š Usage
### Chat Interface
Simply ask your investigation questions:
```
"How do I investigate a suspicious domain?"
"What tools can I use to verify an image's authenticity?"
"How can I trace the origin of a social media account?"
```
The assistant will provide:
1. Investigation overview
2. Step-by-step methodology
3. Recommended tools with descriptions and URLs
4. Best practices and safety considerations
5. Expected outcomes
### Tool Search
Use the "Tool Search" tab to directly search for OSINT tools by category or purpose.
### API Access
This app automatically exposes REST API endpoints for external integration.
**Python Client:**
```python
from gradio_client import Client
client = Client("your-space-url")
result = client.predict(
"How do I investigate a domain?",
api_name="/investigate"
)
print(result)
```
**JavaScript Client:**
```javascript
import { Client } from "@gradio/client";
const client = await Client.connect("your-space-url");
const result = await client.predict("/investigate", {
message: "How do I investigate a domain?"
});
console.log(result.data);
```
**cURL:**
```bash
curl -X POST "https://your-space.hf.space/call/investigate" \
-H "Content-Type: application/json" \
-d '{"data": ["How do I investigate a domain?"]}'
```
**Available Endpoints:**
- `/call/investigate` - Main investigation assistant
- `/call/search_tools` - Direct tool search
- `/gradio_api/openapi.json` - OpenAPI specification
## ๐Ÿ—„๏ธ Database
The app uses Supabase with PGVector extension to store and retrieve OSINT tools.
**Database Schema:**
```sql
CREATE TABLE bellingcat_tools (
id BIGINT PRIMARY KEY,
name TEXT,
category TEXT,
content TEXT,
url TEXT,
cost TEXT,
details TEXT,
embedding VECTOR,
created_at TIMESTAMP WITH TIME ZONE
);
```
**Tool Categories:**
- Archiving & Preservation
- Social Media Investigation
- Image & Video Analysis
- Domain & Network Investigation
- Geolocation
- Data Extraction
- Verification & Fact-Checking
- And more...
## ๐Ÿ› ๏ธ Technology Stack
- **UI/API**: [Gradio](https://gradio.app/) - Automatic API generation
- **RAG Framework**: [LangChain](https://langchain.com/) - Retrieval pipeline
- **Vector Database**: [Supabase](https://supabase.com/) with PGVector extension
- **Embeddings**: HuggingFace sentence-transformers
- **LLM**: [Hugging Face Inference Providers](https://huggingface.co/docs/inference-providers/) - Llama 3.1
- **Language**: Python 3.9+
## ๐Ÿ“ Project Structure
```
osint-llm/
โ”œโ”€โ”€ app.py # Main Gradio application
โ”œโ”€โ”€ requirements.txt # Python dependencies
โ”œโ”€โ”€ .env.example # Environment variables template
โ”œโ”€โ”€ README.md # This file
โ””โ”€โ”€ src/
โ”œโ”€โ”€ __init__.py
โ”œโ”€โ”€ vectorstore.py # Supabase PGVector connection
โ”œโ”€โ”€ rag_pipeline.py # LangChain RAG logic
โ”œโ”€โ”€ llm_client.py # Inference Provider client
โ””โ”€โ”€ prompts.py # Investigation prompt templates
```
## โš™๏ธ Configuration
### Environment Variables
See `.env.example` for all available configuration options.
**Required:**
- `SUPABASE_CONNECTION_STRING` - PostgreSQL connection string
- `HF_TOKEN` - Hugging Face API token
**Optional:**
- `LLM_MODEL` - Model to use (default: meta-llama/Llama-3.1-8B-Instruct)
- `LLM_TEMPERATURE` - Generation temperature (default: 0.7)
- `LLM_MAX_TOKENS` - Max tokens to generate (default: 2000)
- `RETRIEVAL_K` - Number of tools to retrieve (default: 5)
- `EMBEDDING_MODEL` - Embedding model (default: sentence-transformers/all-MiniLM-L6-v2)
### Supported LLM Models
- `meta-llama/Llama-3.1-8B-Instruct` (recommended)
- `meta-llama/Meta-Llama-3-8B-Instruct`
- `Qwen/Qwen2.5-72B-Instruct`
- `mistralai/Mistral-7B-Instruct-v0.3`
## ๐Ÿ’ฐ Cost Considerations
### Hugging Face Inference Providers
- Free tier: $0.10/month credits
- PRO tier: $2.00/month credits + pay-as-you-go
- Typical cost: ~$0.001-0.01 per query
- Recommended budget: $10-50/month for moderate usage
### Supabase
- Free tier sufficient for most use cases
- PGVector operations are standard database queries
### Hugging Face Spaces
- Free CPU hosting available
- GPU upgrade: ~$0.60/hour (optional, not required)
## ๐Ÿ”ฎ Future Enhancements
- [ ] MCP server integration for AI agent tool use
- [ ] Multi-turn conversation with memory
- [ ] User authentication and query logging
- [ ] Additional tool databases and sources
- [ ] Export methodologies as PDF/markdown
- [ ] Tool usage examples and tutorials
- [ ] Community-contributed tool reviews
## ๐Ÿค Contributing
Contributions are welcome! Please feel free to submit issues or pull requests.
## ๐Ÿ“„ License
MIT License - See LICENSE file for details
## ๐Ÿ™ Acknowledgments
- Tool data sourced from [Bellingcat's Online Investigation Toolkit](https://www.bellingcat.com/)
- Built with support from the OSINT community
## ๐Ÿ“ž Support
For issues or questions:
- Open an issue on GitHub
- Check the [Hugging Face Spaces documentation](https://huggingface.co/docs/hub/spaces)
- Review the [Gradio documentation](https://gradio.app/docs/)
---
Built with โค๏ธ for the OSINT community