Spaces:

tomvaillant
/

osint-llm

Sleeping

App Files Files Community

osint-llm / README.md

Tom

Add complete RAG-powered OSINT investigation assistant

6466c00 about 1 month ago

preview code

raw

history blame contribute delete

8.43 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: OSINT Investigation Assistant
emoji: 🔍
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
short_description: RAG-powered OSINT investigation assistant with 344+ tools
license: mit

🔍 OSINT Investigation Assistant

A RAG-powered AI assistant that helps investigators develop structured methodologies for open-source intelligence (OSINT) investigations. Built with LangChain, Supabase PGVector, and Hugging Face Inference Providers.

✨ Features

🎯 Structured Methodologies: Generate step-by-step investigation plans tailored to your query
🛠️ 344+ OSINT Tools: Access recommendations from a comprehensive database of curated OSINT tools
🔍 Context-Aware Retrieval: Semantic search finds the most relevant tools for your investigation
🚀 API Access: Built-in REST API for integration with external applications
💬 Chat Interface: User-friendly conversational interface
🔌 MCP Support: Can be extended to work with AI agents via MCP protocol

🏗️ Architecture

┌──────────────────────────────────────┐
│      Gradio UI + API Endpoints       │
└──────────────┬───────────────────────┘
               │
┌──────────────▼───────────────────────┐
│     LangChain RAG Pipeline           │
│  • Query Understanding               │
│  • Tool Retrieval (PGVector)         │
│  • Response Generation (LLM)         │
└──────────────┬───────────────────────┘
               │
    ┌──────────┴──────────┐
    │                     │
┌───▼───────────┐  ┌─────▼────────────┐
│ Supabase      │  │ HF Inference     │
│ PGVector DB   │  │ Providers        │
│ (344 tools)   │  │ (Llama 3.1)      │
└───────────────┘  └──────────────────┘

🚀 Quick Start

Local Development

Clone the repository
```
git clone <your-repo-url>
cd osint-llm
```
Install dependencies
```
pip install -r requirements.txt
```
Set up environment variables
```
cp .env.example .env
# Edit .env with your credentials
```
Required variables:
- SUPABASE_CONNECTION_STRING: Your Supabase PostgreSQL connection string
- HF_TOKEN: Your Hugging Face API token
Run the application
```
python app.py
```
The app will be available at http://localhost:7860

Hugging Face Spaces Deployment

Create a new Space on Hugging Face
Push this repository to your Space
Set environment variables in Space settings:
- SUPABASE_CONNECTION_STRING
- HF_TOKEN
Deploy - The Space will automatically build and launch

📚 Usage

Chat Interface

Simply ask your investigation questions:

"How do I investigate a suspicious domain?"
"What tools can I use to verify an image's authenticity?"
"How can I trace the origin of a social media account?"

The assistant will provide:

Investigation overview
Step-by-step methodology
Recommended tools with descriptions and URLs
Best practices and safety considerations
Expected outcomes

Tool Search

Use the "Tool Search" tab to directly search for OSINT tools by category or purpose.

API Access

This app automatically exposes REST API endpoints for external integration.

Python Client:

from gradio_client import Client

client = Client("your-space-url")
result = client.predict(
    "How do I investigate a domain?",
    api_name="/investigate"
)
print(result)

JavaScript Client:

import { Client } from "@gradio/client";

const client = await Client.connect("your-space-url");
const result = await client.predict("/investigate", {
  message: "How do I investigate a domain?"
});
console.log(result.data);

cURL:

curl -X POST "https://your-space.hf.space/call/investigate" \
     -H "Content-Type: application/json" \
     -d '{"data": ["How do I investigate a domain?"]}'

Available Endpoints:

/call/investigate - Main investigation assistant
/call/search_tools - Direct tool search
/gradio_api/openapi.json - OpenAPI specification

🗄️ Database

The app uses Supabase with PGVector extension to store and retrieve OSINT tools.

Database Schema:

CREATE TABLE bellingcat_tools (
  id BIGINT PRIMARY KEY,
  name TEXT,
  category TEXT,
  content TEXT,
  url TEXT,
  cost TEXT,
  details TEXT,
  embedding VECTOR,
  created_at TIMESTAMP WITH TIME ZONE
);

Tool Categories:

Archiving & Preservation
Social Media Investigation
Image & Video Analysis
Domain & Network Investigation
Geolocation
Data Extraction
Verification & Fact-Checking
And more...

🛠️ Technology Stack

UI/API: Gradio - Automatic API generation
RAG Framework: LangChain - Retrieval pipeline
Vector Database: Supabase with PGVector extension
Embeddings: HuggingFace sentence-transformers
LLM: Hugging Face Inference Providers - Llama 3.1
Language: Python 3.9+

📁 Project Structure

osint-llm/
├── app.py                    # Main Gradio application
├── requirements.txt          # Python dependencies
├── .env.example             # Environment variables template
├── README.md                # This file
└── src/
    ├── __init__.py
    ├── vectorstore.py       # Supabase PGVector connection
    ├── rag_pipeline.py      # LangChain RAG logic
    ├── llm_client.py        # Inference Provider client
    └── prompts.py           # Investigation prompt templates

⚙️ Configuration

Environment Variables

See .env.example for all available configuration options.

Required:

SUPABASE_CONNECTION_STRING - PostgreSQL connection string
HF_TOKEN - Hugging Face API token

Optional:

LLM_MODEL - Model to use (default: meta-llama/Llama-3.1-8B-Instruct)
LLM_TEMPERATURE - Generation temperature (default: 0.7)
LLM_MAX_TOKENS - Max tokens to generate (default: 2000)
RETRIEVAL_K - Number of tools to retrieve (default: 5)
EMBEDDING_MODEL - Embedding model (default: sentence-transformers/all-MiniLM-L6-v2)

Supported LLM Models

meta-llama/Llama-3.1-8B-Instruct (recommended)
meta-llama/Meta-Llama-3-8B-Instruct
Qwen/Qwen2.5-72B-Instruct
mistralai/Mistral-7B-Instruct-v0.3

💰 Cost Considerations

Hugging Face Inference Providers

Free tier: $0.10/month credits
PRO tier: $2.00/month credits + pay-as-you-go
Typical cost: ~$0.001-0.01 per query
Recommended budget: $10-50/month for moderate usage

Supabase

Free tier sufficient for most use cases
PGVector operations are standard database queries

Hugging Face Spaces

Free CPU hosting available
GPU upgrade: ~$0.60/hour (optional, not required)

🔮 Future Enhancements

MCP server integration for AI agent tool use
Multi-turn conversation with memory
User authentication and query logging
Additional tool databases and sources
Export methodologies as PDF/markdown
Tool usage examples and tutorials
Community-contributed tool reviews

🤝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

Tool data sourced from Bellingcat's Online Investigation Toolkit
Built with support from the OSINT community

📞 Support

For issues or questions:

Open an issue on GitHub
Check the Hugging Face Spaces documentation
Review the Gradio documentation

Built with ❤️ for the OSINT community