Spaces:

tomvaillant
/

osint-llm

Sleeping

File size: 8,425 Bytes

b47c9fb
6466c00
 
 
 
b47c9fb
 
 
 
6466c00
 
b47c9fb
 
6466c00

---
title: OSINT Investigation Assistant
emoji: 🔍
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
short_description: RAG-powered OSINT investigation assistant with 344+ tools
license: mit
---

# 🔍 OSINT Investigation Assistant

A RAG-powered AI assistant that helps investigators develop structured methodologies for open-source intelligence (OSINT) investigations. Built with LangChain, Supabase PGVector, and Hugging Face Inference Providers.

## ✨ Features

- **🎯 Structured Methodologies**: Generate step-by-step investigation plans tailored to your query
- **🛠️ 344+ OSINT Tools**: Access recommendations from a comprehensive database of curated OSINT tools
- **🔍 Context-Aware Retrieval**: Semantic search finds the most relevant tools for your investigation
- **🚀 API Access**: Built-in REST API for integration with external applications
- **💬 Chat Interface**: User-friendly conversational interface
- **🔌 MCP Support**: Can be extended to work with AI agents via MCP protocol

## 🏗️ Architecture

```
┌──────────────────────────────────────┐
│      Gradio UI + API Endpoints       │
└──────────────┬───────────────────────┘
               │
┌──────────────▼───────────────────────┐
│     LangChain RAG Pipeline           │
│  • Query Understanding               │
│  • Tool Retrieval (PGVector)         │
│  • Response Generation (LLM)         │
└──────────────┬───────────────────────┘
               │
    ┌──────────┴──────────┐
    │                     │
┌───▼───────────┐  ┌─────▼────────────┐
│ Supabase      │  │ HF Inference     │
│ PGVector DB   │  │ Providers        │
│ (344 tools)   │  │ (Llama 3.1)      │
└───────────────┘  └──────────────────┘
```

## 🚀 Quick Start

### Local Development

1. **Clone the repository**
   ```bash
   git clone <your-repo-url>
   cd osint-llm
   ```

2. **Install dependencies**
   ```bash
   pip install -r requirements.txt
   ```

3. **Set up environment variables**
   ```bash
   cp .env.example .env
   # Edit .env with your credentials
   ```

   Required variables:
   - `SUPABASE_CONNECTION_STRING`: Your Supabase PostgreSQL connection string
   - `HF_TOKEN`: Your Hugging Face API token

4. **Run the application**
   ```bash
   python app.py
   ```

   The app will be available at `http://localhost:7860`

### Hugging Face Spaces Deployment

1. **Create a new Space** on Hugging Face
2. **Push this repository** to your Space
3. **Set environment variables** in Space settings:
   - `SUPABASE_CONNECTION_STRING`
   - `HF_TOKEN`
4. **Deploy** - The Space will automatically build and launch

## 📚 Usage

### Chat Interface

Simply ask your investigation questions:

```
"How do I investigate a suspicious domain?"
"What tools can I use to verify an image's authenticity?"
"How can I trace the origin of a social media account?"
```

The assistant will provide:
1. Investigation overview
2. Step-by-step methodology
3. Recommended tools with descriptions and URLs
4. Best practices and safety considerations
5. Expected outcomes

### Tool Search

Use the "Tool Search" tab to directly search for OSINT tools by category or purpose.

### API Access

This app automatically exposes REST API endpoints for external integration.

**Python Client:**

```python
from gradio_client import Client

client = Client("your-space-url")
result = client.predict(
    "How do I investigate a domain?",
    api_name="/investigate"
)
print(result)
```

**JavaScript Client:**

```javascript
import { Client } from "@gradio/client";

const client = await Client.connect("your-space-url");
const result = await client.predict("/investigate", {
  message: "How do I investigate a domain?"
});
console.log(result.data);
```

**cURL:**

```bash
curl -X POST "https://your-space.hf.space/call/investigate" \
     -H "Content-Type: application/json" \
     -d '{"data": ["How do I investigate a domain?"]}'
```

**Available Endpoints:**
- `/call/investigate` - Main investigation assistant
- `/call/search_tools` - Direct tool search
- `/gradio_api/openapi.json` - OpenAPI specification

## 🗄️ Database

The app uses Supabase with PGVector extension to store and retrieve OSINT tools.

**Database Schema:**
```sql
CREATE TABLE bellingcat_tools (
  id BIGINT PRIMARY KEY,
  name TEXT,
  category TEXT,
  content TEXT,
  url TEXT,
  cost TEXT,
  details TEXT,
  embedding VECTOR,
  created_at TIMESTAMP WITH TIME ZONE
);
```

**Tool Categories:**
- Archiving & Preservation
- Social Media Investigation
- Image & Video Analysis
- Domain & Network Investigation
- Geolocation
- Data Extraction
- Verification & Fact-Checking
- And more...

## 🛠️ Technology Stack

- **UI/API**: [Gradio](https://gradio.app/) - Automatic API generation
- **RAG Framework**: [LangChain](https://langchain.com/) - Retrieval pipeline
- **Vector Database**: [Supabase](https://supabase.com/) with PGVector extension
- **Embeddings**: HuggingFace sentence-transformers
- **LLM**: [Hugging Face Inference Providers](https://huggingface.co/docs/inference-providers/) - Llama 3.1
- **Language**: Python 3.9+

## 📁 Project Structure

```
osint-llm/
├── app.py                    # Main Gradio application
├── requirements.txt          # Python dependencies
├── .env.example             # Environment variables template
├── README.md                # This file
└── src/
    ├── __init__.py
    ├── vectorstore.py       # Supabase PGVector connection
    ├── rag_pipeline.py      # LangChain RAG logic
    ├── llm_client.py        # Inference Provider client
    └── prompts.py           # Investigation prompt templates
```

## ⚙️ Configuration

### Environment Variables

See `.env.example` for all available configuration options.

**Required:**
- `SUPABASE_CONNECTION_STRING` - PostgreSQL connection string
- `HF_TOKEN` - Hugging Face API token

**Optional:**
- `LLM_MODEL` - Model to use (default: meta-llama/Llama-3.1-8B-Instruct)
- `LLM_TEMPERATURE` - Generation temperature (default: 0.7)
- `LLM_MAX_TOKENS` - Max tokens to generate (default: 2000)
- `RETRIEVAL_K` - Number of tools to retrieve (default: 5)
- `EMBEDDING_MODEL` - Embedding model (default: sentence-transformers/all-MiniLM-L6-v2)

### Supported LLM Models

- `meta-llama/Llama-3.1-8B-Instruct` (recommended)
- `meta-llama/Meta-Llama-3-8B-Instruct`
- `Qwen/Qwen2.5-72B-Instruct`
- `mistralai/Mistral-7B-Instruct-v0.3`

## 💰 Cost Considerations

### Hugging Face Inference Providers
- Free tier: $0.10/month credits
- PRO tier: $2.00/month credits + pay-as-you-go
- Typical cost: ~$0.001-0.01 per query
- Recommended budget: $10-50/month for moderate usage

### Supabase
- Free tier sufficient for most use cases
- PGVector operations are standard database queries

### Hugging Face Spaces
- Free CPU hosting available
- GPU upgrade: ~$0.60/hour (optional, not required)

## 🔮 Future Enhancements

- [ ] MCP server integration for AI agent tool use
- [ ] Multi-turn conversation with memory
- [ ] User authentication and query logging
- [ ] Additional tool databases and sources
- [ ] Export methodologies as PDF/markdown
- [ ] Tool usage examples and tutorials
- [ ] Community-contributed tool reviews

## 🤝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

## 📄 License

MIT License - See LICENSE file for details

## 🙏 Acknowledgments

- Tool data sourced from [Bellingcat's Online Investigation Toolkit](https://www.bellingcat.com/)
- Built with support from the OSINT community

## 📞 Support

For issues or questions:
- Open an issue on GitHub
- Check the [Hugging Face Spaces documentation](https://huggingface.co/docs/hub/spaces)
- Review the [Gradio documentation](https://gradio.app/docs/)

---

Built with ❤️ for the OSINT community