File size: 8,425 Bytes
b47c9fb
6466c00
 
 
 
b47c9fb
 
 
 
6466c00
 
b47c9fb
 
6466c00
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
---
title: OSINT Investigation Assistant
emoji: ๐Ÿ”
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
short_description: RAG-powered OSINT investigation assistant with 344+ tools
license: mit
---

# ๐Ÿ” OSINT Investigation Assistant

A RAG-powered AI assistant that helps investigators develop structured methodologies for open-source intelligence (OSINT) investigations. Built with LangChain, Supabase PGVector, and Hugging Face Inference Providers.

## โœจ Features

- **๐ŸŽฏ Structured Methodologies**: Generate step-by-step investigation plans tailored to your query
- **๐Ÿ› ๏ธ 344+ OSINT Tools**: Access recommendations from a comprehensive database of curated OSINT tools
- **๐Ÿ” Context-Aware Retrieval**: Semantic search finds the most relevant tools for your investigation
- **๐Ÿš€ API Access**: Built-in REST API for integration with external applications
- **๐Ÿ’ฌ Chat Interface**: User-friendly conversational interface
- **๐Ÿ”Œ MCP Support**: Can be extended to work with AI agents via MCP protocol

## ๐Ÿ—๏ธ Architecture

```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚      Gradio UI + API Endpoints       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     LangChain RAG Pipeline           โ”‚
โ”‚  โ€ข Query Understanding               โ”‚
โ”‚  โ€ข Tool Retrieval (PGVector)         โ”‚
โ”‚  โ€ข Response Generation (LLM)         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚                     โ”‚
โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Supabase      โ”‚  โ”‚ HF Inference     โ”‚
โ”‚ PGVector DB   โ”‚  โ”‚ Providers        โ”‚
โ”‚ (344 tools)   โ”‚  โ”‚ (Llama 3.1)      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```

## ๐Ÿš€ Quick Start

### Local Development

1. **Clone the repository**
   ```bash
   git clone <your-repo-url>
   cd osint-llm
   ```

2. **Install dependencies**
   ```bash
   pip install -r requirements.txt
   ```

3. **Set up environment variables**
   ```bash
   cp .env.example .env
   # Edit .env with your credentials
   ```

   Required variables:
   - `SUPABASE_CONNECTION_STRING`: Your Supabase PostgreSQL connection string
   - `HF_TOKEN`: Your Hugging Face API token

4. **Run the application**
   ```bash
   python app.py
   ```

   The app will be available at `http://localhost:7860`

### Hugging Face Spaces Deployment

1. **Create a new Space** on Hugging Face
2. **Push this repository** to your Space
3. **Set environment variables** in Space settings:
   - `SUPABASE_CONNECTION_STRING`
   - `HF_TOKEN`
4. **Deploy** - The Space will automatically build and launch

## ๐Ÿ“š Usage

### Chat Interface

Simply ask your investigation questions:

```
"How do I investigate a suspicious domain?"
"What tools can I use to verify an image's authenticity?"
"How can I trace the origin of a social media account?"
```

The assistant will provide:
1. Investigation overview
2. Step-by-step methodology
3. Recommended tools with descriptions and URLs
4. Best practices and safety considerations
5. Expected outcomes

### Tool Search

Use the "Tool Search" tab to directly search for OSINT tools by category or purpose.

### API Access

This app automatically exposes REST API endpoints for external integration.

**Python Client:**

```python
from gradio_client import Client

client = Client("your-space-url")
result = client.predict(
    "How do I investigate a domain?",
    api_name="/investigate"
)
print(result)
```

**JavaScript Client:**

```javascript
import { Client } from "@gradio/client";

const client = await Client.connect("your-space-url");
const result = await client.predict("/investigate", {
  message: "How do I investigate a domain?"
});
console.log(result.data);
```

**cURL:**

```bash
curl -X POST "https://your-space.hf.space/call/investigate" \
     -H "Content-Type: application/json" \
     -d '{"data": ["How do I investigate a domain?"]}'
```

**Available Endpoints:**
- `/call/investigate` - Main investigation assistant
- `/call/search_tools` - Direct tool search
- `/gradio_api/openapi.json` - OpenAPI specification

## ๐Ÿ—„๏ธ Database

The app uses Supabase with PGVector extension to store and retrieve OSINT tools.

**Database Schema:**
```sql
CREATE TABLE bellingcat_tools (
  id BIGINT PRIMARY KEY,
  name TEXT,
  category TEXT,
  content TEXT,
  url TEXT,
  cost TEXT,
  details TEXT,
  embedding VECTOR,
  created_at TIMESTAMP WITH TIME ZONE
);
```

**Tool Categories:**
- Archiving & Preservation
- Social Media Investigation
- Image & Video Analysis
- Domain & Network Investigation
- Geolocation
- Data Extraction
- Verification & Fact-Checking
- And more...

## ๐Ÿ› ๏ธ Technology Stack

- **UI/API**: [Gradio](https://gradio.app/) - Automatic API generation
- **RAG Framework**: [LangChain](https://langchain.com/) - Retrieval pipeline
- **Vector Database**: [Supabase](https://supabase.com/) with PGVector extension
- **Embeddings**: HuggingFace sentence-transformers
- **LLM**: [Hugging Face Inference Providers](https://huggingface.co/docs/inference-providers/) - Llama 3.1
- **Language**: Python 3.9+

## ๐Ÿ“ Project Structure

```
osint-llm/
โ”œโ”€โ”€ app.py                    # Main Gradio application
โ”œโ”€โ”€ requirements.txt          # Python dependencies
โ”œโ”€โ”€ .env.example             # Environment variables template
โ”œโ”€โ”€ README.md                # This file
โ””โ”€โ”€ src/
    โ”œโ”€โ”€ __init__.py
    โ”œโ”€โ”€ vectorstore.py       # Supabase PGVector connection
    โ”œโ”€โ”€ rag_pipeline.py      # LangChain RAG logic
    โ”œโ”€โ”€ llm_client.py        # Inference Provider client
    โ””โ”€โ”€ prompts.py           # Investigation prompt templates
```

## โš™๏ธ Configuration

### Environment Variables

See `.env.example` for all available configuration options.

**Required:**
- `SUPABASE_CONNECTION_STRING` - PostgreSQL connection string
- `HF_TOKEN` - Hugging Face API token

**Optional:**
- `LLM_MODEL` - Model to use (default: meta-llama/Llama-3.1-8B-Instruct)
- `LLM_TEMPERATURE` - Generation temperature (default: 0.7)
- `LLM_MAX_TOKENS` - Max tokens to generate (default: 2000)
- `RETRIEVAL_K` - Number of tools to retrieve (default: 5)
- `EMBEDDING_MODEL` - Embedding model (default: sentence-transformers/all-MiniLM-L6-v2)

### Supported LLM Models

- `meta-llama/Llama-3.1-8B-Instruct` (recommended)
- `meta-llama/Meta-Llama-3-8B-Instruct`
- `Qwen/Qwen2.5-72B-Instruct`
- `mistralai/Mistral-7B-Instruct-v0.3`

## ๐Ÿ’ฐ Cost Considerations

### Hugging Face Inference Providers
- Free tier: $0.10/month credits
- PRO tier: $2.00/month credits + pay-as-you-go
- Typical cost: ~$0.001-0.01 per query
- Recommended budget: $10-50/month for moderate usage

### Supabase
- Free tier sufficient for most use cases
- PGVector operations are standard database queries

### Hugging Face Spaces
- Free CPU hosting available
- GPU upgrade: ~$0.60/hour (optional, not required)

## ๐Ÿ”ฎ Future Enhancements

- [ ] MCP server integration for AI agent tool use
- [ ] Multi-turn conversation with memory
- [ ] User authentication and query logging
- [ ] Additional tool databases and sources
- [ ] Export methodologies as PDF/markdown
- [ ] Tool usage examples and tutorials
- [ ] Community-contributed tool reviews

## ๐Ÿค Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

## ๐Ÿ“„ License

MIT License - See LICENSE file for details

## ๐Ÿ™ Acknowledgments

- Tool data sourced from [Bellingcat's Online Investigation Toolkit](https://www.bellingcat.com/)
- Built with support from the OSINT community

## ๐Ÿ“ž Support

For issues or questions:
- Open an issue on GitHub
- Check the [Hugging Face Spaces documentation](https://huggingface.co/docs/hub/spaces)
- Review the [Gradio documentation](https://gradio.app/docs/)

---

Built with โค๏ธ for the OSINT community