Spaces:

moazx
/

HBV_AI_Assistant

Sleeping

File size: 6,611 Bytes

4a17bbc

# LlamaParse Integration Summary

## Changes Made

### 1. **core/data_loaders.py** - Complete Replacement
**Status**: ✅ Complete

**Changes**:
- ❌ Removed: `PyMuPDF4LLMLoader` and `TesseractBlobParser`
- ✅ Added: `LlamaParse` and `SimpleDirectoryReader` from llama-index
- ✅ Added: `os` module for environment variable handling

**New Functions**:
1. `load_pdf_documents(pdf_path, api_key=None)` - Basic LlamaParse loader
2. `load_pdf_documents_advanced(pdf_path, api_key=None, premium_mode=False)` - Advanced loader with premium features
3. `load_multiple_pdfs(pdf_directory, api_key=None, file_pattern="*.pdf")` - Batch processing

**Key Features**:
- Medical document optimized parsing instructions
- Accurate page numbering with `split_by_page=True`
- Preserves borderless tables and complex layouts
- Enhanced metadata tracking
- Premium mode option for GPT-4o parsing

---

### 2. **core/config.py** - Configuration Updates
**Status**: ✅ Complete

**Changes**:
```python
# Added to Settings class
LLAMA_CLOUD_API_KEY: str | None = None
LLAMA_PREMIUM_MODE: bool = False
```

**Purpose**:
- Store LlamaParse API key from environment variables
- Control premium/basic parsing mode
- Centralized configuration management

---

### 3. **core/utils.py** - Pipeline Integration
**Status**: ✅ Complete

**Changes**:
1. **Import Update** (Line 12):
   ```python
   from .config import get_embedding_model, VECTOR_STORE_DIR, CHUNKS_PATH, NEW_DATA, PROCESSED_DATA, settings
   ```

2. **Function Update** `_load_documents_for_file()` (Lines 118-141):
   ```python
   def _load_documents_for_file(file_path: Path) -> List[Document]:
       try:
           if file_path.suffix.lower() == '.pdf':
               # Use advanced LlamaParse loader with settings from config
               api_key = settings.LLAMA_CLOUD_API_KEY
               premium_mode = settings.LLAMA_PREMIUM_MODE
               
               return data_loaders.load_pdf_documents_advanced(
                   file_path,
                   api_key=api_key,
                   premium_mode=premium_mode
               )
           return data_loaders.load_markdown_documents(file_path)
       except Exception as e:
           logger.error(f"Failed to load {file_path}: {e}")
           return []
   ```

**Impact**:
- All PDF processing now uses LlamaParse automatically
- Reads configuration from environment variables
- Maintains backward compatibility with markdown files

---

## New Files Created

### 1. **LLAMAPARSE_INTEGRATION.md**
Complete documentation including:
- Setup instructions
- Configuration guide
- Usage examples
- Cost considerations
- Troubleshooting
- Migration guide

### 2. **test_llamaparse.py**
Test suite with:
- Configuration checker
- Single PDF test
- Batch processing test
- Full pipeline test

### 3. **INTEGRATION_SUMMARY.md** (this file)
Quick reference for all changes

---

## Environment Variables Required

Add to your `.env` file:

```env
# Required for LlamaParse
LLAMA_CLOUD_API_KEY=llx-your-api-key-here

# Optional: Enable premium mode (default: False)
LLAMA_PREMIUM_MODE=False

# Existing (still required)
OPENAI_API_KEY=your-openai-key
```

---

## Installation Requirements

```bash
pip install llama-parse llama-index-core
```

---

## How to Use

### Automatic Processing (Recommended)
1. Set `LLAMA_CLOUD_API_KEY` in `.env`
2. Place PDFs in `data/new_data/PROVIDER/`
3. Run your application - documents are processed automatically on startup

### Manual Processing
```python
from core.utils import process_new_data_and_update_vector_store

# Process all new documents
vector_store = process_new_data_and_update_vector_store()
```

### Direct PDF Loading
```python
from pathlib import Path
from core.data_loaders import load_pdf_documents_advanced

pdf_path = Path("data/new_data/SASLT/guideline.pdf")
documents = load_pdf_documents_advanced(pdf_path)
```

---

## Testing

Run the test suite:
```bash
python test_llamaparse.py
```

This will:
1. ✅ Check configuration
2. ✅ Test single PDF loading
3. ✅ (Optional) Test batch processing
4. ✅ (Optional) Test full pipeline

---

## Backward Compatibility

✅ **Fully backward compatible**:
- Existing processed documents remain valid
- Vector store continues to work
- Markdown processing unchanged
- No breaking changes to API

---

## Benefits

| Aspect | Before (PyMuPDF4LLMLoader) | After (LlamaParse) |
|--------|---------------------------|-------------------|
| **Borderless Tables** | ❌ Poor | ✅ Excellent |
| **Complex Layouts** | ⚠️ Moderate | ✅ Excellent |
| **Medical Terminology** | ⚠️ Moderate | ✅ Excellent |
| **Page Numbering** | ✅ Good | ✅ Excellent |
| **Processing Speed** | ✅ Fast (local) | ⚠️ Slower (cloud) |
| **Cost** | ✅ Free | ⚠️ ~$0.003-0.01/page |
| **Accuracy** | ⚠️ Moderate | ✅ High |

---

## Cost Estimation

### Basic Mode (~$0.003/page)
- 50-page guideline: ~$0.15
- 100-page guideline: ~$0.30

### Premium Mode (~$0.01/page)
- 50-page guideline: ~$0.50
- 100-page guideline: ~$1.00

**Note**: LlamaParse caches results, so re-processing is free.

---

## Workflow Example

```
1. User places PDF in data/new_data/SASLT/
   └── new_guideline.pdf

2. Application startup triggers processing
   ├── Detects new PDF
   ├── Calls load_pdf_documents_advanced()
   ├── LlamaParse processes with medical optimizations
   ├── Extracts 50 pages with accurate metadata
   └── Returns Document objects

3. Pipeline continues
   ├── Splits into 245 chunks
   ├── Updates vector store
   └── Moves to data/processed_data/SASLT/new_guideline_20251111_143022.pdf

4. Ready for RAG queries
   └── Vector store contains new guideline content
```

---

## Next Steps

1. ✅ Set `LLAMA_CLOUD_API_KEY` in `.env`
2. ✅ Install dependencies: `pip install llama-parse llama-index-core`
3. ✅ Test with: `python test_llamaparse.py`
4. ✅ Place PDFs in `data/new_data/PROVIDER/`
5. ✅ Run application and verify processing

---

## Support & Troubleshooting

### Common Issues

**1. API Key Not Found**
```
ValueError: LlamaCloud API key not found
```
→ Set `LLAMA_CLOUD_API_KEY` in `.env`

**2. Import Errors**
```
ModuleNotFoundError: No module named 'llama_parse'
```
→ Run: `pip install llama-parse llama-index-core`

**3. Slow Processing**
→ Normal for cloud processing (30-60s per document)
→ Subsequent runs use cache (much faster)

### Logs
Check `logs/app.log` for detailed processing information

---

**Integration Date**: November 11, 2025  
**Status**: ✅ Production Ready  
**Version**: 1.0