Spaces:
Sleeping
Sleeping
File size: 6,611 Bytes
4a17bbc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 |
# LlamaParse Integration Summary
## Changes Made
### 1. **core/data_loaders.py** - Complete Replacement
**Status**: β
Complete
**Changes**:
- β Removed: `PyMuPDF4LLMLoader` and `TesseractBlobParser`
- β
Added: `LlamaParse` and `SimpleDirectoryReader` from llama-index
- β
Added: `os` module for environment variable handling
**New Functions**:
1. `load_pdf_documents(pdf_path, api_key=None)` - Basic LlamaParse loader
2. `load_pdf_documents_advanced(pdf_path, api_key=None, premium_mode=False)` - Advanced loader with premium features
3. `load_multiple_pdfs(pdf_directory, api_key=None, file_pattern="*.pdf")` - Batch processing
**Key Features**:
- Medical document optimized parsing instructions
- Accurate page numbering with `split_by_page=True`
- Preserves borderless tables and complex layouts
- Enhanced metadata tracking
- Premium mode option for GPT-4o parsing
---
### 2. **core/config.py** - Configuration Updates
**Status**: β
Complete
**Changes**:
```python
# Added to Settings class
LLAMA_CLOUD_API_KEY: str | None = None
LLAMA_PREMIUM_MODE: bool = False
```
**Purpose**:
- Store LlamaParse API key from environment variables
- Control premium/basic parsing mode
- Centralized configuration management
---
### 3. **core/utils.py** - Pipeline Integration
**Status**: β
Complete
**Changes**:
1. **Import Update** (Line 12):
```python
from .config import get_embedding_model, VECTOR_STORE_DIR, CHUNKS_PATH, NEW_DATA, PROCESSED_DATA, settings
```
2. **Function Update** `_load_documents_for_file()` (Lines 118-141):
```python
def _load_documents_for_file(file_path: Path) -> List[Document]:
try:
if file_path.suffix.lower() == '.pdf':
# Use advanced LlamaParse loader with settings from config
api_key = settings.LLAMA_CLOUD_API_KEY
premium_mode = settings.LLAMA_PREMIUM_MODE
return data_loaders.load_pdf_documents_advanced(
file_path,
api_key=api_key,
premium_mode=premium_mode
)
return data_loaders.load_markdown_documents(file_path)
except Exception as e:
logger.error(f"Failed to load {file_path}: {e}")
return []
```
**Impact**:
- All PDF processing now uses LlamaParse automatically
- Reads configuration from environment variables
- Maintains backward compatibility with markdown files
---
## New Files Created
### 1. **LLAMAPARSE_INTEGRATION.md**
Complete documentation including:
- Setup instructions
- Configuration guide
- Usage examples
- Cost considerations
- Troubleshooting
- Migration guide
### 2. **test_llamaparse.py**
Test suite with:
- Configuration checker
- Single PDF test
- Batch processing test
- Full pipeline test
### 3. **INTEGRATION_SUMMARY.md** (this file)
Quick reference for all changes
---
## Environment Variables Required
Add to your `.env` file:
```env
# Required for LlamaParse
LLAMA_CLOUD_API_KEY=llx-your-api-key-here
# Optional: Enable premium mode (default: False)
LLAMA_PREMIUM_MODE=False
# Existing (still required)
OPENAI_API_KEY=your-openai-key
```
---
## Installation Requirements
```bash
pip install llama-parse llama-index-core
```
---
## How to Use
### Automatic Processing (Recommended)
1. Set `LLAMA_CLOUD_API_KEY` in `.env`
2. Place PDFs in `data/new_data/PROVIDER/`
3. Run your application - documents are processed automatically on startup
### Manual Processing
```python
from core.utils import process_new_data_and_update_vector_store
# Process all new documents
vector_store = process_new_data_and_update_vector_store()
```
### Direct PDF Loading
```python
from pathlib import Path
from core.data_loaders import load_pdf_documents_advanced
pdf_path = Path("data/new_data/SASLT/guideline.pdf")
documents = load_pdf_documents_advanced(pdf_path)
```
---
## Testing
Run the test suite:
```bash
python test_llamaparse.py
```
This will:
1. β
Check configuration
2. β
Test single PDF loading
3. β
(Optional) Test batch processing
4. β
(Optional) Test full pipeline
---
## Backward Compatibility
β
**Fully backward compatible**:
- Existing processed documents remain valid
- Vector store continues to work
- Markdown processing unchanged
- No breaking changes to API
---
## Benefits
| Aspect | Before (PyMuPDF4LLMLoader) | After (LlamaParse) |
|--------|---------------------------|-------------------|
| **Borderless Tables** | β Poor | β
Excellent |
| **Complex Layouts** | β οΈ Moderate | β
Excellent |
| **Medical Terminology** | β οΈ Moderate | β
Excellent |
| **Page Numbering** | β
Good | β
Excellent |
| **Processing Speed** | β
Fast (local) | β οΈ Slower (cloud) |
| **Cost** | β
Free | β οΈ ~$0.003-0.01/page |
| **Accuracy** | β οΈ Moderate | β
High |
---
## Cost Estimation
### Basic Mode (~$0.003/page)
- 50-page guideline: ~$0.15
- 100-page guideline: ~$0.30
### Premium Mode (~$0.01/page)
- 50-page guideline: ~$0.50
- 100-page guideline: ~$1.00
**Note**: LlamaParse caches results, so re-processing is free.
---
## Workflow Example
```
1. User places PDF in data/new_data/SASLT/
βββ new_guideline.pdf
2. Application startup triggers processing
βββ Detects new PDF
βββ Calls load_pdf_documents_advanced()
βββ LlamaParse processes with medical optimizations
βββ Extracts 50 pages with accurate metadata
βββ Returns Document objects
3. Pipeline continues
βββ Splits into 245 chunks
βββ Updates vector store
βββ Moves to data/processed_data/SASLT/new_guideline_20251111_143022.pdf
4. Ready for RAG queries
βββ Vector store contains new guideline content
```
---
## Next Steps
1. β
Set `LLAMA_CLOUD_API_KEY` in `.env`
2. β
Install dependencies: `pip install llama-parse llama-index-core`
3. β
Test with: `python test_llamaparse.py`
4. β
Place PDFs in `data/new_data/PROVIDER/`
5. β
Run application and verify processing
---
## Support & Troubleshooting
### Common Issues
**1. API Key Not Found**
```
ValueError: LlamaCloud API key not found
```
β Set `LLAMA_CLOUD_API_KEY` in `.env`
**2. Import Errors**
```
ModuleNotFoundError: No module named 'llama_parse'
```
β Run: `pip install llama-parse llama-index-core`
**3. Slow Processing**
β Normal for cloud processing (30-60s per document)
β Subsequent runs use cache (much faster)
### Logs
Check `logs/app.log` for detailed processing information
---
**Integration Date**: November 11, 2025
**Status**: β
Production Ready
**Version**: 1.0
|