Spaces:
Sleeping
Sleeping
| # LlamaParse Integration Guide | |
| ## Overview | |
| The HBV AI Assistant now uses **LlamaParse** for advanced PDF parsing, replacing PyMuPDF4LLMLoader. LlamaParse excels at: | |
| - β Borderless tables (common in medical guidelines) | |
| - β Complex document layouts | |
| - β Hierarchical section preservation | |
| - β Accurate page numbering | |
| - β Medical terminology and dosage tables | |
| ## Setup | |
| ### 1. Install Required Packages | |
| ```bash | |
| pip install llama-parse llama-index-core | |
| ``` | |
| ### 2. Get Your API Key | |
| 1. Visit: https://cloud.llamaindex.ai/api-key | |
| 2. Sign up/login and generate an API key | |
| 3. Copy your API key (format: `llx-...`) | |
| ### 3. Configure Environment Variables | |
| Add to your `.env` file: | |
| ```env | |
| # Required: LlamaParse API Key | |
| LLAMA_CLOUD_API_KEY=llx-your-api-key-here | |
| # Optional: Enable premium GPT-4o mode (higher accuracy, costs more) | |
| LLAMA_PREMIUM_MODE=False | |
| ``` | |
| ## How It Works | |
| ### Automatic Processing Pipeline | |
| When you process new documents from `data/new_data/`, the system automatically: | |
| 1. **Detects PDF files** in `data/new_data/PROVIDER/` directories | |
| 2. **Uses LlamaParse** with medical document optimizations: | |
| - Preserves table structures (including borderless tables) | |
| - Maintains hierarchical headings | |
| - Extracts dosage information accurately | |
| - Keeps reference citations intact | |
| 3. **Splits by page** for accurate page numbering | |
| 4. **Extracts metadata**: provider, disease, page numbers | |
| 5. **Updates vector store** for RAG queries | |
| ### Configuration Options | |
| #### Basic Mode (Default) | |
| ```python | |
| # In .env | |
| LLAMA_CLOUD_API_KEY=llx-your-key | |
| LLAMA_PREMIUM_MODE=False | |
| ``` | |
| - Uses standard LlamaParse parsing | |
| - Good accuracy for most medical documents | |
| - Lower cost | |
| #### Premium Mode | |
| ```python | |
| # In .env | |
| LLAMA_CLOUD_API_KEY=llx-your-key | |
| LLAMA_PREMIUM_MODE=True | |
| ``` | |
| - Uses GPT-4o for parsing | |
| - Highest accuracy for complex tables | |
| - Higher cost per page | |
| - Recommended for critical medical guidelines | |
| ## Usage | |
| ### Processing New Documents | |
| 1. **Place PDFs** in the appropriate directory: | |
| ``` | |
| data/new_data/SASLT/guideline.pdf | |
| data/new_data/WHO/recommendations.pdf | |
| ``` | |
| 2. **Run the processing** (automatic on app startup or manually): | |
| ```python | |
| from core.utils import process_new_data_and_update_vector_store | |
| # Process all new documents | |
| vector_store = process_new_data_and_update_vector_store() | |
| ``` | |
| 3. **Files are automatically moved** to `data/processed_data/` after successful processing | |
| ### Manual PDF Loading | |
| You can also load PDFs manually: | |
| ```python | |
| from pathlib import Path | |
| from core.data_loaders import load_pdf_documents_advanced | |
| # Basic usage (reads API key from environment) | |
| pdf_path = Path("data/new_data/SASLT/guideline.pdf") | |
| documents = load_pdf_documents_advanced(pdf_path) | |
| # With explicit API key | |
| documents = load_pdf_documents_advanced( | |
| pdf_path, | |
| api_key="llx-your-key-here", | |
| premium_mode=True | |
| ) | |
| # Batch processing | |
| from core.data_loaders import load_multiple_pdfs | |
| pdf_dir = Path("data/new_data/SASLT") | |
| all_documents = load_multiple_pdfs(pdf_dir) | |
| ``` | |
| ## Document Metadata | |
| Each processed document includes: | |
| ```python | |
| { | |
| "source": "SASLT_2021.pdf", | |
| "disease": "HBV", | |
| "provider": "SASLT", | |
| "page_number": 6, | |
| "document_index": 5, | |
| "parser": "llamaparse", | |
| "premium_mode": False | |
| } | |
| ``` | |
| ## Parsing Instructions | |
| LlamaParse is configured with medical-specific instructions: | |
| ### Basic Mode | |
| ``` | |
| "This is a medical guideline document. | |
| Pay special attention to tables (including borderless tables), | |
| clinical recommendations, dosage information, and reference citations. | |
| Preserve table structure and maintain hierarchical headings." | |
| ``` | |
| ### Premium Mode | |
| ``` | |
| "Medical guideline document with complex tables. Instructions: | |
| 0. Keep the original text intact without changing anything | |
| 1. Preserve all table structures, especially borderless tables | |
| 2. Maintain hierarchical organization of sections and subsections | |
| 3. Keep dosage tables and treatment algorithms intact | |
| 4. Preserve reference numbers and citations | |
| 5. Identify and mark clinical recommendation levels | |
| 6. Extract figures and their captions accurately" | |
| ``` | |
| ## Cost Considerations | |
| - **Basic Mode**: ~$0.003 per page | |
| - **Premium Mode**: ~$0.01 per page (GPT-4o) | |
| - **Caching**: LlamaParse caches results, so re-processing the same file is free | |
| ### Cost Estimation | |
| For a 50-page medical guideline: | |
| - Basic: ~$0.15 | |
| - Premium: ~$0.50 | |
| ## Troubleshooting | |
| ### API Key Not Found | |
| ``` | |
| ValueError: LlamaCloud API key not found | |
| ``` | |
| **Solution**: Set `LLAMA_CLOUD_API_KEY` in your `.env` file | |
| ### Import Errors | |
| ``` | |
| ModuleNotFoundError: No module named 'llama_parse' | |
| ``` | |
| **Solution**: Install required packages: | |
| ```bash | |
| pip install llama-parse llama-index-core | |
| ``` | |
| ### Slow Processing | |
| - LlamaParse processes documents in the cloud | |
| - First-time processing takes longer (30-60 seconds per document) | |
| - Subsequent processing uses cache (much faster) | |
| - Consider using `premium_mode=False` for faster processing | |
| ### Empty Results | |
| - Check that PDF is not corrupted | |
| - Verify API key is valid | |
| - Check logs for detailed error messages | |
| ## Migration from PyMuPDF4LLMLoader | |
| The integration is **backward compatible**: | |
| - Existing processed documents remain valid | |
| - Vector store continues to work | |
| - Only new documents use LlamaParse | |
| - No changes needed to existing code | |
| ### What Changed | |
| 1. **`core/data_loaders.py`**: Replaced PyMuPDF4LLMLoader with LlamaParse | |
| 2. **`core/config.py`**: Added `LLAMA_CLOUD_API_KEY` and `LLAMA_PREMIUM_MODE` settings | |
| 3. **`core/utils.py`**: Updated `_load_documents_for_file()` to use `load_pdf_documents_advanced()` | |
| ## Benefits Over PyMuPDF4LLMLoader | |
| | Feature | PyMuPDF4LLMLoader | LlamaParse | | |
| |---------|-------------------|------------| | |
| | Borderless tables | β Poor | β Excellent | | |
| | Complex layouts | β οΈ Moderate | β Excellent | | |
| | Medical terminology | β οΈ Moderate | β Excellent | | |
| | Page numbering | β Good | β Excellent | | |
| | Processing speed | β Fast (local) | β οΈ Slower (cloud) | | |
| | Cost | β Free | β οΈ Paid API | | |
| | Accuracy | β οΈ Moderate | β High | | |
| ## Example Workflow | |
| ```python | |
| # 1. Set up environment | |
| # Add to .env: | |
| # LLAMA_CLOUD_API_KEY=llx-your-key-here | |
| # LLAMA_PREMIUM_MODE=False | |
| # 2. Place new PDFs | |
| # data/new_data/SASLT/new_guideline.pdf | |
| # 3. Process automatically (on app startup) | |
| # Or manually: | |
| from core.utils import process_new_data_and_update_vector_store | |
| vector_store = process_new_data_and_update_vector_store() | |
| # Output: | |
| # β Parsing PDF with LlamaParse (Premium: False): new_guideline.pdf | |
| # β Loaded 50 pages from PDF: new_guideline.pdf | |
| # β Split 50 documents into 245 chunks | |
| # β Added 245 new chunks to existing vector store | |
| # π¦ Moved processed file: new_guideline.pdf -> SASLT/new_guideline_20251111_143022.pdf | |
| # 4. Query the system | |
| from core.agent import answer_question | |
| response = answer_question( | |
| "What is the recommended treatment for HBeAg-positive chronic hepatitis B?" | |
| ) | |
| print(response) | |
| ``` | |
| ## Support | |
| For issues or questions: | |
| 1. Check the logs in `logs/app.log` | |
| 2. Verify API key is valid | |
| 3. Review LlamaParse documentation: https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/ | |
| 4. Check environment variables are set correctly | |
| --- | |
| **Last Updated**: November 11, 2025 | |
| **Integration Status**: β Complete and Production Ready | |