Spaces:
Running
Running
| # OCR Utilities | |
| This directory contains utility modules for the Historical OCR project. | |
| ## PDF OCR Processing | |
| The `pdf_ocr.py` module provides specialized functionality for processing PDF documents with OCR. | |
| ### Features | |
| - **Robust PDF-to-Image Conversion**: Converts PDF documents to images using optimized settings before OCR processing | |
| - **Multi-Page Support**: Intelligently handles multi-page documents, allowing processing of specific pages or page ranges | |
| - **Memory-Efficient Processing**: Processes PDFs in batches to prevent memory issues with large documents | |
| - **Fallback Mechanism**: Falls back to structured_ocr's internal processing if direct conversion fails | |
| - **Cleanup Management**: Automatically cleans up temporary files after processing | |
| ### Key Components | |
| - **PDFOCR**: Main class for processing PDF files with OCR | |
| - **PDFConversionResult**: Helper class that holds PDF conversion results and manages cleanup | |
| ### Basic Usage | |
| ```python | |
| from utils.pdf_ocr import PDFOCR | |
| # Initialize the processor | |
| processor = PDFOCR() | |
| # Process a PDF file (all pages, with vision model) | |
| result = processor.process_pdf('document.pdf') | |
| # Process a PDF file (specific pages, with vision model) | |
| result = processor.process_pdf('document.pdf', custom_pages=[1, 3, 5]) | |
| # Process a PDF file (first N pages, without vision model) | |
| result = processor.process_pdf('document.pdf', max_pages=3, use_vision=False) | |
| # Process a PDF file with custom prompt | |
| result = processor.process_pdf( | |
| 'document.pdf', | |
| custom_prompt="This is a historical newspaper with multiple columns." | |
| ) | |
| # Save results to JSON | |
| output_path = processor.save_json_output('document.pdf', 'results.json') | |
| ``` | |
| ### Command Line Usage | |
| The module can also be used directly from the command line: | |
| ```bash | |
| python utils/pdf_ocr.py document.pdf --output results.json | |
| python utils/pdf_ocr.py document.pdf --max-pages 3 | |
| python utils/pdf_ocr.py document.pdf --pages 1,3,5 | |
| python utils/pdf_ocr.py document.pdf --prompt "This is a historical newspaper with multiple columns." | |
| python utils/pdf_ocr.py document.pdf --no-vision | |
| ``` | |
| ### How It Works | |
| 1. The module first attempts to convert the PDF to images using `pdf2image` | |
| 2. It processes the first page with the vision model (if requested) for detailed analysis | |
| 3. Additional pages are processed with the text model for efficiency | |
| 4. All text is combined into a single result with appropriate metadata | |
| 5. If direct conversion fails, it falls back to using `structured_ocr.py` for PDF processing | |
| ### Parameters | |
| - **pdf_path**: Path to the PDF file to process | |
| - **use_vision**: Whether to use vision model for improved analysis (default: True) | |
| - **max_pages**: Maximum number of pages to process (default: all pages) | |
| - **custom_pages**: Specific page numbers to process, 1-based indexing (e.g., [1, 3, 5]) | |
| - **custom_prompt**: Custom instructions for OCR processing | |