Spaces:

milwright
/

historical-ocr

Running

App Files Files Community

historical-ocr / utils /README.md

milwright

fix cline

2d01495 8 months ago

preview code

raw

history blame contribute delete

2.88 kB

	# OCR Utilities

	This directory contains utility modules for the Historical OCR project.

	## PDF OCR Processing

	The `pdf_ocr.py` module provides specialized functionality for processing PDF documents with OCR.

	### Features

	- Robust PDF-to-Image Conversion: Converts PDF documents to images using optimized settings before OCR processing
	- Multi-Page Support: Intelligently handles multi-page documents, allowing processing of specific pages or page ranges
	- Memory-Efficient Processing: Processes PDFs in batches to prevent memory issues with large documents
	- Fallback Mechanism: Falls back to structured_ocr's internal processing if direct conversion fails
	- Cleanup Management: Automatically cleans up temporary files after processing

	### Key Components

	- PDFOCR: Main class for processing PDF files with OCR
	- PDFConversionResult: Helper class that holds PDF conversion results and manages cleanup

	### Basic Usage

	```python
	from utils.pdf_ocr import PDFOCR

	# Initialize the processor
	processor = PDFOCR()

	# Process a PDF file (all pages, with vision model)
	result = processor.process_pdf('document.pdf')

	# Process a PDF file (specific pages, with vision model)
	result = processor.process_pdf('document.pdf', custom_pages=[1, 3, 5])

	# Process a PDF file (first N pages, without vision model)
	result = processor.process_pdf('document.pdf', max_pages=3, use_vision=False)

	# Process a PDF file with custom prompt
	result = processor.process_pdf(
	'document.pdf',
	custom_prompt="This is a historical newspaper with multiple columns."
	)

	# Save results to JSON
	output_path = processor.save_json_output('document.pdf', 'results.json')
	```

	### Command Line Usage

	The module can also be used directly from the command line:

	```bash
	python utils/pdf_ocr.py document.pdf --output results.json
	python utils/pdf_ocr.py document.pdf --max-pages 3
	python utils/pdf_ocr.py document.pdf --pages 1,3,5
	python utils/pdf_ocr.py document.pdf --prompt "This is a historical newspaper with multiple columns."
	python utils/pdf_ocr.py document.pdf --no-vision
	```

	### How It Works

	1. The module first attempts to convert the PDF to images using `pdf2image`
	2. It processes the first page with the vision model (if requested) for detailed analysis
	3. Additional pages are processed with the text model for efficiency
	4. All text is combined into a single result with appropriate metadata
	5. If direct conversion fails, it falls back to using `structured_ocr.py` for PDF processing

	### Parameters

	- pdf_path: Path to the PDF file to process
	- use_vision: Whether to use vision model for improved analysis (default: True)
	- max_pages: Maximum number of pages to process (default: all pages)
	- custom_pages: Specific page numbers to process, 1-based indexing (e.g., [1, 3, 5])
	- custom_prompt: Custom instructions for OCR processing