Spaces:
Running
Running
| # Image Preprocessing for Historical Document OCR | |
| This document outlines the enhanced preprocessing capabilities for improving OCR quality on historical documents, including deskewing, thresholding, and morphological operations. | |
| ## Overview | |
| The preprocessing pipeline offers several options to enhance image quality before OCR processing: | |
| 1. **Deskewing**: Automatically detects and corrects document skew using multiple detection algorithms | |
| 2. **Thresholding**: Converts grayscale images to binary using adaptive or Otsu methods with pre-blur options | |
| 3. **Morphological Operations**: Cleans up binary images by removing noise or filling in gaps | |
| 4. **Document-Type Specific Settings**: Customized preprocessing configurations for different document types | |
| ## Configuration | |
| Preprocessing options are set in `config.py` and are tunable per document type. All settings are accessible through environment variables for easy deployment configuration. | |
| ### Deskewing | |
| ```python | |
| "deskew": { | |
| "enabled": True/False, # Whether to apply deskewing | |
| "angle_threshold": 0.1, # Minimum angle (degrees) to trigger deskewing | |
| "max_angle": 45.0, # Maximum correction angle | |
| "use_hough": True/False, # Use Hough transform in addition to minAreaRect | |
| "consensus_method": "average", # How to combine angle estimations | |
| "fallback": {"enabled": True/False} # Fall back to original if deskewing fails | |
| } | |
| ``` | |
| Deskewing uses two methods: | |
| - **minAreaRect**: Finds contours in the binary image and calculates their orientation | |
| - **Hough Transform**: Detects lines in the image and their angles | |
| The `consensus_method` can be: | |
| - `"average"`: Average of all detected angles (most stable) | |
| - `"median"`: Median of all angles (robust to outliers) | |
| - `"min"`: Minimum absolute angle (most conservative) | |
| - `"max"`: Maximum absolute angle (most aggressive) | |
| ### Thresholding | |
| ```python | |
| "thresholding": { | |
| "method": "adaptive", # "none", "otsu", or "adaptive" | |
| "adaptive_block_size": 11, # Block size for adaptive thresholding (must be odd) | |
| "adaptive_constant": 2, # Constant subtracted from mean | |
| "otsu_gaussian_blur": 1, # Blur kernel size for Otsu pre-processing | |
| "preblur": { | |
| "enabled": True/False, # Whether to apply pre-blur | |
| "method": "gaussian", # "gaussian" or "median" | |
| "kernel_size": 3 # Blur kernel size (must be odd) | |
| }, | |
| "fallback": {"enabled": True/False} # Fall back to grayscale if thresholding fails | |
| } | |
| ``` | |
| Thresholding methods: | |
| - **Otsu**: Automatically determines optimal global threshold (best for high-contrast documents) | |
| - **Adaptive**: Calculates thresholds for different regions (better for uneven lighting, historical documents) | |
| ### Morphological Operations | |
| ```python | |
| "morphology": { | |
| "enabled": True/False, # Whether to apply morphological operations | |
| "operation": "close", # "open", "close", "both" | |
| "kernel_size": 1, # Size of the structuring element | |
| "kernel_shape": "rect" # "rect", "ellipse", "cross" | |
| } | |
| ``` | |
| Morphological operations: | |
| - **Open**: Erosion followed by dilation - removes small noise and disconnects thin connections | |
| - **Close**: Dilation followed by erosion - fills small holes and connects broken elements | |
| - **Both**: Applies opening followed by closing | |
| ### Document Type Configurations | |
| The system includes optimized settings for different document types: | |
| ```python | |
| "document_types": { | |
| "standard": { | |
| # Default settings - will use the global settings | |
| }, | |
| "newspaper": { | |
| "deskew": {"enabled": True, "angle_threshold": 0.3, "max_angle": 10.0}, | |
| "thresholding": { | |
| "method": "adaptive", | |
| "adaptive_block_size": 15, | |
| "adaptive_constant": 3, | |
| "preblur": {"method": "gaussian", "kernel_size": 3} | |
| }, | |
| "morphology": {"operation": "close", "kernel_size": 1} | |
| }, | |
| "handwritten": { | |
| "deskew": {"enabled": True, "angle_threshold": 0.5, "use_hough": False}, | |
| "thresholding": { | |
| "method": "adaptive", | |
| "adaptive_block_size": 31, | |
| "adaptive_constant": 5, | |
| "preblur": {"method": "median", "kernel_size": 3} | |
| }, | |
| "morphology": {"operation": "open", "kernel_size": 1} | |
| }, | |
| "book": { | |
| "deskew": {"enabled": True}, | |
| "thresholding": { | |
| "method": "otsu", | |
| "preblur": {"method": "gaussian", "kernel_size": 5} | |
| }, | |
| "morphology": {"operation": "both", "kernel_size": 1} | |
| } | |
| } | |
| ``` | |
| ## Performance and Logging | |
| ```python | |
| "performance": { | |
| "parallel": { | |
| "enabled": True/False, # Whether to use parallel processing | |
| "max_workers": 4 # Maximum number of worker threads | |
| }, | |
| "timeout_ms": 10000 # Timeout for preprocessing (in milliseconds) | |
| } | |
| "logging": { | |
| "enabled": True/False, # Whether to log preprocessing metrics | |
| "metrics": ["skew_angle", "binary_nonzero_pct", "processing_time"], | |
| "output_path": "logs/preprocessing_metrics.json" | |
| } | |
| ``` | |
| ## Usage with OCR Processing | |
| When processing documents, simply specify the document type: | |
| ```python | |
| preprocessing_options = { | |
| "document_type": "newspaper", # Use newspaper-optimized settings | |
| "grayscale": True, # Legacy option: apply grayscale conversion | |
| "denoise": True, # Legacy option: apply denoising | |
| "contrast": 10, # Legacy option: adjust contrast (0-100) | |
| "rotation": 0 # Legacy option: manual rotation (degrees) | |
| } | |
| # Apply preprocessing and OCR | |
| result = process_file(file_bytes, file_ext, preprocessing_options=preprocessing_options) | |
| ``` | |
| ## Visual Examples | |
| ### Original Document | |
| *[A historical newspaper or document image would be shown here]* | |
| ### After Deskewing | |
| *[The same document, with skew corrected]* | |
| ### After Thresholding | |
| *[The document converted to binary with clear text]* | |
| ### After Morphological Operations | |
| *[The binary image with small noise removed and/or gaps filled]* | |
| ## Troubleshooting | |
| ### Poor Deskewing Results | |
| - **Symptom**: Document skew is not correctly detected or corrected | |
| - **Solution**: Try adjusting `angle_threshold` or `max_angle`, or disable Hough transform for handwritten documents | |
| ### Thresholding Issues | |
| - **Symptom**: Text is lost or background noise is excessive after thresholding | |
| - **Solution**: Try changing the thresholding method or adjusting `adaptive_block_size` and `adaptive_constant` | |
| ### Performance Concerns | |
| - **Symptom**: Processing is too slow for large documents | |
| - **Solution**: Enable parallel processing, reduce image size, or disable some preprocessing steps for faster results | |