Spaces:

seanpedrickcase
/

document_redaction_vlm

Running on Zero

App Files Files Community

document_redaction_vlm / test /README.md

seanpedrickcase

Sync: Changed search text tab title

d864d45 13 days ago

preview code

raw

history blame

3.71 kB

	# CLI Redaction Test Suite

	This test suite provides comprehensive testing for the `cli_redact.py` script based on all the examples shown in the CLI epilog.

	## Overview

	The test suite includes tests for:

	1. PDF Redaction Examples
	- Default settings (local OCR)
	- Text extraction only (no redaction)
	- Text extraction with whole page redaction
	- Redaction with allow lists
	- Limited pages with custom fuzzy matching
	- Custom deny/allow/whole page lists
	- Image redaction

	2. Tabular Anonymisation Examples
	- CSV anonymisation with specific columns
	- Different anonymisation strategies
	- Word document anonymisation

	3. AWS Services Examples
	- Textract and Comprehend redaction
	- Signature extraction
	- Layout extraction

	4. Duplicate Detection Examples
	- Duplicate pages in OCR files
	- Line-level duplicate detection
	- Tabular duplicate detection

	5. Textract Batch Operations
	- Submit documents for analysis
	- Retrieve results by job ID
	- List recent jobs

	## Running the Tests

	### Method 1: Run the test suite directly
	```bash
	cd test
	python test.py
	```

	### Method 2: Use the convenience script
	```bash
	cd test
	python run_tests.py
	```

	### Method 3: Run with unittest
	```bash
	cd test
	python -m unittest test.test.TestCLIRedactExamples -v
	```

	## Test Behavior

	- File Dependencies: Tests will be skipped if required example files are not found in the `example_data/` directory
	- AWS Tests: AWS-related tests may fail if credentials are not configured, but this is expected
	- Temporary Output: All tests use temporary output directories that are cleaned up automatically
	- Timeout: Each test has a 10-minute timeout to prevent hanging

	## Test Structure

	The test suite uses Python's `unittest` framework with the following structure:

	- `TestCLIRedactExamples`: Main test class containing all test methods
	- `run_cli_redact()`: Helper function that executes the CLI script with specified parameters
	- `run_all_tests()`: Main function that runs all tests and provides a summary

	## Example Output

	```
	================================================================================
	DOCUMENT REDACTION CLI TEST SUITE
	================================================================================
	This test suite runs through all the examples from the CLI epilog.
	Tests will be skipped if required example files are not found.
	AWS-related tests may fail if credentials are not configured.
	================================================================================

	Test setup complete. Script: /path/to/cli_redact.py
	Example data directory: /path/to/example_data
	Temp output directory: /tmp/test_output_xyz

	=== Testing PDF redaction with default settings ===
	✅ PDF redaction with default settings passed

	=== Testing PDF text extraction only ===
	✅ PDF text extraction only passed

	...

	================================================================================
	TEST SUMMARY
	================================================================================
	Tests run: 20
	Failures: 0
	Errors: 0
	Skipped: 2

	Overall result: ✅ PASSED
	================================================================================
	```

	## Requirements

	- Python 3.6+
	- All dependencies for the main CLI script
	- Example data files in the `example_data/` directory (for full test coverage)
	- AWS credentials (for AWS-related tests)

	## Notes

	- Tests are designed to be robust and will skip gracefully if files are missing
	- AWS tests are marked as completed even if they fail due to missing credentials
	- The test suite provides detailed output for debugging
	- All temporary files are cleaned up automatically