|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
tags: |
|
|
- document-question-answering |
|
|
- ocr |
|
|
- summarization |
|
|
- document-ai |
|
|
pipeline_tag: document-question-answering |
|
|
model_name: docintel |
|
|
--- |
|
|
|
|
|
# 🧾 DOCINTEL — Document AI (Donut-based) |
|
|
|
|
|
**DOCINTEL** extracts structured insights from scanned PDFs and images using **naver-clova-ix/donut-base** (Donut). It supports OCR fallback, entity extraction, and document summarization via Donut on page images. |
|
|
|
|
|
> ⚠️ Install system dependencies: `poppler` and `tesseract` for pdf2image and pytesseract respectively. |
|
|
|
|
|
## Quickstart |
|
|
|
|
|
1. Create venv & install dependencies: |
|
|
```bash |
|
|
python -m venv venv |
|
|
source venv/bin/activate # Windows: venv\Scripts\activate |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
2. Run API server: |
|
|
```bash |
|
|
uvicorn app:app --host 0.0.0.0 --port 8000 |
|
|
``` |
|
|
|
|
|
3. Upload a PDF and call endpoints (see examples/demo_commands.txt). |
|
|
|
|
|
## Files |
|
|
- `ocr_extractor.py` — PDF→images→OCR pipeline |
|
|
- `pdf_loader.py` — extract embedded text from PDFs |
|
|
- `entity_tagger.py` — regex-based entity extraction |
|
|
- `summarize_doc.py` — DONUT-based summarizer for page images |
|
|
- `app.py` — FastAPI server with upload/summary endpoints |
|
|
|
|
|
## Notes |
|
|
- Donut requires vision-encoder-decoder inference which may need GPU for speed. |
|
|
- For text-only PDFs consider using `extract_text_from_pdf` then a text summarizer instead of Donut. |
|
|
- This repo is a prototype/demo. Validate on your data before production use. |
|
|
|
|
|
|