hmnshudhmn24
/

docintel-ai-extractor

Document Question Answering

Model card Files Files and versions

docintel-ai-extractor / README.md

hmnshudhmn24's picture

Update README.md

b529491 verified about 1 month ago

|

history blame contribute delete

1.45 kB

	---
	language:
	- en
	license: mit
	tags:
	- document-question-answering
	- ocr
	- summarization
	- document-ai
	pipeline_tag: document-question-answering
	model_name: docintel
	---

	# 🧾 DOCINTEL — Document AI (Donut-based)

	DOCINTEL extracts structured insights from scanned PDFs and images using naver-clova-ix/donut-base (Donut). It supports OCR fallback, entity extraction, and document summarization via Donut on page images.

	> ⚠️ Install system dependencies: `poppler` and `tesseract` for pdf2image and pytesseract respectively.

	## Quickstart

	1. Create venv & install dependencies:
	```bash
	python -m venv venv
	source venv/bin/activate # Windows: venv\Scripts\activate
	pip install -r requirements.txt
	```

	2. Run API server:
	```bash
	uvicorn app:app --host 0.0.0.0 --port 8000
	```

	3. Upload a PDF and call endpoints (see examples/demo_commands.txt).

	## Files
	- `ocr_extractor.py` — PDF→images→OCR pipeline
	- `pdf_loader.py` — extract embedded text from PDFs
	- `entity_tagger.py` — regex-based entity extraction
	- `summarize_doc.py` — DONUT-based summarizer for page images
	- `app.py` — FastAPI server with upload/summary endpoints

	## Notes
	- Donut requires vision-encoder-decoder inference which may need GPU for speed.
	- For text-only PDFs consider using `extract_text_from_pdf` then a text summarizer instead of Donut.
	- This repo is a prototype/demo. Validate on your data before production use.