| | --- |
| | license: apache-2.0 |
| | library_name: transformers |
| | pipeline_tag: image-text-to-text |
| | tags: |
| | - ocr |
| | - vision-language-model |
| | - document-understanding |
| | --- |
| | |
| | # OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models |
| |
|
| | OCRVerse is a holistic OCR method that enables unified text-centric OCR (extracting text from documents like books and magazines) and vision-centric OCR (identifying visual elements from information-dense sources like charts, web pages, and scientific plots) in an end-to-end manner. |
| |
|
| | - **Paper:** [OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models](https://huggingface.co/papers/2601.21639) |
| | - **GitHub Repository:** [DocTron-hub/OCRVerse](https://github.com/DocTron-hub/OCRVerse) |
| |
|
| | ## Usage Example |
| |
|
| | To use OCRVerse, please ensure you have the `transformers` library installed: |
| |
|
| | ```shell |
| | pip install "transformers>=4.57.0" |
| | ``` |
| |
|
| | ### Text-Centric Document Parsing |
| |
|
| | Below is a simple example of how to use OCRVerse for document parsing tasks. |
| |
|
| | ```python |
| | from transformers import Qwen3VLForConditionalGeneration, AutoProcessor |
| | import torch |
| | |
| | # Load model |
| | model_path = 'DocTron/OCRVerse' |
| | model = Qwen3VLForConditionalGeneration.from_pretrained( |
| | model_path, |
| | dtype="auto", |
| | device_map="cuda", |
| | trust_remote_code=True |
| | ) |
| | processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) |
| | |
| | # Prepare input with image and text |
| | image_path = "path/to/your/image.jpg" |
| | # We recommend using the following prompt for better performance |
| | prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML." |
| | |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | {"type": "image", "image": image_path}, |
| | {"type": "text", "text": prompt}, |
| | ] |
| | } |
| | ] |
| | |
| | # Preparation for inference |
| | inputs = processor.apply_chat_template( |
| | messages, |
| | tokenize=True, |
| | add_generation_prompt=True, |
| | return_dict=True, |
| | return_tensors="pt" |
| | ) |
| | inputs = inputs.to(model.device) |
| | |
| | # Inference: Generation of the output |
| | generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False) |
| | |
| | generated_ids = [ |
| | output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids) |
| | ] |
| | output_text = processor.tokenizer.batch_decode( |
| | generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| | ) |
| | print(output_text[0]) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | If you find this project useful, please cite our paper: |
| |
|
| | ```bibtex |
| | @article{zhong2026ocrverse, |
| | title={OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models}, |
| | author={Yufeng Zhong and Lei Chen and Xuanle Zhao and Wenkang Han and Liming Zheng and Jing Huang and Deyang Jiang and Yilin Cao and Lin Ma and Zhixiong Zeng}, |
| | journal={arXiv preprint arXiv:2601.21639}, |
| | year={2026} |
| | } |
| | ``` |