Update README.md
Browse files
README.md
CHANGED
|
@@ -1,12 +1,12 @@
|
|
| 1 |
# Infinity-Parser-7B
|
| 2 |
|
| 3 |
-
<a><img src="assets/logo.png" height="16" width="16" style="display: inline"><b> Paper </b></a> |
|
| 4 |
<a href="https://github.com/infly-ai/INF-MLLM/tree/main/Infinity-Parser"><img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" height="16" width="16" style="display: inline"><b> Github </b></a> |
|
| 5 |
<a href="https://huggingface.co/spaces/infly/Infinity-Parser-Demo">💬<b> Web Demo </b></a>
|
| 6 |
|
| 7 |
# Introduction
|
| 8 |
|
| 9 |
-
We develop Infinity-Parser, an end-to-end scanned document parsing model trained with reinforcement learning. By incorporating verifiable rewards based on layout and content, Infinity-Parser maintains the original document's structure and content with high fidelity. Extensive evaluations on benchmarks in cluding OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models.
|
| 10 |
|
| 11 |
# Architecture
|
| 12 |
|
|
@@ -14,6 +14,84 @@ Overview of Infinity-Parser training framework. Our model is optimized via reinf
|
|
| 14 |
|
| 15 |

|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
# License
|
| 18 |
|
| 19 |
-
This
|
|
|
|
| 1 |
# Infinity-Parser-7B
|
| 2 |
|
| 3 |
+
<a href="https://arxiv.org/pdf/2510.15349"><img src="assets/logo.png" height="16" width="16" style="display: inline"><b> Paper </b></a> |
|
| 4 |
<a href="https://github.com/infly-ai/INF-MLLM/tree/main/Infinity-Parser"><img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" height="16" width="16" style="display: inline"><b> Github </b></a> |
|
| 5 |
<a href="https://huggingface.co/spaces/infly/Infinity-Parser-Demo">💬<b> Web Demo </b></a>
|
| 6 |
|
| 7 |
# Introduction
|
| 8 |
|
| 9 |
+
We develop Infinity-Parser, an end-to-end scanned document parsing model trained with reinforcement learning. By incorporating verifiable rewards based on layout and content, Infinity-Parser maintains the original document's structure and content with high fidelity. Extensive evaluations on benchmarks in cluding OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models while preserving the model’s general multimodal understanding capability.
|
| 10 |
|
| 11 |
# Architecture
|
| 12 |
|
|
|
|
| 14 |
|
| 15 |

|
| 16 |
|
| 17 |
+
# Quick Start
|
| 18 |
+
|
| 19 |
+
## Inference
|
| 20 |
+
|
| 21 |
+
```python
|
| 22 |
+
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
|
| 23 |
+
from qwen_vl_utils import process_vision_info
|
| 24 |
+
|
| 25 |
+
model_path = "infly/Infinity-Parser-7B"
|
| 26 |
+
prompt = "Please transform the document’s contents into Markdown format."
|
| 27 |
+
|
| 28 |
+
# default: Load the model on the available device(s)
|
| 29 |
+
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 30 |
+
model_path, torch_dtype="auto", device_map="auto"
|
| 31 |
+
)
|
| 32 |
+
|
| 33 |
+
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
|
| 34 |
+
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 35 |
+
# model_path,
|
| 36 |
+
# torch_dtype=torch.bfloat16,
|
| 37 |
+
# attn_implementation="flash_attention_2",
|
| 38 |
+
# device_map="auto",
|
| 39 |
+
# )
|
| 40 |
+
|
| 41 |
+
min_pixels = 256 * 28 * 28 # 448 * 448
|
| 42 |
+
max_pixels = 2304 * 28 * 28 # 1344 * 1344
|
| 43 |
+
processor = AutoProcessor.from_pretrained(model_path, min_pixels=min_pixels, max_pixels=max_pixels)
|
| 44 |
+
|
| 45 |
+
messages = [
|
| 46 |
+
{
|
| 47 |
+
"role": "user",
|
| 48 |
+
"content": [
|
| 49 |
+
{
|
| 50 |
+
"type": "image",
|
| 51 |
+
"image": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png",
|
| 52 |
+
},
|
| 53 |
+
{"type": "text", "text": prompt},
|
| 54 |
+
],
|
| 55 |
+
}
|
| 56 |
+
]
|
| 57 |
+
|
| 58 |
+
text = processor.apply_chat_template(
|
| 59 |
+
messages, tokenize=False, add_generation_prompt=True
|
| 60 |
+
)
|
| 61 |
+
image_inputs, video_inputs = process_vision_info(messages)
|
| 62 |
+
inputs = processor(
|
| 63 |
+
text=[text],
|
| 64 |
+
images=image_inputs,
|
| 65 |
+
videos=video_inputs,
|
| 66 |
+
padding=True,
|
| 67 |
+
return_tensors="pt",
|
| 68 |
+
)
|
| 69 |
+
inputs = inputs.to("cuda")
|
| 70 |
+
|
| 71 |
+
generated_ids = model.generate(**inputs, max_new_tokens=4096)
|
| 72 |
+
generated_ids_trimmed = [
|
| 73 |
+
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
| 74 |
+
]
|
| 75 |
+
output_text = processor.batch_decode(
|
| 76 |
+
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
| 77 |
+
)
|
| 78 |
+
print(output_text)
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
# Citation
|
| 82 |
+
|
| 83 |
+
```plain_text
|
| 84 |
+
@misc{wang2025infinityparserlayoutaware,
|
| 85 |
+
title={Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing},
|
| 86 |
+
author={Baode Wang and Biao Wu and Weizhen Li and Meng Fang and Zuming Huang and Jun Huang and Haozhe Wang and Yanjie Liang and Ling Chen and Wei Chu and Yuan Qi},
|
| 87 |
+
year={2025},
|
| 88 |
+
eprint={2510.15349},
|
| 89 |
+
archivePrefix={arXiv},
|
| 90 |
+
primaryClass={cs.CL},
|
| 91 |
+
url={https://arxiv.org/abs/2510.15349},
|
| 92 |
+
}
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
# License
|
| 96 |
|
| 97 |
+
This dataset is licensed under cc-by-nc-sa-4.0.
|