tencent
/

HunyuanOCR

+<div align="center">
+# HunyuanOCR
+</div>
+<p align="center">
+ <img src="[./assets/hyocr-head-img.png](https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/assets/hyocr-head-img.png)" width="80%"/> <br>
+</p>
+<p align="center">
+<a href="https://huggingface.co/spaces/tencent/HunyuanOCR"><b>🐙 Github</b></a> |
+<a href="https://huggingface.co/spaces/tencent/HunyuanOCR"><b>🎯 Demo</b></a> |
+<a href="https://huggingface.co/tencent/HunyuanOCR"><b>📥 Model Download</b></a> |
+<a href="./Hunyuan_OCR_Technical_Report.pdf"><b>📄 Technical Report</b></a>
+</p>
+## 🔥 News
+- **[2025/11/25]** 📝 Inference code and model weights publicly available.
+## 📖 Introduction
+**HunyuanOCR** stands as a leading end-to-end OCR expert VLM powered by Hunyuan's native multimodal architecture. With a remarkably lightweight 1B parameter design, it has achieved multiple state-of-the-art benchmarks across the industry. The model demonstrates mastery in **complex multilingual document parsing** while excelling in practical applications including **text spotting, open-field information extraction, video subtitle extraction, and photo translation**.
+Built on Tencent's Hunyuan technology, this versatile model delivers exceptional performance through end-to-end architecture design and single-pass inference. It significantly simplifies deployment while maintaining competitive performance against both established cascade systems and commercial APIs.
+## ✨ Key Features
+- 💪 **Efficient Lightweight Architecture**: Built on Hunyuan's native multimodal architecture and training strategy, achieving SOTA performance with only 1B parameters, significantly reducing deployment costs.
+- 📑 **Comprehensive OCR Capabilities**: A single model covering classic OCR tasks including text detection and recognition, complex document parsing, open-field information extraction and video subtitle extraction, while supporting end-to-end photo translation and document QA.
+- 🚀 **Ultimate Usability**: Deeply embraces the "end-to-end" philosophy of large models - achieving SOTA results with single instruction and single inference, offering greater efficiency and convenience compared to industry cascade solutions.
+- 🌏 **Extensive Language Support**: Robust support for over 100 languages, excelling in both single-language and mixed-language scenarios across various document types.
+<div align="left">
+  <img src="./assets/hyocr-pipeline.png" alt="HunyuanOCR framework" width="80%">
+</div>
+## 🛠️ Dependencies and Installation
+### System Requirements
+- 🖥️ Operating System: Linux
+- 🐍 Python: 3.12+ (recommended and tested)
+- ⚡ CUDA: 12.8
+- 🔥 PyTorch: 2.7.1
+- 🎮 GPU: NVIDIA GPU with CUDA support
+- 🧠 GPU Memory: 80GB
+- 💾 Disk Space: 6GB
+## 🚀 Quick Start with vLLM
+### Installation
+```bash
+pip install vllm --extra-index-url https://wheels.vllm.ai/nightly
+```
+### Model Inference
+```python
+from vllm import LLM, SamplingParams
+from PIL import Image
+from transformers import AutoProcessor
+model_path = "tencent/HunyuanOCR"
+llm = LLM(model=model_path, trust_remote_code=True)
+processor = AutoProcessor.from_pretrained(model_path)
+sampling_params = SamplingParams(temperature=0, max_tokens=16384)
+img_path = "/path/to/image.jpg"
+img = Image.open(img_path)
+messages = [
+    {"role": "user", "content": [
+        {"type": "image", "image": img_path},
+        {"type": "text", "text": "Detect and recognize text in the image, and output the text coordinates in a formatted manner."}
+    ]}
+]
+prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = {"prompt": prompt, "multi_modal_data": {"image": [img]}}
+output = llm.generate([inputs], sampling_params)[0]
+print(output.outputs[0].text)
+```
+## 🚀 Quick Start with Transformers
+### Installation
+```bash
+pip install git+https://github.com/huggingface/transformers@82a06db03535c49aa987719ed0746a76093b1ec4
+```
+> **Note**: We will merge it into the Transformers main branch later.
+### Model Inference
+```python
+from transformers import AutoProcessor
+from transformers import HunYuanVLForConditionalGeneration
+from PIL import Image
+import torch
+model_name_or_path = "tencent/HunyuanOCR"
+processor = AutoProcessor.from_pretrained(model_name_or_path, use_fast=False)
+img_path = "path/to/your/image.jpg"
+image_inputs = Image.open(img_path)
+messages1 = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": img_path},
+            {"type": "text", "text": (
+                "Extract all information from the main body of the document image "
+                "and represent it in markdown format, ignoring headers and footers. "
+                "Tables should be expressed in HTML format, formulas in the document "
+                "should be represented using LaTeX format, and the parsing should be "
+                "organized according to the reading order."
+            )},
+        ],
+    }
+]
+messages = [messages1]
+texts = [
+    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
+    for msg in messages
+]
+inputs = processor(
+    text=texts,
+    images=image_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+model = HunYuanVLForConditionalGeneration.from_pretrained(
+    model_name_or_path,
+    attn_implementation="eager",
+    dtype=torch.bfloat16,
+    device_map="auto"
+)
+with torch.no_grad():
+    device = next(model.parameters()).device
+    inputs = inputs.to(device)
+    generated_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
+if "input_ids" in inputs:
+    input_ids = inputs.input_ids
+else:
+    print("inputs: # fallback", inputs)
+    input_ids = inputs.inputs
+generated_ids_trimmed = [
+    out_ids[len(in_ids):] for in_ids, out_ids in zip(input_ids, generated_ids)
+]
+output_texts = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_texts)
+```
+## 💬 Application-oriented Prompts
+| Task | English | Chinese |
+|------|---------|---------|
+| **Spotting** | Detect and recognize text in the image, and output the text coordinates in a formatted manner. | 检测并识别图片中的文字，将文本坐标格式化输出。 |
+| **Parsing** | • Identify the formula in the image and represent it using LaTeX format.<br><br>• Parse the table in the image into HTML.<br><br>• Parse the chart in the image; use Mermaid format for flowcharts and Markdown for other charts.<br><br>• Extract all information from the main body of the document image and represent it in markdown format, ignoring headers and footers. Tables should be expressed in HTML format, formulas in the document should be represented using LaTeX format, and the parsing should be organized according to the reading order. | • 识别图片中的公式，用 LaTeX 格式表示。<br><br>• 把图中的表格解析为 HTML。<br><br>• 解析图中的图表，对于流程图使用 Mermaid 格式表示，其他图表使用 Markdown 格式表示。<br><br>• 提取文档图片中正文的所有信息用 markdown 格式表示，其中页眉、页脚部分忽略，表格用 html 格式表达，文档中公式用 latex 格式表示，按照阅读顺序组织进行解析。 |
+| **Information Extraction** | • Output the value of Key.<br><br>• Extract the content of the fields: ['key1','key2', ...] from the image and return it in JSON format.<br><br>• Extract the subtitles from the image. | • 输出 Key 的值。<br><br>• 提取图片中的: ['key1','key2', ...] 的字段内容，并按照 JSON 格式返回。<br><br>• 提取图片中的字幕。 |
+| **Translation** | First extract the text, then translate the text content into English. If it is a document, ignore the header and footer. Formulas should be represented in LaTeX format, and tables should be represented in HTML format. | 先提取文字，再将文字内容翻译为英文。若是文档，则其中页眉、页脚忽略。公式用latex格式表示，表格用html格式表示。 |
+## 📚 Citation
+@misc{hunyuanocr2025,
+    title={HunyuanOCR Technical Report},
+    author={Tencent Hunyuan Vision Team},
+    year={2025},
+    publisher={GitHub},
+    journal={GitHub repository},
+    howpublished={\url{https://github.com/Tencent-Hunyuan/HunyuanOCR}}
+}
+## 🙏 Acknowledgements
+Thanks to all contributors who helped build HunyuanOCR
+Special thanks to the Tencent Hunyuan Team
+We appreciate the support from the open-source community