--- license: other language: - zh - en pipeline_tag: image-text-to-text library_name: transformers ---
🎯 Demo | 📥 Model Download | 📄 Technical Report | 🌟 Github
## 📖 Introduction **HunyuanOCR** stands as a leading end-to-end OCR expert VLM powered by Hunyuan's native multimodal architecture. With a remarkably lightweight 1B parameter design, it has achieved multiple state-of-the-art benchmarks across the industry. The model demonstrates mastery in **complex multilingual document parsing** while excelling in practical applications including **text spotting, open-field information extraction, video subtitle extraction, and photo translation**. ## 🚀 Quick Start with Transformers ### Installation ```bash pip install git+https://github.com/huggingface/transformers@82a06db03535c49aa987719ed0746a76093b1ec4 ``` > **Note**: We will merge it into the Transformers main branch later. ### Model Inference ```python from transformers import AutoProcessor from transformers import HunYuanVLForConditionalGeneration from PIL import Image import torch def clean_repeated_substrings(text): """Clean repeated substrings in text""" n = len(text) if n<8000: return text for length in range(2, n // 10 + 1): candidate = text[-length:] count = 0 i = n - length while i >= 0 and text[i:i + length] == candidate: count += 1 i -= length if count >= 10: return text[:n - length * (count - 1)] return text model_name_or_path = "tencent/HunyuanOCR" processor = AutoProcessor.from_pretrained(model_name_or_path, use_fast=False) img_path = "path/to/your/image.jpg" image_inputs = Image.open(img_path) messages1 = [ { "role": "user", "content": [ {"type": "image", "image": img_path}, {"type": "text", "text": ( "Extract all information from the main body of the document image " "and represent it in markdown format, ignoring headers and footers. " "Tables should be expressed in HTML format, formulas in the document " "should be represented using LaTeX format, and the parsing should be " "organized according to the reading order." )}, ], } ] messages = [messages1] texts = [ processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages ] inputs = processor( text=texts, images=image_inputs, padding=True, return_tensors="pt", ) model = HunYuanVLForConditionalGeneration.from_pretrained( model_name_or_path, attn_implementation="eager", dtype=torch.bfloat16, device_map="auto" ) with torch.no_grad(): device = next(model.parameters()).device inputs = inputs.to(device) generated_ids = model.generate(**inputs, max_new_tokens=16384, do_sample=False) if "input_ids" in inputs: input_ids = inputs.input_ids else: print("inputs: # fallback", inputs) input_ids = inputs.inputs generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(input_ids, generated_ids) ] output_texts = clean_repeated_substrings(processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )) print(output_texts) ``` ## 🚀 Quick Start with vLLM ### Installation ```bash pip install vllm --extra-index-url https://wheels.vllm.ai/nightly ``` ### Model Inference ```python from vllm import LLM, SamplingParams from PIL import Image from transformers import AutoProcessor model_path = "tencent/HunyuanOCR" llm = LLM(model=model_path, trust_remote_code=True) processor = AutoProcessor.from_pretrained(model_path) sampling_params = SamplingParams(temperature=0, max_tokens=16384) img_path = "/path/to/image.jpg" img = Image.open(img_path) messages = [ {"role": "user", "content": [ {"type": "image", "image": img_path}, {"type": "text", "text": "Detect and recognize text in the image, and output the text coordinates in a formatted manner."} ]} ] prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = {"prompt": prompt, "multi_modal_data": {"image": [img]}} output = llm.generate([inputs], sampling_params)[0] print(output.outputs[0].text) ``` ## 💬 Application-oriented Prompts | Task | English | Chinese | |------|---------|---------| | **Spotting** | Detect and recognize text in the image, and output the text coordinates in a formatted manner. | 检测并识别图片中的文字,将文本坐标格式化输出。 | | **Parsing** | • Identify the formula in the image and represent it using LaTeX format.