--- license: apache-2.0 language: - en - zh pipeline_tag: image-text-to-text tags: - ERNIE4.5 library_name: transformers base_model: baidu/ERNIE-4.5-VL-28B-A3B-Thinking --- # ERNIE-4.5-VL-28B-A3B-Thinking AWQ - INT8 ## Model Details ### Quantization Details - **Quantization Method:** cyankiwi AWQ v1.0 - **Bits:** 8 - **Group Size:** 32 - **Calibration Dataset:** [5CD-AI/LLaVA-CoT-o1-Instruct](https://huggingface.co/datasets/5CD-AI/LLaVA-CoT-o1-Instruct) - **Quantization Tool:** [llm-compressor](https://github.com/vllm-project/llm-compressor) ### Memory Usage | **Type** | **ERNIE-4.5-VL-28B-A3B-Thinking** | **ERNIE-4.5-VL-28B-A3B-Thinking-AWQ-8bit** | |:---------------:|:----------------:|:----------------:| | **Memory Size** | 55.3 GB | 31.2 GB | | **KV Cache per Token** | 56.0 kB | 28.0 kB | | **KV Cache per Context** | 7.0 GB | 3.5 GB | ### Evaluations | **Benchmarks** | **ERNIE-4.5-VL-28B-A3B-Thinking** | **ERNIE-4.5-VL-28B-A3B-Thinking-AWQ-8bit** | |:---------------:|:----------------:|:----------------:| | **Perplexity** | 1.80803 | 1.80776 | - **Evaluation Context Length:** 16384 ## Inference ### Prerequisite ```bash pip install uv uv pip install -U vllm --pre \ --extra-index-url https://wheels.vllm.ai/nightly \ --extra-index-url https://download.pytorch.org/whl/cu129 \ --index-strategy unsafe-best-match ``` ### Basic Usage ```bash vllm serve cyankiwi/ERNIE-4.5-VL-28B-A3B-Thinking-AWQ-8bit --trust-remote-code \ --reasoning-parser ernie45 \ --tool-call-parser ernie45 \ --enable-auto-tool-choice ``` ## Additional Information ### Changelog - **v1.0.0** - Initial quantized release ### Authors - **Name:** Ton Cao - **Contacts:** ton@cyan.kiwi
# π **Introducing ERNIE-4.5-VL-28B-A3B-Thinking: A Breakthrough in Multimodal AI** π₯ [Demo](https://huggingface.co/spaces/baidu/ERNIE-4.5-VL-28B-A3B-Thinking) ## Model Highlights Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded **ERNIE-4.5-VL-28B-A3B-Thinking** achieves a remarkable leap forward in multimodal reasoning capabilities. π§ β¨ Through an extensive mid-training phase, the model absorbed a vast and highly diverse corpus of premium visual-language reasoning data. This massive-scale training process dramatically boosted the model's representation power while deepening the semantic alignment between visual and language modalitiesβunlocking unprecedented capabilities in nuanced visual-textual reasoning. π The model leverages cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency. β‘ Responding to strong community demand, we've significantly strengthened the model's grounding performance with improved instruction-following capabilities, making visual grounding functions more accessible than ever. π― Additionally, our innovative "Thinking with Images" feature, when paired with tools like image zooming and image search, dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge. ππΌοΈ Together, these enhancements form a critical foundation for developing sophisticated multimodal agents, empowering developers and researchers to create next-generation AI applications that push the boundaries of what's possible in visual-language understanding. π€π  ## Key Capabilities As a lightweight model that activates only **3B parameters** β‘, **ERNIE-4.5-VL-28B-A3B-Thinking** closely matches the performance of the industry's top flagship models across various benchmarks. π - **Visual Reasoning** π§ ποΈ: Bolstered by large-scale reinforcement learning, the model demonstrates exceptional multi-step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks! πβ¨ - **STEM Reasoning** π¬π: Leveraging its powerful visual abilities, the model achieves a leap in performance on STEM tasks like solving problems from photos, easily handling even complex questions! π―π‘ - **Visual Grounding** ππ¨: Features more precise grounding and flexible instruction execution, easily triggering grounding functions in complex industrial scenarios for a significant efficiency boost! βοΈπͺ - **Thinking with Images** π€π: The model thinks like a human, capable of freely zooming in and out of images to grasp every detail and uncover all information. πΌοΈβ¨ - **Tool Utilization** π οΈβ‘: Empowered by robust tool-calling capabilities, the model can instantly use functions like image search to easily identify long-tail knowledge and achieve comprehensive information retrieval! ππ - **Video Understanding** π¬π₯: The model possesses outstanding temporal awareness and event localization abilities, accurately identifying content changes across different time segments in a video, making video analysis smarter and more efficient! β±οΈπ ## Quickstart [Hugging Face π€ app](https://huggingface.co/spaces/akhaliq/ERNIE-4.5-VL-28B-A3B-Thinking) ### Using `transformers` Library Here is an example of how to use the `transformers` library for inference: ```python import torch from transformers import AutoProcessor, AutoTokenizer, AutoModelForCausalLM model_path = 'baidu/ERNIE-4.5-VL-28B-A3B-Thinking' model = AutoModelForCausalLM.from_pretrained( model_path, device_map="auto", dtype=torch.bfloat16, trust_remote_code=True ) processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) model.add_image_preprocess(processor) messages = [ { "role": "user", "content": [ { "type": "text", "text": "What color clothes is the girl in the picture wearing?" }, { "type": "image_url", "image_url": { "url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg" } }, ] }, ] text = processor.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) image_inputs, video_inputs = processor.process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) device = next(model.parameters()).device inputs = inputs.to(device) generated_ids = model.generate( inputs=inputs['input_ids'].to(device), **inputs, max_new_tokens=1024, use_cache=False ) output_text = processor.decode(generated_ids[0][len(inputs['input_ids'][0]):]) print(output_text) ``` ### vLLM Inference Install the vLLM main branch ```bash pip install uv uv pip install -U vllm --pre \ --extra-index-url https://wheels.vllm.ai/nightly \ --extra-index-url https://download.pytorch.org/whl/cu129 \ --index-strategy unsafe-best-match ``` Run vLLM ```bash # 80G*1 GPUοΌIf an error occurs, add the --gpu-memory-utilization 0.95 and try again vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code ``` Run vLLM using `reasoning-parser` and `tool-call-parser` ```bash # 80G*1 GPUοΌIf an error occurs, add the --gpu-memory-utilization 0.95 and try again vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \ --reasoning-parser ernie45 \ --tool-call-parser ernie45 \ --enable-auto-tool-choice ``` ### FastDeploy Inference Quickly deploy services using FastDeploy as shown below. For more detailed usage, refer to the [FastDeploy GitHub Repository](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/get_started/ernie-4.5-vl-thinking.md). **Note:** For single-card deployment, at least 80GB of GPU memory is required. ```bash fastdeploy serve --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \ --max-model-len 131072 \ --max-num-seqs 32 \ --port 8180 \ --quantization wint8 \ --reasoning-parser ernie-45-vl-thinking \ --tool-call-parser ernie-45-vl-thinking \ --mm-processor-kwargs '{"image_max_pixels": 12845056 }' ``` ### Finetuning with ERNIEKit [ERNIEKit](https://github.com/PaddlePaddle/ERNIE) is a training toolkit based on PaddlePaddle, specifically designed for the ERNIE series of open-source large models. It provides comprehensive support for scenarios such as instruction fine-tuning (SFT, LoRA) and alignment training (DPO), ensuring optimal performance. Usage Examples: ```bash # Download model huggingface-cli download baidu/ERNIE-4.5-VL-28B-A3B-Thinking --local-dir baidu/ERNIE-4.5-VL-28B-A3B-Thinking # SFT erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft/run_sft_lora_8k.yaml # SFT (Function Call) erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft_function_call/run_sft_8k.yaml ``` For more detailed examples, including SFT with LoRA, multi-GPU configurations, and advanced scripts, please refer to the examples folder within the [ERNIEKit](https://github.com/PaddlePaddle/ERNIE) repository. ## License The ERNIE 4.5 models are provided under the Apache License 2.0. This license permits commercial use, subject to its terms and conditions. Copyright (c) 2025 Baidu, Inc. All Rights Reserved. ## Citation If you find ERNIE 4.5 useful or wish to use it in your projects, please kindly cite our technical report: ```text @misc{ernie2025technicalreport, title={ERNIE 4.5 Technical Report}, author={Baidu-ERNIE-Team}, year={2025}, primaryClass={cs.CL}, howpublished={\url{https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf}} } ```