llmware
/

Qwen2.5-VL-3B-Instruct-ov-int4-npu

Model card Files Files and versions

Qwen2.5-VL-3B-Instruct-ov-int4-npu / README.md

doberst's picture

Update README.md

23aec60 verified 17 days ago

|

history blame contribute delete

3.28 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	---

	This is the [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) model, converted to OpenVINO, with int4 weights for the language model, int8 weights for the other models.
	The INT4 weights are compressed with symmetric, channel-wise quantization, with AWQ and scale estimation. The model works on CPU, GPU and NPU. See below for the model export command/properties.

	This is subject to Qwen Research License

	## Download Model

	To download the model, run `pip install huggingface-hub[cli]` and then:
	```
	huggingface-cli download llmware/Qwen2.5-VL-3B-Instruct-ov-int4-npu --local-dir Qwen2.5-VL-3B-Instruct-ov-int4-npu
	```

	## Run inference with OpenVINO GenAI

	Use OpenVINO GenAI to run inference on this model. This model works with OpenVINO GenAI 2025.3 and later. For NPU inference, make sure to use the latest NPU driver ([Windows](https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html), [Linux](https://github.com/intel/linux-npu-driver))

	- Install OpenVINO GenAI and pillow:

	```
	pip install --upgrade openvino-genai pillow
	```

	- Download a test image: `curl -O "https://storage.openvinotoolkit.org/test_data/images/dog.jpg"`
	- Run inference:

	```python
	import numpy as np
	import openvino as ov
	import openvino_genai
	from PIL import Image

	# Choose GPU instead of NPU to run the model on Intel integrated or discrete GPU, or CPU to run on CPU.
	# CACHE_DIR caches the model the first time, so subsequent model loading will be faster
	pipeline_config = {"CACHE_DIR": "model_cache"}
	pipe = openvino_genai.VLMPipeline("Qwen2.5-VL-3B-Instruct-ov-int4-npu", "NPU", **pipeline_config)

	image = Image.open("dog.jpg")
	# optional: resizing to a smaller size (depending on image and prompt) is often useful to speed up inference.
	image = image.resize((128, 128))

	image_data = np.array(image.getdata()).reshape(1, image.size[1], image.size[0], 3).astype(np.uint8)
	image_data = ov.Tensor(image_data)

	prompt = "Can you describe the image?"
	result = pipe.generate(prompt, image=image_data, max_new_tokens=100)
	print(result.texts[0])
	```

	See [OpenVINO GenAI repository](https://github.com/openvinotoolkit/openvino.genai?tab=readme-ov-file#performing-visual-language-text-generation)

	## Model export properties

	Model export command:

	```
	optimum-cli export openvino -m Qwen/Qwen2.5-VL-3B-Instruct --weight-format int4 --group-size -1 --sym --awq --scale-estimation --dataset contextual Qwen2.5-VL-3B-Instruct-ov-int4-
	npu
	```

	### Framework versions

	```
	openvino_version : 2025.3.0-19807-44526285f24-releases/2025/3
	nncf_version : 2.17.0
	optimum_intel_version : 1.26.0.dev0+0e2ccef
	optimum_version : 1.27.0
	pytorch_version : 2.7.1
	transformers_version : 4.51.3
	```

	### LLM export properties

	```
	all_layers : False
	awq : True
	backup_mode : int8_asym
	compression_format : dequantize
	gptq : False
	group_size : -1
	ignored_scope : []
	lora_correction : False
	mode : int4_sym
	ratio : 1.0
	scale_estimation : True
	sensitivity_metric : max_activation_variance
	```