prithivMLmods
/

ZwZ-8B-FP8

 license: apache-2.0
 base_model:
 - inclusionAI/ZwZ-8B
+datasets:
+- inclusionAI/ZwZ-RL-VQA
+- inclusionAI/ZoomBench
+language:
+- en
+pipeline_tag: image-text-to-text
+library_name: transformers
+tags:
+- text-generation-inference
+- F8_E4M3
+- fp8
+- vllm
+- llm-compressor
+---
+![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/cJvpKspuxHdZNnkURe5jC.png)
+# **ZwZ-8B-FP8**
+> **ZwZ-8B-FP8** is an FP8-compressed evolution built on top of **inclusionAI/ZwZ-8B**. This variant leverages **BF16 · FP8 (F8_E4M3)** precision formats to significantly reduce memory footprint and improve inference efficiency while preserving the fine-grained multimodal perception strengths of the original architecture.
+> The result is a highly efficient 8B vision-language model optimized for real-time, single-pass visual reasoning with enhanced hardware efficiency.
+> [!important]
+> FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs – [FP8 W8A8](https://docs.vllm.ai/en/stable/features/quantization/fp8/). Quantization W8A8 FP8-dynamic recipe – [examples](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8).
+## About the Base Model
+**ZwZ-8B** from inclusionAI is an 8B-parameter fine-grained multimodal perception vision-language model built upon Qwen3-VL-8B. It is trained using innovative **Region-to-Image Distillation (R2I)** combined with reinforcement learning to achieve state-of-the-art visual understanding in a single forward pass.
+Unlike traditional VLMs that require inference-time zooming, cropping, or tool calling, ZwZ internalizes region-level perception directly into full-image reasoning.
+### Key Innovations of ZwZ-8B
+* **Region-to-Image Distillation (R2I)**:
+  Teacher models such as Qwen3-VL-235B and GLM-4.5V generate high-fidelity VQA supervision on micro-cropped image regions with precise bounding boxes. This region-grounded supervision is distilled back into full-image context, allowing the student model to internalize fine-grained perception.
+* **Single-Pass Fine-Grained Understanding**:
+  Eliminates multi-step inference pipelines involving zooming, cropping, or external tool calls.
+* **Strong Micro-Perception Capabilities**:
+  * OCR and small-text detection
+  * Object counting
+  * Color and material attribute recognition
+  * Structural analysis
+  * Symbol and icon detection in dense scenes
+* **Out-of-Distribution Generalization**:
+  Demonstrates strong performance on:
+  * Visual reasoning benchmarks
+  * GUI agent tasks
+  * AIGC detection
+  * Complex real-world scenes
+* **Edge-Optimized Deployment**:
+  Enables real-time robotics and mobile vision applications without multi-stage inference overhead.
+ZwZ is part of a broader model family spanning 4B, 7B, and 8B scales.
+## What FP8 Adds
+The **ZwZ-8B-FP8** variant introduces:
+* **BF16 · FP8 (F8_E4M3) Compression**: Transformer Engine–based quantization reduces VRAM usage while maintaining strong perception fidelity.
+* **Higher Throughput**: Improved tokens per second and image processing speed.
+* **Lower Memory Footprint**: Better deployment feasibility on Hopper-class and compatible GPUs.
+* **Production-Friendly Efficiency**: Ideal for real-time multimodal systems requiring compact yet powerful perception models.
+## Quick Start with Transformers
+```python
+from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
+from qwen_vl_utils import process_vision_info
+import torch
+# Load the FP8-compressed ZwZ-8B model
+model = Qwen3VLForConditionalGeneration.from_pretrained(
+    "prithivMLmods/ZwZ-8B-FP8",
+    torch_dtype="auto",
+    device_map="auto"
+)
+processor = AutoProcessor.from_pretrained(
+    "prithivMLmods/ZwZ-8B-FP8"
+)
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
+            },
+            {"type": "text", "text": "Analyze the fine-grained details in this image."},
+        ],
+    }
+]
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+).to("cuda")
+generated_ids = model.generate(**inputs, max_new_tokens=256)
+generated_ids_trimmed = [
+    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed,
+    skip_special_tokens=True,
+    clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+## Intended Use
+* Real-time multimodal perception systems
+* Robotics and embodied AI
+* GUI agents
+* OCR-heavy and structured visual environments
+* Edge deployment scenarios requiring single-pass inference
+## Limitations & Risks
+* FP8 requires compatible GPU architectures for optimal acceleration.
+* While compression maintains strong fidelity, extremely fine-grained edge cases may show minor precision differences compared to full BF16.
+* Users are responsible for ethical and lawful deployment.