Update README.md
Browse files
README.md
CHANGED
|
@@ -10,4 +10,44 @@ tags:
|
|
| 10 |
language:
|
| 11 |
- zh
|
| 12 |
- en
|
| 13 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
language:
|
| 11 |
- zh
|
| 12 |
- en
|
| 13 |
+
---
|
| 14 |
+
# VLM-FO1: Qwen2.5-VL-3B-v01
|
| 15 |
+
|
| 16 |
+
This repository contains the VLM-FO1_Qwen2.5-VL-3B-v01 model, an implementation of the [VLM-FO1](https://github.com/om-ai-lab/VLM-FO1) framework built on the Qwen2.5-VL-3B base model.
|
| 17 |
+
|
| 18 |
+
VLM-FO1 is a novel plug-and-play framework designed to bridge the gap between the high-level reasoning of Vision-Language Models (VLMs) and the need for fine-grained visual perception.
|
| 19 |
+
|
| 20 |
+
## Model Details
|
| 21 |
+
|
| 22 |
+
### Model Description
|
| 23 |
+
|
| 24 |
+
VLM-FO1 endows pre-trained VLMs with superior fine-grained perception without compromising their inherent high-level reasoning and general understanding capabilities. It operates as a plug-and-play module that can be integrated with any existing VLM, establishing an effective and flexible paradigm for building the next generation of perception-aware models.
|
| 25 |
+
|
| 26 |
+
VLM-FO1 excels at a wide range of fine-grained perception tasks, including Object Grounding, Region Generative Understanding, Visual Region Reasoning, and more.
|
| 27 |
+
|
| 28 |
+
🧩 **Plug-and-Play Modularity:** Our framework is designed as a set of enhancement modules that can be seamlessly integrated with any pre-trained VLM, preserving its original weights and capabilities.
|
| 29 |
+
|
| 30 |
+
🧠 **Hybrid Region Encoder (HFRE):** We introduce a novel Dual-Vision Encoder architecture that fuses semantic-rich features with perception-enhanced features, creating powerful region tokens that capture both high-level meaning and fine-grained spatial detail.
|
| 31 |
+
|
| 32 |
+
🎯 **State-of-the-Art Performance:** VLM-FO1 achieves SOTA results across a diverse suite of benchmarks.
|
| 33 |
+
|
| 34 |
+
✅ **Preserves General Abilities:** Our two-stage training strategy ensures that fine-grained perception is gained without causing catastrophic forgetting of the base model's powerful general visual understanding abilities.
|
| 35 |
+
|
| 36 |
+
### Model Sources
|
| 37 |
+
|
| 38 |
+
- **Repository:** [https://github.com/om-ai-lab/VLM-FO1]
|
| 39 |
+
- **Paper:** [https://arxiv.org/pdf/2509.25916]
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
## Citation
|
| 44 |
+
|
| 45 |
+
```bibtex
|
| 46 |
+
@article{liu2025vlm,
|
| 47 |
+
title={VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs},
|
| 48 |
+
author={Liu, Peng and Shen, Haozhan and Fang, Chunxin and Sun, Zhicheng and Liao, Jiajia and Zhao, Tiancheng},
|
| 49 |
+
journal={arXiv preprint arXiv:2509.25916},
|
| 50 |
+
year={2025}
|
| 51 |
+
}
|
| 52 |
+
```
|
| 53 |
+
|