RegionRet
RegionRet is a LoRA adapter model for region-level vision-language retrieval, fine-tuned from ColQwen2.5-Base using Parameter-Efficient Fine-Tuning (PEFT).
Model Details
- Model Type: LoRA Adapter (PEFT)
- Base Model: ColQwen2.5-Base
- Task Type: Feature Extraction
- Framework: PEFT 0.14.0
LoRA Configuration
- Rank (r): 32
- LoRA Alpha: 32
- LoRA Dropout: 0.1
- Target Modules: MLP projections (down_proj, gate_proj, up_proj) and attention projections (k_proj, q_proj, v_proj, o_proj), plus custom_text_proj
Model Architecture
- Processor: ColQwen2_5_Processor
- Max Visual Tokens: 1536
- Attention: Flash Attention 2
- Precision: bfloat16
Uses
Please refer to https://github.com/Aeryn666/RegionRAG.
Training Details
Training Data
- VisRAG-Ret-Train-In-domain-data
- Visual-CoT (DocVQA, TextCap, TextVQA, InfographicsVQA)
Training Configuration
- Loss Function: RegionContraLoss (global_tau=0.02, local_tau=0.25, local_coef=0.01)
- Epochs: 5
- Batch Size: 80 per device
- Learning Rate: 2e-4
- Precision: bfloat16
- Gradient Checkpointing: Enabled
Limitations
- Requires ColQwen2.5-Base base model to function
- Optimized for region-level vision-language retrieval tasks
- GPU with bfloat16 and Flash Attention 2 support recommended
Citation
If you use this model, please cite:
@misc{li2025regionragregionlevelretrievalaugmentedgeneration,
title={RegionRAG: Region-level Retrieval-Augmented Generation for Visual Document Understanding},
author={Yinglu Li and Zhiying Lu and Zhihang Liu and Yiwei Sun and Chuanbin Liu and Hongtao Xie},
year={2025},
eprint={2510.27261},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.27261},
}
License
Please refer to the license of the base model ColQwen2.5.
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support