--- language: - en base_model: - Qwen/Qwen2.5-VL-3B-Instruct pipeline_tag: image-text-to-text library_name: transformers tags: - trl - VisualUnderstanding - text-generation-inference - VisionLanguageAttribution - AttributeCaptioning - VLA datasets: - prithivMLmods/blip3o-caption-mini-arrow - prithivMLmods/Caption3o-Opt-v3 - prithivMLmods/Caption3o-Opt-v2 - >- Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647 --- ![2.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/EUorMi4zOONUl9USQzBRp.png) # **DeepAttriCap-VLA-3B** > The **DeepAttriCap-VLA-3B** model is a fine-tuned version of **Qwen2.5-VL-3B-Instruct**, tailored for **Vision-Language Attribution** and **Image Captioning**. This variant is designed to generate precise, attribute-rich descriptions that define the visual properties of objects and scenes in detail, ensuring both object-level identification and contextual captioning. # Key Highlights 1. **Vision-Language Attribution**: Produces structured captions with explicit object attributes, properties, and contextual details. 2. **High-Precision Descriptions**: Captures fine-grained visual properties (shape, color, texture, material, relations). 3. **Balanced Object-Centric and Scene-Level Captions**: Generates both holistic captions and per-object attributions. 4. **Adaptable Across Image Types**: Works well on natural, artistic, abstract, and technical imagery. 5. **Built on Qwen2.5-VL Architecture**: Leverages the strengths of the 3B multimodal instruction-tuned variant for fine-grained reasoning. 6. **Multilingual Capability**: English is default, with multilingual captioning enabled through prompt engineering. > model type: experimental # Training Details This model was fine-tuned on a mixture of curated image–caption datasets with emphasis on **attribute-based captioning** and **precise object-property definition**: * **[prithivMLmods/blip3o-caption-mini-arrow](https://huggingface.co/datasets/prithivMLmods/blip3o-caption-mini-arrow)** * **[prithivMLmods/Caption3o-Opt-v3](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v3)** * **[prithivMLmods/Caption3o-Opt-v2](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v2)** * **[Multimodal-Fatima/Caltech101\_not\_background\_test\_facebook\_opt\_2.7b\_Attributes\_Caption\_ns\_5647](https://huggingface.co/datasets/Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647)** The training objective emphasized **attribution-style captioning**—capturing precise object details, relationships, and scene-level semantics. --- ## SYSTEM_PROMPT ```py CAPTION_SYSTEM_PROMPT = """ You are an AI assistant that rigorously follows this response protocol: 1. For every input image, your primary task is to write a **precise caption**. The caption must capture the **essence of the image** in clear, concise, and contextually accurate language. 2. Along with the caption, provide a structured set of **attributes** that describe the visual elements. Attributes should include details such as objects, people, actions, colors, environment, mood, and other notable characteristics. 3. Always include a **class_name** field. This must represent the **core theme or main subject** of the image in a compact format. - Use the syntax: `{class_name==write_the_core_theme}` - Example: `{class_name==dog_playing}` or `{class_name==city_sunset}` 4. Maintain the following strict format in your output: - **Caption:** - **Attributes:** - **{class_name==core_theme}** 5. Ensure captions are **precise, neutral, and descriptive**, avoiding unnecessary elaboration or subjective interpretation unless explicitly required. 6. Do not reference the rules or instructions in the output. Only return the formatted caption, attributes, and class_name. """.strip() ``` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://huggingface.co/prithivMLmods/DeepAttriCap-VLA-3B/blob/main/deepattricap-vla-3b-colab-notebook-demo/DeepAttriCap_VLA_3B.ipynb) --- # Quick Start with Transformers ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "prithivMLmods/DeepAttriCap-VLA-3B", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("prithivMLmods/DeepAttriCap-VLA-3B") messages = [ { "role": "user", "content": [ {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}, {"type": "text", "text": "Provide an attribute-rich caption for this image."}, ], } ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt" ).to("cuda") generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` # Intended Use * Attribute-rich object recognition and captioning. * Vision-language research in attribution and property extraction. * Dataset creation for fine-grained visual description tasks. * Enabling descriptive captions for images with complex object relationships. * Supporting creative, technical, and educational use cases requiring precise captions. # Limitations * May produce variable levels of granularity depending on the image complexity. * Not optimized for highly censored or safety-critical deployments. * Might over-attribute or hallucinate properties in ambiguous or abstract visuals