| | --- |
| | language: |
| | - en |
| | library_name: transformers |
| | tags: |
| | - video-captioning |
| | - audiovisual |
| | - qwen2.5-omni |
| | - instruction-tuning |
| | - attribute-structured |
| | - quality-verified |
| | pipeline_tag: image-text-to-text |
| | model-index: |
| | - name: ASID-Captioner-7B |
| | results: [] |
| | --- |
| | |
| | # ASID-Captioner-7B |
| |
|
| | ASID-Captioner-7B is an audiovisual captioning model (based on Qwen2.5-Omni) fine-tuned for attribute-structured and quality-verified video understanding. It is designed to generate fine-grained captions that cover both visual and audio signals, with controllable prompting over multiple attributes. |
| |
|
| | [[🏠 Homepage]([https://](https://asid-caption.github.io/))] [[📖 Arxiv Paper](https://arxiv.org/pdf/2602.13013)] [[🤗 Models & Datasets](https://huggingface.co/AudioVisual-Caption)] [[💻 Code](https://github.com/)] |
| |
|
| | ## Introduction |
| |
|
| | Modern video MLLMs often describe long and complex audiovisual content with a single caption, which can be incomplete (missing audio or camera details), unstructured, and weakly controllable. |
| |
|
| | ASID-Captioner-7B is trained to follow attribute-specific instructions and produce more organized, fine-grained descriptions. It is built upon Qwen2.5-Omni and fine-tuned on ASID-1M, which provides structured supervision over multiple attributes (scene, characters, objects, actions, narrative elements, speech, camera, emotions) with quality verification and refinement. |
| |
|
| | ## Key Features |
| |
|
| | - Audiovisual captioning: uses both video frames and audio (when available). |
| | - Attribute-structured instruction following: supports prompts targeting specific attributes (e.g., speech-only, camera-only). |
| | - High-quality supervision: trained on attribute-structured, quality-verified instructions from ASID-1M. |
| | - Standard Transformers interface: load with transformers and the Qwen2.5-Omni processor/model classes. |
| |
|
| | ## What’s in this repo |
| |
|
| | Typical files include: |
| |
|
| | - config.json |
| | - generation_config.json |
| | - preprocessor_config.json |
| | - chat_template.jinja |
| | - added_tokens.json / special_tokens_map.json |
| | - model-*.safetensors and model.safetensors.index.json |
| | |
| | ## Prompting (recommended) |
| | |
| | ASID-Captioner-7B works best with explicit attribute prompts, for example: |
| | |
| | - Describe the scene in the video in detail. Write your answer as one coherent paragraph. |
| | - Describe the characters in the video in detail. Write your answer as one coherent paragraph. |
| | - Provide a comprehensive description of all the content in the video, leaving out no details, and naturally covering the scene, characters, objects, actions, narrative elements, speech, camera, and emotions in a single coherent account. |
| | |
| | ## Usage (minimal, single GPU) |
| | |
| | ### Install |
| | |
| | ```bash |
| | pip install -U transformers accelerate |
| | ``` |
| | |
| | Optional: faster attention |
| | |
| | If you want faster attention (optional), install FlashAttention2 following its official instructions. |
| | |
| | You must also have `qwen_omni_utils.process_mm_info` available in your environment (same as your reference script). |
| | |
| | ### Run inference |
| | |
| | ```python |
| | import os |
| | import torch |
| | from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor |
| | from qwen_omni_utils import process_mm_info |
| | |
| | # Constants (same spirit as reference) |
| | VIDEO_MAX_PIXELS = 401408 # 512*28*28 |
| | VIDEO_TOTAL_PIXELS = 20070400 # 512*28*28*50 |
| | USE_AUDIO_IN_VIDEO = True |
| | |
| | # Some pipelines use this env var |
| | os.environ["VIDEO_MAX_PIXELS"] = str(VIDEO_TOTAL_PIXELS) |
| | |
| | model_id = "AudioVisual-Caption/ASID-Captioner-7B" |
| |
|
| | model = Qwen2_5OmniForConditionalGeneration.from_pretrained( |
| | model_id, |
| | torch_dtype=torch.bfloat16, |
| | device_map="cuda", |
| | attn_implementation="flash_attention_2", # optional; remove if not available |
| | low_cpu_mem_usage=True, |
| | ) |
| | model.disable_talker() |
| | |
| | processor = Qwen2_5OmniProcessor.from_pretrained(model_id) |
| | |
| | file_path = "/path/to/video.mp4" |
| | prompt = "Provide a comprehensive description of all the content in the video, leaving out no details, and naturally covering the scene, characters, objects, actions, narrative elements, speech, camera, and emotions in a single coherent account." |
| |
|
| | conversation = [ |
| | { |
| | "role": "system", |
| | "content": [ |
| | { |
| | "type": "text", |
| | "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech." |
| | } |
| | ], |
| | }, |
| | { |
| | "role": "user", |
| | "content": [ |
| | {"type": "video", "video": file_path, "max_pixels": VIDEO_MAX_PIXELS}, |
| | {"type": "text", "text": prompt}, |
| | ], |
| | }, |
| | ] |
| | |
| | text = processor.apply_chat_template( |
| | conversation, |
| | add_generation_prompt=True, |
| | tokenize=False, |
| | ) |
| | |
| | # IMPORTANT: reference-style multimodal extraction |
| | audios, images, videos = process_mm_info( |
| | conversation, |
| | use_audio_in_video=USE_AUDIO_IN_VIDEO, |
| | ) |
| | |
| | inputs = processor( |
| | text=text, |
| | audio=audios, |
| | images=images, |
| | videos=videos, |
| | return_tensors="pt", |
| | padding=True, |
| | use_audio_in_video=USE_AUDIO_IN_VIDEO, |
| | ) |
| | |
| | device = "cuda" |
| | inputs = inputs.to(device).to(model.dtype) |
| |
|
| | with torch.no_grad(): |
| | text_ids = model.generate( |
| | **inputs, |
| | use_audio_in_video=USE_AUDIO_IN_VIDEO, |
| | do_sample=False, |
| | thinker_max_new_tokens=4096, |
| | repetition_penalty=1.1, |
| | use_cache=True, |
| | ) |
| | |
| | decoded = processor.batch_decode( |
| | text_ids, |
| | skip_special_tokens=True, |
| | clean_up_tokenization_spaces=False, |
| | )[0] |
| | |
| | answer = decoded.split("\nassistant\n")[-1].strip() |
| | print(answer) |
| | ``` |
| | |
| | ### Notes (important) |
| | |
| | - If you do **not** use `process_mm_info`, you may get missing/incorrect audiovisual inputs in some environments. |
| | - `use_audio_in_video=True` enables audio-conditioned captioning when your runtime supports extracting audio from the video container. |
| | - `thinker_max_new_tokens` is used in the reference script. If your environment does not recognize it, replace with `max_new_tokens`. |
| | |
| | |
| | ## Training Data |
| | |
| | This model is fine-tuned using ASID-1M (attribute-structured and quality-verified audiovisual instructions). |
| | Dataset: AudioVisual-Caption/ASID-1M |
| | |
| | |
| | ## Citation |
| | |
| | If you use our model in your research, please cite our paper: |
| | |
| | ~~~bibtex |
| | @misc{asid2026, |
| | title={Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions}, |
| | author={Yunheng Li and Hengrui Zhang and Meng-Hao Guo and Wenzhao Gao and Shaoyong Jia and Shaohui Jiao and Qibin Hou1 and Ming-Ming Cheng}, |
| | year={2026} |
| | } |
| | ~~~ |
| | |
| | ## Contact |
| | |
| | Please open a Discussion on the Hugging Face page for usage questions or issues. |
| | ``` |
| |
|