## Video RAG

In this notebook, we'll do video RAG using OmniEmbed, the any-to-any embedding model and Qwen2.5-Omni, the any-to-any model that takes in images, videos, audios, text and outputs text.

Alternatively, you could also do audio RAG or visual document RAG, this notebook serves as example to retrieve videos and do inference.

Unfortunately due to how memory hungry Qwen models are, we run this notebook on A100. OmniEmbed model is based on 7B Qwen with video processing, perhaps at some point we get a 3B model and try running at L4.

## OmniEmbed Inference

Here we initialize and write a simple inference function for OmniEmbed. We use multivent checkpoint because it's further trained on videos.

In [1]:
!pip install -q qwen-omni-utils

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.7/39.7 MB[0m [31m62.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import torch
from transformers import AutoProcessor, Qwen2_5OmniThinkerForConditionalGeneration
from qwen_omni_utils import process_mm_info

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")
model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
    'Tevatron/OmniEmbed-v0.1-multivent',
    torch_dtype=torch.bfloat16
).to(device).eval()

preprocessor_config.json:   0%|          | 0.00/667 [00:00<?, ?B/s]

You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/832 [00:00<?, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

adapter_config.json:   0%|          | 0.00/873 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.90G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/152 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/103M [00:00<?, ?B/s]

In [3]:
def encode_message(message):
    texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)[0] + "<|endoftext|>"
    audio_inputs, image_inputs, video_inputs = process_mm_info(message, use_audio_in_video=True)

    inputs = processor(
        text=texts,
        audio=audio_inputs,
        images=image_inputs,
        videos=video_inputs,
        return_tensors="pt",
        padding="longest",
    )
    for k in inputs:
        inputs[k] = inputs[k].to(device)

    cache_position = torch.arange(0, inputs['input_ids'].shape[1], device=device)
    inputs = model.prepare_inputs_for_generation(**inputs, use_cache=True, cache_position=cache_position)
    model_outputs = model(**inputs, return_dict=True, output_hidden_states=True)

    last_hidden_state = model_outputs.hidden_states[-1]
    reps = last_hidden_state[:, -1]
    reps = torch.nn.functional.normalize(reps, p=2, dim=-1)
    return reps


In [4]:
def template(message):
  if message.startswith("https://"):
    message = [
        {
            "role": "user",
            "content": [
                {"type": "video",
                 "fps": 0.5,
                 "video": message},
            ],
        },
    ]
  else:
    message = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": message},
            ],
        },
    ]
  return message

## Video Retrieval

Here we have a video of a ramen and video of a tofu recipe, we ask the model which video is the tofu recipe to retrieve the video. We then get most relevant video (tofu recipe) and pass it to the Qwen2.5-Omni for detailed generation.

In [5]:
text_query = template('Query: How to cook Mapo Tofu?')
video_1 = template("https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/mapo_tofu.mp4")
video_2 = template("https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/ramen.mp4")

sim1 = torch.cosine_similarity(encode_message(text_query), encode_message(video_1))
sim2 = torch.cosine_similarity(encode_message(text_query), encode_message(video_2))

print("Similarities:", sim1.item(), sim2.item())


	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
qwen-vl-utils using torchvision to read video.
Unused or unrecognized kwargs: images.
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
Unused or unrecognized kwargs: images.


Similarities: 0.5078125 0.1962890625


First video is more relevant, so we take that.

After we get our retrieval result we have to remove the model so we have some VRAM for Qwen2.5-Omni. #gpupoorlife

In [5]:
del model, processor

In [4]:
import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

model = Qwen2_5OmniForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")

config.json: 0.00B [00:00, ?B/s]

Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/2.43G [00:00<?, ?B/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

Qwen2_5OmniToken2WavModel does not support eager attention implementation, fall back to sdpa


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

spk_dict.pt:   0%|          | 0.00/260k [00:00<?, ?B/s]

You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


We can now provide our text query and the selected video to Qwen Omni for the G (generation) part of the RAG.

In [5]:

conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "system",
        "content": [
            {"type": "text", "text": 'Query: How to cook Mapo Tofu? Respond in English.'} # video has chinese characters
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "video",
             "fps": 0.25,
             "video": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/mapo_tofu.mp4"},
        ],
    },
]

In [6]:
from qwen_omni_utils import process_mm_info

USE_AUDIO_IN_VIDEO = True

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)
text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO)

text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)


	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
qwen-vl-utils using torchvision to read video.
Unused or unrecognized kwargs: images.
Setting `pad_token_id` to `eos_token_id`:8292 for open-end generation.


Actually, the response is almost nearly same as what they tell in the video, which means the model really listened to video to respond!

In [5]:
text

["system\nYou are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.\nsystem\nQuery: How to cook Mapo Tofu? Respond in English.\nuser\n\nassistant\nWell, first you need to season the pork with ginger, light soy sauce, and shaoxing wine. Then, sauté the garlic until fragrant. After that, add the ground pork and spicy bean sauce. Mix and toss for 4 minutes until cooked through. Add Sichuan peppercorns, five - spice, sugar, chicken bouillon powder, chili oil, and chicken stock. Bring it to a boil, thicken it with a cornstarch slurry, and gently mix in the cubed silken tofu until it's warm. Finally, finish with sesame oil and green onions. Serve with rice and enjoy. If you have any other questions about cooking or want to share your cooking experiences, feel free to let me know."]