metadata
license: mit
Example Usage:
from transformers import VideoMAEImageProcessor, AutoModel, AutoConfig
import numpy as np
import torch
config = AutoConfig.from_pretrained("revliter/internvideo_next_large_p14_res224_f16", trust_remote_code=True)
processor = VideoMAEImageProcessor.from_pretrained("revliter/internvideo_next_large_p14_res224_f16")
model = AutoModel.from_pretrained('revliter/internvideo_next_large_p14_res224_f16', config=config, trust_remote_code=True)
model = model.cuda().half()
video = list(np.random.rand(16, 3, 224, 224))
# B, T, C, H, W -> B, C, T, H, W
inputs = processor(video, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].permute(0, 2, 1, 3, 4).half().cuda()
output_embedding = model.extract_features(**inputs)
print(output_embedding.shape) # [1, 4096, 1024]
Please refer to https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/single_modality/requirements.txt for package requirements.
Citation
If this work is helpful for your research, please consider citing InternVideo.
@article{wang2025internvideonext,
title={InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision},
author={Chenting Wang and Yuhan Zhu and Yicheng Xu and Jiange Yang and Ziang Yan and Yali Wang and Yi Wang and Limin Wang},
year={2025},
journal={arXiv preprint arXiv:2512.01342},
}