--- license: mit --- Example Usage: ```python from transformers import VideoMAEImageProcessor, AutoModel, AutoConfig import numpy as np import torch config = AutoConfig.from_pretrained("revliter/internvideo_next_large_p14_res224_f16", trust_remote_code=True) processor = VideoMAEImageProcessor.from_pretrained("revliter/internvideo_next_large_p14_res224_f16") model = AutoModel.from_pretrained('revliter/internvideo_next_large_p14_res224_f16', config=config, trust_remote_code=True) model = model.cuda().half() video = list(np.random.rand(16, 3, 224, 224)) # B, T, C, H, W -> B, C, T, H, W inputs = processor(video, return_tensors="pt") inputs['pixel_values'] = inputs['pixel_values'].permute(0, 2, 1, 3, 4).half().cuda() output_embedding = model.extract_features(**inputs) print(output_embedding.shape) # [1, 4096, 1024] ``` Please refer to https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/single_modality/requirements.txt for package requirements. ### Citation If this work is helpful for your research, please consider citing InternVideo. ``` @article{wang2025internvideonext, title={InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision}, author={Chenting Wang and Yuhan Zhu and Yicheng Xu and Jiange Yang and Ziang Yan and Yali Wang and Yi Wang and Limin Wang}, year={2025}, journal={arXiv preprint arXiv:2512.01342}, } ```