---
license: mit
---

Example Usage:
```python
from transformers import VideoMAEImageProcessor, AutoModel, AutoConfig
import numpy as np
import torch


config = AutoConfig.from_pretrained("revliter/internvideo_next_large_p14_res224_f16", trust_remote_code=True)
processor = VideoMAEImageProcessor.from_pretrained("revliter/internvideo_next_large_p14_res224_f16")
model = AutoModel.from_pretrained('revliter/internvideo_next_large_p14_res224_f16', config=config, trust_remote_code=True)

model = model.cuda().half()
video = list(np.random.rand(16, 3, 224, 224))

# B, T, C, H, W -> B, C, T, H, W
inputs = processor(video, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].permute(0, 2, 1, 3, 4).half().cuda()
output_embedding = model.extract_features(**inputs)

print(output_embedding.shape) # [1, 4096, 1024]
```

Please refer to https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/single_modality/requirements.txt for package requirements.

### Citation
If this work is helpful for your research, please consider citing InternVideo.

```
@article{wang2025internvideonext,
    title={InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision}, 
    author={Chenting Wang and Yuhan Zhu and Yicheng Xu and Jiange Yang and Ziang Yan and Yali Wang and Yi Wang and Limin Wang},
    year={2025},
    journal={arXiv preprint arXiv:2512.01342},
}
```