|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
Example Usage: |
|
|
```python |
|
|
from transformers import VideoMAEImageProcessor, AutoModel, AutoConfig |
|
|
import numpy as np |
|
|
import torch |
|
|
|
|
|
|
|
|
config = AutoConfig.from_pretrained("revliter/internvideo_next_large_p14_res224_f16", trust_remote_code=True) |
|
|
processor = VideoMAEImageProcessor.from_pretrained("revliter/internvideo_next_large_p14_res224_f16") |
|
|
model = AutoModel.from_pretrained('revliter/internvideo_next_large_p14_res224_f16', config=config, trust_remote_code=True) |
|
|
|
|
|
model = model.cuda().half() |
|
|
video = list(np.random.rand(16, 3, 224, 224)) |
|
|
|
|
|
# B, T, C, H, W -> B, C, T, H, W |
|
|
inputs = processor(video, return_tensors="pt") |
|
|
inputs['pixel_values'] = inputs['pixel_values'].permute(0, 2, 1, 3, 4).half().cuda() |
|
|
output_embedding = model.extract_features(**inputs) |
|
|
|
|
|
print(output_embedding.shape) # [1, 4096, 1024] |
|
|
``` |
|
|
|
|
|
Please refer to https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/single_modality/requirements.txt for package requirements. |
|
|
|
|
|
### Citation |
|
|
If this work is helpful for your research, please consider citing InternVideo. |
|
|
|
|
|
``` |
|
|
@article{wang2025internvideonext, |
|
|
title={InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision}, |
|
|
author={Chenting Wang and Yuhan Zhu and Yicheng Xu and Jiange Yang and Ziang Yan and Yali Wang and Yi Wang and Limin Wang}, |
|
|
year={2025}, |
|
|
journal={arXiv preprint arXiv:2512.01342}, |
|
|
} |
|
|
``` |