revliter
/

internvideo_next_large_p14_res224_f16

InternVideoNext_Large

Model card Files Files and versions

internvideo_next_large_p14_res224_f16 / README.md

revliter's picture

Update README.md

b59d469 verified 5 days ago

|

history blame contribute delete

1.4 kB

	---
	license: mit
	---

	Example Usage:
	```python
	from transformers import VideoMAEImageProcessor, AutoModel, AutoConfig
	import numpy as np
	import torch


	config = AutoConfig.from_pretrained("revliter/internvideo_next_large_p14_res224_f16", trust_remote_code=True)
	processor = VideoMAEImageProcessor.from_pretrained("revliter/internvideo_next_large_p14_res224_f16")
	model = AutoModel.from_pretrained('revliter/internvideo_next_large_p14_res224_f16', config=config, trust_remote_code=True)

	model = model.cuda().half()
	video = list(np.random.rand(16, 3, 224, 224))

	# B, T, C, H, W -> B, C, T, H, W
	inputs = processor(video, return_tensors="pt")
	inputs['pixel_values'] = inputs['pixel_values'].permute(0, 2, 1, 3, 4).half().cuda()
	output_embedding = model.extract_features(**inputs)

	print(output_embedding.shape) # [1, 4096, 1024]
	```

	Please refer to https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/single_modality/requirements.txt for package requirements.

	### Citation
	If this work is helpful for your research, please consider citing InternVideo.

	```
	@article{wang2025internvideonext,
	title={InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision},
	author={Chenting Wang and Yuhan Zhu and Yicheng Xu and Jiange Yang and Ziang Yan and Yali Wang and Yi Wang and Limin Wang},
	year={2025},
	journal={arXiv preprint arXiv:2512.01342},
	}
	```