Improve model card: Add descriptive tags
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,19 +1,22 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
pipeline_tag: image-text-to-text
|
| 4 |
-
library_name: transformers
|
| 5 |
base_model:
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
base_model_relation: merge
|
| 9 |
datasets:
|
| 10 |
-
|
| 11 |
-
|
| 12 |
language:
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
| 14 |
tags:
|
| 15 |
-
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
---
|
| 18 |
|
| 19 |
# InternVL3_5-14B-Pretrained
|
|
@@ -28,7 +31,7 @@ tags:
|
|
| 28 |
|
| 29 |
## Introduction
|
| 30 |
|
| 31 |
-
We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0
|
| 32 |
|
| 33 |

|
| 34 |
|
|
@@ -142,7 +145,7 @@ Compared to InternVL3.5, InternVL3.5-Flash further integrates the *Visual Resolu
|
|
| 142 |
Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM).
|
| 143 |
In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens.
|
| 144 |
For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly.
|
| 145 |
-
Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50
|
| 146 |
|
| 147 |
|
| 148 |

|
|
@@ -234,7 +237,7 @@ $$
|
|
| 234 |
\Bigg],
|
| 235 |
$$
|
| 236 |
|
| 237 |
-
where \\(\mathrm{KL}
|
| 238 |
|
| 239 |
|
| 240 |
`Router training`:
|
|
@@ -530,40 +533,50 @@ generation_config = dict(max_new_tokens=1024, do_sample=True)
|
|
| 530 |
# pure-text conversation (纯文本对话)
|
| 531 |
question = 'Hello, who are you?'
|
| 532 |
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
|
| 533 |
-
print(f'User: {question}
|
|
|
|
| 534 |
|
| 535 |
question = 'Can you tell me a story?'
|
| 536 |
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
|
| 537 |
-
print(f'User: {question}
|
|
|
|
| 538 |
|
| 539 |
# single-image single-round conversation (单图单轮对话)
|
| 540 |
-
question = '<image
|
|
|
|
| 541 |
response = model.chat(tokenizer, pixel_values, question, generation_config)
|
| 542 |
-
print(f'User: {question}
|
|
|
|
| 543 |
|
| 544 |
# single-image multi-round conversation (单图多轮对话)
|
| 545 |
-
question = '<image
|
|
|
|
| 546 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
|
| 547 |
-
print(f'User: {question}
|
|
|
|
| 548 |
|
| 549 |
question = 'Please write a poem according to the image.'
|
| 550 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
|
| 551 |
-
print(f'User: {question}
|
|
|
|
| 552 |
|
| 553 |
# multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
|
| 554 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
| 555 |
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
|
| 556 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
| 557 |
|
| 558 |
-
question = '<image
|
|
|
|
| 559 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
| 560 |
history=None, return_history=True)
|
| 561 |
-
print(f'User: {question}
|
|
|
|
| 562 |
|
| 563 |
question = 'What are the similarities and differences between these two images.'
|
| 564 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
| 565 |
history=history, return_history=True)
|
| 566 |
-
print(f'User: {question}
|
|
|
|
| 567 |
|
| 568 |
# multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
|
| 569 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
|
@@ -571,17 +584,21 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
|
|
| 571 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
| 572 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
| 573 |
|
| 574 |
-
question = 'Image-1: <image
|
|
|
|
|
|
|
| 575 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
| 576 |
num_patches_list=num_patches_list,
|
| 577 |
history=None, return_history=True)
|
| 578 |
-
print(f'User: {question}
|
|
|
|
| 579 |
|
| 580 |
question = 'What are the similarities and differences between these two images.'
|
| 581 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
| 582 |
num_patches_list=num_patches_list,
|
| 583 |
history=history, return_history=True)
|
| 584 |
-
print(f'User: {question}
|
|
|
|
| 585 |
|
| 586 |
# batch inference, single image per sample (单图批处理)
|
| 587 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
|
@@ -589,13 +606,15 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
|
|
| 589 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
| 590 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
| 591 |
|
| 592 |
-
questions = ['<image
|
|
|
|
| 593 |
responses = model.batch_chat(tokenizer, pixel_values,
|
| 594 |
num_patches_list=num_patches_list,
|
| 595 |
questions=questions,
|
| 596 |
generation_config=generation_config)
|
| 597 |
for question, response in zip(questions, responses):
|
| 598 |
-
print(f'User: {question}
|
|
|
|
| 599 |
|
| 600 |
# video multi-round conversation (视频多轮对话)
|
| 601 |
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
|
|
@@ -633,17 +652,24 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
|
|
| 633 |
video_path = './examples/red-panda.mp4'
|
| 634 |
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
|
| 635 |
pixel_values = pixel_values.to(torch.bfloat16).cuda()
|
| 636 |
-
video_prefix = ''.join([f'Frame{i+1}: <image
|
|
|
|
| 637 |
question = video_prefix + 'What is the red panda doing?'
|
| 638 |
-
# Frame1: <image
|
|
|
|
|
|
|
|
|
|
|
|
|
| 639 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
| 640 |
num_patches_list=num_patches_list, history=None, return_history=True)
|
| 641 |
-
print(f'User: {question}
|
|
|
|
| 642 |
|
| 643 |
question = 'Describe this video in detail.'
|
| 644 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
| 645 |
num_patches_list=num_patches_list, history=history, return_history=True)
|
| 646 |
-
print(f'User: {question}
|
|
|
|
| 647 |
```
|
| 648 |
|
| 649 |
#### Streaming Output
|
|
@@ -727,7 +753,9 @@ image_urls=[
|
|
| 727 |
|
| 728 |
images = [load_image(img_url) for img_url in image_urls]
|
| 729 |
# Numbering images improves multi-image conversations
|
| 730 |
-
response = pipe((f'Image-1: {IMAGE_TOKEN}
|
|
|
|
|
|
|
| 731 |
print(response.text)
|
| 732 |
```
|
| 733 |
|
|
@@ -830,3 +858,14 @@ If you find this project useful in your research, please consider citing:
|
|
| 830 |
year={2025}
|
| 831 |
}
|
| 832 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
+
- OpenGVLab/InternViT-300M-448px-V2_5
|
| 4 |
+
- Qwen/Qwen3-14B
|
|
|
|
| 5 |
datasets:
|
| 6 |
+
- OpenGVLab/MMPR-v1.2
|
| 7 |
+
- OpenGVLab/MMPR-Tiny
|
| 8 |
language:
|
| 9 |
+
- multilingual
|
| 10 |
+
library_name: transformers
|
| 11 |
+
license: apache-2.0
|
| 12 |
+
pipeline_tag: image-text-to-text
|
| 13 |
tags:
|
| 14 |
+
- internvl
|
| 15 |
+
- custom_code
|
| 16 |
+
- multimodal-llm
|
| 17 |
+
- vision-language-model
|
| 18 |
+
- agent
|
| 19 |
+
base_model_relation: merge
|
| 20 |
---
|
| 21 |
|
| 22 |
# InternVL3_5-14B-Pretrained
|
|
|
|
| 31 |
|
| 32 |
## Introduction
|
| 33 |
|
| 34 |
+
We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
|
| 35 |
|
| 36 |

|
| 37 |
|
|
|
|
| 145 |
Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM).
|
| 146 |
In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens.
|
| 147 |
For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly.
|
| 148 |
+
Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50% while maintaining nearly 100% of the performance of InternVL3.5.
|
| 149 |
|
| 150 |
|
| 151 |

|
|
|
|
| 237 |
\Bigg],
|
| 238 |
$$
|
| 239 |
|
| 240 |
+
where \\(\mathrm{KL}\\) denotes the KL divergence and \(\xi\) denotes the compression rate, which is uniformly sampled from \(\{\frac{1}{4},\frac{1}{16}\}\). The image \(I_\xi\) is represented as 256 tokens when \(\xi=\frac{1}{4}\) and 64 tokens when \(\xi=\frac{1}{16}\). Notably, the reference model always performs inference with \(\xi=\frac{1}{4}\).
|
| 241 |
|
| 242 |
|
| 243 |
`Router training`:
|
|
|
|
| 533 |
# pure-text conversation (纯文本对话)
|
| 534 |
question = 'Hello, who are you?'
|
| 535 |
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
|
| 536 |
+
print(f'User: {question}
|
| 537 |
+
Assistant: {response}')
|
| 538 |
|
| 539 |
question = 'Can you tell me a story?'
|
| 540 |
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
|
| 541 |
+
print(f'User: {question}
|
| 542 |
+
Assistant: {response}')
|
| 543 |
|
| 544 |
# single-image single-round conversation (单图单轮对话)
|
| 545 |
+
question = '<image>
|
| 546 |
+
Please describe the image shortly.'
|
| 547 |
response = model.chat(tokenizer, pixel_values, question, generation_config)
|
| 548 |
+
print(f'User: {question}
|
| 549 |
+
Assistant: {response}')
|
| 550 |
|
| 551 |
# single-image multi-round conversation (单图多轮对话)
|
| 552 |
+
question = '<image>
|
| 553 |
+
Please describe the image in detail.'
|
| 554 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
|
| 555 |
+
print(f'User: {question}
|
| 556 |
+
Assistant: {response}')
|
| 557 |
|
| 558 |
question = 'Please write a poem according to the image.'
|
| 559 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
|
| 560 |
+
print(f'User: {question}
|
| 561 |
+
Assistant: {response}')
|
| 562 |
|
| 563 |
# multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
|
| 564 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
| 565 |
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
|
| 566 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
| 567 |
|
| 568 |
+
question = '<image>
|
| 569 |
+
Describe the two images in detail.'
|
| 570 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
| 571 |
history=None, return_history=True)
|
| 572 |
+
print(f'User: {question}
|
| 573 |
+
Assistant: {response}')
|
| 574 |
|
| 575 |
question = 'What are the similarities and differences between these two images.'
|
| 576 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
| 577 |
history=history, return_history=True)
|
| 578 |
+
print(f'User: {question}
|
| 579 |
+
Assistant: {response}')
|
| 580 |
|
| 581 |
# multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
|
| 582 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
|
|
|
| 584 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
| 585 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
| 586 |
|
| 587 |
+
question = 'Image-1: <image>
|
| 588 |
+
Image-2: <image>
|
| 589 |
+
Describe the two images in detail.'
|
| 590 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
| 591 |
num_patches_list=num_patches_list,
|
| 592 |
history=None, return_history=True)
|
| 593 |
+
print(f'User: {question}
|
| 594 |
+
Assistant: {response}')
|
| 595 |
|
| 596 |
question = 'What are the similarities and differences between these two images.'
|
| 597 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
| 598 |
num_patches_list=num_patches_list,
|
| 599 |
history=history, return_history=True)
|
| 600 |
+
print(f'User: {question}
|
| 601 |
+
Assistant: {response}')
|
| 602 |
|
| 603 |
# batch inference, single image per sample (单图批处理)
|
| 604 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
|
|
|
| 606 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
| 607 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
| 608 |
|
| 609 |
+
questions = ['<image>
|
| 610 |
+
Describe the image in detail.'] * len(num_patches_list)
|
| 611 |
responses = model.batch_chat(tokenizer, pixel_values,
|
| 612 |
num_patches_list=num_patches_list,
|
| 613 |
questions=questions,
|
| 614 |
generation_config=generation_config)
|
| 615 |
for question, response in zip(questions, responses):
|
| 616 |
+
print(f'User: {question}
|
| 617 |
+
Assistant: {response}')
|
| 618 |
|
| 619 |
# video multi-round conversation (视频多轮对话)
|
| 620 |
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
|
|
|
|
| 652 |
video_path = './examples/red-panda.mp4'
|
| 653 |
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
|
| 654 |
pixel_values = pixel_values.to(torch.bfloat16).cuda()
|
| 655 |
+
video_prefix = ''.join([f'Frame{i+1}: <image>
|
| 656 |
+
' for i in range(len(num_patches_list))])
|
| 657 |
question = video_prefix + 'What is the red panda doing?'
|
| 658 |
+
# Frame1: <image>
|
| 659 |
+
Frame2: <image>
|
| 660 |
+
...
|
| 661 |
+
Frame8: <image>
|
| 662 |
+
{question}
|
| 663 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
| 664 |
num_patches_list=num_patches_list, history=None, return_history=True)
|
| 665 |
+
print(f'User: {question}
|
| 666 |
+
Assistant: {response}')
|
| 667 |
|
| 668 |
question = 'Describe this video in detail.'
|
| 669 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
| 670 |
num_patches_list=num_patches_list, history=history, return_history=True)
|
| 671 |
+
print(f'User: {question}
|
| 672 |
+
Assistant: {response}')
|
| 673 |
```
|
| 674 |
|
| 675 |
#### Streaming Output
|
|
|
|
| 753 |
|
| 754 |
images = [load_image(img_url) for img_url in image_urls]
|
| 755 |
# Numbering images improves multi-image conversations
|
| 756 |
+
response = pipe((f'Image-1: {IMAGE_TOKEN}
|
| 757 |
+
Image-2: {IMAGE_TOKEN}
|
| 758 |
+
describe these two images', images))
|
| 759 |
print(response.text)
|
| 760 |
```
|
| 761 |
|
|
|
|
| 858 |
year={2025}
|
| 859 |
}
|
| 860 |
```
|
| 861 |
+
|
| 862 |
+
|
| 863 |
+
## Acknowledgement
|
| 864 |
+
|
| 865 |
+
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|
| 866 |
+
|
| 867 |
+
______________________________________________________________________
|
| 868 |
+
|
| 869 |
+
Scan the following QR Code, join our WeChat group.
|
| 870 |
+
|
| 871 |
+
<p align="center"><img width="300" alt="image" src="https://github.com/user-attachments/assets/f776df09-ebba-4fd5-80c2-fec4ff1518be"></p>
|