arxiv:2512.02351

Understanding and Harnessing Sparsity in Unified Multimodal Models

Published on Dec 2

· Submitted by

Shwai He on Dec 3

ByteDance Seed

Upvote

Authors:

Abstract

Unified multimodal models suffer from inefficiencies in certain tasks, leading to the proposal of Mixture-of-Experts Adaptation to improve compression and maintain performance.

AI-generated summary

Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at https://github.com/Shwai-He/SparseUnifiedModel{this link}.

View arXiv page View PDF Add to collection

Community

Shwai

Paper submitter 4 days ago

Large multimodal models have achieved remarkable progress in both
understanding and generation. Recent efforts pursue unified multimodal models
that integrate heterogeneous components to support both capabilities within a
single framework. However, such unification introduces inference
inefficiencies, e.g., specific tasks or samples may not require the full
knowledge or capacity of the unified model. Yet, a systematic understanding of
how these inefficiencies manifest across different components remains limited.
In this work, we first conduct a systematic analysis of unified multimodal
model components using training-free pruning as a probing methodology,
considering both depth pruning and width reduction. Our study reveals that the
understanding component exhibits notable compressibility in both understanding
and generation tasks, which is more pronounced in the latter. In contrast, the
generation components are highly sensitive to compression, with performance
deteriorating sharply even under moderate compression ratios. To address this
limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the
dynamic activation patterns observed across different samples. This approach
partitions the generation module into multiple experts and enables sparse
activation to restore generation quality. We validate the effectiveness of
sparse activation through expert-frozen tuning and further demonstrate that a
fully trainable adaptation delivers additional gains. As a result, the adapted
BAGEL model achieves performance comparable to the full model while activating
only about half of its parameters. The code is released at
\href{https://github.com/Shwai-He/SparseUnifiedModel}{this link}.

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.02351 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.02351 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.02351 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.