Papers
arxiv:2602.20161

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Published on Feb 23
· Submitted by
Ahmed Heakl
on Feb 24
Authors:
,
,
,
,
,
,
,
,
,

Abstract

A compact vision-language-diffusion model called Mobile-O enables efficient unified multimodal understanding and generation on mobile devices through specialized architecture design and optimized training methodology.

AI-generated summary

Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/

Community

Paper author Paper submitter
edited about 22 hours ago

TL;DR: Introducing Mobile-O

  • What it is: A compact, unified multimodal model that brings both visual understanding and image generation directly to mobile devices.
  • The Breakthrough: It eliminates cloud dependency for multimodal AI. Using a novel "Mobile Conditioning Projector," it achieves high efficiency with minimal compute.
  • Real-World Speed: It can generate a 512x512 image in about 3 seconds natively on an iPhone.
  • The Benchmarks: Despite its small size, it outperforms existing unified models like Show-O and JanusFlow in both visual understanding and generation (scoring 74% on GenEval), while running 6x to 11x faster.

iOS App: https://apps.apple.com/us/app/mobile-o/id6759238106

Main Results

image

Architecture

Training Recipe

Paper author Paper submitter
edited about 23 hours ago

More Examples

Image Generation Image Editing
image_generation image_editing

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.20161 in a Space README.md to link it from this page.

Collections including this paper 5