|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
library_name: diffusers |
|
|
pipeline_tag: text-to-image |
|
|
tags: |
|
|
- diffusion |
|
|
- text-to-image |
|
|
- ambient diffusion |
|
|
- low-quality data |
|
|
- synthetic data |
|
|
--- |
|
|
|
|
|
# Ambient Diffusion Omni (Ambient-o): Training Good Models with Bad Data |
|
|
 |
|
|
|
|
|
## Model Description |
|
|
|
|
|
Ambient Diffusion Omni (Ambient-o) is a framework for using low-quality, synthetic, and out-of-distribution images to improve the quality of diffusion models. Unlike traditional approaches that rely on highly curated datasets, Ambient-o extracts valuable signal from all available images during training, including data typically discarded as "low-quality." |
|
|
|
|
|
This model card is for a text-to-image diffusion model trained on 8-H100 GPUs only for two days. The key innovation is the usage of synthetic data as "noisy" samples. |
|
|
|
|
|
## Architecture |
|
|
|
|
|
Ambient-o builds upon the [MicroDiffusion](https://github.com/SonyResearch/micro_diffusion) cobebase -- we use a Mixture of Experts Diffusion Transformer totaling ~1.1B parameters. |
|
|
|
|
|
## Text-to-Image Results |
|
|
|
|
|
|
|
|
Ambient-o demonstrates improvements in text-to-image generation. Compared to the two baselines of 1) filtering low-quality samples and 2) using all the data as equal, Ambient-o achieves increased diversity compared to 1) and enhanced quality compared to 2). Ambient-o achieves visual improvements without sacrificing diversity. |
|
|
|
|
|
|
|
|
### Training Data Composition |
|
|
|
|
|
The model was trained on a diverse mixture of datasets: |
|
|
- **Conceptual Captions (CC12M)**: 12M image-caption pairs |
|
|
- **Segment Anything (SA1B)**: 11.1M high-resolution images with LLaVA-generated captions |
|
|
- **JourneyDB**: 4.4M synthetic image-caption pairs from Midjourney |
|
|
- **DiffusionDB**: 10.7M quality-filtered synthetic image-caption pairs from Stable Diffusion |
|
|
|
|
|
Data from DiffusionDB were treated as noisy samples. |
|
|
|
|
|
|
|
|
### Technical Approach |
|
|
|
|
|
#### High Noise Regime |
|
|
At high diffusion times, the model leverages the theoretical insight that noise contracts distributional differences, reducing mismatch between high-quality target distribution and mixed-quality training data. This creates a beneficial bias-variance trade-off where low-quality samples increase sample size and reduce estimator variance. |
|
|
|
|
|
#### Low Noise Regime |
|
|
At low diffusion times, the model exploits locality properties of natural images, using small image crops that allow borrowing high-frequency details from out-of-distribution or synthetic images when their marginal distributions match the target data. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from micro_diffusion.models.model import create_latent_diffusion |
|
|
from huggingface_hub import hf_hub_download |
|
|
from safetensors import safe_open |
|
|
|
|
|
# Init model |
|
|
params = { |
|
|
'latent_res': 64, |
|
|
'in_channels': 4, |
|
|
'pos_interp_scale': 2.0, |
|
|
} |
|
|
model = create_latent_diffusion(**params).to('cuda') |
|
|
|
|
|
# Download weights from HF |
|
|
model_dict_path = hf_hub_download(repo_id="giannisdaras/ambient-o", filename="model.safetensors") |
|
|
model_dict = {} |
|
|
with safe_open(model_dict_path, framework="pt", device="cpu") as f: |
|
|
for key in f.keys(): |
|
|
model_dict[key] = f.get_tensor(key) |
|
|
|
|
|
# Convert parameters to float32 + load |
|
|
float_model_params = { |
|
|
k: v.to(torch.float32) for k, v in model_dict.items() |
|
|
} |
|
|
model.dit.load_state_dict(float_model_params) |
|
|
|
|
|
# Eval mode |
|
|
model = model.eval() |
|
|
|
|
|
# Generate images |
|
|
prompts = [ |
|
|
"Pirate ship trapped in a cosmic maelstrom nebula, rendered in cosmic beach whirlpool engine, volumet", |
|
|
"A illustration from a graphic novel. A bustling city street under the shine of a full moon.", |
|
|
"A giant cobra snake made from corn", |
|
|
"A fierce garden gnome warrior, clad in armor crafted from leaves and bark, brandishes a tiny sword.", |
|
|
"A capybara made of lego sitting in a realistic, natural field", |
|
|
"a close-up of a fire spitting dragon, cinematic shot.", |
|
|
"Panda mad scientist mixing sparkling chemicals, artstation" |
|
|
] |
|
|
images = model.generate(prompt=prompts, num_inference_steps=30, guidance_scale=5.0, seed=42) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{daras2025ambient, |
|
|
title={Ambient Diffusion Omni: Training Good Models with Bad Data}, |
|
|
author={Daras, Giannis and Rodriguez-Munoz, Adrian and Klivans, Adam and Torralba, Antonio and Daskalakis, Constantinos}, |
|
|
journal={arXiv preprint}, |
|
|
year={2025}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
The model follows the [license](https://github.com/SonyResearch/micro_diffusion/blob/main/LICENSE) of the MicroDiffusion repo. |
|
|
|