Enhance model card for PixNerd (#1)
Browse files- Enhance model card for PixNerd (f3f9510680b19d577e5b290905ece5ac356a0d16)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -1,5 +1,96 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: unconditional-image-generation
|
| 4 |
---
|
| 5 |
|
| 6 |
+
# PixNerd: Pixel Neural Field Diffusion
|
| 7 |
+
|
| 8 |
+
<div style="text-align: center;">
|
| 9 |
+
<a href="https://huggingface.co/papers/2507.23268"><img src="https://img.shields.io/badge/Paper-2507.23268-b31b1b.svg" alt="Paper"></a>
|
| 10 |
+
<a href="https://github.com/MCG-NJU/PixNerd"><img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="Code"></a>
|
| 11 |
+
<a href="https://huggingface.co/spaces/MCG-NJU/PixNerd"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Online_Demo-green" alt="Demo"></a>
|
| 12 |
+
</div>
|
| 13 |
+
|
| 14 |
+
PixNerd is a novel pixel-space diffusion transformer for image generation, introduced in the paper [PixNerd: Pixel Neural Field Diffusion](https://huggingface.co/papers/2507.23268). Unlike conventional diffusion models that depend on a compressed latent space shaped by a pre-trained VAE, PixNerd proposes to model patch-wise decoding with a neural field. This results in a single-scale, single-stage, efficient, and end-to-end solution that directly operates in pixel space, avoiding accumulated errors and decoding artifacts.
|
| 15 |
+
|
| 16 |
+
<p align="center">
|
| 17 |
+
<img src="https://huggingface.co/MCG-NJU/PixNerd/resolve/main/figs/arch.png" alt="PixNerd Architecture Diagram" width="700">
|
| 18 |
+
</p>
|
| 19 |
+
|
| 20 |
+
### ✨ Key Highlights
|
| 21 |
+
|
| 22 |
+
* **Efficient Pixel-Space Diffusion**: Directly models image generation in pixel space, eliminating the need for VAEs and their associated complexities or artifacts.
|
| 23 |
+
* **Neural Field Decoding**: Employs neural fields for patch-wise decoding, improving the modeling of high-frequency details.
|
| 24 |
+
* **Single-Stage & End-to-End**: Offers a simplified, efficient training and inference paradigm without complex cascade pipelines.
|
| 25 |
+
* **High Performance**: Achieves competitive FID scores on ImageNet 256x256 (2.15 FID) and 512x512 (2.84 FID) for unconditional image generation.
|
| 26 |
+
* **Text-to-Image Extension**: The framework is extensible to text-to-image applications, achieving strong results on benchmarks like GenEval (0.73 overall score) and DPG (80.9 overall score).
|
| 27 |
+
|
| 28 |
+
## Visualizations
|
| 29 |
+
|
| 30 |
+
Below are sample images generated by PixNerd, showcasing its capabilities:
|
| 31 |
+
|
| 32 |
+
<p align="center">
|
| 33 |
+
<img src="https://huggingface.co/MCG-NJU/PixNerd/resolve/main/figs/pixelnerd_teaser.png" alt="PixNerd Teaser" width="700">
|
| 34 |
+
<br/>
|
| 35 |
+
<img src="https://huggingface.co/MCG-NJU/PixNerd/resolve/main/figs/pixnerd_multires.png" alt="PixNerd Multi-Resolution Examples" width="700">
|
| 36 |
+
</p>
|
| 37 |
+
|
| 38 |
+
## Checkpoints
|
| 39 |
+
|
| 40 |
+
The following checkpoints are available:
|
| 41 |
+
|
| 42 |
+
| Dataset | Model | Params | FID | HuggingFace |
|
| 43 |
+
|---------------|---------------|--------|-------|---------------------------------------|
|
| 44 |
+
| ImageNet256 | PixNerd-XL/16 | 700M | 2.15 | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) |
|
| 45 |
+
| ImageNet512 | PixNerd-XL/16 | 700M | 2.84 | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) |
|
| 46 |
+
|
| 47 |
+
| Dataset | Model | Params | GenEval | DPG | HuggingFace |
|
| 48 |
+
|---------------|---------------|--------|------|------|----------------------------------------------------------|
|
| 49 |
+
| Text-to-Image | PixNerd-XXL/16| 1.2B | 0.73 | 80.9 | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XXL-P16-T2I) |
|
| 50 |
+
|
| 51 |
+
## Online Demos
|
| 52 |
+
|
| 53 |
+
You can try out the PixNerd-XXL/16 (text-to-image) model on our Hugging Face Space demo: [https://huggingface.co/spaces/MCG-NJU/PixNerd](https://huggingface.co/spaces/MCG-NJU/PixNerd).
|
| 54 |
+
|
| 55 |
+
To host a local Gradio demo for text-to-image applications, run the following command after setting up the environment:
|
| 56 |
+
|
| 57 |
+
```bash
|
| 58 |
+
python app.py --config configs_t2i/inference_heavydecoder.yaml --ckpt_path=XXX.ckpt
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
## Usage
|
| 62 |
+
|
| 63 |
+
For image generation (C2i for ImageNet), you can use the provided codebase. First, install the required dependencies:
|
| 64 |
+
|
| 65 |
+
```bash
|
| 66 |
+
# for installation
|
| 67 |
+
pip install -r requirements.txt
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
Then, run inference using the `main.py` script (replace `XXX.ckpt` with your checkpoint path):
|
| 71 |
+
|
| 72 |
+
```bash
|
| 73 |
+
# for inference
|
| 74 |
+
python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt
|
| 75 |
+
# or specify the GPU(s) to use:
|
| 76 |
+
CUDA_VISIBLE_DEVICES=0,1, python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
For more details on training and evaluation for both C2i and T2i applications, please refer to the [official GitHub repository](https://github.com/MCG-NJU/PixNerd).
|
| 80 |
+
|
| 81 |
+
## Citation
|
| 82 |
+
|
| 83 |
+
If you find this work useful for your research, please cite our paper:
|
| 84 |
+
|
| 85 |
+
```bibtex
|
| 86 |
+
@article{2507.23268,
|
| 87 |
+
Author = {Shuai Wang and Ziteng Gao and Chenhui Zhu and Weilin Huang and Limin Wang},
|
| 88 |
+
Title = {PixNerd: Pixel Neural Field Diffusion},
|
| 89 |
+
Year = {2025},
|
| 90 |
+
Eprint = {arXiv:2507.23268},
|
| 91 |
+
}
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
## Acknowledgement
|
| 95 |
+
|
| 96 |
+
The code is mainly built upon [FlowDCN](https://github.com/MCG-NJU/DDT) and [DDT](https://github.com/MCG-NJU/FlowDCN).
|