TinyFlux-Deep

An expanded TinyFlux architecture that increases depth and width while preserving learned representations. TinyFlux-Deep is ported from TinyFlux with strategic layer expansion and attention head doubling.

Model Description

TinyFlux-Deep extends the base TinyFlux model by:

Doubling attention heads (2 → 4) with expanded hidden dimension (256 → 512)
5× more double-stream layers (3 → 15)
8× more single-stream layers (3 → 25)
Preserving learned weights from TinyFlux in frozen anchor positions

Architecture Comparison

Component	TinyFlux	TinyFlux-Deep	Flux
Hidden size	256	512	3072
Attention heads	2	4	24
Head dimension	128	128	128
Double-stream layers	3	15	19
Single-stream layers	3	25	38
VAE channels	16	16	16
Total params	~8M	~85M	~12B

Layer Mapping (Ported from TinyFlux)

The original TinyFlux weights are strategically distributed and frozen:

Single blocks (3 → 25):

TinyFlux Layer	TinyFlux-Deep Position	Status
0	0	Frozen
1	8, 12, 16	Frozen (3 copies)
2	24	Frozen
—	1-7, 9-11, 13-15, 17-23	Trainable

Double blocks (3 → 15):

TinyFlux Layer	TinyFlux-Deep Position	Status
0	0	Frozen
1	4, 7, 10	Frozen (3 copies)
2	14	Frozen
—	1-3, 5-6, 8-9, 11-13	Trainable

Trainable ratio: ~70% of parameters

Attention Head Expansion

Original 2 heads are copied to new positions, with 2 new heads randomly initialized:

Old head 0 → New head 0
Old head 1 → New head 1
Heads 2-3 → Xavier initialized (scaled 0.02×)

Text Encoders

Same as TinyFlux:

Role	Model
Sequence encoder	flan-t5-base (768 dim)
Pooled encoder	CLIP-L (768 dim)

Training

Strategy

Port TinyFlux weights with dimension expansion
Freeze ported layers as "anchor" knowledge
Train new layers to interpolate between anchors
Optional: Unfreeze all and fine-tune at lower LR

Dataset

Trained on AbstractPhil/flux-schnell-teacher-latents:

10,000 samples
Pre-computed VAE latents (16, 64, 64) from 512×512 images
Diverse prompts covering people, objects, scenes, styles

Training Details

Objective: Flow matching (rectified flow)
Timestep sampling: Logit-normal with Flux shift (s=3.0)
Loss weighting: Min-SNR-γ (γ=5.0)
Optimizer: AdamW (lr=5e-5, β=(0.9, 0.99), wd=0.01)
Schedule: Cosine with warmup
Precision: bfloat16
Batch size: 32 (16 × 2 gradient accumulation)

Usage

Installation

pip install torch transformers diffusers safetensors huggingface_hub

Inference

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import T5EncoderModel, T5Tokenizer, CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL

# Load model (copy TinyFlux class definition first, use TinyFluxDeepConfig)
config = TinyFluxDeepConfig()
model = TinyFlux(config).to("cuda").to(torch.bfloat16)

weights = load_file(hf_hub_download("AbstractPhil/tiny-flux-deep", "model.safetensors"))
model.load_state_dict(weights, strict=False)  # strict=False for precomputed buffers
model.eval()

# Load encoders
t5_tok = T5Tokenizer.from_pretrained("google/flan-t5-base")
t5_enc = T5EncoderModel.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16).to("cuda")
clip_tok = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
clip_enc = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.bfloat16).to("cuda")
vae = AutoencoderKL.from_pretrained("black-forest-labs/FLUX.1-schnell", subfolder="vae", torch_dtype=torch.bfloat16).to("cuda")

# Encode prompt
prompt = "a photo of a cat sitting on a windowsill"
t5_in = t5_tok(prompt, max_length=128, padding="max_length", truncation=True, return_tensors="pt").to("cuda")
t5_out = t5_enc(**t5_in).last_hidden_state
clip_in = clip_tok(prompt, max_length=77, padding="max_length", truncation=True, return_tensors="pt").to("cuda")
clip_out = clip_enc(**clip_in).pooler_output

# Euler sampling with Flux shift
def flux_shift(t, s=3.0):
    return s * t / (1 + (s - 1) * t)

x = torch.randn(1, 64*64, 16, device="cuda", dtype=torch.bfloat16)
img_ids = TinyFlux.create_img_ids(1, 64, 64, "cuda")

t_linear = torch.linspace(0, 1, 21, device="cuda")
timesteps = flux_shift(t_linear)

for i in range(20):
    t = timesteps[i].unsqueeze(0)
    dt = timesteps[i+1] - timesteps[i]
    guidance = torch.tensor([3.5], device="cuda", dtype=torch.bfloat16)
    
    v = model(
        hidden_states=x,
        encoder_hidden_states=t5_out,
        pooled_projections=clip_out,
        timestep=t,
        img_ids=img_ids,
        guidance=guidance,
    )
    x = x + v * dt

# Decode
latents = x.reshape(1, 64, 64, 16).permute(0, 3, 1, 2)
latents = latents / vae.config.scaling_factor
image = vae.decode(latents.float()).sample
image = (image / 2 + 0.5).clamp(0, 1)

Configuration

@dataclass
class TinyFluxDeepConfig:
    hidden_size: int = 512
    num_attention_heads: int = 4
    attention_head_dim: int = 128
    in_channels: int = 16
    joint_attention_dim: int = 768
    pooled_projection_dim: int = 768
    num_double_layers: int = 15
    num_single_layers: int = 25
    mlp_ratio: float = 4.0
    axes_dims_rope: Tuple[int, int, int] = (16, 56, 56)
    guidance_embeds: bool = True

Files

AbstractPhil/tiny-flux-deep/
├── model.safetensors      # Model weights (~340MB)
├── config.json            # Model configuration
├── frozen_params.json     # List of frozen parameter names
├── README.md              # This file
├── model.py               # Model architecture (includes TinyFluxDeepConfig)
├── inference_colab.py     # Inference script
├── train_deep_colab.py    # Training script with layer freezing
├── port_to_deep.py        # Porting script from TinyFlux
├── checkpoints/           # Training checkpoints
│   └── step_*.safetensors
├── logs/                  # Tensorboard logs
└── samples/               # Generated samples during training

Porting from TinyFlux

To create a new TinyFlux-Deep from scratch:

# Run port_to_deep.py
# 1. Downloads AbstractPhil/tiny-flux weights
# 2. Creates TinyFlux-Deep model (512 hidden, 4 heads, 25 single, 15 double)
# 3. Expands attention heads (2→4) and hidden dimension (256→512)
# 4. Distributes layers to anchor positions
# 5. Saves to AbstractPhil/tiny-flux-deep

Comparison with TinyFlux

Aspect	TinyFlux	TinyFlux-Deep
Parameters	~8M	~85M
Memory (bf16)	~16MB	~170MB
Forward pass	~15ms	~60ms
Capacity	Limited	Moderate
Training	From scratch	Ported + fine-tuned

Limitations

Resolution: Trained on 512×512 only
Quality: Better than TinyFlux, still below full Flux
Text understanding: Limited by smaller T5 encoder (768 vs 4096 dim)
Early training: Model is actively being trained
Experimental: Intended for research, not production

Intended Use

Studying model scaling and expansion techniques
Testing layer freezing and knowledge transfer
Rapid prototyping with moderate capacity
Educational purposes
Baseline for architecture experiments

Citation

@misc{tinyfluxdeep2026,
  title={TinyFlux-Deep: Expanded Flux Architecture with Knowledge Preservation},
  author={AbstractPhil},
  year={2026},
  url={https://huggingface.co/AbstractPhil/tiny-flux-deep}
}

Related Models

AbstractPhil/tiny-flux - Base model (8M params)
black-forest-labs/FLUX.1-schnell - Original Flux

Acknowledgments

Black Forest Labs for the original Flux architecture
Hugging Face for diffusers and transformers libraries

License

MIT License - See LICENSE file for details.

Note: This is an experimental research model under active development. Training is ongoing and weights may be updated frequently.

Downloads last month: 54

Model tree for AbstractPhil/tiny-flux-deep

Base model

black-forest-labs/FLUX.1-schnell

Finetuned

AbstractPhil/tiny-flux