How to use with Diffusers
TBD. The model requires pipeline patching in order to remap embeddings preparation. Code coming soon.
How to use in Comfy
TBD.
I plan to provide at least fp8 safetensors distributed version of this model.
WaveCut/FLUX.2-TE-Trimmed-7L-Distil (distilled text encoder)
Smaller 7-layer text encoder distilled from the FLUX.2 text encoder aka Mistral-Small-3.2-24B-Instruct-2506 for use as a lighter text backbone.
Model description
- Teacher model:
mistralai/Mistral-Small-3.2-24B-Instruct-2506(subfolder:text_encoder) - Compact init:
WaveCut/FLUX2-TE-Trimmed7L-Research - Architecture: Mistral3 text encoder, 7 transformer layers
- Max sequence length during distillation: 512
- Dataset:
k-mktr/improved-flux-prompts(split:train, field:prompt) - Dtype: bfloat16
Distillation setup
- Objective: token-wise MSE between teacher and compact model last hidden states
- Optimizer: AdamW (lr=1e-05, weight_decay=0.01)
- Scheduler: cosine with warmup (warmup_ratio=0.1)
- Batch size: 8
- Gradient accumulation steps: 1
- Epochs: 16
Distillation metrics (evaluation)
On the training split (held-out eval batches):
- token-wise MSE: 347.5225
- token-level cosine similarity: 0.9222
- pooled (last token) cosine similarity: 0.9722
- Downloads last month
- 77
Model tree for WaveCut/FLUX.2-TE-Trimmed-7L-Distil
Base model
black-forest-labs/FLUX.2-dev