How to use with Diffusers

TBD. The model requires pipeline patching in order to remap embeddings preparation. Code coming soon.

How to use in Comfy

TBD. I plan to provide at least fp8 safetensors distributed version of this model.


WaveCut/FLUX.2-TE-Trimmed-7L-Distil (distilled text encoder)

Smaller 7-layer text encoder distilled from the FLUX.2 text encoder aka Mistral-Small-3.2-24B-Instruct-2506 for use as a lighter text backbone.

Model description

  • Teacher model: mistralai/Mistral-Small-3.2-24B-Instruct-2506 (subfolder: text_encoder)
  • Compact init: WaveCut/FLUX2-TE-Trimmed7L-Research
  • Architecture: Mistral3 text encoder, 7 transformer layers
  • Max sequence length during distillation: 512
  • Dataset: k-mktr/improved-flux-prompts (split: train, field: prompt)
  • Dtype: bfloat16

Distillation setup

  • Objective: token-wise MSE between teacher and compact model last hidden states
  • Optimizer: AdamW (lr=1e-05, weight_decay=0.01)
  • Scheduler: cosine with warmup (warmup_ratio=0.1)
  • Batch size: 8
  • Gradient accumulation steps: 1
  • Epochs: 16

Distillation metrics (evaluation)

On the training split (held-out eval batches):

  • token-wise MSE: 347.5225
  • token-level cosine similarity: 0.9222
  • pooled (last token) cosine similarity: 0.9722
Downloads last month
77
Safetensors
Model size
6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WaveCut/FLUX.2-TE-Trimmed-7L-Distil

Finetuned
(13)
this model

Dataset used to train WaveCut/FLUX.2-TE-Trimmed-7L-Distil