Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
AbstractPhil 
posted an update about 11 hours ago
Post
87
Meet FluxLailah; AbstractPhil/tiny-flux-deep; 220m Flux variant currently pretraining at BF16. She is experimental, does not produce solid images yet - and yet she is producing. There is both an EMA and a raw weights pair producing different images. The EMA is particularly interesting at times.
Lailah uses flan-t5-base, clip-vit-l-14, and BlackForestLabs Flux1s VAE.
SEQ limit 128, images 512x512 for now. Lailah's early form is based on three variants. TinyFlux's weights were carefully planted into a deeper structure and trained yet again - dubbed TinyFlux-Deep. This variant has 15 dual-stream blocks and 25 single-stream blocks, nearly identical weight code as Flux with a similar attention mechanism - but intentionally deviant and compacted with careful consideration to scaling and purpose of mechanisms.
She went through quite a few growing pains with her earlier attention mechanism which required a reimagining today and careful consideration of the consequences, and now I present to you the preliminary look into Lailah.
The preliminary training is still heavily under way, the mechanisms are still being augmented, and her stability is currently being measured. The potential for fidelity, depth, and quality are still in measure - so I will be shifting attention and pivoting utility based on the needs over time.

The pretraining has hit an impasse.
Currently it's a linear timestep based on shift and a random number between 1 and 5 for guidance. I have narrowed the possibilities down to two that can be implemented today to solve this problem; CFG or TIMESTEP, which expert is required and which is the best candidate?

  1. The model WILL require a timestep expert manifold. This will allow the expertise for the timestep manifold to be managed by something much more trained and more intelligent during training, which will require CFG guidance training controlled by learning or complete random chance. E.G. standard dropout to encourage CFG.
  2. OR the model WILL require a cfg expert to distill the guidance embeds. This model is simply too small. The embeds CAN learn useful information yes, if they are distilled from an expert to cake the CFG into the model by default. This will likely require a third expert that can be modularly snapped off for inference; this expert will likely need to be present during training otherwise the model will heavily drift due to the model's small size.

I have trained a multitude of v-pred sdxl models and a flow-matching shift sd15 model that can represent this necessary implication. This begs the question now; which expert should be used and should I just make a very specific tinyflux expert distilled from ALL SD15 and SDXL timestep variants using David?

This leads to one CORE and IMPORTANT question; CAN THIS BE REPRESENTED WITHOUT AN EXPERT!? I think this is possible, I've ran VIT experiments that used raw sinusoidal for encodings with a surprisingly fair representation of encoding capacity.

The model is ALREADY responsive to CFG but only in part. The current cfg guidance is only getting in the way in many points and I assume is just jiggering in noise, so I'll need to either disable it or use it correctly. The further in training the model gets the more retraining will be required for such a component, so the decisions need to happen sooner rather than later.

Alright I've decided, I'll be training experimentally for some epochs the expertise afforded by sd15-flow-lune's timestep and trajectory knowledge as the guidance distillation mechanism for training. How accurate to the interpolation requirement of tinyflux is to be determined.

Flow-Lune is an acceptable distillation that converted sd15 into a useful representation of an image synthesizer with entirely synthetic data based on sd15 and schnell data.

In this post