Will the final model be released certainly? + captioning question
Just wanted to know if the chance is high of releasing the final trained model when is ready. Because others reserve their best models for apis and things like that sadly. Also can i know a little how the dataset was captioned, like did you include tag caption plus natural language caption for each image, or interchanging one type of caption per image Just to train my loras following that.
Second the question about the final model being released.
The final model version is definitely being released. Nobody would care about this model at all if the weights weren't available. The plan from the beginning has always been to make everything open-weights.
For the captioning, every image has multiple caption variants and it trains on all of them. Full tag list, tag list with dropout, tags followed by caption, caption followed by tags, short caption only, long caption only.
Loras are usually very light finetunes so probably a wide variety of captioning styles could work. It's not like you have to exactly follow how the base model was trained.
The final model version is definitely being released. Nobody would care about this model at all if the weights weren't available. The plan from the beginning has always been to make everything open-weights.
This is a big sigh of relief for me! As much as I love the model, I did have similar thoughts. Love the work y'all are doing!
For the captioning, every image has multiple caption variants and it trains on all of them. Full tag list, tag list with dropout, tags followed by caption, caption followed by tags, short caption only, long caption only.
Loras are usually very light finetunes so probably a wide variety of captioning styles could work. It's not like you have to exactly follow how the base model was trained.
Could you share what was used for creating the Natural Language captions?
For the captioning, every image has multiple caption variants and it trains on all of them. Full tag list, tag list with dropout, tags followed by caption, caption followed by tags, short caption only, long caption only.
Loras are usually very light finetunes so probably a wide variety of captioning styles could work. It's not like you have to exactly follow how the base model was trained.
Could you share what was used for creating the Natural Language captions?
to add to this, a few example samples from the dataset wouldbe really nice to use as a reference on how to structure your own datasets, whenever LoRA/ft takes off (I'm sure it will, this model is soaking concepts like crazy)
Commenting here just to get notifications when the full version releases hopefully, do have feedback and suggestions though for future models:
- Other data sources such as e621, R34, etc.
- More NSFW representation in dataset
- Better non-tag understanding
- Bigger text encoder
- Different base e.g. Zimage Base
- Object style prompting like NewBie
- Better text which it can already do, but seems to fail at things like "text that says xyz made of cheese"
- Bigger/smaller variants
- More real life data to cover concepts not commonly in anime style images, struggles quite a bit with this if I want to get really oddly specific things
- Small prompt enhancer LLM
- Better default art style less slop look, model really comes to life with artist tags but should be better by default
I know a lot of these things are probably not practical for a small team, just throwing out any random idea I have. Great model so far, seems about as good as NovelAI (not the leaked SD 1.5 model) last time I tried it.
Really impressed with how many artists and characters it knows, some I never expected it to do decently well although not perfect. Also, it's reasonably fast and stable on my kinda maybe sorta defective RX 6800 that likes to give me all sorts of weird issues or crashes/reboots or slowness, while Z Image Base or other models really aren't.
Best of luck :-)
Good job!
And I really hope that Chinese input can be supported.