Improving WAI

#1
by BlueberryTra1n - opened

Hi, I couldn't find a way to write to you directly, I have an outdated gpu, which made me test a lot of fast generation methods, I found this lora that reads built-in artist styles very well, unlike dmd2.
That's why I want to ask you, could please make a pruned WAI with this lora if it's possible?

Lora - https://civitai.com/models/1355945/wai-illustrious-rectified-4steps

I created a new folder 'alternate-4step' with the merge of WAI and the lora you sent. Tell me what you think...

About where to find me: https://inkbunny.net/Coercer1730

First of all, thank you for answering and doing this work - it is very important for people with limited resources like me. I must admit I did not think that I would get not only an answer but also the model itself, so I had to "dig around" on this topic myself. I would like to know how you pruning or converting and reducing the size of the checkpoint, yesterday i came across an extension for ComfyUI called fp8 converter and passed a full-size WAI through it and merged with this Lora, it turned out to be about 4 and a half gigabytes, I will download and try your version though.

I also tried your bananasplitz and novafurry, but they generating with artifacts or maybe I did something wrong, maybe i need to use your comfy workflow.

Also in metadata as you can see there chucksFNSNoob_noobaiEpsilonPredV11.safetensors - are you sure you picked WAI?

Sorry for the delay. I'm over my head now with academic issues, and I maybe (Very probably) messed up in the hurry. I've remerged everything again, so let's see if it's okay now.

Merging loras to models in ComfyUI is very easy - You just need to find the 'Save Checkpoint' node, which should be there in every Comfy installation. It is identical to the load checkpoint node, but with three inputs instead of outputs. Simply load the full checkpoint, the lora you want to use, and connect the model and clip outputs of LoraLoader to the save checkpoint node, and connect the VAE directly from LoadCheckpoint to the output (LoraLoader does not have vae input/output).

Now, the output size will depend of the flags you ran Comfyui with. Open the .bat you use to run Comfy on notepad, and you should see something like

.\python_embeded\python.exe -s ComfyUI\main.py (--some_flags)

Add the following flags at the end of the line:

--fp8_e5m2-unet --fp8_e5m2-text-enc (You can use the e4m3fn variants too, I found no difference).

Those are REALLY important if you have low specs - you'll inference in fp8 instead of fp16, saving a whopping 50% memory. And, when the 'Save Checkpoint' node is run, you'll have a ~3.7 GB model as output, like me. You need no extension if you add the flags. If your gpu can't fit even the gguf models or you have none at all (like me), also add the --cpu flag (That's the difference between run and run_cpu.bat). The T4 I say I have is actually Colab's T4.

What the extension is doing is, probably, just forcing the UNET to be saved in fp8, (--fp8_e5m2-unet), leaving the clip as fp16, which will yield a bigger file than if you force it to be fp8 too. You can see all the components inside a .safetensors by watching any gguf model. The .gguf file is the unet, which you wrote in fp8. The clip files (There are actually 2 clips) are what you stored in fp16 instead of fp8, and a vae.

In short:

fp16 unet + fp16 clip(g+l) + vae = 6.6 GB (civitai)
fp8 unet + fp16 clip(g+l) + vae = 4.5 GB (yours)
fp8 unet + fp8 clip(g+l) + vae = 3.72 GB (my 'pruned' model)
Q4_K_S gguf Unet + fp8 clip(g+l) + vae = 2.4 GB (GGUF might not work well with loras).

You can search how to convert sdxl models to GGUF, you'll find the notebook that does so. I simply created a variant for that notebook that automatizes the conversion for all types and uploads them to HF, nothing more (It can still be more refined LOL).

I uploaded a workflow because, as you might have noticed, using regular samplers (Eulers, DPM,...) will yield bad results when working with DMD2 loras, in general. To have properly formed images (Although quality will be degraded), you need to select 'lcm' sampler and 'beta' scheduler. If you don't mind waiting, increase steps to 12-16 (8 is the bare minimum), and increase cfg from 1.0 to 1.2 or 1.5 (It depends on the model). The big Kohya's node in my workflow is just a trick that makes the image composition a bit worse by generating the first half of the total steps at half resolution, further quickening the generation. It's just another node you can bypass if you feel the image is not good enough, but it works well for solos.

As an ending note:

Speed constrained but NOT memory constrained -> Use DMD2 models, although you can use Comfy like normal if you are patient.
Memory constrained but NOT speed constrained -> Use GGUF models.
Speed constrained AND memory constrained -> Use Q3 to Q5 GGUF models (I use Q4_K_S) + Kohya DeepShrink node (It's ComfyUI core) + Lower to 20 steps or so.
Speed constrained AND memory constrained AND time constrained -> Coercer1730. D'oh!!! (Psst, leave it genning something in your sleep lmao, mine takes up to 2 hours per image if I don't use DMD2).

Also, if memory is CRITICAL (6-8GB ram or you want to run full GPU on 2GB GPU), you can use Q3_K_M gguf and the modular workflow in: https://huggingface.co/Coercer/ChuckFnS_GGUF (You'll need to install this custom node: https://github.com/endman100/ComfyUI-SaveAndLoadPromptCondition.git (Basically, it runs clip, unet and vae separately, one by one, to save a bit more memory).

I have just uploaded a remake of the checkpoint. Let's see if I did it well now.

Hope that helps. Any doubt, tell me ;)

Wow, thank you again for such a detailed answer, I learned a lot.
I have 16 gigs of ram and 1050ti (4gigs), for some strange reason I can't use forge so I had to look towards comfyui and swarmui, and it's waaaay faster than forge, I have very basic knowledge of how generation works, so forgive me if i don't understand something, also sorry for my bad english, i can understand all what you writing but writing to you myself correct - is hard.
When the forge is launching, the model is immediately loading into memory, I don't know why, but this knocks out my system and the entire interface (ram 100%, ive tried a lot of variants/settings but with no success, and what's more, Lora loading takes 10 minutes to load/switch to another lora which is terrible with same freezes here and there, it seems that this is due to forge way of processing inference, while comfyui works in blocks (or smth like that) and therefore is more productive, the lora loads instantly and the model loads in a reasonable time (about two minutes) with fp16 6.9 gigs checkpoint.
But fp8 variant loads INSTANTLY, i guess this kind of experience people with good gpu have.
So here's what I did the first time, I converted the model to fp8 using comfy extension (FP8 converter) but only unet and not the clip because there's two separate nodes for clip and unet conversion and i picked one, then I saw a screenshot of the workflow of this extension on github, so i started from scratch. I took a fp16 WAI, merged with recitified lora and more vibrant and contrast vae (because recitifed lora generating in somewhat low contrast and brightness), and then converted both the Clip and UNET to FP8 - so now i have i guess similiar to yours result - 3.7gb safetensors file. But your way feels more native so i will try that with next checkpoint like Noobai EPS (vpred and cfg-1 are enemies as you know).
GGUF is a different story for me, way to complicated i guess.
Regarding the launch parameters - this is very useful information, it means I don’t need this node jumper/extension which essentially does the same thing but as a separate node. I have also never heard of this node (KohyaDeepShrink) and it is just amazing, I made several generations and did not feel any significant loss of quality.

About samplers, yeah i've noticed that LCM gives cleaner and better quality outputs, usually i used euler_a sgm uniform because i saw that most people use it.

good luck with academic btw ;)

And thanks for your amazing project to make life for guys like me more comfy
raw gen 1024x640, 8 steps, cfg1, euler_a sgm uniform
1024x640, 8steps, euler_a.png

Glad to see it works!

With 4gb VRAM you should be okay using normal checkpoints, no need to use GGUF at all for raw gens. Just use the fp8 flags and you're good to go. It's not the best PC in the world, but it's not bad by any means. At least, compared to mine (If only!).

Don't worry about language - I'm Spanish, so I'm in the same boat.

The reason forge only worls better in high-ram machines is because I think it uses StableDiffusionPipeline which is the default inference method by huggingface. However, Comfy uses a different, particular code, so it works differently than forge or a1111. Forge might be faster on RTX 3xxx and 4xxx cards, but of not, you're better off with Comfy.

KohyaDeepShrink is a good invention, yeah!

And LCM stands for Latent consistency Model. Without delving into math, it is a way of computing with some sort of cheatsheet that helps the model solve images faster.

And thanks for your amazing project to make life for guys like me more comfy -> Pun intended?

raw gen 1024x640 -> That's not a standard resolution. some people reported the best results with 768 x 1344, 832 x 1216, 896 x 1152, 1024 x 1024, and the same ones, reversed.

good luck with academic btw ;) -> Thanks!

image(13).jpg

If I generate in 1216x832 - there will be memory exhaustion and long tile VAE decoding, 1024x640 is an attempt to make longer-wider portrait generations, I will try other resolutions that do not break dimensions.

I also noticed that if the character is far away, the face is very poorly detailed and even inpaint with high denoise does not change the situation much, maybe initial number of steps should be 16+

Usually I generate exactly in 1024x640 at 10 steps (takes about 35-45 seconds), then if the basic result is satisfactory I do a refining/upscaling by 1.5 with remacri and only then I eliminate the shortcomings with inpaint, probably not the most efficient process.

Yes, the pun was intentional.
You only have a CPU, I sympathize with you, I can't even imagine how long you have to wait for generation, my 2-5 minutes for generation-upscaling seem insignificant, i hope you will get a good gpu someday, you deserve that.

The problem with small details is the amount of pixels they occupy of the image. A 1000x1000 image has a millon pixels, so loet's say illustrious can draw things with at least 200 thousand pixels of detail. Now, think, the head of a faraway character is a 3% of the image. At that scale, it is just 30000 pixels, which is clearly not enough. But, if you upscale the image by 2, the number of pixels increases by 4, which will now yield a 120000 pixel head. It might not be able to generate it out of nowhere, but if you hi-res a image, it can be enough. That's why hi-res fix is important.

I have only a CPU and python knowledge to try to overcome possible colab bans ;) I render 3d animations, and I've done like 100 of them, so yeah, I'm pretty used to wait.

Oh yeah i've heard that google can't launch webui anymore, i've also heard of Qdiffusion project that let's you generating on colab (2-4 hours a day i think or you can make multiple google accounts and switch them) so it might be useful for you (for me it's too junky way to workaround lack of resources) but you definitely be able to generate with "rich" parameters.
Thank you for sharing your knowledge, Gracias.

Since you are a competent person - I have another question about the launch flags in comfyui, namely should I use --normalvram or -lowvram or --novram there are other flags like reserve vram, preview method, cross attention method, don't upcast attention, should i use some of them? can you enlighten me?

I used to have Qdiffusion around 2023. Wasn't bad, but was limited, also I think it suffered a block too, although I don't really know if it was lifted. I have several accounts I currently use to blender rendering.

About the launch flags, the --normalvram --lowvram... are used if you get OOM errors in some parts of the workflow. I don't really know the full process of inference, but I think it stores the information of the model not being used (Clip in Ksampler, Unet when VAE decoding...) on CPU RAM instead of VRAM. This saves some gpu memory, so it can help getting to runa bit higher resolution for you, as you seem to be very close to inferencing 1024x1024 or 1216x832. Lowvram lowers VRAM consumption, novram more, although a bit of GPU is used for some type of backend? I'm not sure. --cpu is the flag to go to use 0% GPU.

Reserve VRAM is used for high VRAM cards, so Comfy saves the VRAM you say to other tasks, like running a LLM in parallel. I don't reccomend doing that with your specs. It suffices to run normal comfy, but you might have problems if you want to load an LLM at the same time, you'll probably run OOM.

Preview method is used if you want to see how the image is being formed, similar to Forge. But as you take just 35-40 seconds to gen a full image, it's not a big deal to have it on. For me, I need to know how the image is forming. I have it turned on, so by the time 30 of the total 120 genning minutes have passed, I can see if the image is going to be well formed or not, so it's useful for me.

About attention methods - I'm a bit more lost there. Before the sd1.5 times (Yes, I've been around for ages), xformers used to give a 50% speed boost, but now, I think Comfy ses some sort of attention method of its own? (Not sure), but there's not the boost that there used to be. Trial and error, but I don't really got too much of an improvement, if any at all. (I'm GPU-less, so results might differ to you).

Thank you for reading my walls of text too ;)

No no it's just great that I came across you and you answer my questions so comprehensively, I was constantly told in the comments when I asked how to speed up generation "just buy yourself a new GPU" they just don't realize that I'm poor and can't afford it. I'm glad I met you.

Besides you, I talked to another person and he uses and praises Noobai v-pred - but there is a different approach and its own nuances like using rescale cfg which probably doesn't work if you are on cfg-1 i've tried this model and get weird outputs so.

I'm not a friend of v-pred myself. They say it increases prompt adherence, but not on my case. And god knows I use them extensively. It's such a hassle, and all my favourite models are EPS.

All because of the full range of colors 0-255 and a smaller range of CFG, the output images are either overcooked or too sharp. About following the prompt - I heard that a noob is finetune of Illustrious 0.1 but they trained not only on Danboru tags but also on E621, better understanding of concepts and so on.

So merging lora and converting (pruning?) models goes smooth, i've tested some of checkpoints on civitai generatior with WAI-recitified lora and Hyphoria and YiffyMIX is nice, so if you want you could try them for yourself, just saying.

Oh, sorry for the radio silence, I was getting some things going on, but I'm okay now. My favourite now is Xavier 1.0 as it is really good at pokémon (100% of my furry interest)

Hello, i was on vacation - so yeah.
I've noticed you added GGUF YiffyMix - i don't know about that format - is there any particular advantage of using it or how to use it?

Just another model that caught my eye - similar in knowledge to Xavier, but more 2.5D-ish. Good for an alternate styke without knowledge loss. At this level, it is pretty much pick one that you want and that's it. I don't think it's going to be a revolution on illustrious models at this point. (Looking forward Flux becoming mainstream)

Yeah but what's the difference between GGUF and usual safetensors - is there any adantages in gguf?

The difference is memory consumption - Usual safetensors are fp16, which means 16 bits (2 bytes) of memory per weight (2.57Billion+latents). Most safetensors can be converted to fp8 (1 byte per weight), but can't be reduced further. However, GGUF allows to convert to 2,3,4,5,6, and 8 bits per weight (Q2...Q8), which results in a lower memory consumption. for example, Q3_K_S models, if run separately (clip>Ksampler>Vae Decode) It can fit in just a 50$ 2GB VRAM card fully. They are a bit slower than their .safetensors counterpart, tho.

Hello again friend, i want again to thank you for your guides, i've alreade created some merges with loras and just mixing models for fun.
So i have a question - have you tried v-pred models? Also multiple dmd2 loras appeared on civitai - how do you think they have difference or it's just reuploads?

Nice to see you started investigating in your own!

I don't really get the hype with V-pred models at all. They are buggy, are a hassle to make them work properly, and I find no difference with epsilon models at all. Maybe it's just me, tho.

About the DMD2 loras, there is another low-step lora named HyperSDXL, but requires a TCD sampler. Also, it has less prompt adherence, although solos can be pulled out. This way, 4 steps are enough.

Sign up or log in to comment