r/StableDiffusion – Telegram
Flux2-Klein-9B-True-V1 , Qwen-Image-2512-Turbo-LoRA-2-Steps & Z-Image-Turbo-Art Released (2x fine tunes & 1 Lora)

Three new models released today , no time to download them and test them all (apart from a quick comparison between Klein 9B and the new Klein 9B True fine tune) as I'm off to the pub.

This isn't a comparison between the 3 models as they are totally different things.

# 1.Z-Image-Turbo-Art

"This model is a fine-tuned fusion of Z Image and Z Image Turbo . It extracts some of the stylization capabilities from the Z Image Base model and then performs a layered fusion with Z Image Turbo followed by quick fine-tuning, This is just an attempt to fully utilize the Z Image Base model currently. Compared to the official models, this model images are clearer and the stylization capability is stronger, but the model has reduced delicacy in portraits, especially on skin, while text rendering capability is largely maintained."

https://huggingface.co/wikeeyang/Z-Image-Turbo-Art

# 2.Flux2-Klein-9B-True-V1

"This model is a fine-tuned version of FLUX.2-klein-9B. Compared to the official model, it is undistilled, clearer, and more realistic, with more precise editing capabilities, greatly reducing the problem of detail collapse caused by insufficient steps in distilled models."

https://huggingface.co/wikeeyang/Flux2-Klein-9B-True-V1

https://preview.redd.it/xqja0uvywhgg1.png?width=1693&format=png&auto=webp&s=290b93d949be6570f59cf182803d2f04c8131ce7

Above: Left is original pic , edit was to add a black dress in image 2, middle is original Klein 9B and the right pic is the 9B True model. I think I need more tests tbh.

# 3. Qwen-Image-2512-Turbo-LoRA-2-Steps

"This is a 2-step turbo LoRA for Qwen Image 2512 trained by Wuli Team, representing an advancement over our 4-step turbo LoRA."

https://huggingface.co/Wuli-art/Qwen-Image-2512-Turbo-LoRA-2-Steps

https://redd.it/1qr6rrr
@rStableDiffusion
I Finally Learned About VAE Channels (Core Concept)

With a recent upgrade to a 5090, I can start training loras with hi-res images containing lots of tiny details. Reading through this lora training guide I wondered if training on high resolution images would work for SDXL or would just be a waste of time. That led me down a rabbit hole that would cost me 4 hours, but it was worth it because I found this blog post which very clearly explains why SDXL always seems to drop the ball when it comes to "high frequency details" and why training it with high-quality images would be a waste of time if I wanted to preserve those details in its output.

The keyword I was missing was the number of channels the VAE model uses. The higher the number of channels, the more detail that can be reconstructed during decoding. SDXL (and SD1.5, Qwen) uses a 4-channel VAE, but the number can go higher. When Flux was released, I saw higher quality out of the model, but far slower generation times. That is because it uses a 16-channel VAE. It turns out Flux is not slower than SDXL, it's simply doing more work, and I couldn't properly appreciate that advantage at the time.

Flux, SD3 (which everyone clowned on), and now the popular Z-Image all use 16-channel VAEs which have lower compression than SDXL, which allows them to reconstruct higher fidelity images. So you might be wondering: why not just use a 16-channel VAE on SDXL? The answer is it's not compatible, the model itself will not accept latent images at the compression ratios that 16-channel VAEs encode/decode. You would probably need to re-train the model from the ground up to give it that ability.

Higher channel count comes at a cost though, which materializes in generation time and VRAM. For some, the tradeoff is worth it, but I wanted crystal clarity before I dumped a bunch of time and energy into lora training. I will probably pick 1440x1440 resolution for SDXL loras, and 1728x1728 or higher for Z-Image.

The resolution itself isn't what the model learns though, that would be the relationships between the pixels, which can be reproduced at ANY resolution. The key is that some pixel relationships (like in text, eyelids, fingernails) are often not represented in the training data with enough pixels either for the model to learn, or for the VAE to reproduce. Even if the model learned the concept of a fishing net and generated a perfect fishing net, the VAE would still destroy that fishing net before spitting it out.

With all of that in mind, the reason why early models sucked at hands, and full-body shots had jumbled faces is obvious. The model was doing its best to draw those details in latent space, but the VAE simply discarded those details upon decoding the image. And who gets blamed? Who but the star of the show, the model itself, which in retrospect, did nothing wrong. This is why closeup images express more detail than zoomed-out ones.

So why does the image need to be compressed at all? Because it would be way too computationally expensive to generate full-resolution images, so the job of the VAE is to compress the image into a more manageable size for the model to work with. This compression is always a factor of 8, so from a lora training standpoint, if you want the model to learn any particular detail, that detail should still be clear when the training image is reduced by 8x or else it will just get lost in the noise.

The more channels, the less information is destroyed

https://redd.it/1qrcaky
@rStableDiffusion