r/StableDiffusion – Telegram
To be very clear: as good as it is, Z-Image is NOT multi-modal or auto-regressive, there is NO difference whatsoever in how it uses Qwen relative to how other models use T5 / Mistral / etc. It DOES NOT "think" about your prompt and it never will. It is a standard diffusion model in all ways.

A lot of people seem extremely confused about this and appear to be convinced that Z-Image is something it isn't and never will be (the somewhat misleadingly worded, perhaps intentionally but perhaps not, blurbs on various parts of the Z-Image HuggingFace being mostly to blame).

TLDR it loads Qwen the SAME way that any other model loads any other text encoder, it's purely processing with absolutely none of the typical Qwen chat format personality being "alive". This is why for example it also cannot refuse prompts that Qwen certainly otherwise would if you had it loaded in a conventional chat context on Ollama or in LMStudio.

https://redd.it/1pm5vw0
@rStableDiffusion
It turns out that weight size matters quite a lot with Kandinsky 5

fp8

bf16

Sorry for the boring video, I initially set out to do some basics with CFG on the Pro 5s T2V model, and someone asked which quant I was using, so I did this comparison while I was at it. This is same seed/settings, the only difference here is fp8 vs bf16. I'm used to most models having small accuracy issues, but this is practically a whole different result, so I thought I'd pass this along here.

Workflow: https://pastebin.com/daZdYLAv

edit: Crap! I uploaded the wrong video for bf16, this is the proper one:

proper bf16



https://redd.it/1pm4y7t
@rStableDiffusion
REALISTIC - WHERE IS WALDO? USING FLUX (test)
https://redd.it/1pm95c1
@rStableDiffusion