r/StableDiffusion – Telegram
Black Forest Labs listened to the community... Flux 3!
https://redd.it/1padev5
@rStableDiffusion
To flux devs, Don't feel bad and thanks till today
https://redd.it/1pads1a
@rStableDiffusion
Can we please talk about the actual groundbreaking part of Z-Image instead of just spamming?

**TL;DR**: Z-Image didn’t just release another SOTA model, they dropped an amazing training methodology for the entire open-source diffusion community. Let’s nerd out about that for a minute instead of just flexing our Z-images.

\-----
I swear I love this sub and it’s usually my go-to place for real news and discussion about new models, but ever since Z-Image (ZIT) dropped, my feed is 90% “look at this Z-image generated waifu”, “look at my prompt engineering and ComfyUI skills.” Yes, the images are great. Yes, I’m also guilty of generating spicy stuff for fun (I post those on r/unstable_diffusion like a civilized degenerate), but man… I now have to scroll for five minutes to find a single post that isn’t a ZIT gallery.

So this is my ask: can we start talking about the part that actually matters long-term?

Like, what do you guys think about the paper? Because what they did with the training pipeline is revolutionary. They basically handed the open-source community a complete blueprint for training SOTA diffusion models. D-DMD + DMDR + RLHF, a set of techniques that dramatically cuts the cost and time needed to get frontier-level performance.

We’re talking about a path to:

* Actually decent open-source models that don’t require a hyperscaler budget
* The realistic possibility of seeing things like a properly distilled Flux 2, or even a “pico-banana Pro”.

And on top of that, RL on diffusion (like what happened with Flux SRPO) is probably the next big thing. Imagine the day when someone releases open-source RL actors/checkpoints that can just… fix your fine-tune automatically. No more iterating with LoRAs, drop your dataset, let the RL agent cook overnight, wake up to a perfect model.

That’s the conversation I want to have here. Not the 50th “ZIT is scary good at hands!!!” post (we get it).

And... WTF they spent >600k training this model and they said it's budget friendly, LOL. Just imagine how many GPU hours needs nano banana or flux.

https://preview.redd.it/iw48wciqac4g1.png?width=1144&format=png&auto=webp&s=1513fb3887d0b2e50a201487c678c51173014d72

==================

Edit: I just came across r/ZImageAI and it seems like a great dedicated spot for Z-Image generations.

https://redd.it/1pabhxl
@rStableDiffusion
My 4 stage upscale workflow to squeeze every drop from Z-Image Turbo

Workflow: [https://pastebin.com/b0FDBTGn](https://pastebin.com/b0FDBTGn)

ChatGPT Custom Instructions: [https://pastebin.com/qmeTgwt9](https://pastebin.com/qmeTgwt9)

I made [this](https://www.reddit.com/r/StableDiffusion/comments/1p80j9x/comment/nr1jak5/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) comment on a separate thread a couple of days ago and I noticed that some of you guys were interested to learn more details

What I basically did is (and before I continue I must admit that this is not my idea. I am doing this since SD 1.5 and I don't remember where I borrowed the original idea from)

* Generate at a very low resolution, small enough to let the model draw an outline and then do a massive latent upscale with 0.7 denoise
* Adds a ton of details, sharper image and best quality (almost close to *I can jerk off to my own generated image* level)

I already shared that workflow with others in that same thread. I was reading through the comments and ideas that other's shared here and decided to double down on this approach

New and improved workflow:

* The one I am posting here is a 4 stage workflow. It starts by generating an image at 64x80 resolution
* Stage 1: Magic starts. We use a very low shift value here to give the model some breathing space and be creative - we don't want it to follow our prompt strictly here
* Stage 2: A high shift value so it follows our prompt and draws the composition. this is where it gets interesting. what you see here is what your final image will look like (from Stage 4) or maybe at least 90% resemblance. So, you can stop here if you don't like the composition. It barely takes a couple of seconds
* Stage 3: If you are satisfied with the composition, you can run stage 3. This is where we add details. We use a low shift value to give the model some breathing space. The composition will not change much because the denoise value is lower
* Stage 4: So you are happy with where the model is heading in terms of composition, lighting etc. run this stage and get the final image. Here we use shift value 7

What about CFG?

* Stage 1 to 3 uses CFG > 1. I also included a *ahmm very large* negative prompt in my workflow. It works for me and it does make a difference

Is it slow?

* Nope. The whole process (stage 1 to 4) still finishes in 1 minute or maximum 1 min 10 seconds (on my 4060ti) and you are greeted with a 1456x1840 image. You will not loose speed and you have the flexibility to bail out early if you don't like the composition

Seed variety?

* You get good seed variety with this workflow because you are forcing the model to generate something random but by following your prompt in stage 1. It will not generate the same 64x80 resolution image every time and combine this with low denoise values in each stage you get good variations

Important things to remember:

* Please do not use shift 7 for everything. You will kill the model's creativity and get the same boring image every single seed. Let it breath. Experiment with different values
* The 2nd pastebin link has the chatgpt instructions (Use GPT 4o, GPT 5 refuses to name the subjects - at least in my case) I use to get prompts.
* You can use it if you like. The important thing is (even if you use it or not), the first few keywords in your prompt should absolutely describe the scene briefly. Why? because we are generating at a very low resolution so we want the model to draw an outline first. If you describe it like *"oh there is a tree, its green, the climate is cool, bla bla bla, there is a man"*, the low res generation will give you a tree haha

If you have issues working with this workflow, just comment and I will assist. Feedback is welcome. Enjoy

https://redd.it/1paegb2
@rStableDiffusion