NEW BOT Телеграм, страница

r/StableDiffusion

GLM-Image explained: why autoregressive + diffusion actually matters

Seeing some confusion about what makes GLM-Image different so let me break it down.

How diffusion models work (Flux, SD, etc):

You start with pure noise. The model looks at ALL pixels simultaneously and goes "this should be a little less noisy." Repeat 20-50 times until you have an image.

The entire image evolves together in parallel. There's no concept of "first this, then that."

How autoregressive works:

Generate one piece at a time. Each new piece looks at everything before it to decide what comes next.

This is how LLMs write text:

"The cat sat on the "
→ probably "mat"
"The cat sat on the mat and "
→ probably "purred"

Each word is chosen based on all previous words.

GLM-Image does BOTH:

1. Autoregressive stage: A 9B LLM (literally initialized from GLM-4) generates ~256-4096 semantic tokens. These tokens encode MEANING and LAYOUT, not pixels.

2. Diffusion stage: A 7B diffusion model takes those semantic tokens and renders actual pixels.

Think of it like: the LLM writes a detailed blueprint, then diffusion builds the house.

Why this matters

Prompt: "A coffee shop chalkboard menu: Espresso $3.50, Latte $4.25, Cappuccino $4.75"

Diffusion approach:
- Text encoder compresses your prompt into embeddings
- Model tries to match those embeddings while denoising
- No sequential reasoning happens
- Result: "Esperrso $3.85, Latle $4.5?2" - garbled nonsense

Autoregressive approach:
- LLM actually PARSES the prompt: "ok, three items, three prices, menu format"
- Generates tokens sequentially: menu layout → first item "Espresso" → price "$3.50" → second item...
- Each token sees full context of what came before
- Result: readable text in correct positions

This is why GLM-Image hits 91% text accuracy while Flux sits around 50%.

Another example - knowledge-dense images:

Prompt: "An infographic showing the water cycle with labeled stages: evaporation, condensation, precipitation, collection"

Diffusion models struggle here because they're not actually REASONING about what an infographic should contain. They're pattern matching against training data.

Autoregressive models can leverage actual language understanding. The same architecture that knows "precipitation comes after condensation" can encode that into the image tokens.

The tradeoff:

Autoregressive is slower (sequential generation vs parallel) and the model is bigger (16B total). For pure aesthetic/vibes generation where text doesn't matter, Flux is still probably better.

But for anything where the image needs to convey actual information accurately - text, diagrams, charts, signage, documents - this architecture has a real advantage.

Will report back in a few hours with some test images.

https://redd.it/1qcegzd
@rStableDiffusion

From the StableDiffusion community on Reddit

Explore this post and more from the StableDiffusion community

5 views05:40

r/StableDiffusion

0:25

This media is not supported in your browser

VIEW IN TELEGRAM

LTX2 Easy All in One Workflow.

https://redd.it/1qchwcg
@rStableDiffusion

5 views09:40

Starting to play with LTX-2 ic-lora with pose control. Made a Pwnisher style video

https://redd.it/1qciya2
@rStableDiffusion

5 views10:40

r/StableDiffusion

First test with GLM. Results are okay-ish so far

https://redd.it/1qcjoo5
@rStableDiffusion

From the StableDiffusion community on Reddit: First test with GLM. Results are okay-ish so far

Explore this post and more from the StableDiffusion community

6 views11:40

r/StableDiffusion

When's this wait gonna be over?
https://redd.it/1qcle3s
@rStableDiffusion

7 views12:40

r/StableDiffusion

Local Comparison: GLM-Image vs Flux.2 Dev vs Z-Image Turbo, no cherry picking

https://redd.it/1qcn46q
@rStableDiffusion

From the StableDiffusion community on Reddit: Local Comparison: GLM-Image vs Flux.2 Dev vs Z-Image Turbo, no cherry picking

Explore this post and more from the StableDiffusion community

7 views14:40

r/StableDiffusion

0:26

This media is not supported in your browser

VIEW IN TELEGRAM

LTX2-Infinity updated to v0.5.7

https://redd.it/1qckf3p
@rStableDiffusion

6 views15:40

r/StableDiffusion

Billions of parameters just to give me 7 fingers.
https://redd.it/1qckzc1
@rStableDiffusion

6 views16:40

r/StableDiffusion

0:21

This media is not supported in your browser

VIEW IN TELEGRAM

The Dragon (VHS Style): Z-Image Turbo - Wan 2.2 FLFTV - Qwen Image Edit 2511 - RTX 2060 Super 8GB VRAM

https://redd.it/1qcosvm
@rStableDiffusion

6 views17:40

r/StableDiffusion

0:15

This media is not supported in your browser

VIEW IN TELEGRAM

LTX, It do be like that,

https://redd.it/1qcsxzp
@rStableDiffusion

6 views18:40

r/StableDiffusion

0:20

This media is not supported in your browser

VIEW IN TELEGRAM

Starting to narrow in on LTX2 Prompting

https://redd.it/1qcv9gv
@rStableDiffusion

5 views19:40

r/StableDiffusion

0:09

This media is not supported in your browser

VIEW IN TELEGRAM

WTF! LTX-2 is delivering for real 🫧 Made in 160s, 20steps on a 5090

https://redd.it/1qcvu8r
@rStableDiffusion

6 views20:40

r/StableDiffusion

0:34

This media is not supported in your browser

VIEW IN TELEGRAM

Soprano 1.1-80M released: 95% fewer hallucinations and 63% preference rate over Soprano-80M

https://redd.it/1qcuuet
@rStableDiffusion

5 views21:40

r/StableDiffusion

0:20

This media is not supported in your browser

VIEW IN TELEGRAM

More help with prompting. for LTX2

https://redd.it/1qcyhwr
@rStableDiffusion

4 views22:40

Surgical masking with Wan 2.2 Animate in ComfyUI

https://redd.it/1qd219g
@rStableDiffusion

4 views23:40

r/StableDiffusion

0:05

This media is not supported in your browser

VIEW IN TELEGRAM

Did you try to generate pixel art animations?

https://redd.it/1qczdy9
@rStableDiffusion

4 views00:40

LTX-2 I2V synced to an MP3: Distill Lora Quality STR 1 vs .6 - New Workflow Version 2.

https://redd.it/1qd525f
@rStableDiffusion

5 views01:40

r/StableDiffusion

0:12

This media is not supported in your browser

VIEW IN TELEGRAM

LTX-2: 1,000,000 Hugging Face downloads, and counting!

https://redd.it/1qd406h
@rStableDiffusion

5 views02:40

r/StableDiffusion

0:47

This media is not supported in your browser

VIEW IN TELEGRAM

Qwen Image Edit 2511 Unblur Upscale LoRA

https://redd.it/1qd3tps
@rStableDiffusion

7 views03:40

I built a real-time 360 volumetric environment generator running entirely locally. Uses SD.cpp, Depth Anything V2, and LaMa, all within Unity Engine.

https://redd.it/1qde674
@rStableDiffusion

5 views08:40

r/StableDiffusion

🧪 New Model Drop: Z-epiCRealism for ZImageTurbo

https://redd.it/1qdfrv6
@rStableDiffusion

From the StableDiffusion community on Reddit: 🧪 New Model Drop: Z-epiCRealism for ZImageTurbo

Explore this post and more from the StableDiffusion community

4 views10:43

About

Blog

Apps

Platform