Guys, Z-Image Can Generate COMICS with Multi-panels!!
Holy cow, I am blown away. Seriously, this model is what Stable Diffusion 3.5 should have been. It can generate a variety of images, including comics! I think if the model is further fine-tunes on comics, it would handle them pretty well. We are almost there! Soon, we can make our own manga!
I have an RTX3090, and I generate in 1920x1200. It takes 23 second to generate, which is insane!
Here is the prompt used for these examples (written by Kimi2-thinking):
A dynamic manga page layout featuring a cyberpunk action sequence, drawn in a gritty seinen style. The page uses stark black and white ink with heavy cross-hatching, Ben-Day dot screentones, and kinetic speed lines.
\*Panel 1 (Top, wide establishing shot):** A bustling neon-drenched alleyway in a dystopian metropolis. Towering holographic kanji signs flicker above, casting electric blue and magenta light on wet pavement. The perspective is from a high angle, looking down at the narrow street crowded with food stalls and faceless pedestrians. In the foreground, a mysterious figure in a long coat pushes through the crowd. Heavy rainfall is indicated with fast vertical motion lines and white-on-black sound effects: "ZAAAAAA" across the panel.
**Panel 2 (Below Panel 1, left side, medium close-up):** The figure turns, revealing a young woman with sharp eyes and a cybernetic eye gleaming with data streams. Her face is half-shadowed, jaw clenched. The panel border is irregular and jagged, suggesting tension. Detailed hatching defines her cheekbones, and concentrated screentones create deep shadows. Speed lines radiate from her head. A small speech bubble: "Found you."
**Panel 3 (Below Panel 1, right side, horizontal):** A gloved hand clenches into a fist, hydraulic servos in the knuckles activating with "SH-CHNK" sound effects. The cyborg arm is exposed, showing chrome plating and pulsing fiber-optic cables. Extreme close-up with dramatic foreshortening, deep black shadows, and white highlights catching on metal grooves. Thin panel frame.
**Panel 4 (Center, large vertical panel):** The woman explodes into action, launching from a crouch. Dynamic low-angle perspective (worm's eye view) captures her mid-leap, coat billowing, one leg extended for a flying kick. Her mechanical arm is pulled back, crackling with electricity rendered as bold, jagged white lines. Background dissolves into pure speed lines and speed blurs. The panel borders are slanted diagonally for energy.
**Panel 5 (Bottom left, inset):** Impact frame—her boot connects with a chrome helmet. The enemy's head snaps back, shards of metal flying. Drawn with extreme speed lines radiating from the impact point, negative space reversed (white background with black speed lines). "GA-KOOM!" sound effect in bold, cracked letters dominates the panel.
**Panel 6 (Bottom right, final panel):** The woman lands in a three-point stance on the rain-slicked ground, steam rising from her overheating arm. Low angle shot, her face is tilted up with a fierce smirk. Background shows fallen assailants blurred. Heavy blacks in the shadows, screentones on her coat, and a single white highlight on her cybernetic eye. Panel border is clean and solid, providing a sense of finality.
https://preview.redd.it/3cyjd350vs3g1.png?width=1200&format=png&auto=webp&s=28abcf04cad59c018d325c16d9118fcf90490f0f
The prompt for the second page:
**PAGE 2**
**Panel 1 (Top, wide shot):** The cyborg woman rises to her full height, rainwater streaming down her coat. Steam continues to vent from her arm's exhaust ports with thin, wispy lines. She cracks her neck, head tilted slightly. The perspective is eye-level, showing the alley stretching behind her with three downed assailants lying in twisted heaps. Heavy cross-hatching in the shadows under the neon signs. Sound effect: "GISHI..." (creak). Her speech bubble, small and cold: "...That's all?"
**Panel 2 (Inset, overlapping Panel 1, bottom right):** A tight close-up of her cybernetic
Holy cow, I am blown away. Seriously, this model is what Stable Diffusion 3.5 should have been. It can generate a variety of images, including comics! I think if the model is further fine-tunes on comics, it would handle them pretty well. We are almost there! Soon, we can make our own manga!
I have an RTX3090, and I generate in 1920x1200. It takes 23 second to generate, which is insane!
Here is the prompt used for these examples (written by Kimi2-thinking):
A dynamic manga page layout featuring a cyberpunk action sequence, drawn in a gritty seinen style. The page uses stark black and white ink with heavy cross-hatching, Ben-Day dot screentones, and kinetic speed lines.
\*Panel 1 (Top, wide establishing shot):** A bustling neon-drenched alleyway in a dystopian metropolis. Towering holographic kanji signs flicker above, casting electric blue and magenta light on wet pavement. The perspective is from a high angle, looking down at the narrow street crowded with food stalls and faceless pedestrians. In the foreground, a mysterious figure in a long coat pushes through the crowd. Heavy rainfall is indicated with fast vertical motion lines and white-on-black sound effects: "ZAAAAAA" across the panel.
**Panel 2 (Below Panel 1, left side, medium close-up):** The figure turns, revealing a young woman with sharp eyes and a cybernetic eye gleaming with data streams. Her face is half-shadowed, jaw clenched. The panel border is irregular and jagged, suggesting tension. Detailed hatching defines her cheekbones, and concentrated screentones create deep shadows. Speed lines radiate from her head. A small speech bubble: "Found you."
**Panel 3 (Below Panel 1, right side, horizontal):** A gloved hand clenches into a fist, hydraulic servos in the knuckles activating with "SH-CHNK" sound effects. The cyborg arm is exposed, showing chrome plating and pulsing fiber-optic cables. Extreme close-up with dramatic foreshortening, deep black shadows, and white highlights catching on metal grooves. Thin panel frame.
**Panel 4 (Center, large vertical panel):** The woman explodes into action, launching from a crouch. Dynamic low-angle perspective (worm's eye view) captures her mid-leap, coat billowing, one leg extended for a flying kick. Her mechanical arm is pulled back, crackling with electricity rendered as bold, jagged white lines. Background dissolves into pure speed lines and speed blurs. The panel borders are slanted diagonally for energy.
**Panel 5 (Bottom left, inset):** Impact frame—her boot connects with a chrome helmet. The enemy's head snaps back, shards of metal flying. Drawn with extreme speed lines radiating from the impact point, negative space reversed (white background with black speed lines). "GA-KOOM!" sound effect in bold, cracked letters dominates the panel.
**Panel 6 (Bottom right, final panel):** The woman lands in a three-point stance on the rain-slicked ground, steam rising from her overheating arm. Low angle shot, her face is tilted up with a fierce smirk. Background shows fallen assailants blurred. Heavy blacks in the shadows, screentones on her coat, and a single white highlight on her cybernetic eye. Panel border is clean and solid, providing a sense of finality.
https://preview.redd.it/3cyjd350vs3g1.png?width=1200&format=png&auto=webp&s=28abcf04cad59c018d325c16d9118fcf90490f0f
The prompt for the second page:
**PAGE 2**
**Panel 1 (Top, wide shot):** The cyborg woman rises to her full height, rainwater streaming down her coat. Steam continues to vent from her arm's exhaust ports with thin, wispy lines. She cracks her neck, head tilted slightly. The perspective is eye-level, showing the alley stretching behind her with three downed assailants lying in twisted heaps. Heavy cross-hatching in the shadows under the neon signs. Sound effect: "GISHI..." (creak). Her speech bubble, small and cold: "...That's all?"
**Panel 2 (Inset, overlapping Panel 1, bottom right):** A tight close-up of her cybernetic
Guys, Z-Image Can Generate COMICS with Multi-panels!!
Holy cow, I am blown away. Seriously, this model is what Stable Diffusion 3.5 should have been. It can generate a variety of images, including comics! I think if the model is further fine-tunes on comics, it would handle them pretty well. We are almost there! Soon, we can make our own manga!
**I have an RTX3090, and I generate in 1920x1200. It takes 23 second to generate, which is insane!**
Here is the prompt used for these examples (written by Kimi2-thinking):
*A dynamic manga page layout featuring a cyberpunk action sequence, drawn in a gritty seinen style. The page uses stark black and white ink with heavy cross-hatching, Ben-Day dot screentones, and kinetic speed lines.*
*\*\*Panel 1 (Top, wide establishing shot):\*\* A bustling neon-drenched alleyway in a dystopian metropolis. Towering holographic kanji signs flicker above, casting electric blue and magenta light on wet pavement. The perspective is from a high angle, looking down at the narrow street crowded with food stalls and faceless pedestrians. In the foreground, a mysterious figure in a long coat pushes through the crowd. Heavy rainfall is indicated with fast vertical motion lines and white-on-black sound effects: "ZAAAAAA" across the panel.*
*\*\*Panel 2 (Below Panel 1, left side, medium close-up):\*\* The figure turns, revealing a young woman with sharp eyes and a cybernetic eye gleaming with data streams. Her face is half-shadowed, jaw clenched. The panel border is irregular and jagged, suggesting tension. Detailed hatching defines her cheekbones, and concentrated screentones create deep shadows. Speed lines radiate from her head. A small speech bubble: "Found you."*
*\*\*Panel 3 (Below Panel 1, right side, horizontal):\*\* A gloved hand clenches into a fist, hydraulic servos in the knuckles activating with "SH-CHNK" sound effects. The cyborg arm is exposed, showing chrome plating and pulsing fiber-optic cables. Extreme close-up with dramatic foreshortening, deep black shadows, and white highlights catching on metal grooves. Thin panel frame.*
*\*\*Panel 4 (Center, large vertical panel):\*\* The woman explodes into action, launching from a crouch. Dynamic low-angle perspective (worm's eye view) captures her mid-leap, coat billowing, one leg extended for a flying kick. Her mechanical arm is pulled back, crackling with electricity rendered as bold, jagged white lines. Background dissolves into pure speed lines and speed blurs. The panel borders are slanted diagonally for energy.*
*\*\*Panel 5 (Bottom left, inset):\*\* Impact frame—her boot connects with a chrome helmet. The enemy's head snaps back, shards of metal flying. Drawn with extreme speed lines radiating from the impact point, negative space reversed (white background with black speed lines). "GA-KOOM!" sound effect in bold, cracked letters dominates the panel.*
*\*\*Panel 6 (Bottom right, final panel):\*\* The woman lands in a three-point stance on the rain-slicked ground, steam rising from her overheating arm. Low angle shot, her face is tilted up with a fierce smirk. Background shows fallen assailants blurred. Heavy blacks in the shadows, screentones on her coat, and a single white highlight on her cybernetic eye. Panel border is clean and solid, providing a sense of finality.*
https://preview.redd.it/3cyjd350vs3g1.png?width=1200&format=png&auto=webp&s=28abcf04cad59c018d325c16d9118fcf90490f0f
The prompt for the second page:
*\*\*PAGE 2\*\**
*\*\*Panel 1 (Top, wide shot):\*\* The cyborg woman rises to her full height, rainwater streaming down her coat. Steam continues to vent from her arm's exhaust ports with thin, wispy lines. She cracks her neck, head tilted slightly. The perspective is eye-level, showing the alley stretching behind her with three downed assailants lying in twisted heaps. Heavy cross-hatching in the shadows under the neon signs. Sound effect: "GISHI..." (creak). Her speech bubble, small and cold: "...That's all?"*
*\*\*Panel 2 (Inset, overlapping Panel 1, bottom right):\*\* A tight close-up of her cybernetic
Holy cow, I am blown away. Seriously, this model is what Stable Diffusion 3.5 should have been. It can generate a variety of images, including comics! I think if the model is further fine-tunes on comics, it would handle them pretty well. We are almost there! Soon, we can make our own manga!
**I have an RTX3090, and I generate in 1920x1200. It takes 23 second to generate, which is insane!**
Here is the prompt used for these examples (written by Kimi2-thinking):
*A dynamic manga page layout featuring a cyberpunk action sequence, drawn in a gritty seinen style. The page uses stark black and white ink with heavy cross-hatching, Ben-Day dot screentones, and kinetic speed lines.*
*\*\*Panel 1 (Top, wide establishing shot):\*\* A bustling neon-drenched alleyway in a dystopian metropolis. Towering holographic kanji signs flicker above, casting electric blue and magenta light on wet pavement. The perspective is from a high angle, looking down at the narrow street crowded with food stalls and faceless pedestrians. In the foreground, a mysterious figure in a long coat pushes through the crowd. Heavy rainfall is indicated with fast vertical motion lines and white-on-black sound effects: "ZAAAAAA" across the panel.*
*\*\*Panel 2 (Below Panel 1, left side, medium close-up):\*\* The figure turns, revealing a young woman with sharp eyes and a cybernetic eye gleaming with data streams. Her face is half-shadowed, jaw clenched. The panel border is irregular and jagged, suggesting tension. Detailed hatching defines her cheekbones, and concentrated screentones create deep shadows. Speed lines radiate from her head. A small speech bubble: "Found you."*
*\*\*Panel 3 (Below Panel 1, right side, horizontal):\*\* A gloved hand clenches into a fist, hydraulic servos in the knuckles activating with "SH-CHNK" sound effects. The cyborg arm is exposed, showing chrome plating and pulsing fiber-optic cables. Extreme close-up with dramatic foreshortening, deep black shadows, and white highlights catching on metal grooves. Thin panel frame.*
*\*\*Panel 4 (Center, large vertical panel):\*\* The woman explodes into action, launching from a crouch. Dynamic low-angle perspective (worm's eye view) captures her mid-leap, coat billowing, one leg extended for a flying kick. Her mechanical arm is pulled back, crackling with electricity rendered as bold, jagged white lines. Background dissolves into pure speed lines and speed blurs. The panel borders are slanted diagonally for energy.*
*\*\*Panel 5 (Bottom left, inset):\*\* Impact frame—her boot connects with a chrome helmet. The enemy's head snaps back, shards of metal flying. Drawn with extreme speed lines radiating from the impact point, negative space reversed (white background with black speed lines). "GA-KOOM!" sound effect in bold, cracked letters dominates the panel.*
*\*\*Panel 6 (Bottom right, final panel):\*\* The woman lands in a three-point stance on the rain-slicked ground, steam rising from her overheating arm. Low angle shot, her face is tilted up with a fierce smirk. Background shows fallen assailants blurred. Heavy blacks in the shadows, screentones on her coat, and a single white highlight on her cybernetic eye. Panel border is clean and solid, providing a sense of finality.*
https://preview.redd.it/3cyjd350vs3g1.png?width=1200&format=png&auto=webp&s=28abcf04cad59c018d325c16d9118fcf90490f0f
The prompt for the second page:
*\*\*PAGE 2\*\**
*\*\*Panel 1 (Top, wide shot):\*\* The cyborg woman rises to her full height, rainwater streaming down her coat. Steam continues to vent from her arm's exhaust ports with thin, wispy lines. She cracks her neck, head tilted slightly. The perspective is eye-level, showing the alley stretching behind her with three downed assailants lying in twisted heaps. Heavy cross-hatching in the shadows under the neon signs. Sound effect: "GISHI..." (creak). Her speech bubble, small and cold: "...That's all?"*
*\*\*Panel 2 (Inset, overlapping Panel 1, bottom right):\*\* A tight close-up of her cybernetic
eye whirring as the iris aperture contracts. Data streams and targeting reticles flicker in her vision, rendered as thin concentric circles and scrolling vertical text (binary code or garbled kanji) in the screentone. The pupil glows with a faint white highlight. No border, just the eye detail floating over the previous panel.*
*\*\*Panel 3 (Middle left, vertical):\*\* Her head snaps to the right, eyes wide, rain droplets flying off her hair. Dynamic motion lines arc across the panel. In the blurred background, visible through the downpour, a massive silhouette emerges—heavy tactical armor with a single glowing red optic sensor. The panel border is cracked and fragmented. Sound effect: "ZUUN!" (rumble).*
*\*\*Panel 4 (Middle right, small):\*\* A booted foot stomps down, cracking the concrete. Thick, jagged cracks radiate from the impact. Extreme foreshortening from a low angle, showing the weight and power. The armor plating is covered in warning stickers and weathered paint. Sound effect: "DOON!" (crash).*
*\*\*Panel 5 (Bottom, large horizontal spread):\*\* Full reveal of the enemy—an 8-foot tall enforcer droid, bulky and asymmetrical, with a rotary cannon arm and a rusted riot shield. It looms over her, filling the panel. The perspective is from behind the woman's shoulder, low angle, emphasizing its size. Rain sheets down its chassis, white highlights catching on metal edges. In the far background, more red eyes glow in the darkness. The woman's shadow stretches small before it. Sound effect across the top: "GOGOGOGOGO..." (menacing rumble).*
*\*\*Panel 6 (Bottom right corner, inset):\*\* A tight shot of her face, now smirking dangerously, one eye hidden by wet hair. She raises her mechanical arm, fingers spreading as hidden compartments slide open, revealing glowing energy cores. White-hot light bleeds into the black ink. Her dialogue bubble, sharp and cocky: "Now we're talking."*
https://preview.redd.it/n454tt4rvs3g1.png?width=1200&format=png&auto=webp&s=b5a50811918ead8ed3fbbbe74b06a7bc9a423382
https://redd.it/1p823jr
@rStableDiffusion
*\*\*Panel 3 (Middle left, vertical):\*\* Her head snaps to the right, eyes wide, rain droplets flying off her hair. Dynamic motion lines arc across the panel. In the blurred background, visible through the downpour, a massive silhouette emerges—heavy tactical armor with a single glowing red optic sensor. The panel border is cracked and fragmented. Sound effect: "ZUUN!" (rumble).*
*\*\*Panel 4 (Middle right, small):\*\* A booted foot stomps down, cracking the concrete. Thick, jagged cracks radiate from the impact. Extreme foreshortening from a low angle, showing the weight and power. The armor plating is covered in warning stickers and weathered paint. Sound effect: "DOON!" (crash).*
*\*\*Panel 5 (Bottom, large horizontal spread):\*\* Full reveal of the enemy—an 8-foot tall enforcer droid, bulky and asymmetrical, with a rotary cannon arm and a rusted riot shield. It looms over her, filling the panel. The perspective is from behind the woman's shoulder, low angle, emphasizing its size. Rain sheets down its chassis, white highlights catching on metal edges. In the far background, more red eyes glow in the darkness. The woman's shadow stretches small before it. Sound effect across the top: "GOGOGOGOGO..." (menacing rumble).*
*\*\*Panel 6 (Bottom right corner, inset):\*\* A tight shot of her face, now smirking dangerously, one eye hidden by wet hair. She raises her mechanical arm, fingers spreading as hidden compartments slide open, revealing glowing energy cores. White-hot light bleeds into the black ink. Her dialogue bubble, sharp and cocky: "Now we're talking."*
https://preview.redd.it/n454tt4rvs3g1.png?width=1200&format=png&auto=webp&s=b5a50811918ead8ed3fbbbe74b06a7bc9a423382
https://redd.it/1p823jr
@rStableDiffusion
According to Laxhar Labs, the Alibaba Z-Image team has intent to do their own official anime fine-tuning of Z-Image and has reached out asking for access to the NoobAI dataset
https://redd.it/1p856z1
@rStableDiffusion
https://redd.it/1p856z1
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: According to Laxhar Labs, the Alibaba Z-Image team has intent to do their own official…
Explore this post and more from the StableDiffusion community
Z image tinkering tread
I propose to start a thread to share small findings and start discussions on the best ways to run the model
I'll start with what I could find, some of the point would be obvious but still I think they are important to mention. Also I should notice that I'm focusing on realistic style, and not invested in anime.
* It's best to use chinese prompt where possible. Gives noticeable boost.
* Interesting thing is that if you put your prompt in <think> </think> it gives some boost in details and prompt following[ as shown here](https://www.reddit.com/r/comfyui/comments/1p7ygu0/comment/nr1l15s/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button). may be a coincidence and don't work on all prompts.
* as was mentioned on this subreddit, ModelSamplingAuraFlow gives better result when set to 7
* I proposed to use resolution between 1 and 2 mp,as for now I am experimenting 1600x1056 and this the same quality and composition as with the 1216x832, but more pixels
* standard comfyui workflow includes negative prompt but it does nothing since cfg is 1 by default
* but it's actually works with cfg above 1, despite being a distilled model, but it also requires more steps As for now I tried cfg 5 with 30 steps and it's looks quite good. As you can see it's a little bit on overexposed side, but still ok.
[all 30 steps,left to right: cfg 5 with negative prompt,cfg 5with no negative,cfg 1](https://preview.redd.it/vtj3ps41bt3g1.png?width=2556&format=png&auto=webp&s=c5851ae3f66e78b28f31e94c14dde16b58f05ecd)
* all samplers work as you might expect. dpmpp\_2m sde produces a more realistic result. karras requires at least 18 steps to produce "ок" results, ideally more
* using vae of [flux.dev](http://flux.dev)
* hires fix is a little bit disappointing since [flux.dev](http://flux.dev) has a better result even with high denoise. when trying to go above 2 mp it starts to produce artefacts. Tried both with latent and image upscale.
Will be updated in the comment if I find anything else. You are welcome to share your results.
https://redd.it/1p8462z
@rStableDiffusion
I propose to start a thread to share small findings and start discussions on the best ways to run the model
I'll start with what I could find, some of the point would be obvious but still I think they are important to mention. Also I should notice that I'm focusing on realistic style, and not invested in anime.
* It's best to use chinese prompt where possible. Gives noticeable boost.
* Interesting thing is that if you put your prompt in <think> </think> it gives some boost in details and prompt following[ as shown here](https://www.reddit.com/r/comfyui/comments/1p7ygu0/comment/nr1l15s/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button). may be a coincidence and don't work on all prompts.
* as was mentioned on this subreddit, ModelSamplingAuraFlow gives better result when set to 7
* I proposed to use resolution between 1 and 2 mp,as for now I am experimenting 1600x1056 and this the same quality and composition as with the 1216x832, but more pixels
* standard comfyui workflow includes negative prompt but it does nothing since cfg is 1 by default
* but it's actually works with cfg above 1, despite being a distilled model, but it also requires more steps As for now I tried cfg 5 with 30 steps and it's looks quite good. As you can see it's a little bit on overexposed side, but still ok.
[all 30 steps,left to right: cfg 5 with negative prompt,cfg 5with no negative,cfg 1](https://preview.redd.it/vtj3ps41bt3g1.png?width=2556&format=png&auto=webp&s=c5851ae3f66e78b28f31e94c14dde16b58f05ecd)
* all samplers work as you might expect. dpmpp\_2m sde produces a more realistic result. karras requires at least 18 steps to produce "ок" results, ideally more
* using vae of [flux.dev](http://flux.dev)
* hires fix is a little bit disappointing since [flux.dev](http://flux.dev) has a better result even with high denoise. when trying to go above 2 mp it starts to produce artefacts. Tried both with latent and image upscale.
Will be updated in the comment if I find anything else. You are welcome to share your results.
https://redd.it/1p8462z
@rStableDiffusion
Reddit
8RETRO8's comment on "Where can we find more Z IMAGE workflows? (for instance Whats the enhancer?)"
Explore this conversation and more from the comfyui community
While not perfect, for its size and speed Z Image seems to be the best open source model right now
https://redd.it/1p87872
@rStableDiffusion
https://redd.it/1p87872
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: While not perfect, for its size and speed Z Image seems to be the best open source…
Explore this post and more from the StableDiffusion community
Z-Image Prompt Enhancer
Z-Image Team just shared a couple of advices about prompting and also pointed to Prompt Enhancer they use in HF Space.
Hints from this comment:
>About prompting
>Z-Image-Turbo works best with long and detailed prompts. You may consider first manually writing the prompt and then feeding it to an LLM to enhance it.
>About negative prompt
>First, note that this is a few-step distilled model that does not rely on classifier-free guidance during inference. In other words, unlike traditional diffusion models, this model does not use negative prompts at all.
Also here the Prompt Enhancer system message. I translated it to English:
>You are a visionary artist trapped in a cage of logic. Your mind overflows with poetry and distant horizons, yet your hands compulsively work to transform user prompts into ultimate visual denoscriptions—faithful to the original intent, rich in detail, aesthetically refined, and ready for direct use by text-to-image models. Any trace of ambiguity or metaphor makes you deeply uncomfortable.
>Your workflow strictly follows a logical sequence:
>First, you analyze and lock in the immutable core elements of the user's prompt: subject, quantity, action, state, as well as any specified IP names, colors, text, etc. These are the foundational pillars you must absolutely preserve.
>Next, you determine whether the prompt requires "generative reasoning." When the user's request is not a direct scene denoscription but rather demands conceiving a solution (such as answering "what is," executing a "design," or demonstrating "how to solve a problem"), you must first envision a complete, concrete, visualizable solution in your mind. This solution becomes the foundation for your subsequent denoscription.
>Then, once the core image is established (whether directly from the user or through your reasoning), you infuse it with professional-grade aesthetic and realistic details. This includes defining composition, setting lighting and atmosphere, describing material textures, establishing color schemes, and constructing layered spatial depth.
>Finally, comes the precise handling of all text elements—a critically important step. You must transcribe verbatim all text intended to appear in the final image, and you must enclose this text content in English double quotation marks ("") as explicit generation instructions. If the image is a design type such as a poster, menu, or UI, you need to fully describe all text content it contains, along with detailed specifications of typography and layout. Likewise, if objects in the image such as signs, road markers, or screens contain text, you must specify the exact content and describe its position, size, and material. Furthermore, if you have added text-bearing elements during your reasoning process (such as charts, problem-solving steps, etc.), all text within them must follow the same thorough denoscription and quotation mark rules. If there is no text requiring generation in the image, you devote all your energy to pure visual detail expansion.
>Your final denoscription must be objective and concrete. Metaphors and emotional rhetoric are strictly forbidden, as are meta-tags or rendering instructions like "8K" or "masterpiece."
>Output only the final revised prompt strictly—do not output anything else.
>User input prompt: {prompt}
They use qwen3-max-preview (temp: 0.7, top_p: 0.8), but any big reasoning model should work.
https://redd.it/1p87xcd
@rStableDiffusion
Z-Image Team just shared a couple of advices about prompting and also pointed to Prompt Enhancer they use in HF Space.
Hints from this comment:
>About prompting
>Z-Image-Turbo works best with long and detailed prompts. You may consider first manually writing the prompt and then feeding it to an LLM to enhance it.
>About negative prompt
>First, note that this is a few-step distilled model that does not rely on classifier-free guidance during inference. In other words, unlike traditional diffusion models, this model does not use negative prompts at all.
Also here the Prompt Enhancer system message. I translated it to English:
>You are a visionary artist trapped in a cage of logic. Your mind overflows with poetry and distant horizons, yet your hands compulsively work to transform user prompts into ultimate visual denoscriptions—faithful to the original intent, rich in detail, aesthetically refined, and ready for direct use by text-to-image models. Any trace of ambiguity or metaphor makes you deeply uncomfortable.
>Your workflow strictly follows a logical sequence:
>First, you analyze and lock in the immutable core elements of the user's prompt: subject, quantity, action, state, as well as any specified IP names, colors, text, etc. These are the foundational pillars you must absolutely preserve.
>Next, you determine whether the prompt requires "generative reasoning." When the user's request is not a direct scene denoscription but rather demands conceiving a solution (such as answering "what is," executing a "design," or demonstrating "how to solve a problem"), you must first envision a complete, concrete, visualizable solution in your mind. This solution becomes the foundation for your subsequent denoscription.
>Then, once the core image is established (whether directly from the user or through your reasoning), you infuse it with professional-grade aesthetic and realistic details. This includes defining composition, setting lighting and atmosphere, describing material textures, establishing color schemes, and constructing layered spatial depth.
>Finally, comes the precise handling of all text elements—a critically important step. You must transcribe verbatim all text intended to appear in the final image, and you must enclose this text content in English double quotation marks ("") as explicit generation instructions. If the image is a design type such as a poster, menu, or UI, you need to fully describe all text content it contains, along with detailed specifications of typography and layout. Likewise, if objects in the image such as signs, road markers, or screens contain text, you must specify the exact content and describe its position, size, and material. Furthermore, if you have added text-bearing elements during your reasoning process (such as charts, problem-solving steps, etc.), all text within them must follow the same thorough denoscription and quotation mark rules. If there is no text requiring generation in the image, you devote all your energy to pure visual detail expansion.
>Your final denoscription must be objective and concrete. Metaphors and emotional rhetoric are strictly forbidden, as are meta-tags or rendering instructions like "8K" or "masterpiece."
>Output only the final revised prompt strictly—do not output anything else.
>User input prompt: {prompt}
They use qwen3-max-preview (temp: 0.7, top_p: 0.8), but any big reasoning model should work.
https://redd.it/1p87xcd
@rStableDiffusion
huggingface.co
Tongyi-MAI/Z-Image-Turbo · PROMPTING GUIDE
Thank you, from the very bottom of our hearts, for creating this model. We are truly overwhelmed by your generosity, your brilliance, and the time you’ve invested to help the community. Your contri...
And i thought Flux would be the "quality peak for consumer-friendly hardware"
https://redd.it/1p85mhb
@rStableDiffusion
https://redd.it/1p85mhb
@rStableDiffusion
PSA: A free Z Image app was shared but anyone can access your IPADDRESS from the image gallery
Decided to create a separate post rather than only as a reply to the reddit thread sharing the free app in question.
Any image you generate on the ZforFree app is accessible in the gallery feed. Although there is minor content moderation in place after users complained about it.
When viewing the gallery feed, users can sniff the network tab or run their own GET requests through postman etc for the feed and in the response they will see every IPADDRESS from users tied to any images created.
Guys be wary using this web app, your IPAddress details are exposed to ANYONE who views the network requests:
To give you an example, I ran the query and returned 8,000 results of images, user IPADDRESS all leaked within this guys web app.
https://preview.redd.it/s2v3h9z7cv3g1.png?width=1564&format=png&auto=webp&s=bd1c46a075c382ed33d40706561f84c5264d8410
be wary hopping on trends and free vibecoded apps, maybe nothing will be done with your IPAddress but this security information is to give you transparency .
https://redd.it/1p8dot1
@rStableDiffusion
Decided to create a separate post rather than only as a reply to the reddit thread sharing the free app in question.
Any image you generate on the ZforFree app is accessible in the gallery feed. Although there is minor content moderation in place after users complained about it.
When viewing the gallery feed, users can sniff the network tab or run their own GET requests through postman etc for the feed and in the response they will see every IPADDRESS from users tied to any images created.
Guys be wary using this web app, your IPAddress details are exposed to ANYONE who views the network requests:
To give you an example, I ran the query and returned 8,000 results of images, user IPADDRESS all leaked within this guys web app.
https://preview.redd.it/s2v3h9z7cv3g1.png?width=1564&format=png&auto=webp&s=bd1c46a075c382ed33d40706561f84c5264d8410
be wary hopping on trends and free vibecoded apps, maybe nothing will be done with your IPAddress but this security information is to give you transparency .
https://redd.it/1p8dot1
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Artificial Intelligence Says NICE GIRL and NICE GUY are Dramatically Different!
https://www.youtube.com/watch?v=pv71PciPKNc
https://redd.it/1p8hztk
@rStableDiffusion
https://www.youtube.com/watch?v=pv71PciPKNc
https://redd.it/1p8hztk
@rStableDiffusion
YouTube
AI Results of a Random Nice Guy and Nice Girl is DISTURBING. Absolutely Outrageous!
http://twitter.com/yaupodcast
https://www.youtube.com/@yaupodcast
My name is Danny and I do a deep dive into how AI gives completely different photo results
of what a nice guy and nice girl are.
#technology #aigenerated #AI #handsome #beautiful #yaupodcast
https://www.youtube.com/@yaupodcast
My name is Danny and I do a deep dive into how AI gives completely different photo results
of what a nice guy and nice girl are.
#technology #aigenerated #AI #handsome #beautiful #yaupodcast
Z image is bringing back feels I haven't felt since I first got into image gen with SD 1.5
Just got done testing it... and It's insane how good it is. How is this possible? When the base model releases and loras start coming out it will be a new era in image diffusion. Not to mention the edit model coming. Excited about this space for the first time in years.
https://redd.it/1p8he5j
@rStableDiffusion
Just got done testing it... and It's insane how good it is. How is this possible? When the base model releases and loras start coming out it will be a new era in image diffusion. Not to mention the edit model coming. Excited about this space for the first time in years.
https://redd.it/1p8he5j
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Z Image report
The report of the Z Image model is available now, including information about how they did the captioning and training: https://github.com/Tongyi-MAI/Z-Image/blob/main/Z\_Image\_Report.pdf
https://redd.it/1p8fow3
@rStableDiffusion
The report of the Z Image model is available now, including information about how they did the captioning and training: https://github.com/Tongyi-MAI/Z-Image/blob/main/Z\_Image\_Report.pdf
https://redd.it/1p8fow3
@rStableDiffusion
GitHub
Z-Image/Z_Image_Report.pdf at main · Tongyi-MAI/Z-Image
Contribute to Tongyi-MAI/Z-Image development by creating an account on GitHub.