r/StableDiffusion – Telegram
Another ZIT praise post!

Z Image is incredible! This is what the SD3 should have been! I am so blown away by this even after back to back weeks of working with WAN, QWEN, and even good ol' SDXL. Here's a still of a cat from an imaginary film that doesn't exist that displays his worries. All made with no LORAs or editing.

https://preview.redd.it/e9wtcthdfx4g1.png?width=1920&format=png&auto=webp&s=06bb40cc734301f135b6c6f29f1795b29addefaa



https://redd.it/1pcw2nr
@rStableDiffusion
The Secrets of Realism, Consistency and Variety with Z Image Turbo

I had been struggling to get realistic photographs of regular people with Z Image Turbo. I was used to SDXL and FLUX, which only required few key words (‘Average’, ‘film photo esthetic’), and I was quite disappointed when at first I kept getting plastic, similar model-looking people for every prompt. I tried all the usual keywords, and Z Image seemed to ignore most of them. And then, while enhancing prompts with ChatGPT, I noticed that his denoscriptions were much more verbose and accurate to describe the subject’s features. The vocabulary used was far more precise, and matched actual photography terms (I’m a photog by trade). So, to get an actual snapshot of an average man or woman, that looks like a real photograph, be precise with every detail of the subject - and in particular the type of camera (and lens, film) used to shoot the photograph. The trick being that Z Image Turbo responds to different keywords than previous models, to say the same thing. Here’s an example that works very well:

“Medium shot of a realistic ordinary middle-aged French man with an average, everyday appearance. He has a long face, piercing blue eyes, a three-days beard and messy mid-length light-brown hair. He is sitting on a bar stool in an ordinary restaurant, drinking a glass of red wine, Paris, France. Shot with a point-and-shoot film camera.”

Notice the ‘realistic, ordinary, average, everyday appearance’ wording. That’s very verbose, but the model responds well. Using only ‘average’ like with SDXL, only confuses the model, making even more consistent plastic-looking people.

The key point is that out of the box, Z Image Turbo will pump out perfect digital images of beautiful-looking people by default (if you do not specify the camera type or precise subject features). As soon as you add these details, it responds as expected, and better than any other model I’ve used. ‘Shot with a point-and-shoot camera’ is also extremely effective at guiding the model to give you actual snapshots. There are many that work well. ‘35mm film camera’ works as expected, and provide beautiful, realistic fine grain and tones reminiscent of film days.

On the other hand, if you’re looking for consistency, Z Image Turbo delivers as well, so long as the prompt is precise enough on the features of the person you want to keep in your shots. ComfyUI’s multi-prompt capabilities provide with all the variety you’d ever want, using the simple bracket system.

Keep in mind that Z Image Turbo’s CLIP has a limit of 512 tokens, meaning long prompts over 300 words might get truncated, making multi-prompts a must for both breath and detail for each iteration. You can use an LLM to shorten your long prompts.

Putting it all together: for example, I wanted to reproduce realistic photographs of that very cute Korean actress from ‘I am Not a Robot’ on Netflix, and simulate a full photoshoot in a variety of settings, poses and moods. To accomplish this, I divided the prompt in sections, and used mustaches ({}’s and |) to increase the variety of each shot. Here’s the final result:



“Full-body portrait of a petite adult Korean woman with a delicate frame, long sleek straight dark brown hair falling past her chest with full blunt bangs, smooth fair skin, and soft refined facial features, photographed in a park with natural surroundings, shot with {a 35mm analog film camera with visible grain|an iPhone snapshot with handheld imperfections|a Polaroid instant camera with creamy tones|a professional 35mm film SLR with crisp lenses|a compact point-and-shoot film camera with gentle flash falloff|a digital point-and-shoot camera with clean edges}.

Camera angle is {eye-level for a natural look|slightly low angle giving subtle height|slightly high angle for gentle emphasis on her face|three-quarter angle revealing depth in the background|parallel frontal angle for documentary realism|angled from the side capturing mid-motion|classic portrait framing with centered geometry}.

Time of day is {golden hour with warm soft