Lora Training for Z Image Turbo on 12gb VRAM
Shoutout to Ostris for getting Z Image supported for lora training so quickly.
[https://github.com/ostris/ai-toolkit](https://github.com/ostris/ai-toolkit)
[https://huggingface.co/ostris/zimage\_turbo\_training\_adapter](https://huggingface.co/ostris/zimage_turbo_training_adapter)
Wanted to share that it looks like you will be able to train this with GPU's with 12gb VRAM. Currently running it on his run pod template.
[https://console.runpod.io/hub/template/ai-toolkit-ostris-ui-official?id=0fqzfjy6f3](https://console.runpod.io/hub/template/ai-toolkit-ostris-ui-official?id=0fqzfjy6f3)
`MODEL OPTIONS`
* `Low VRAM: ON`
* `LAYER OFFLOADING: OFF`
`QUANTIZATION`
* `Transformer: float8 (default)`
* `Text Encoder: float8 (default)`
`TARGET`
* `Target Type: LoRA`
* `Linear Rank: 32`
`SAVE`
* `Data Type: BF16`
* `Save Every: 500`
* `Max Step Saves to Keep: 4`
`TRAINING`
* `Batch Size: 1`
* `Gradient Accumulation: 1`
* `Steps: 3000`
* `Optimizer: AdamW8Bit`
* `Learning Rate: 0.0001`
* `Weight Decay: 0.0001`
* `Timestep Type: Sigmoid`
* `Timestep Bias: Balanced`
* `EMA (Exponential Moving Average):`
* `Use EMA: OFF`
* `Text Encoder Optimizations:`
* `Unload TE: OFF`
* `Cache Text Embeddings: ON`
* `Regularization:`
* `Differential Output Preservation: OFF`
* `Blank Prompt Preservation: OFF`
17 Image data set - Resolution settings 512, 768,1024 (ON)
RTX 5090
1.30s/it, lr: 1.0e-04 loss: 3.742e-01]
Halfway through my training and it's already looking fantastic. Estimating about 1.5hrs to train 3000 steps with samples and saves.
CivitAI is about to be flooded with LORAs. Give this dude some money: [https://www.patreon.com/ostris](https://www.patreon.com/ostris)
https://redd.it/1p957k2
@rStableDiffusion
Shoutout to Ostris for getting Z Image supported for lora training so quickly.
[https://github.com/ostris/ai-toolkit](https://github.com/ostris/ai-toolkit)
[https://huggingface.co/ostris/zimage\_turbo\_training\_adapter](https://huggingface.co/ostris/zimage_turbo_training_adapter)
Wanted to share that it looks like you will be able to train this with GPU's with 12gb VRAM. Currently running it on his run pod template.
[https://console.runpod.io/hub/template/ai-toolkit-ostris-ui-official?id=0fqzfjy6f3](https://console.runpod.io/hub/template/ai-toolkit-ostris-ui-official?id=0fqzfjy6f3)
`MODEL OPTIONS`
* `Low VRAM: ON`
* `LAYER OFFLOADING: OFF`
`QUANTIZATION`
* `Transformer: float8 (default)`
* `Text Encoder: float8 (default)`
`TARGET`
* `Target Type: LoRA`
* `Linear Rank: 32`
`SAVE`
* `Data Type: BF16`
* `Save Every: 500`
* `Max Step Saves to Keep: 4`
`TRAINING`
* `Batch Size: 1`
* `Gradient Accumulation: 1`
* `Steps: 3000`
* `Optimizer: AdamW8Bit`
* `Learning Rate: 0.0001`
* `Weight Decay: 0.0001`
* `Timestep Type: Sigmoid`
* `Timestep Bias: Balanced`
* `EMA (Exponential Moving Average):`
* `Use EMA: OFF`
* `Text Encoder Optimizations:`
* `Unload TE: OFF`
* `Cache Text Embeddings: ON`
* `Regularization:`
* `Differential Output Preservation: OFF`
* `Blank Prompt Preservation: OFF`
17 Image data set - Resolution settings 512, 768,1024 (ON)
RTX 5090
1.30s/it, lr: 1.0e-04 loss: 3.742e-01]
Halfway through my training and it's already looking fantastic. Estimating about 1.5hrs to train 3000 steps with samples and saves.
CivitAI is about to be flooded with LORAs. Give this dude some money: [https://www.patreon.com/ostris](https://www.patreon.com/ostris)
https://redd.it/1p957k2
@rStableDiffusion
GitHub
GitHub - ostris/ai-toolkit: The ultimate training toolkit for finetuning diffusion models
The ultimate training toolkit for finetuning diffusion models - ostris/ai-toolkit
Z-Image: Best Practices for Maximum detail, Clarity and Quality?
Z-Image pics tend to be a *little blurry, a *little grainy, and a *little compressed looking.
Here's what I know (or think I know) so far that can help clear things up a bit.
\- Don't render at 1024x1024. Go higher to 1440x1440, 1920x1088 or 2048x2048. 3840x2160 is too high for this model natively.
\- Change the shift (ModelSamplingAuraFlow) from 3 (default) to 7. If the node is off, it defaults to 3.
\- Using more steps than 9 doesn't help, it hurts. 20 or 30 steps just results in blotchy skin.
EDIT \- The combination of euler and sgm_uniform solves the problem of skin getting blotchy at higher steps. But after SOME testing I can't notice any reason to go higher than 9 steps. The image isn't any sharper, there aren't any more details. Text accuracy doesn't increase either. Anatomy is equal in 9 or 25 steps etc. But maybe there is SOME reason increase steps? IDK
\- From my testing res2 and bong_tangent also result in worse looking blotchy skin. Euler/Beta or Euler/linear_quadratic seem to produce the cleanest images (I have NOT tried all combinations)
\- Lowering cfg from 1 to 0.8 will mute colors a bit, which you may like.
Raising cfg from 1 to 2 or 3 will saturate colors and make them pop while still remaining balanced. Any higher than 3 and your images burn. And honestly I prefer the look of cfg2 compared to cfg1, BUT raising cfg above 1 will also result in a near doubling of your render time.
\- Up-scaling with Topaz produces *very nice results, but if you know of an in-Comfy solution that is better I'd love to hear about it.
What have you found produces the best results from Z-Image?
https://redd.it/1p8xtln
@rStableDiffusion
Z-Image pics tend to be a *little blurry, a *little grainy, and a *little compressed looking.
Here's what I know (or think I know) so far that can help clear things up a bit.
\- Don't render at 1024x1024. Go higher to 1440x1440, 1920x1088 or 2048x2048. 3840x2160 is too high for this model natively.
\- Change the shift (ModelSamplingAuraFlow) from 3 (default) to 7. If the node is off, it defaults to 3.
\- Using more steps than 9 doesn't help, it hurts. 20 or 30 steps just results in blotchy skin.
EDIT \- The combination of euler and sgm_uniform solves the problem of skin getting blotchy at higher steps. But after SOME testing I can't notice any reason to go higher than 9 steps. The image isn't any sharper, there aren't any more details. Text accuracy doesn't increase either. Anatomy is equal in 9 or 25 steps etc. But maybe there is SOME reason increase steps? IDK
\- From my testing res2 and bong_tangent also result in worse looking blotchy skin. Euler/Beta or Euler/linear_quadratic seem to produce the cleanest images (I have NOT tried all combinations)
\- Lowering cfg from 1 to 0.8 will mute colors a bit, which you may like.
Raising cfg from 1 to 2 or 3 will saturate colors and make them pop while still remaining balanced. Any higher than 3 and your images burn. And honestly I prefer the look of cfg2 compared to cfg1, BUT raising cfg above 1 will also result in a near doubling of your render time.
\- Up-scaling with Topaz produces *very nice results, but if you know of an in-Comfy solution that is better I'd love to hear about it.
What have you found produces the best results from Z-Image?
https://redd.it/1p8xtln
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Z Image flaws...
So there's been a huge amount of hype about Z Image so I was excited to finally get to use it and see if it all stacks up. I'm seeing some aspects I'd call "flaws" and perhaps you guys can offer your insight:
* Images often have exactly the same composition: I type in "A cyberpunk bedroom" and every shot is from the same direction & same proportions. Bed in the same position. Wall in the same place. Window in the same place. Despite it being able to fulfill quite complex prompts it also seems incapable of being imaginative beyond that core prompt fulfillment. It gives you one solution then every other one follows the same layout.
* For example I did the prompt, "An elephant on a ball", and the ball was always a ball with a globe printed on it. I could think of a hundred different types of ball that elephant could be on but this model cannot.
* I also did "an elephant in a jungle, dense jungle vegetation" and every single image has a similar shaped tree in the top right. You can watch it build the image and it goes so far as to drop that tree in at the second step. Kinda bizarre. Surely it must have enough knowledge of jungles to mix it up a bit or simply let the random seed trigger that diversity. Apparently not though.
* It struggles to break away from what it thinks an image should look like: I typed in "A Margarita in a beer glass" and "A Margarita in a whisky glass" and it fails on both. Every single Margarita in existence apparently is made from the same identical shaped glass.
* It feels clear to me that whatever clever stuff they've done to make this model shine is also the thing that reduces its diversity. Like as others have pointed out, people often look incredibly similar. Again, it just loses diversity.
* Position/viewer handling: I find it can often be quite hard to get it to follow prompts of how to position people. "From the side" often does nothing and it follows the same image layout with or without that. It can get the composition you want but sometimes you need to hit some specific denoscription to achieve that. Where as previous models would offer up quite some diversity every time, at the cost of also giving you horrors sometimes.
* I agree the model is worth gushing over it. They hype is big and deserved but it does come at a price. It's not perfect and feels like we've gained some things but lost in other areas.
https://redd.it/1p94upi
@rStableDiffusion
So there's been a huge amount of hype about Z Image so I was excited to finally get to use it and see if it all stacks up. I'm seeing some aspects I'd call "flaws" and perhaps you guys can offer your insight:
* Images often have exactly the same composition: I type in "A cyberpunk bedroom" and every shot is from the same direction & same proportions. Bed in the same position. Wall in the same place. Window in the same place. Despite it being able to fulfill quite complex prompts it also seems incapable of being imaginative beyond that core prompt fulfillment. It gives you one solution then every other one follows the same layout.
* For example I did the prompt, "An elephant on a ball", and the ball was always a ball with a globe printed on it. I could think of a hundred different types of ball that elephant could be on but this model cannot.
* I also did "an elephant in a jungle, dense jungle vegetation" and every single image has a similar shaped tree in the top right. You can watch it build the image and it goes so far as to drop that tree in at the second step. Kinda bizarre. Surely it must have enough knowledge of jungles to mix it up a bit or simply let the random seed trigger that diversity. Apparently not though.
* It struggles to break away from what it thinks an image should look like: I typed in "A Margarita in a beer glass" and "A Margarita in a whisky glass" and it fails on both. Every single Margarita in existence apparently is made from the same identical shaped glass.
* It feels clear to me that whatever clever stuff they've done to make this model shine is also the thing that reduces its diversity. Like as others have pointed out, people often look incredibly similar. Again, it just loses diversity.
* Position/viewer handling: I find it can often be quite hard to get it to follow prompts of how to position people. "From the side" often does nothing and it follows the same image layout with or without that. It can get the composition you want but sometimes you need to hit some specific denoscription to achieve that. Where as previous models would offer up quite some diversity every time, at the cost of also giving you horrors sometimes.
* I agree the model is worth gushing over it. They hype is big and deserved but it does come at a price. It's not perfect and feels like we've gained some things but lost in other areas.
https://redd.it/1p94upi
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Z Image Character LoRA on 29 real photos - trained on 4090 in ~5 hours.
https://redd.it/1p9e8g3
@rStableDiffusion
https://redd.it/1p9e8g3
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Z Image Character LoRA on 29 real photos - trained on 4090 in ~5 hours.
Explore this post and more from the StableDiffusion community