Ref2Font: Generate full font atlases from just two letters (FLUX.2 Klein 9B LoRA)
https://redd.it/1qw83f5
@rStableDiffusion
https://redd.it/1qw83f5
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Ref2Font: Generate full font atlases from just two letters (FLUX.2 Klein 9B LoRA)
Explore this post and more from the StableDiffusion community
Z-Image workflow to combine two character loras using SAM segmentation
https://redd.it/1qwdl2b
@rStableDiffusion
https://redd.it/1qwdl2b
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Z-Image workflow to combine two character loras using SAM segmentation
Explore this post and more from the StableDiffusion community
Z-image lora training news
Many people reported that the lora training sucks for z-image base. Less than 12 hours ago, someone on Bilibili claimed that he/she found the cause - unit 8 used by AdamW8bit optimizer. According to the author, you have to use FP8 optimizer for z-image base. The author pasted some comparisons in his/her post. One can check check https://b23.tv/g7gUFIZ for more info.
https://redd.it/1qw05vn
@rStableDiffusion
Many people reported that the lora training sucks for z-image base. Less than 12 hours ago, someone on Bilibili claimed that he/she found the cause - unit 8 used by AdamW8bit optimizer. According to the author, you have to use FP8 optimizer for z-image base. The author pasted some comparisons in his/her post. One can check check https://b23.tv/g7gUFIZ for more info.
https://redd.it/1qw05vn
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Thoughts and Solutions on Z-IMAGE Training Issues Machine Translation
After the launch of ZIB (Z-IMAGE), I spent a lot of time training on it and ran into quite a few weird issues. After many experiments, I’ve gathered some experience and solutions that I wanted to share with the community.
# 1. General Configuration (The Basics)
First off, regarding the format: Use FULL RANK LoKR with factor 8-12. In my testing, Full Rank LoKR is a superior format compared to LoRA and significantly improves training results.
Optimizers/LR: I don't think the optimizer or learning rate is the biggest bottleneck here. As long as your settings aren't wildly off, it should train fine. If you are unsure, just stick to Prodigy\_ADV with LR 1 and Cosine scheduler.
Warning: Be careful with BNB 8bit processing, as it might cause precision loss. (Reference discussion:Reddit Link)
Captioning: My experience here is very similar to SD and subsequent models. The logic remains the same: Do not over-describe the inherent features of your subject, but do describe the distractions/elements you want to separate from the subject.
Short vs. Long Tags: If you want to use short tags for prompting, you must train with short tags. However, this often leads to structural errors. A mix of long/short caption wildcards—or just sticking to long prompting —seems to avoid this structural instability.
Most of the above aligns with what we know from previous model training. However, let's talk about the new problems specific to ZIB.
# 2. The Core Problems with ZIB
Currently, I've identified two major hurdles:
# (1) Precision
Based on my runs and other researches, ZIB is extremely sensitive to precision.
https://www.reddit.com/r/StableDiffusion/comments/1qw05vn/zimage\_lora\_training\_news/
I switched my setup to: BF16 + Kahan summation + OneTrainer SVD Quant BF16 + Rank 16.
**https://github.com/kohya-ss/sd-noscripts/pull/2187**
The magic result? I can run this on 12GB VRAM in OneTrainer. This change significantly improved both the training quality and learning speed. Precision seems to be the learning bottleneck here. Using Kahan summation (or stochastic rounding) provides a noticeable improvement, similar to how it helps with older models.
# (2) The Timestep Problem
Even after fixing precision, ZIB can still be hard to train. I noticed instability even when using FP32. So, I dug deeper.
Looking at the Z-IMAGE report, it uses a Logit Normal (similar to SD3) and Dynamic Timestep Shift (similar to FLUX). It shifts sampling towards high noise based on resolution.
>Following SD3 [18\], we employ the logit-normal noise sampler to concentrate the training process on intermediate timesteps. Additionally, to account for the variations in Signal-to-Noise Ratio (SNR) arising from our multi-resolution training setup, we adopt the dynamic time shifting strategy as used in Flux [34\]. This ensures that the noise level is appropriately scaled for different image resolutions
If you look at a 512X timestep distribution
https://preview.redd.it/gj2326nvylhg1.png?width=506&format=png&auto=webp&s=5964a026a3522ef0d99fd32d0382e3b953120585
To align with this, I explicitly used Logit Normal and Dynamic Timestep Shift in OneTrainer.
My Observation: When training on just a single image, I noticed abnormal LOSS SPIKES at both low timesteps (0-50) and high timesteps (950-1000).
https://preview.redd.it/90fy67o3zlhg1.png?width=323&format=png&auto=webp&s=825c741345001f769e3a0db824f0ac667ba5ffd3
inspired by Chroma (https://huggingface.co/lodestones/Chroma), sparse sampling probabilities at certain steps might be the culprit behind loss spikes.
>the tails—where high-noise and low-noise regions exist—are trained super sparsely. If you
After the launch of ZIB (Z-IMAGE), I spent a lot of time training on it and ran into quite a few weird issues. After many experiments, I’ve gathered some experience and solutions that I wanted to share with the community.
# 1. General Configuration (The Basics)
First off, regarding the format: Use FULL RANK LoKR with factor 8-12. In my testing, Full Rank LoKR is a superior format compared to LoRA and significantly improves training results.
Optimizers/LR: I don't think the optimizer or learning rate is the biggest bottleneck here. As long as your settings aren't wildly off, it should train fine. If you are unsure, just stick to Prodigy\_ADV with LR 1 and Cosine scheduler.
Warning: Be careful with BNB 8bit processing, as it might cause precision loss. (Reference discussion:Reddit Link)
Captioning: My experience here is very similar to SD and subsequent models. The logic remains the same: Do not over-describe the inherent features of your subject, but do describe the distractions/elements you want to separate from the subject.
Short vs. Long Tags: If you want to use short tags for prompting, you must train with short tags. However, this often leads to structural errors. A mix of long/short caption wildcards—or just sticking to long prompting —seems to avoid this structural instability.
Most of the above aligns with what we know from previous model training. However, let's talk about the new problems specific to ZIB.
# 2. The Core Problems with ZIB
Currently, I've identified two major hurdles:
# (1) Precision
Based on my runs and other researches, ZIB is extremely sensitive to precision.
https://www.reddit.com/r/StableDiffusion/comments/1qw05vn/zimage\_lora\_training\_news/
I switched my setup to: BF16 + Kahan summation + OneTrainer SVD Quant BF16 + Rank 16.
**https://github.com/kohya-ss/sd-noscripts/pull/2187**
The magic result? I can run this on 12GB VRAM in OneTrainer. This change significantly improved both the training quality and learning speed. Precision seems to be the learning bottleneck here. Using Kahan summation (or stochastic rounding) provides a noticeable improvement, similar to how it helps with older models.
# (2) The Timestep Problem
Even after fixing precision, ZIB can still be hard to train. I noticed instability even when using FP32. So, I dug deeper.
Looking at the Z-IMAGE report, it uses a Logit Normal (similar to SD3) and Dynamic Timestep Shift (similar to FLUX). It shifts sampling towards high noise based on resolution.
>Following SD3 [18\], we employ the logit-normal noise sampler to concentrate the training process on intermediate timesteps. Additionally, to account for the variations in Signal-to-Noise Ratio (SNR) arising from our multi-resolution training setup, we adopt the dynamic time shifting strategy as used in Flux [34\]. This ensures that the noise level is appropriately scaled for different image resolutions
If you look at a 512X timestep distribution
https://preview.redd.it/gj2326nvylhg1.png?width=506&format=png&auto=webp&s=5964a026a3522ef0d99fd32d0382e3b953120585
To align with this, I explicitly used Logit Normal and Dynamic Timestep Shift in OneTrainer.
My Observation: When training on just a single image, I noticed abnormal LOSS SPIKES at both low timesteps (0-50) and high timesteps (950-1000).
https://preview.redd.it/90fy67o3zlhg1.png?width=323&format=png&auto=webp&s=825c741345001f769e3a0db824f0ac667ba5ffd3
inspired by Chroma (https://huggingface.co/lodestones/Chroma), sparse sampling probabilities at certain steps might be the culprit behind loss spikes.
>the tails—where high-noise and low-noise regions exist—are trained super sparsely. If you
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Thoughts and Solutions on Z-IMAGE Training Issues [Machine Translation]
After the launch of ZIB (Z-IMAGE), I spent a lot of time training on it and ran into quite a few weird issues. After many experiments, I’ve gathered some experience and solutions that I wanted to share with the community.
# 1. General Configuration (The Basics)
First off, regarding the format: **Use FULL RANK LoKR with factor 8-12.** In my testing, Full Rank LoKR is a superior format compared to LoRA and significantly improves training results.
* **Optimizers/LR:** I don't think the optimizer or learning rate is the biggest bottleneck here. As long as your settings aren't wildly off, it should train fine. If you are unsure, just stick to **Prodigy\_ADV with LR 1 and Cosine scheduler**.
* **Warning:** Be careful with **BNB 8bit** processing, as it might cause precision loss. (Reference discussion:[Reddit Link](https://www.reddit.com/r/StableDiffusion/comments/1qw05vn/zimage_lora_training_news/))
* **Captioning:** My experience here is very similar to SD and subsequent models. The logic remains the same: Do not over-describe the inherent features of your subject, but *do* describe the distractions/elements you want to separate from the subject.
* **Short vs. Long Tags:** If you want to use short tags for prompting, you must train with short tags. However, this often leads to structural errors. A mix of long/short caption wildcards—or just sticking to long prompting —seems to avoid this structural instability.
Most of the above aligns with what we know from previous model training. However, let's talk about the **new problems specific to ZIB.**
# 2. The Core Problems with ZIB
Currently, I've identified two major hurdles:
# (1) Precision
Based on my runs and other researches, ZIB is extremely sensitive to precision.
[https://www.reddit.com/r/StableDiffusion/comments/1qw05vn/zimage\_lora\_training\_news/](https://www.reddit.com/r/StableDiffusion/comments/1qw05vn/zimage_lora_training_news/)
I switched my setup to: **BF16 + Kahan summation + OneTrainer SVD Quant BF16 + Rank 16.**
[**https://github.com/kohya-ss/sd-noscripts/pull/2187**](https://github.com/kohya-ss/sd-noscripts/pull/2187)
The magic result? **I can run this on 12GB VRAM in OneTrainer.** This change significantly improved both the training quality and learning speed. Precision seems to be the learning bottleneck here. Using Kahan summation (or stochastic rounding) provides a noticeable improvement, similar to how it helps with older models.
# (2) The Timestep Problem
Even after fixing precision, ZIB can still be hard to train. I noticed instability even when using FP32. So, I dug deeper.
Looking at the Z-IMAGE report, it uses a **Logit Normal** (similar to SD3) and **Dynamic Timestep Shift** (similar to FLUX). It shifts sampling towards high noise based on resolution.
>Following SD3 \[18\], we employ the logit-normal noise sampler to concentrate the training process on intermediate timesteps. Additionally, to account for the variations in Signal-to-Noise Ratio (SNR) arising from our multi-resolution training setup, we adopt the dynamic time shifting strategy as used in Flux \[34\]. This ensures that the noise level is appropriately scaled for different image resolutions
If you look at a 512X timestep distribution
https://preview.redd.it/gj2326nvylhg1.png?width=506&format=png&auto=webp&s=5964a026a3522ef0d99fd32d0382e3b953120585
To align with this, I explicitly used **Logit Normal** and **Dynamic Timestep Shift** in **OneTrainer**.
**My Observation:** When training on just a single image, I noticed abnormal **LOSS SPIKES** at both low timesteps (0-50) and high timesteps (950-1000).
https://preview.redd.it/90fy67o3zlhg1.png?width=323&format=png&auto=webp&s=825c741345001f769e3a0db824f0ac667ba5ffd3
inspired by Chroma ([https://huggingface.co/lodestones/Chroma](https://huggingface.co/lodestones/Chroma)), sparse sampling probabilities at certain steps might be the culprit behind loss spikes.
>the tails—where high-noise and low-noise regions exist—are trained super sparsely. If you
After the launch of ZIB (Z-IMAGE), I spent a lot of time training on it and ran into quite a few weird issues. After many experiments, I’ve gathered some experience and solutions that I wanted to share with the community.
# 1. General Configuration (The Basics)
First off, regarding the format: **Use FULL RANK LoKR with factor 8-12.** In my testing, Full Rank LoKR is a superior format compared to LoRA and significantly improves training results.
* **Optimizers/LR:** I don't think the optimizer or learning rate is the biggest bottleneck here. As long as your settings aren't wildly off, it should train fine. If you are unsure, just stick to **Prodigy\_ADV with LR 1 and Cosine scheduler**.
* **Warning:** Be careful with **BNB 8bit** processing, as it might cause precision loss. (Reference discussion:[Reddit Link](https://www.reddit.com/r/StableDiffusion/comments/1qw05vn/zimage_lora_training_news/))
* **Captioning:** My experience here is very similar to SD and subsequent models. The logic remains the same: Do not over-describe the inherent features of your subject, but *do* describe the distractions/elements you want to separate from the subject.
* **Short vs. Long Tags:** If you want to use short tags for prompting, you must train with short tags. However, this often leads to structural errors. A mix of long/short caption wildcards—or just sticking to long prompting —seems to avoid this structural instability.
Most of the above aligns with what we know from previous model training. However, let's talk about the **new problems specific to ZIB.**
# 2. The Core Problems with ZIB
Currently, I've identified two major hurdles:
# (1) Precision
Based on my runs and other researches, ZIB is extremely sensitive to precision.
[https://www.reddit.com/r/StableDiffusion/comments/1qw05vn/zimage\_lora\_training\_news/](https://www.reddit.com/r/StableDiffusion/comments/1qw05vn/zimage_lora_training_news/)
I switched my setup to: **BF16 + Kahan summation + OneTrainer SVD Quant BF16 + Rank 16.**
[**https://github.com/kohya-ss/sd-noscripts/pull/2187**](https://github.com/kohya-ss/sd-noscripts/pull/2187)
The magic result? **I can run this on 12GB VRAM in OneTrainer.** This change significantly improved both the training quality and learning speed. Precision seems to be the learning bottleneck here. Using Kahan summation (or stochastic rounding) provides a noticeable improvement, similar to how it helps with older models.
# (2) The Timestep Problem
Even after fixing precision, ZIB can still be hard to train. I noticed instability even when using FP32. So, I dug deeper.
Looking at the Z-IMAGE report, it uses a **Logit Normal** (similar to SD3) and **Dynamic Timestep Shift** (similar to FLUX). It shifts sampling towards high noise based on resolution.
>Following SD3 \[18\], we employ the logit-normal noise sampler to concentrate the training process on intermediate timesteps. Additionally, to account for the variations in Signal-to-Noise Ratio (SNR) arising from our multi-resolution training setup, we adopt the dynamic time shifting strategy as used in Flux \[34\]. This ensures that the noise level is appropriately scaled for different image resolutions
If you look at a 512X timestep distribution
https://preview.redd.it/gj2326nvylhg1.png?width=506&format=png&auto=webp&s=5964a026a3522ef0d99fd32d0382e3b953120585
To align with this, I explicitly used **Logit Normal** and **Dynamic Timestep Shift** in **OneTrainer**.
**My Observation:** When training on just a single image, I noticed abnormal **LOSS SPIKES** at both low timesteps (0-50) and high timesteps (950-1000).
https://preview.redd.it/90fy67o3zlhg1.png?width=323&format=png&auto=webp&s=825c741345001f769e3a0db824f0ac667ba5ffd3
inspired by Chroma ([https://huggingface.co/lodestones/Chroma](https://huggingface.co/lodestones/Chroma)), sparse sampling probabilities at certain steps might be the culprit behind loss spikes.
>the tails—where high-noise and low-noise regions exist—are trained super sparsely. If you
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
train for a looong time (say, 1000 steps), the likelihood of hitting those tail regions is almost zero. The problem? When the model finally does see them, the loss spikes hard, throwing training out of whack—even with a huge batch size.
In high Batch Sizes (BS), this instability might be diluted. In small BS, there is a small probability that most samples in a batch fall into these "**sparse timestep**" zones—an anomaly the model hasn't seen much—causing instability.
**The Solution:** I manually modified the configuration to set **Min SNR Gamma = 5**.
* This drastically reduced the loss at low timesteps.
* Surprisingly, it also alleviated the loss spikes at the 950-1000 range. The high-step instability might actually be a ripple effect of the low-step spikes.
https://preview.redd.it/bc29t9aoylhg1.png?width=323&format=png&auto=webp&s=296f6f9c0359f20b143d959cddcb16683d82a8c9
# 3. How to Implement
If you are using unmodified OneTrainer or AI Toolkit, Z-IMAGE might not support the Min SNR option directly yet. You can try **limiting the minimum timesteps** to achieve a similar effect. And use logit normal and dynmatic timestep shift on OneTrainer
Alternatively, you can use my fork of OneTrainer:
\*\*GitHub:\*\*[https://github.com/gesen2egee/OneTrainer](https://github.com/gesen2egee/OneTrainer)
My fork includes support for:
* LoKR
* Min SNR Gamma
* A modified optimizer: `automagic_sinkgd` (which already includes Kahan summation).
**(If you want to maintain the original fork, all optimizers ending with \_ADV are versions that have already added Stochastic rounding, which can greatly solve the precision problem.)**
Hope this helps anyone else struggling with ZIB training!
https://redd.it/1qwc4t0
@rStableDiffusion
In high Batch Sizes (BS), this instability might be diluted. In small BS, there is a small probability that most samples in a batch fall into these "**sparse timestep**" zones—an anomaly the model hasn't seen much—causing instability.
**The Solution:** I manually modified the configuration to set **Min SNR Gamma = 5**.
* This drastically reduced the loss at low timesteps.
* Surprisingly, it also alleviated the loss spikes at the 950-1000 range. The high-step instability might actually be a ripple effect of the low-step spikes.
https://preview.redd.it/bc29t9aoylhg1.png?width=323&format=png&auto=webp&s=296f6f9c0359f20b143d959cddcb16683d82a8c9
# 3. How to Implement
If you are using unmodified OneTrainer or AI Toolkit, Z-IMAGE might not support the Min SNR option directly yet. You can try **limiting the minimum timesteps** to achieve a similar effect. And use logit normal and dynmatic timestep shift on OneTrainer
Alternatively, you can use my fork of OneTrainer:
\*\*GitHub:\*\*[https://github.com/gesen2egee/OneTrainer](https://github.com/gesen2egee/OneTrainer)
My fork includes support for:
* LoKR
* Min SNR Gamma
* A modified optimizer: `automagic_sinkgd` (which already includes Kahan summation).
**(If you want to maintain the original fork, all optimizers ending with \_ADV are versions that have already added Stochastic rounding, which can greatly solve the precision problem.)**
Hope this helps anyone else struggling with ZIB training!
https://redd.it/1qwc4t0
@rStableDiffusion
Z Image lora training is solved! A new Ztuner trainer soon!
Finally, the day we have all been waiting for has arrived. On X we got the answer:
https://x.com/bdsqlsz/status/2019349964602982494
The problem was that adam8bit performs very poorly, and even AdamW and earlier it was found by a user "None9527", but now we have the answer: it is "prodigy_adv + Stochastic rounding". This optimizer will get the job done and not only this.
Soon we will get a new trainer called "Ztuner".
And as of now OneTrainer exposes Prodigy_Adv as an optimizer option and explicitly lists Stochastic Rounding as a toggleable feature for BF16/FP16 training.
Hopefully we will get this implementation soon in other trainers too.
https://redd.it/1qwj4hu
@rStableDiffusion
Finally, the day we have all been waiting for has arrived. On X we got the answer:
https://x.com/bdsqlsz/status/2019349964602982494
The problem was that adam8bit performs very poorly, and even AdamW and earlier it was found by a user "None9527", but now we have the answer: it is "prodigy_adv + Stochastic rounding". This optimizer will get the job done and not only this.
Soon we will get a new trainer called "Ztuner".
And as of now OneTrainer exposes Prodigy_Adv as an optimizer option and explicitly lists Stochastic Rounding as a toggleable feature for BF16/FP16 training.
Hopefully we will get this implementation soon in other trainers too.
https://redd.it/1qwj4hu
@rStableDiffusion
X (formerly Twitter)
青龍聖者 (@bdsqlsz) on X
Through ablation experiments and collaboration with the official team, the training problem was finally solved.
Recommended configuration now: prodigy_adv + Stochastic rounding
It has been confirmed that adam8bit performs very poorly, and adamw seems to do…
Recommended configuration now: prodigy_adv + Stochastic rounding
It has been confirmed that adam8bit performs very poorly, and adamw seems to do…