r/StableDiffusion – Telegram
Update to Repo for my AI Toolkit Fork + New Yaml Settings for I2V motion training

Hi, PR has already been submitted to Ostris but yeah... my last one hasn't even been looked at. So here is my fork repo:
[https://github.com/relaxis/ai-toolkit](https://github.com/relaxis/ai-toolkit)

Changes:

1. Automagic now trains separate LR per lora (high and low noise) if it detects MoE training - LR outputs now print to log and terminal. You can also train each lora according to different optimizer parameters:

​

optimizer_params:
lr_bump: 0.000005 #old
min_lr: 0.000008 #old
max_lr: 0.0003 #old
beta2: 0.999
weight_decay: 0.0001
clip_threshold: 1
high_noise_lr_bump: 0.00001 # new
high_noise_min_lr: 0.00001 # new
high_noise_max_lr: 0.0003 # new
low_noise_lr_bump: 0.000005 # new
low_noise_min_lr: 0.00001 # new
low_noise_max_lr: 0.0003 #new

2. Changed resolution bucket logic - previously this worked on SDXL bucket logic but now you can specify pixel count. The logic will allow higher dimension videos and images to be trained as long as they fit within the specified pixel count (allows for higher resolution low vram videos below your cut off resolution).

resolution: - 512
max_pixels_per_frame: 262144

https://redd.it/1oiyuzr
@rStableDiffusion
How do people use WAN for image generation?

I've read plenty comments mentioning how good WAN is supposed to be with image gen, but nobody shares any specific or details about it.

Do they use the default workflow and modify settings? Is there a custom workflow for it? If its apparently so good, how come there's no detailed guide for it? Couldn't be better than Qwen, could it?

https://redd.it/1oj8ubq
@rStableDiffusion
Your Hunyuan 3D 2.1 preferred workflow, settings, techniques?

Local only, always. Thanks.

They say start with a joke so..
How do 3D modelers say they're sorry?
They Topologize.

I realize Hunyuan 3D 2.1 won't produce as good a result as nonlocal options but I want to get the output as good as I can with local.

What do you folks do to improve your output?

My model and textures always come out very bad, like a playdoe model with textures worse than an NES game.

Anyway, I have tried a few different workflows such as Pixel Artistry's 3D 2.1 workflow and I've tried:

Increasing the octree resolution to 1300 and the steps to 100. (The octree resolution seems to have the most impact on model quality but I can only go so high before OOM).

Using a higher resolution square source image from 1024 to 4096.

Also, is there a way to increase the Octree Resolution far beyond the GPU VRAM limits but have the generation take longer? For example, it only takes a couple minutes to generate a model (pre texturing) but I wouldn't mind letting it run overnight or longer if it could generate a much higher quality model. Is there a way to do this?

Thanks fam

Disclaimer: (5090, 64GB Ram)

https://redd.it/1ojcfti
@rStableDiffusion
This media is not supported in your browser
VIEW IN TELEGRAM
Texturing using StableGen with SDXL on a more complex scene + experimenting with FLUX.1-dev

https://redd.it/1ojfsvv
@rStableDiffusion
Variational Autoencoder (VAE): How to train and inference (with code)

Hey,

I have been exploring Variational Autoencoders (VAEs) recently, and I wanted to share a concise explanation about their architecture, training process, and inference mechanism.

You can check out the code [here](https://gist.github.com/nik-55/a5fcfaec90a01a3190abf0ba125e1796)

A Variational Autoencoder (VAE) is a type of **generative neural network** that learns to compress data into a probabilistic, low-dimensional "latent space" and then generate new data from it. Unlike a standard autoencoder, its **encoder** doesn't output a single compressed vector; instead, it outputs the parameters (a **mean** and **variance**) of a probability distribution. A sample is then drawn from this distribution and passed to the **decoder**, which attempts to reconstruct the original input. This probabilistic approach, combined with a unique loss function that balances **reconstruction accuracy** (how well it rebuilds the input) and **KL divergence** (how organized and "normal" the latent space is), forces the VAE to learn the underlying structure of the data, allowing it to generate new, realistic variations by sampling different points from that learned latent space.

There are plenty of resources on how to perform inference with a VAE, but fewer on how to train one, or how, for example, Stable Diffusion came up with its magic number, 0.18215

# Architecture
It is bit of inspired from the architecture of [Wan 2.1 VAE](https://github.com/Wan-Video/Wan2.1/blob/main/wan/modules/vae.py) which is a video generative model.

### Key Components

- `ResidualBlock`: A standard ResNet-style block using SiLU activations: (Norm -> SiLU -> Conv -> Norm -> SiLU -> Conv) + Shortcut. This allows for building deeper networks by improving gradient flow.
- `AttentionBlock`: A scaled_dot_product_attention block is used in the bottleneck of the encoder and decoder. This allows the model to weigh the importance of different spatial locations and capture long-range dependencies.

### Encoder

The encoder compresses the input image into a statistical representation (a mean and variance) in the latent space.
- A preliminary Conv2d projects the image into a higher dimensional space.
- The data flows through several ResidualBlocks, progressively increasing the number of channels.
- A Downsample layer (a strided convolution) halves the spatial dimensions.
- At this lower resolution, more ResidualBlocks and an AttentionBlock are applied to process the features.
- Finally, a Conv2d maps the features to latent_dim * 2 channels. This output is split down the middle: one half becomes the mu (mean) vector, and the other half becomes the logvar (log-variance) vector.

### Decoder

The decoder takes a single vector z sampled from the latent space and attempts to reconstruct image.
- It begins with a Conv2d to project the input latent_dim vector into a high-dimensional feature space.
- It roughly mirrors the encoder's architecture, using ResidualBlocks and an AttentionBlock to process the features.
- An Upsample block (Nearest-Exact + Conv) doubles the spatial dimensions back to the original size.
- More ResidualBlocks are applied, progressively reducing the channel count.
- A final Conv2d layer maps the features back to input image channel, producing the reconstructed image (as logits).

# Training

### The Reparameterization Trick
A core problem in training VAEs is that the sampling step (z is randomly drawn from N(mu, logvar)) is not differentiable, so gradients cannot flow back to the encoder.
- Problem: We can't backpropagate through a random node.
- Solution: We re-parameterize the sampling. Instead of sampling z directly, we sample a random noise vector eps from a standard normal distribution N(0, I). We then deterministically compute z using our encoder's outputs: `std = torch.exp(0.5 * logvar)` `z = mu + eps * std`
- Result: The randomness is now an input to the computation rather than a step within it. This creates a differentiable path, allowing gradients to flow back through mu and