r/StableDiffusion – Telegram
logvar to update the encoder.

### Loss Function

The total loss for the VAE is loss = `recon_loss + kl_weight * kl_loss`

- Reconstruction Loss (recon_loss): It forces the encoder to capture all the important information about the input image and pack it into the latent vector z. If the information isn't in z, the decoder can't possibly recreate the image, and this loss will be high.
- KL Divergence Loss (kl_loss): Without this, the encoder would just learn to "memorize" the images. It would assign each image a far-flung, specific point in the latent space. The kl_loss prevents this by forcing all the encoded distributions to be "pulled" toward the origin (0, 0) and have a variance of 1. This organizes the latent space, packing all the encoded images into a smooth, continuous "cloud." This smoothness is what allows us to generate new, unseen images.

Simply adding the reconstruction and KL losses together often causes VAE training to fail due to a problem known as posterior collapse. This occurs when the KL loss is too strong at the beginning, incentivizing the encoder to find a trivial solution: it learns to ignore the input image entirely and just outputs a standard normal distribution (μ=0, σ=1) for every image, making the KL loss zero. As a result, the latent vector z contains no information, and the decoder, in turn, only learns to output a single, blurry, "average" image.

The solution is **KL annealing**, where the KL loss is "warmed up." For the first several epochs, its weight is set to 0, forcing the loss to be purely reconstruction-based; this compels the model to first get good at autoencoding and storing useful information in z. After this warm-up, the KL weight is gradually increased from 0 up to its target value, slowly introducing the regularizing pressure. This allows the model to organize the already-informative latent space into a smooth, continuous cloud without "forgetting" how to encode the image data.

Note: With logits based loss function (like binary cross entropy with logits), the output layer does not use an activation function like sigmoid. This is because the loss function itself applies the necessary transformations internally for numerical stability.

# Inference

Once trained, we throw away the encoder. To generate new images, we only use the decoder. We just need to feed it plausible latent vectors z. How we get those z vectors is the key.

### Method 1: Sample from the Aggregate Posterior
This method produces the high-quality and most representative samples.
- The Concept: The KL loss pushes the average of all encoded distributions to be near N(0, I), but the actual, combined distribution of all z vectors (the "aggregate posterior" q(z)) is not a perfect bell curve. It's a complex, "cloud" or "pancake" shape that represents the true structure of your data.
- The Problem: If we just sample from N(0, I) (Method 2), we might pick a z vector that is in an "empty" region of the latent space where no training data ever got mapped. The decoder, having never seen a z from this region, will produce a poor or nonsensical image.
- The Solution: We sample from a distribution that better approximates this true latent cloud.
- Pass the entire training dataset through the trained encoder one time.
- Collect all the output mu and var values.
- Calculate the global mean (agg_mean) and global variance (agg_var) of this entire latent dataset. (This uses the Law of Total Variance: `Var(Z) = E[Var(Z|X)] + Var(E[Z|X]))`.
- Instead of sampling from N(0, I), we now sample from `N(agg_mean, agg_var)`.
- The Result: Samples from this distribution are much more likely to fall "on-distribution," in dense areas of the latent space. This results in generated images that are much clearer, more varied, and more faithful to the training data.

### Method 2: Sample from the Prior N(0, I)
- The Concept: This method assumes the training was perfectly successful and the latent cloud q(z) is identical to the prior p(z) = N(0, I).
- The Solution: Simply generate a random vector z from a standard normal distribution (z =
torch.randn(...)) and feed it to the decoder.
- The Result: This often produces lower-quality, blurrier, or less representative images that miss some variations seen in the training data.

### Method 3: Latent Space Interpolation
This method isn't for generating random images, but for visualizing the structure and smoothness of the latent space.
- The Concept: A well-trained VAE has a smooth latent space. This means the path between any two encoded images should also be meaningful.
- The Solution:
- Encode image_A to get its latent vector z1.
- Encode image_B to get its latent vector z2.
- Create a series of intermediate vectors by walking in a straight line: `z_interp = (1 - alpha) * z1 + alpha * z2`, for alpha stepping from 0 to 1.
- Decode each z_interp vector.
- The Result: A smooth animation of image_A seamlessly "morphing" into image_B. This is a great sanity check that your model has learned a continuous and meaningful representation, not just a disjointed "lookup table."


Thanks for reading.
Checkout the [code](https://gist.github.com/nik-55/a5fcfaec90a01a3190abf0ba125e1796) to dig in more into detail and experiment.

Happy Hacking!


https://redd.it/1ojhzgf
@rStableDiffusion
RTX 5080 + SageAttention 3 — 2K Video in 5.7 Minutes (WSL2, CUDA 13.0)

**Repository:** [github.com/k1n0F/sageattention3-blackwell-wsl2](https://github.com/k1n0F/sageattention3-blackwell-wsl2)

I’ve completed the full **SageAttention 3 Blackwell build** under **WSL2 + Ubuntu 22.04**, using **CUDA 13.0 / PyTorch 2.10.0-dev**.
The build runs stably inside **ComfyUI + WAN Video Wrapper** and fully detects the **FP4 quantization API**, compiled for Blackwell (SM\_120).

**Results:**

* 125 frames @ 1984×1120
* Runtime: 341 seconds (\~5.7 minutes)
* VRAM usage: 9.95 GB (max), 10.65 GB (reserved)
* FP4 API detected: `scale_and_quant_fp4`, `blockscaled_fp4_attn`, `fp4quant_cuda`
* Device: RTX 5080 (Blackwell SM\_120)
* Platform: WSL2 Ubuntu 22.04 + CUDA 13.0

# Summary

* Built **PyTorch 2.10.0-dev + CUDA 13.0** from source
* Compiled SageAttention3 with `TORCH_CUDA_ARCH_LIST="12.0+PTX"`
* Fixed all major issues: `-lcuda`, `allocator mismatch`, `checkPoolLiveAllocations`, `CUDA_HOME`, `Python.h`, missing module imports
* Verified presence of FP4 quantization and attention kernels (not yet used in inference)
* Achieved stable runtime under ComfyUI with full CUDA graph support

# Proof of Successful Build

attention mode override: sageattn3
tensor out (1, 8, 128, 64) torch.bfloat16 cuda:0
Max allocated memory: 9.953 GB
Comfy-VFI done — 125 frames generated
Prompt executed in 341.08 seconds


# Conclusion

This marks the **fully documented and stable SageAttention3 build for Blackwell (SM\_120)**,
compiled and executed entirely inside **WSL2**, **without official support**.
The FP4 infrastructure is fully present and verified, ready for future activation and testing.

https://redd.it/1ojosl5
@rStableDiffusion
What's the most technically advanced local model out there?

Just curious, which one of the models, architectures, etc that can be run on a PC is the most advanced from a technical point of view? Not asking for better images or more optimizations, but for a model that, say, uses something more powerful than clip encoders to associate prompts with images, or that incorporates multimodality, or any other trick that holds more promise than just perfecting the training dataset for a checkpoint.

https://redd.it/1ojgek3
@rStableDiffusion
Has anyone tried a new model FIBO?

https://huggingface.co/briaai/FIBO

https://huggingface.co/spaces/briaai/FIBO

The following is the official introduction forwarded

# What's FIBO?

Most text-to-image models excel at imagination—but not control. FIBO is built for professional workflows, not casual use. Trained on structured JSON captions up to 1,000+ words, FIBO enables precise, reproducible control over lighting, composition, color, and camera settings. The structured captions foster native disentanglement, allowing targeted, iterative refinement without prompt drift. With only 8B parameters, FIBO delivers high image quality, strong prompt adherence, and professional-grade control—trained exclusively on licensed data.

https://redd.it/1ojsdji
@rStableDiffusion
UDIO just got nuked by UMG.

I know this is not an open source tool, but there are some serious implications for the whole AI generative community. Basically:

UDIO settled with UMG and ninja rolled out a new TOS that PROHIBITS you from:

1. Downloading generated songs.
2. Owning a copy of any generated song on ANY of your devices.

The TOS is working retroactively. You can no longer download songs generated under old TOS, which allowed free personal and commercial use.

What is worth noting, udio was not only a purely generative tool, many musicans uploaded their own music, to modify and enchance it, given the ability to separate stems. People lost months of work overnight.

https://redd.it/1ojvjh3
@rStableDiffusion