r/StableDiffusion – Telegram
Still doesn't seem to be a robust way of creating extended videos with Wan 2.2

With 2.1 and InfiniteTalk, we can create long running videos with very little quality loss

It seems strange to me that nothing in 2.2 seems to offer this capability. Wan Animate does a decent job, but it's limited to fixed pose references which struggle with any complex movement between multiple characters

All extend-from-last-frame techniques look extremely questionable because the quality has gone after decoding. VACE 2.2 does nothing to help here and even when it does provide continuous movement between segments (with frames for context), it will 'smooth the transition' rather than keep it consistent

Without something like InfiniteTalk in 2.2, I'm finding it difficult to make any good extended video, which is a shame given all the capabilities or 2.2 motion and loras

https://redd.it/1p187pr
@rStableDiffusion
Is it just me who gets this impression ? Is SDXL better than Flux and Qwen for generating art like this ? Is the problem the text encoder ?
https://redd.it/1p1dsce
@rStableDiffusion
SDXL simple basic shapes prompt help.
https://redd.it/1p1g8i8
@rStableDiffusion
2 million parameters denoiser model is everything that you need! ( Source code + model + poison detector) Anti Nightshade, Anti Glaze

Wassup
Today, I’m going to show you a new model designed for image **depoisoning**.

I decided to build something fresh, and this time the focus was on efficiency: the new model is incredibly lightweight, clocking in at just **2 million parameters**.

In addition to the denoiser, I’ve also trained a separate AI "Detector" that can tell you whether an image has been poisoned or if it's clean.

A quick heads-up: neither model is magic. They can (and likely will) make mistakes, but I have done my best to minimize errors. Regarding the denoiser specifically, I feel the architecture is a solid improvement over my previous version.

# 1. The Denoiser Architecture

Unlike standard heavy U-Nets, this architecture is designed to be Bias-Free and highly responsive to the specific noise level of the image.

Here is how this works:

* **Gaussian Prior Extraction:** Before the network even starts processing, the model uses a `ResidualPriorExtractor`. It runs fixed Gaussian kernels over the image to separate high-frequency details (edges/noise) from the smooth background. This gives the model a "head start" by highlighting areas where poison usually hides.
* **Noise Conditioning:** The model isn't static. It uses a `NoiseConditioner` that takes a noise level (sgima) and a content denoscriptor. It projects these into an embedding that modulates the network layers. Essentially, the model adjusts its "aggression" based on how noisy the image is.
* **Bias-Free Design:** All convolutions in the network have `bias=False`. This forces the network to rely purely on the feature data and normalization (`LayerNorm2d`), which often leads to better generalization in restoration tasks.
* **Gated Residual Blocks:** The core building blocks use **Global Gating**. The network calculates a gating value (0 to 1) based on the global mean of the features, allowing it to selectively let information pass through or be suppressed.

# 2. The "Predictor" (Detector) Architecture

Why I call it the "Predictor" ( for fun and giggles):

I named this model the Predictor because it doesn't just classify an image as "Bad" or "Good"—it simultaneously predicts the noise mask (where the poison is located).

This is a much more complex beast called **GhostResidualDecomposition-Net** . Here is how it achieves high accuracy :

* **The Backbone (ResNet + SE):** The encoder uses Residual Blocks enhanced with **Squeeze-and-Excitation (SE) Blocks**. SE blocks allow the network to perform "channel attention"—learning which feature maps are important and weighing them higher.
* **ASPP (Atrous Spatial Pyramid Pooling):** Located at the bottleneck, this module looks at the image with different "zoom levels" (dilated convolutions). This captures context at multiple scales, ensuring the model sees both fine noise patterns and the global image structure.
* **Attention Gates in the Decoder:** When the network upsamples the image to reconstruct it, it uses **Attention Gates** on the skip connections. Instead of blindly copying features from the encoder, these gates filter the features to focus only on relevant regions (the poisoned pixels).

# 3. The "Ghost Loss" Function

[yeah, only 7 epochs. ](https://preview.redd.it/58zzm7qyd92g1.png?width=331&format=png&auto=webp&s=66692cfdf6f459c25024dd86a2a0a3456dbc2038)

To train the Predictor, I used a custom loss function I call **Ghost Loss** ( very original ). It ensures the model isn't just hallucinating a clean image. It combines four specific penalties:

1. **Pixel-wise Noise Match:** Does the predicted noise mask match the real poison?
2. **Restoration Match (MSE):** If we subtract the mask, does the result look like the original clean image?
3. **Binary Classification (BCE):** Did it correctly flag the image as Poisoned/Safe?
4. **Semantic Anchor (Perceptual Loss):** This is the "Ghost" part. It runs the restored image through a frozen **VGG16** network to ensure the *features* (not just pixels) match
Nvidia sells an H100 for 10 times its manufacturing cost. Nvidia is the big villain company; it's because of them that large models like GPU 4 aren't available to run on consumer hardware. AI development will only advance when this company is dethroned.

Nvidia's profit margin on data center GPUs is really very high, 7 to 10 times higher.

It would hypothetically be possible for this GPU to be available to home consumers without Nvidia's inflated monopoly!

This company is delaying the development of AI.

https://redd.it/1p1m5gl
@rStableDiffusion
Version 1.0 The Easiest Way to Train Wan 2.2 LoRAs (Under $5)

https://github.com/obsxrver/wan22-lora-training
If you’ve been wanting to train your own Wan 2.2 Video LoRAs but are intimidated by the hardware requirements, parameter tweaking insanity, or the installation nightmare—I built a solution that handles it all for you.

If

https://preview.redd.it/8avncmwwbb2g1.png?width=875&format=png&auto=webp&s=71f66d615d269a03af89744285543476c7ab880e

This is currently the easiest, fastest, and cheapest way to get a high-quality training run done.

Why this method?

Zero Setup: No installing Python, CUDA, or hunting for dependencies. You launch a pre-built [Vast.AI](http://Vast.AI) template, and it's ready in minutes.
Full WebUI: Drag-and-drop your videos/images, edit captions, and click "Start." No terminal commands required.
Extremely Cheap: You can rent a dual RTX 5090 node, train a full LoRA in 2-3 hours, and auto-shutdown. Total cost is usually under $5.
Auto-Save: It automatically uploads your finished LoRA to your Cloud Storage (Google Drive/S3/Dropbox) and kills the instance so you don't pay for a second longer than necessary.

How it works:

1. Click the Vast.AI template link (in the repo).
2. Open the WebUI in your browser.
3. Upload your dataset and press Train.
4. Come back in an hour to find your LoRA in your Google Drive.

It supports both Text-to-Video and Image-to-Video, and optimizes for dual-GPU setups (training High/Low noise simultaneously) to cut training time in half.

Repo + Template Link:

https://github.com/obsxrver/wan22-lora-training

Let me know

if you have questions

https://redd.it/1p1puml
@rStableDiffusion