r/StableDiffusion – Telegram
Still doesn't seem to be a robust way of creating extended videos with Wan 2.2

With 2.1 and InfiniteTalk, we can create long running videos with very little quality loss

It seems strange to me that nothing in 2.2 seems to offer this capability. Wan Animate does a decent job, but it's limited to fixed pose references which struggle with any complex movement between multiple characters

All extend-from-last-frame techniques look extremely questionable because the quality has gone after decoding. VACE 2.2 does nothing to help here and even when it does provide continuous movement between segments (with frames for context), it will 'smooth the transition' rather than keep it consistent

Without something like InfiniteTalk in 2.2, I'm finding it difficult to make any good extended video, which is a shame given all the capabilities or 2.2 motion and loras

https://redd.it/1p187pr
@rStableDiffusion
Is it just me who gets this impression ? Is SDXL better than Flux and Qwen for generating art like this ? Is the problem the text encoder ?
https://redd.it/1p1dsce
@rStableDiffusion
SDXL simple basic shapes prompt help.
https://redd.it/1p1g8i8
@rStableDiffusion
2 million parameters denoiser model is everything that you need! ( Source code + model + poison detector) Anti Nightshade, Anti Glaze

Wassup
Today, I’m going to show you a new model designed for image **depoisoning**.

I decided to build something fresh, and this time the focus was on efficiency: the new model is incredibly lightweight, clocking in at just **2 million parameters**.

In addition to the denoiser, I’ve also trained a separate AI "Detector" that can tell you whether an image has been poisoned or if it's clean.

A quick heads-up: neither model is magic. They can (and likely will) make mistakes, but I have done my best to minimize errors. Regarding the denoiser specifically, I feel the architecture is a solid improvement over my previous version.

# 1. The Denoiser Architecture

Unlike standard heavy U-Nets, this architecture is designed to be Bias-Free and highly responsive to the specific noise level of the image.

Here is how this works:

* **Gaussian Prior Extraction:** Before the network even starts processing, the model uses a `ResidualPriorExtractor`. It runs fixed Gaussian kernels over the image to separate high-frequency details (edges/noise) from the smooth background. This gives the model a "head start" by highlighting areas where poison usually hides.
* **Noise Conditioning:** The model isn't static. It uses a `NoiseConditioner` that takes a noise level (sgima) and a content denoscriptor. It projects these into an embedding that modulates the network layers. Essentially, the model adjusts its "aggression" based on how noisy the image is.
* **Bias-Free Design:** All convolutions in the network have `bias=False`. This forces the network to rely purely on the feature data and normalization (`LayerNorm2d`), which often leads to better generalization in restoration tasks.
* **Gated Residual Blocks:** The core building blocks use **Global Gating**. The network calculates a gating value (0 to 1) based on the global mean of the features, allowing it to selectively let information pass through or be suppressed.

# 2. The "Predictor" (Detector) Architecture

Why I call it the "Predictor" ( for fun and giggles):

I named this model the Predictor because it doesn't just classify an image as "Bad" or "Good"—it simultaneously predicts the noise mask (where the poison is located).

This is a much more complex beast called **GhostResidualDecomposition-Net** . Here is how it achieves high accuracy :

* **The Backbone (ResNet + SE):** The encoder uses Residual Blocks enhanced with **Squeeze-and-Excitation (SE) Blocks**. SE blocks allow the network to perform "channel attention"—learning which feature maps are important and weighing them higher.
* **ASPP (Atrous Spatial Pyramid Pooling):** Located at the bottleneck, this module looks at the image with different "zoom levels" (dilated convolutions). This captures context at multiple scales, ensuring the model sees both fine noise patterns and the global image structure.
* **Attention Gates in the Decoder:** When the network upsamples the image to reconstruct it, it uses **Attention Gates** on the skip connections. Instead of blindly copying features from the encoder, these gates filter the features to focus only on relevant regions (the poisoned pixels).

# 3. The "Ghost Loss" Function

[yeah, only 7 epochs. ](https://preview.redd.it/58zzm7qyd92g1.png?width=331&format=png&auto=webp&s=66692cfdf6f459c25024dd86a2a0a3456dbc2038)

To train the Predictor, I used a custom loss function I call **Ghost Loss** ( very original ). It ensures the model isn't just hallucinating a clean image. It combines four specific penalties:

1. **Pixel-wise Noise Match:** Does the predicted noise mask match the real poison?
2. **Restoration Match (MSE):** If we subtract the mask, does the result look like the original clean image?
3. **Binary Classification (BCE):** Did it correctly flag the image as Poisoned/Safe?
4. **Semantic Anchor (Perceptual Loss):** This is the "Ghost" part. It runs the restored image through a frozen **VGG16** network to ensure the *features* (not just pixels) match
Nvidia sells an H100 for 10 times its manufacturing cost. Nvidia is the big villain company; it's because of them that large models like GPU 4 aren't available to run on consumer hardware. AI development will only advance when this company is dethroned.

Nvidia's profit margin on data center GPUs is really very high, 7 to 10 times higher.

It would hypothetically be possible for this GPU to be available to home consumers without Nvidia's inflated monopoly!

This company is delaying the development of AI.

https://redd.it/1p1m5gl
@rStableDiffusion
Version 1.0 The Easiest Way to Train Wan 2.2 LoRAs (Under $5)

https://github.com/obsxrver/wan22-lora-training
If you’ve been wanting to train your own Wan 2.2 Video LoRAs but are intimidated by the hardware requirements, parameter tweaking insanity, or the installation nightmare—I built a solution that handles it all for you.

If

https://preview.redd.it/8avncmwwbb2g1.png?width=875&format=png&auto=webp&s=71f66d615d269a03af89744285543476c7ab880e

This is currently the easiest, fastest, and cheapest way to get a high-quality training run done.

Why this method?

Zero Setup: No installing Python, CUDA, or hunting for dependencies. You launch a pre-built [Vast.AI](http://Vast.AI) template, and it's ready in minutes.
Full WebUI: Drag-and-drop your videos/images, edit captions, and click "Start." No terminal commands required.
Extremely Cheap: You can rent a dual RTX 5090 node, train a full LoRA in 2-3 hours, and auto-shutdown. Total cost is usually under $5.
Auto-Save: It automatically uploads your finished LoRA to your Cloud Storage (Google Drive/S3/Dropbox) and kills the instance so you don't pay for a second longer than necessary.

How it works:

1. Click the Vast.AI template link (in the repo).
2. Open the WebUI in your browser.
3. Upload your dataset and press Train.
4. Come back in an hour to find your LoRA in your Google Drive.

It supports both Text-to-Video and Image-to-Video, and optimizes for dual-GPU setups (training High/Low noise simultaneously) to cut training time in half.

Repo + Template Link:

https://github.com/obsxrver/wan22-lora-training

Let me know

if you have questions

https://redd.it/1p1puml
@rStableDiffusion
Brand NEW Meta SAM3 - now for Comfy-UI !
https://redd.it/1p1xu20
@rStableDiffusion
角色迁移到场景的Lora

https://preview.redd.it/csium62eye2g1.png?width=2217&format=png&auto=webp&s=f768ad1c26423cb63435f42aa904494aa8dcfe53

https://preview.redd.it/hq5g80ifye2g1.png?width=6509&format=png&auto=webp&s=d306d61880fb3ad31ee28656502938097a3dc20d

https://preview.redd.it/8bmhpf5gye2g1.png?width=6134&format=png&auto=webp&s=69629ea3f65beb4d59e4ab1532b9024de1b7213f

https://preview.redd.it/0lixjergye2g1.png?width=5727&format=png&auto=webp&s=b1cd9df101639a61bf93ce0a696fca11c28cd2b0

https://preview.redd.it/f3b8bhrgye2g1.png?width=2450&format=png&auto=webp&s=d84fdb2028527b833834a2d933e221203ae5ac20

https://preview.redd.it/wcwolqfhye2g1.png?width=3848&format=png&auto=webp&s=67704d46a0fc69706298d6a26426cc61f37387c4

I used Qwen image editing 2509 + RoleScene Blend LORA, and used 5090 to complete the migration of the following characters to the scene in about 30 seconds

You can download the model here: https://civitai.com/models/2142049/rolescene-blend

Use the workflow I built here: https://www.runninghub.ai/post/1991385798813790209

You can register using my invitation link: https://www.runninghub.ai/?inviteCode=t0lfdxyz

Here is my teaching video, currently only in Chinese: https://www.bilibili.com/video/BV1afCfBFEJG/?spm\_id\_from=333.1387.homepage.video\_card.click&vd\_source=ae85ec1de21e4084d40c5d4eec667b8f
I used Qwen image editing 2509 + RoleScene Blend LORA, and used 5090 to complete the migration of the following characters to the scene in about 30 seconds

You can download the model here: https://civitai.com/models/2142049/rolescene-blend

Use the workflow I built here: https://www.runninghub.ai/post/1991385798813790209

You can register using my invitation link: https://www.runninghub.ai/?inviteCode=t0lfdxyz

Here is my teaching video, currently only in Chinese: https://www.bilibili.com/video/BV1afCfBFEJG/?spm\_id\_from=333.1387.homepage.video\_card.click&vd\_source=ae85ec1de21e4084d40c5d4eec667b8f

https://redd.it/1p233zo
@rStableDiffusion
Is InstantID + Canny still the best method in 2025 for generating consistent LoRA reference images?

Hey everyone,
I’m building a LoRA for a custom female character and I need around 10–20 consistent face images (different angles, light, expressions, etc). I’m planning to use the InstantID + Canny ControlNet workflow in ComfyUI.

Before I finalize my setup, I want to ask:

1. Is InstantID + Canny still the most reliable method in 2025 for producing identity-consistent images for LoRA training?

2. Are there any improved workflows (InstantID + Depth, FaceID, or new consistency nodes) that give better results?

3. Does anyone have a ComfyUI graph or recommended settings they can share?

4. Anything I should avoid when generating reference shots (lighting, resolution, negative prompts, etc.)?

I’m aiming for high identity consistency (90%+), so any updated advice from 2025 users would really help.

Thanks!

https://redd.it/1p22zbb
@rStableDiffusion
How do I stop female characters from dancing and bouncing their boobs in WAN 2.2 video?

Everytime I include a reference character of a woman she just starts dancing and her boobs start bouncing for literally no reason. The prompt I used for one of the videos is "the woman pulls out a gun and aims at the man" but while aiming the gun she just started doing tiktok dances and furiously shaking her hips.

I included in the negative prompts "dancing, tiktok dances, shaking hips" etc... but it doesn't seem to be having any effect.

Edit: I'm using the Wan smooth mix checkpoint. Does that affect the motion that much? The characters only bounce and dance when they are 3D models, real women just follow the prompt.

https://redd.it/1p26ebl
@rStableDiffusion