Pony 7 weights released. Yet this image tells everything about it
https://preview.redd.it/cpi9frjr0bxf1.png?width=1280&format=png&auto=webp&s=2ee6f038b91dd912cba295e024c7e21a65f46943
n3ko, 2girls, (yamato_\\(one piece\\)), (yae_miko), cat ears, pink makeup, tall, mature, seductive, standing, medium_hair, pink green glitter glossy sheer neck striped jumpsuit, lace-up straps, green_eyes, highres, absurdres, (flat colors:1.1), flat background
https://redd.it/1ofzf8n
@rStableDiffusion
https://preview.redd.it/cpi9frjr0bxf1.png?width=1280&format=png&auto=webp&s=2ee6f038b91dd912cba295e024c7e21a65f46943
n3ko, 2girls, (yamato_\\(one piece\\)), (yae_miko), cat ears, pink makeup, tall, mature, seductive, standing, medium_hair, pink green glitter glossy sheer neck striped jumpsuit, lace-up straps, green_eyes, highres, absurdres, (flat colors:1.1), flat background
https://redd.it/1ofzf8n
@rStableDiffusion
FlashPack: High-throughput tensor loading for PyTorch
https://github.com/fal-ai/flashpack
FlashPack — a new, high-throughput file format and loading mechanism for PyTorch that makes model checkpoint I/O blazingly fast, even on systems without access to GPU Direct Storage (GDS).
With FlashPack, loading any model can be 3–6× faster than with the current state-of-the-art methods like
https://redd.it/1og1toy
@rStableDiffusion
https://github.com/fal-ai/flashpack
FlashPack — a new, high-throughput file format and loading mechanism for PyTorch that makes model checkpoint I/O blazingly fast, even on systems without access to GPU Direct Storage (GDS).
With FlashPack, loading any model can be 3–6× faster than with the current state-of-the-art methods like
accelerate or the standard load_state_dict() and to() flow — all wrapped in a lightweight, pure-Python package that works anywhere.https://redd.it/1og1toy
@rStableDiffusion
GitHub
GitHub - fal-ai/flashpack: High-throughput tensor loading for PyTorch
High-throughput tensor loading for PyTorch. Contribute to fal-ai/flashpack development by creating an account on GitHub.
This media is not supported in your browser
VIEW IN TELEGRAM
Automatically texturing a character with SDXL & ControlNet in Blender
https://redd.it/1og3u26
@rStableDiffusion
https://redd.it/1og3u26
@rStableDiffusion
Transform Your Videos Using Wan 2.1 Ditto (Low Vram Workflow)
https://youtu.be/iuakm3YQYY8
https://redd.it/1oge209
@rStableDiffusion
https://youtu.be/iuakm3YQYY8
https://redd.it/1oge209
@rStableDiffusion
YouTube
ComfyUI Tutorial: Transform Your Videos Using Wan 2.1 Ditto #comfyui #wan2 #comfyuitutorial
On this tutorial I will show you how to edit your video using the new ditto model that allows you to change the style of your video while keeping the poses and motions of the original video consistent, and without using any controlnet like depth, canny, openpose…
What's the big deal about Chroma?
I am trying to understand why are people excited about Chroma. For photorealistic images I get improper faces, takes too long and quality is ok.
I use ComfyUI.
What is the use case of Chroma? Am I using it wrong?
https://redd.it/1ogbkm1
@rStableDiffusion
I am trying to understand why are people excited about Chroma. For photorealistic images I get improper faces, takes too long and quality is ok.
I use ComfyUI.
What is the use case of Chroma? Am I using it wrong?
https://redd.it/1ogbkm1
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing (a new open dataset by Apple)
https://github.com/apple/pico-banana-400k
https://redd.it/1ogg414
@rStableDiffusion
https://github.com/apple/pico-banana-400k
https://redd.it/1ogg414
@rStableDiffusion
GitHub
GitHub - apple/pico-banana-400k
Contribute to apple/pico-banana-400k development by creating an account on GitHub.
Genuine question, why is no one using Hunyuan video?
I'm seeing most people using WAN only. Also, Lora support for hunyuan I2V seems to not exist at all?
I really would have tested both of them but I doubt my PC can handle it. So are there specific reasons why WAN is much widely used and why there is barely any support for hunyuan (i2v)?
https://redd.it/1oge14v
@rStableDiffusion
I'm seeing most people using WAN only. Also, Lora support for hunyuan I2V seems to not exist at all?
I really would have tested both of them but I doubt my PC can handle it. So are there specific reasons why WAN is much widely used and why there is barely any support for hunyuan (i2v)?
https://redd.it/1oge14v
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Beginner since few weeks, i always got trouble loading other users Workflow, there's always something missing and i, often, have hard time finding these missing nodes by myself (find some with ComfyUI Manager or Google search but sometimes not ) - Any tips from long time users ?
https://redd.it/1oghw3m
@rStableDiffusion
https://redd.it/1oghw3m
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Beginner since few weeks, i always got trouble loading other users Workflow, there's…
Explore this post and more from the StableDiffusion community
DGX Spark Benchmarks (Stable Diffusion edition)
tl;dr: DGX Spark is slower than a RTX5090 by around 3.1 times for diffusion tasks.
I happened to procure a DGX Spark (Asus Ascent GX10 variant). This is a cheaper variant of the DGX Spark costing \~US$3k, and this price reduction was achieved by switching out the PCIe 5.0 4TB NVMe disk for a PCIe 4.0 1TB one.
Based on profiling this variant using llama.cpp, it can be determined that in spite of the cost reduction the GPU and memory bandwidth performance appears to be comparable [to the regular DGX Spark baseline](https://github.com/ggml-org/llama.cpp/discussions/16578).
./llama-bench -m ./gpt-oss-20b-mxfp4.gguf -fa 1 -d 0,4096,8192,16384
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 | 3639.61 ± 9.49 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 | 81.04 ± 0.49 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 3382.30 ± 6.68 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 74.66 ± 0.94 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 3140.84 ± 15.23 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 69.63 ± 2.31 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 2657.65 ± 6.55 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 65.39 ± 0.07 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 2032.37 ± 9.45 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 57.06 ± 0.08 |
Now on to the benchmarks focusing on diffusion models. Because the DGX Spark is more compute oriented, this is one of the few cases where the DGX Spark can have an advantage compared to its other competitors such as the AMD's Strix Halo and Apple Sillicon.
Involved systems:
* DGX Spark, 128GB coherent unified memory, Phison NVMe 1TB, DGX OS (6.11.0-1016-nvidia)
* AMD 5800X3D, 96GB DDR4, RTX5090, Samsung 870 QVO 4TB, Windows 11 24H2
Benchmarks were conducted using ComfyUI against the following models
* Qwen Image Edit 2509 with 4-step LoRA (fp8\_e4m3n)
* Illustrious model (SDXL)
* SD3.5 Large (fp8\_scaled)
* WAN 2.2 T2V with 4-step LoRA (fp8\_scaled)
All tests were done using the workflow templates available directly from ComfyUI, except for the Illustrious model which was a random model I took from civitai for "research" purposes.
**ComfyUI Setup**
* DGX Spark: Using v0.3.66. Flags: --use-flash-attention --highvram
* RTX 5090: Using v0.3.66, Windows build. Default settings.
**Render Duration (First Run)**
During the first execution, the model is not yet cached in memory, so it needs to be loaded from disk. Over here the disk performance of the Asus Ascent may have influence on the model load time due to using a significantly slower disk, so we expect the actual retail DGX Spark to be faster in this regard.
The following chart illustrates the time taken in seconds complete a batch size of 1.
[Render duration in seconds \(lower is
tl;dr: DGX Spark is slower than a RTX5090 by around 3.1 times for diffusion tasks.
I happened to procure a DGX Spark (Asus Ascent GX10 variant). This is a cheaper variant of the DGX Spark costing \~US$3k, and this price reduction was achieved by switching out the PCIe 5.0 4TB NVMe disk for a PCIe 4.0 1TB one.
Based on profiling this variant using llama.cpp, it can be determined that in spite of the cost reduction the GPU and memory bandwidth performance appears to be comparable [to the regular DGX Spark baseline](https://github.com/ggml-org/llama.cpp/discussions/16578).
./llama-bench -m ./gpt-oss-20b-mxfp4.gguf -fa 1 -d 0,4096,8192,16384
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 | 3639.61 ± 9.49 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 | 81.04 ± 0.49 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 3382.30 ± 6.68 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 74.66 ± 0.94 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 3140.84 ± 15.23 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 69.63 ± 2.31 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 2657.65 ± 6.55 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 65.39 ± 0.07 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 2032.37 ± 9.45 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 57.06 ± 0.08 |
Now on to the benchmarks focusing on diffusion models. Because the DGX Spark is more compute oriented, this is one of the few cases where the DGX Spark can have an advantage compared to its other competitors such as the AMD's Strix Halo and Apple Sillicon.
Involved systems:
* DGX Spark, 128GB coherent unified memory, Phison NVMe 1TB, DGX OS (6.11.0-1016-nvidia)
* AMD 5800X3D, 96GB DDR4, RTX5090, Samsung 870 QVO 4TB, Windows 11 24H2
Benchmarks were conducted using ComfyUI against the following models
* Qwen Image Edit 2509 with 4-step LoRA (fp8\_e4m3n)
* Illustrious model (SDXL)
* SD3.5 Large (fp8\_scaled)
* WAN 2.2 T2V with 4-step LoRA (fp8\_scaled)
All tests were done using the workflow templates available directly from ComfyUI, except for the Illustrious model which was a random model I took from civitai for "research" purposes.
**ComfyUI Setup**
* DGX Spark: Using v0.3.66. Flags: --use-flash-attention --highvram
* RTX 5090: Using v0.3.66, Windows build. Default settings.
**Render Duration (First Run)**
During the first execution, the model is not yet cached in memory, so it needs to be loaded from disk. Over here the disk performance of the Asus Ascent may have influence on the model load time due to using a significantly slower disk, so we expect the actual retail DGX Spark to be faster in this regard.
The following chart illustrates the time taken in seconds complete a batch size of 1.
[Render duration in seconds \(lower is
GitHub
Performance of llama.cpp on NVIDIA DGX Spark · ggml-org llama.cpp · Discussion #16578
Overview This document summarizes the performance of llama.cpp for various models on the new NVIDIA DGX Spark. Benchmarks include: Prefill (pp) and generation (tg) at various context depths (d) Bat...
better\)](https://preview.redd.it/jvg50yhy3gxf1.png?width=600&format=png&auto=webp&s=0fb3bf71073a362921b1ec1ef3eace36950f9412)
For first-time renders, the gap between the systems is also influenced by the disk speed. For the particular systems I have, the disks are not particularly fast and I'm certain there would be other enthusiasts who can load models a lot faster.
**Render Duration (Subsequent Runs)**
After the model is cached into memory, the subsequent passes would be significantly faster. Note that for DGX Spark we should set \`--highvram\` to maximize the use of the coherent memory and to increase the likelihood of retaining the model in memory. Its observed for some models, omitting this flag for the DGX Spark may result in significantly poorer performance for subsequent runs (especially for Qwen Image Edit).
The following chart illustrates the time taken in seconds complete a batch size of 1. Multiple passes were conducted until a steady state is reached.
[Render duration in seconds \(lower is better\)](https://preview.redd.it/llc7b0h84gxf1.png?width=600&format=png&auto=webp&s=65fbe1ae55cc7917d87b02fb9b4c41bbe25c69c1)
We can also infer the relative GPU compute performance between the two systems based on the iteration speed
[Iterations per second \(higher is better\)](https://preview.redd.it/7vn0vz4g4gxf1.png?width=600&format=png&auto=webp&s=904264194ced1f87cb4c152797c595d4e92bbbf0)
Overall we can infer that:
* The DGX Spark render duration is around 3.06 times slower, and the gap widens when using larger model
* The RTX 5090 compute performance is around 3.18 times faster
While the DGX Spark is not as fast as the Blackwell desktop GPU, its performance puts it close in performance to a RTX3090 for diffusion tasks, but having access to a much larger amount of memory.
**Notes**
* This is not a sponsored review, I paid for it with my own money.
* I do not have a second DGX Spark to try nccl with, because the shop I bought the DGX Spark no longer have any left in stock. Otherwise I would probably be toying with Hunyuan Image 3.0.
* I do not have access to a Strix Halo machine so don't ask me to compare it with that.
* I do have a M4 Max Macbook but I gave up waiting after 10 minutes for some of the larger models.
https://redd.it/1ogjjlj
@rStableDiffusion
For first-time renders, the gap between the systems is also influenced by the disk speed. For the particular systems I have, the disks are not particularly fast and I'm certain there would be other enthusiasts who can load models a lot faster.
**Render Duration (Subsequent Runs)**
After the model is cached into memory, the subsequent passes would be significantly faster. Note that for DGX Spark we should set \`--highvram\` to maximize the use of the coherent memory and to increase the likelihood of retaining the model in memory. Its observed for some models, omitting this flag for the DGX Spark may result in significantly poorer performance for subsequent runs (especially for Qwen Image Edit).
The following chart illustrates the time taken in seconds complete a batch size of 1. Multiple passes were conducted until a steady state is reached.
[Render duration in seconds \(lower is better\)](https://preview.redd.it/llc7b0h84gxf1.png?width=600&format=png&auto=webp&s=65fbe1ae55cc7917d87b02fb9b4c41bbe25c69c1)
We can also infer the relative GPU compute performance between the two systems based on the iteration speed
[Iterations per second \(higher is better\)](https://preview.redd.it/7vn0vz4g4gxf1.png?width=600&format=png&auto=webp&s=904264194ced1f87cb4c152797c595d4e92bbbf0)
Overall we can infer that:
* The DGX Spark render duration is around 3.06 times slower, and the gap widens when using larger model
* The RTX 5090 compute performance is around 3.18 times faster
While the DGX Spark is not as fast as the Blackwell desktop GPU, its performance puts it close in performance to a RTX3090 for diffusion tasks, but having access to a much larger amount of memory.
**Notes**
* This is not a sponsored review, I paid for it with my own money.
* I do not have a second DGX Spark to try nccl with, because the shop I bought the DGX Spark no longer have any left in stock. Otherwise I would probably be toying with Hunyuan Image 3.0.
* I do not have access to a Strix Halo machine so don't ask me to compare it with that.
* I do have a M4 Max Macbook but I gave up waiting after 10 minutes for some of the larger models.
https://redd.it/1ogjjlj
@rStableDiffusion
Wan2.1 Mocha Video Character One-Click Replacement
https://reddit.com/link/1ogkacm/video/5banxduzggxf1/player
Workflow download:
https://civitai.com/models/2075972?modelVersionId=2348984
Project address:https://orange-3dv-team.github.io/MoCha/
Controllable video character replacement with a user-provided one remains a challenging problem due to the lack of qualified paired-video data. Prior works have predominantly adopted a reconstruction-based paradigm reliant on per-frame masks and explicit structural guidance (e.g., pose, depth). This reliance, however, renders them fragile in complex scenarios involving occlusions, rare poses, character-object interactions, or complex illumination, often resulting in visual artifacts and temporal discontinuities. In this paper, we propose MoCha, a novel framework that bypasses these limitations, which requires only a single first-frame mask and re-renders the character by unifying different conditions into a single token stream. Further, MoCha adopts a condition-aware RoPE to support multi-reference images and variable-length video generation. To overcome the data bottleneck, we construct a comprehensive data synthesis pipeline to collect qualified paired-training videos. Extensive experiments show that our method substantially outperforms existing state-of-the-art approaches.
https://redd.it/1ogkacm
@rStableDiffusion
https://reddit.com/link/1ogkacm/video/5banxduzggxf1/player
Workflow download:
https://civitai.com/models/2075972?modelVersionId=2348984
Project address:https://orange-3dv-team.github.io/MoCha/
Controllable video character replacement with a user-provided one remains a challenging problem due to the lack of qualified paired-video data. Prior works have predominantly adopted a reconstruction-based paradigm reliant on per-frame masks and explicit structural guidance (e.g., pose, depth). This reliance, however, renders them fragile in complex scenarios involving occlusions, rare poses, character-object interactions, or complex illumination, often resulting in visual artifacts and temporal discontinuities. In this paper, we propose MoCha, a novel framework that bypasses these limitations, which requires only a single first-frame mask and re-renders the character by unifying different conditions into a single token stream. Further, MoCha adopts a condition-aware RoPE to support multi-reference images and variable-length video generation. To overcome the data bottleneck, we construct a comprehensive data synthesis pipeline to collect qualified paired-training videos. Extensive experiments show that our method substantially outperforms existing state-of-the-art approaches.
https://redd.it/1ogkacm
@rStableDiffusion
Civitai
Wan2.1 Mocha Video Character One-Click Replacement - v1.0 | Stable Diffusion Workflows | Civitai
You can click on the link below and try it out directly. If the effect is good, you can deploy it locally https://www.runninghub.ai/post/1981739348...