[Release] New ComfyUI node – Step Audio EditX TTS
🎙️ ComfyUI-Step\_Audio\_EditX\_TTS: Zero-Shot Voice Cloning + Advanced Audio Editing
**TL;DR:** Clone any voice from 3-30 seconds of audio, then edit emotion, style, speed, and add effects—all while preserving voice identity. State-of-the-art quality, now in ComfyUI.
Currently recommend 10 -18 gb VRAM
[GitHub](https://github.com/Saganaki22/ComfyUI-Step_Audio_EditX_TTS) | [HF Model](https://huggingface.co/stepfun-ai/Step-Audio-EditX) | [Demo](https://stepaudiollm.github.io/step-audio-editx/) | [HF Spaces](https://huggingface.co/spaces/stepfun-ai/Step-Audio-EditX)
\---
This one brings Step Audio EditX to ComfyUI – state-of-the-art zero-shot voice cloning and audio editing. Unlike typical TTS nodes, this gives you two specialized nodes for different workflows:
[Clone on the left, Edit on the right](https://preview.redd.it/p33fzzhrzh0g1.png?width=1331&format=png&auto=webp&s=c5db8c5950bacd3b1ae91050bb26de52bb29b30c)
# What it does:
**🎤 Clone Node** – Zero-shot voice cloning from just 3-30 seconds of reference audio
* Feed it any voice sample + text trannoscript
* Generate unlimited new speech in that exact voice
* Smart longform chunking for texts over 2000 words (auto-splits and stitches seamlessly)
* Perfect for character voices, narration, voiceovers
**🎭 Edit Node** – Advanced audio editing while preserving voice identity
* **Emotions**: happy, sad, angry, excited, calm, fearful, surprised, disgusted
* **Styles**: whisper, gentle, serious, casual, formal, friendly
* **Speed control**: faster/slower with multiple levels
* **Paralinguistic effects**: `[Laughter]`, `[Breathing]`, `[Sigh]`, `[Gasp]`, `[Cough]`
* **Denoising**: clean up background noise or remove silence
* Multi-iteration editing for stronger effects (1=subtle, 5=extreme)
[voice clone + denoise & edit style exaggerated 1 iteration \/ float32](https://reddit.com/link/1otsbfb/video/m1c8m1nd5i0g1/player)
[voice clone + edit emotion admiration 1 iteration \/ float32](https://reddit.com/link/1otsbfb/video/dczqvi6vai0g1/player)
# Performance notes:
* Getting solid results on RTX 4090 with bfloat16 (\~11-14GB VRAM for clone, \~14-18GB for edit)
* Current quantization support (int8/int4) available but with quality trade-offs
* **Note: We're waiting on the Step AI research team to release official optimized quantized models for better lower-VRAM performance – will implement them as soon as they drop!**
* Multiple attention mechanisms (SDPA, Eager, Flash Attention, Sage Attention)
* Optional VRAM management – keeps model loaded for speed or unloads to free memory
# Quick setup:
* Install via ComfyUI Manager (search "Step Audio EditX TTS") or manually clone the repo
* Download **both** Step-Audio-EditX and Step-Audio-Tokenizer from HuggingFace
* Place them in `ComfyUI/models/Step-Audio-EditX/`
* Full folder structure and troubleshooting in the README
# Workflow ideas:
* Clone any voice → edit emotion/style for character variations
* Clean up noisy recordings with denoise mode
* Speed up/slow down existing audio without pitch shift
* Add natural-sounding paralinguistic effects to generated speech
[Advanced workflow with Whisper \/ trannoscription, clone + edit](https://preview.redd.it/wkc39r900i0g1.png?width=1379&format=png&auto=webp&s=557b8a0893fcbbb58dd957c299d8a3f8d6bed8e9)
The README has full parameter guides, VRAM recommendations, example settings, and troubleshooting tips. Works with all ComfyUI audio nodes.
If you find it useful, drop a ⭐ on GitHub
https://redd.it/1otsbfb
@rStableDiffusion
🎙️ ComfyUI-Step\_Audio\_EditX\_TTS: Zero-Shot Voice Cloning + Advanced Audio Editing
**TL;DR:** Clone any voice from 3-30 seconds of audio, then edit emotion, style, speed, and add effects—all while preserving voice identity. State-of-the-art quality, now in ComfyUI.
Currently recommend 10 -18 gb VRAM
[GitHub](https://github.com/Saganaki22/ComfyUI-Step_Audio_EditX_TTS) | [HF Model](https://huggingface.co/stepfun-ai/Step-Audio-EditX) | [Demo](https://stepaudiollm.github.io/step-audio-editx/) | [HF Spaces](https://huggingface.co/spaces/stepfun-ai/Step-Audio-EditX)
\---
This one brings Step Audio EditX to ComfyUI – state-of-the-art zero-shot voice cloning and audio editing. Unlike typical TTS nodes, this gives you two specialized nodes for different workflows:
[Clone on the left, Edit on the right](https://preview.redd.it/p33fzzhrzh0g1.png?width=1331&format=png&auto=webp&s=c5db8c5950bacd3b1ae91050bb26de52bb29b30c)
# What it does:
**🎤 Clone Node** – Zero-shot voice cloning from just 3-30 seconds of reference audio
* Feed it any voice sample + text trannoscript
* Generate unlimited new speech in that exact voice
* Smart longform chunking for texts over 2000 words (auto-splits and stitches seamlessly)
* Perfect for character voices, narration, voiceovers
**🎭 Edit Node** – Advanced audio editing while preserving voice identity
* **Emotions**: happy, sad, angry, excited, calm, fearful, surprised, disgusted
* **Styles**: whisper, gentle, serious, casual, formal, friendly
* **Speed control**: faster/slower with multiple levels
* **Paralinguistic effects**: `[Laughter]`, `[Breathing]`, `[Sigh]`, `[Gasp]`, `[Cough]`
* **Denoising**: clean up background noise or remove silence
* Multi-iteration editing for stronger effects (1=subtle, 5=extreme)
[voice clone + denoise & edit style exaggerated 1 iteration \/ float32](https://reddit.com/link/1otsbfb/video/m1c8m1nd5i0g1/player)
[voice clone + edit emotion admiration 1 iteration \/ float32](https://reddit.com/link/1otsbfb/video/dczqvi6vai0g1/player)
# Performance notes:
* Getting solid results on RTX 4090 with bfloat16 (\~11-14GB VRAM for clone, \~14-18GB for edit)
* Current quantization support (int8/int4) available but with quality trade-offs
* **Note: We're waiting on the Step AI research team to release official optimized quantized models for better lower-VRAM performance – will implement them as soon as they drop!**
* Multiple attention mechanisms (SDPA, Eager, Flash Attention, Sage Attention)
* Optional VRAM management – keeps model loaded for speed or unloads to free memory
# Quick setup:
* Install via ComfyUI Manager (search "Step Audio EditX TTS") or manually clone the repo
* Download **both** Step-Audio-EditX and Step-Audio-Tokenizer from HuggingFace
* Place them in `ComfyUI/models/Step-Audio-EditX/`
* Full folder structure and troubleshooting in the README
# Workflow ideas:
* Clone any voice → edit emotion/style for character variations
* Clean up noisy recordings with denoise mode
* Speed up/slow down existing audio without pitch shift
* Add natural-sounding paralinguistic effects to generated speech
[Advanced workflow with Whisper \/ trannoscription, clone + edit](https://preview.redd.it/wkc39r900i0g1.png?width=1379&format=png&auto=webp&s=557b8a0893fcbbb58dd957c299d8a3f8d6bed8e9)
The README has full parameter guides, VRAM recommendations, example settings, and troubleshooting tips. Works with all ComfyUI audio nodes.
If you find it useful, drop a ⭐ on GitHub
https://redd.it/1otsbfb
@rStableDiffusion
GitHub
GitHub - Saganaki22/ComfyUI-Step_Audio_EditX_TTS: ComfyUI nodes for Step Audio EditX - State-of-the-art zero-shot voice cloning…
ComfyUI nodes for Step Audio EditX - State-of-the-art zero-shot voice cloning and audio editing with emotion, style, speed control, and more. - Saganaki22/ComfyUI-Step_Audio_EditX_TTS
Best service to rent GPU and run ComfyUI and other stuff for making LORAs and image/video generation ?
I’m looking for recommendations on the best GPU rental services. Ideally, I need something that charges only for actual compute time, not for every minute the GPU is connected.
Here’s my situation: I work on two PCs, and often I’ll set up a generation task, leave it running for a while, and come back later. So if the generation itself takes 1 hour and then the GPU sits idle for another hour, I don’t want to get billed for 2 hours of usage — just the 1 hour of actual compute time.
Does anyone know of any GPU rental services that work this way? Or at least something close to that model?
https://redd.it/1ou3g8v
@rStableDiffusion
I’m looking for recommendations on the best GPU rental services. Ideally, I need something that charges only for actual compute time, not for every minute the GPU is connected.
Here’s my situation: I work on two PCs, and often I’ll set up a generation task, leave it running for a while, and come back later. So if the generation itself takes 1 hour and then the GPU sits idle for another hour, I don’t want to get billed for 2 hours of usage — just the 1 hour of actual compute time.
Does anyone know of any GPU rental services that work this way? Or at least something close to that model?
https://redd.it/1ou3g8v
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Why are there no 4 step loras for Chroma?
Schnell (which Chroma is based on) is a 4 steps fast model and Flux Dev has multiple 4-8 step loras available. Wan and Qwen also have 4 step loras. The currently available flash loras for Chroma are made by one person and they are as far as I know just extractions from Chroma Flash models (although there is barely any info on this), so how come nobody else has made a faster lightning lora for Chroma?
Both the Chroma flash model and the Flash Loras barely speed up generation, as they need at least 16 steps, but work the best with 20-24 steps (or sometimes higher), which at that point is just a regular generation time. However for some reason they usually make outputs more stable and better (very good for art specifically).
So is there some kind of architectural difficulty with Chroma that makes it impossible to speed it up more? That would be weird since it is basically Flux.
https://redd.it/1ou4ynv
@rStableDiffusion
Schnell (which Chroma is based on) is a 4 steps fast model and Flux Dev has multiple 4-8 step loras available. Wan and Qwen also have 4 step loras. The currently available flash loras for Chroma are made by one person and they are as far as I know just extractions from Chroma Flash models (although there is barely any info on this), so how come nobody else has made a faster lightning lora for Chroma?
Both the Chroma flash model and the Flash Loras barely speed up generation, as they need at least 16 steps, but work the best with 20-24 steps (or sometimes higher), which at that point is just a regular generation time. However for some reason they usually make outputs more stable and better (very good for art specifically).
So is there some kind of architectural difficulty with Chroma that makes it impossible to speed it up more? That would be weird since it is basically Flux.
https://redd.it/1ou4ynv
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
"Nowhere to go" Short Film (Wan22 I2V ComfyUI)
https://youtu.be/2CACps38HQI
https://redd.it/1oua5v2
@rStableDiffusion
https://youtu.be/2CACps38HQI
https://redd.it/1oua5v2
@rStableDiffusion
YouTube
174 | "Nowhere to go" | Short Film (Wan22 I2V ComfyUI) [4K]
"Nowhere to go"
Inputs - SDXL
Video - Wan 2.2 14b I2V (First-to-last frame interpolation) via ComfyUI
100% AI Generated with local open source models
____________________________________________
Let me know your feedback in the comments, also consider…
Inputs - SDXL
Video - Wan 2.2 14b I2V (First-to-last frame interpolation) via ComfyUI
100% AI Generated with local open source models
____________________________________________
Let me know your feedback in the comments, also consider…
@ Heavy users, professionals and others w/ a focus on consistent generation: How do you deal with the high frequency of new model releases?
* Do you test every supposedly ‘better’ model to see if it works for your purposes?
* If so, how much time do you invest in testing/evaluating?
* Or do you stick to a model and get the best out of it?
https://redd.it/1ouajdf
@rStableDiffusion
* Do you test every supposedly ‘better’ model to see if it works for your purposes?
* If so, how much time do you invest in testing/evaluating?
* Or do you stick to a model and get the best out of it?
https://redd.it/1ouajdf
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
"Nowhere to go" Short Film (Wan22 I2V ComfyUI)
https://youtu.be/2CACps38HQI
https://redd.it/1oua616
@rStableDiffusion
https://youtu.be/2CACps38HQI
https://redd.it/1oua616
@rStableDiffusion
YouTube
174 | "Nowhere to go" | Short Film (Wan22 I2V ComfyUI) [4K]
"Nowhere to go"
Inputs - SDXL
Video - Wan 2.2 14b I2V (First-to-last frame interpolation) via ComfyUI
100% AI Generated with local open source models
____________________________________________
Let me know your feedback in the comments, also consider…
Inputs - SDXL
Video - Wan 2.2 14b I2V (First-to-last frame interpolation) via ComfyUI
100% AI Generated with local open source models
____________________________________________
Let me know your feedback in the comments, also consider…
Is an RTX 5090 necessary for the newest and most advanced AI video models? Is it normal for RTX GPUs to be so expensive in Europe? If video models continue to advance, will more GB of VRAM be needed? What will happen if GPU prices continue to rise? Is AMD behind NVIDIA?
https://redd.it/1oufag3
@rStableDiffusion
https://redd.it/1oufag3
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Is an RTX 5090 necessary for the newest and most advanced AI video models? Is it…
Explore this post and more from the StableDiffusion community
ComfyUi on new AMD GPU - today and future
Hi, I want to get more invested in AI generation and also lora training. I have some experience with comfy from work, but would like to dig deeper at home.
Since NVidia GPUs with 24GB are above my budget, I am curious about the AMD Radeon AI PRO R9700.
I know that AMD was said to be no good for comfyui. Has this changed? I read about PyTorch support and things like ROCm etc, but to be honest I don't know how that affects workflows in practical means. Does this mean that I will be able to do everything that I would be able to do with NVidia? I have no background in engineering whatsoever, so I would have a hard time finding workarounds and stuff. But is this even the case with the new GPUs from AMD?
Would be greatful for any help!
https://redd.it/1ouhneo
@rStableDiffusion
Hi, I want to get more invested in AI generation and also lora training. I have some experience with comfy from work, but would like to dig deeper at home.
Since NVidia GPUs with 24GB are above my budget, I am curious about the AMD Radeon AI PRO R9700.
I know that AMD was said to be no good for comfyui. Has this changed? I read about PyTorch support and things like ROCm etc, but to be honest I don't know how that affects workflows in practical means. Does this mean that I will be able to do everything that I would be able to do with NVidia? I have no background in engineering whatsoever, so I would have a hard time finding workarounds and stuff. But is this even the case with the new GPUs from AMD?
Would be greatful for any help!
https://redd.it/1ouhneo
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Sharing the winners of the first Arca Gidan Prize. All made with open models + most shared the workflows and LoRAs they used. Amazing to see what a solo artist can do in a week (but we'll give more time for the next edition!)
Link here. Congrats to prize recipients and all who participated! I'll share details on the next one here + on our discord if you're interested.
https://redd.it/1oujqlj
@rStableDiffusion
Link here. Congrats to prize recipients and all who participated! I'll share details on the next one here + on our discord if you're interested.
https://redd.it/1oujqlj
@rStableDiffusion
The Arca Gidan Prize
The Arca Gidan Prize - Nov 2025 Submissions
An award for those who push open source AI art models to their artistic limits.
What's the best wan checkpoint/LoRA/finetune to animate cartoon and anime?
https://redd.it/1oukz3d
@rStableDiffusion
https://redd.it/1oukz3d
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
FIBO- by BRIAAI A text to image model trained on long structured captions . allows iterative editing of images.
https://redd.it/1oumkt0
@rStableDiffusion
https://redd.it/1oumkt0
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: FIBO- by BRIAAI A text to image model trained on long structured captions . allows…
Explore this post and more from the StableDiffusion community