r/StableDiffusion – Telegram
Update — FP4 Infrastructure Verified (Oct 31 2025)

Quick follow-up to my previous post about running SageAttention 3 on an RTX 5080 (Blackwell) under WSL2 + CUDA 13.0 + PyTorch 2.10 nightly.

After digging into the internal API, I confirmed that the hidden FP4 quantization hooks (scaleandquantfp4, enableblockscaledfp4attn, etc.) are fully implemented at the Python level — even though the low-level CUDA kernels are not yet active.

I built an experimental FP4 quantization layer and integrated it directly into nodesmodelloading.py.
The system initializes correctly, executes under Blackwell, and logs tensor output + VRAM profile with FP4 hooks active.
However, true FP4 compute isn’t yet functional, as the CUDA backend still defaults to FP8/FP16 paths.


---

Proof of Execution

attention mode override: sageattn3
FP4 quantization applied to transformer
FP4 API fallback to BF16/FP8 pipeline
Max allocated memory: 9.95 GB
Prompt executed in 341.08 seconds


---

Next Steps

Wait for full NV-FP4 exposure in future CUDA / PyTorch releases

Continue testing with non-quantized WAN 2.2 models

Publish an FP4-ready fork once reproducibility is verified


Full build logs and technical details are on GitHub:
Repository: github.com/k1n0F/sageattention3-blackwell-wsl2

https://redd.it/1oktwaz
@rStableDiffusion