Last week in Image & Video Generation
I curate a weekly newsletter on multimodal AI. Here are the image & video generation highlights from this week:
**One Attention Layer is Enough(Apple)**
* Apple proves single attention layer transforms vision features into SOTA generators.
* Dramatically simplifies diffusion architecture without sacrificing quality.
* [Paper](https://arxiv.org/abs/2512.07829)
https://preview.redd.it/ggv1v459qb7g1.jpg?width=2294&format=pjpg&auto=webp&s=7c830bb9a64cfeddf7442910e7eef6c6dff935e1
**DMVAE - Reference-Matching VAE**
* Matches latent distributions to any reference for controlled generation.
* Achieves state-of-the-art synthesis with fewer training epochs.
* [Paper](https://huggingface.co/papers/2512.07778) | [Model](https://huggingface.co/sen-ye/dmvae/tree/main)
https://preview.redd.it/ve5tk92aqb7g1.jpg?width=692&format=pjpg&auto=webp&s=6e1edf72b4f45677759b78d7d9e73cd59aef20d2
**Qwen-Image-i2L - Image to Custom LoRA**
* First open-source tool converting single images into custom LoRAs.
* Enables personalized generation from minimal input.
* [ModelScope](https://modelscope.cn/models/DiffSynth-Studio/Qwen-Image-i2L/summary) | [Code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/qwen_image/model_inference_low_vram/Qwen-Image-i2L.py)
https://preview.redd.it/or5kkkhgqb7g1.jpg?width=1640&format=pjpg&auto=webp&s=dc88bd866947cf89a3a564832dfbae4253e5638b
**RealGen - Photorealistic Generation**
* Uses detector-guided rewards to improve text-to-image photorealism.
* Optimizes for perceptual realism beyond standard training.
* [Website](https://yejy53.github.io/RealGen/) | [Paper](https://huggingface.co/papers/2512.00473) | [GitHub](https://github.com/yejy53/RealGen?tab=readme-ov-file) | [Models](https://huggingface.co/lokiz666/Realgen-detection-models)
https://preview.redd.it/wpnnvh6iqb7g1.jpg?width=1200&format=pjpg&auto=webp&s=ae33b572b90d969db7655bb4dc948117149867a4
**Qwen 360 Diffusion - 360° Text-to-Image**
* State-of-the-art text-to-360° image generation.
* Best-in-class immersive content creation.
* [Hugging Face](https://huggingface.co/ProGamerGov/qwen-360-diffusion) | [Viewe](https://progamergov.github.io/html-360-viewer/)r
**Shots - Cinematic Multi-Angle Generation**
* Generates 9 cinematic camera angles from one image with consistency.
* Perfect visual coherence across different viewpoints.
* [Post](https://x.com/higgsfield_ai/status/1998895357707825503?s=20)
https://reddit.com/link/1pn1xym/video/2floylaoqb7g1/player
**Nano Banana Pro Solution(ComfyUI)**
* Efficient workflow generating 9 distinct 1K images from 1 prompt.
* \~3 cents per image with improved speed.
* [Post](https://x.com/hellorob/status/1999537115168636963?s=42)
https://reddit.com/link/1pn1xym/video/g8hk35mpqb7g1/player
Checkout the [full newsletter](https://open.substack.com/pub/thelivingedge/p/last-week-in-multimodal-ai-37-less?utm_campaign=post-expanded-share&utm_medium=web) for more demos, papers, and resources(couldnt add all the images/videos due to Reddit limit).
https://redd.it/1pn1xym
@rStableDiffusion
I curate a weekly newsletter on multimodal AI. Here are the image & video generation highlights from this week:
**One Attention Layer is Enough(Apple)**
* Apple proves single attention layer transforms vision features into SOTA generators.
* Dramatically simplifies diffusion architecture without sacrificing quality.
* [Paper](https://arxiv.org/abs/2512.07829)
https://preview.redd.it/ggv1v459qb7g1.jpg?width=2294&format=pjpg&auto=webp&s=7c830bb9a64cfeddf7442910e7eef6c6dff935e1
**DMVAE - Reference-Matching VAE**
* Matches latent distributions to any reference for controlled generation.
* Achieves state-of-the-art synthesis with fewer training epochs.
* [Paper](https://huggingface.co/papers/2512.07778) | [Model](https://huggingface.co/sen-ye/dmvae/tree/main)
https://preview.redd.it/ve5tk92aqb7g1.jpg?width=692&format=pjpg&auto=webp&s=6e1edf72b4f45677759b78d7d9e73cd59aef20d2
**Qwen-Image-i2L - Image to Custom LoRA**
* First open-source tool converting single images into custom LoRAs.
* Enables personalized generation from minimal input.
* [ModelScope](https://modelscope.cn/models/DiffSynth-Studio/Qwen-Image-i2L/summary) | [Code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/qwen_image/model_inference_low_vram/Qwen-Image-i2L.py)
https://preview.redd.it/or5kkkhgqb7g1.jpg?width=1640&format=pjpg&auto=webp&s=dc88bd866947cf89a3a564832dfbae4253e5638b
**RealGen - Photorealistic Generation**
* Uses detector-guided rewards to improve text-to-image photorealism.
* Optimizes for perceptual realism beyond standard training.
* [Website](https://yejy53.github.io/RealGen/) | [Paper](https://huggingface.co/papers/2512.00473) | [GitHub](https://github.com/yejy53/RealGen?tab=readme-ov-file) | [Models](https://huggingface.co/lokiz666/Realgen-detection-models)
https://preview.redd.it/wpnnvh6iqb7g1.jpg?width=1200&format=pjpg&auto=webp&s=ae33b572b90d969db7655bb4dc948117149867a4
**Qwen 360 Diffusion - 360° Text-to-Image**
* State-of-the-art text-to-360° image generation.
* Best-in-class immersive content creation.
* [Hugging Face](https://huggingface.co/ProGamerGov/qwen-360-diffusion) | [Viewe](https://progamergov.github.io/html-360-viewer/)r
**Shots - Cinematic Multi-Angle Generation**
* Generates 9 cinematic camera angles from one image with consistency.
* Perfect visual coherence across different viewpoints.
* [Post](https://x.com/higgsfield_ai/status/1998895357707825503?s=20)
https://reddit.com/link/1pn1xym/video/2floylaoqb7g1/player
**Nano Banana Pro Solution(ComfyUI)**
* Efficient workflow generating 9 distinct 1K images from 1 prompt.
* \~3 cents per image with improved speed.
* [Post](https://x.com/hellorob/status/1999537115168636963?s=42)
https://reddit.com/link/1pn1xym/video/g8hk35mpqb7g1/player
Checkout the [full newsletter](https://open.substack.com/pub/thelivingedge/p/last-week-in-multimodal-ai-37-less?utm_campaign=post-expanded-share&utm_medium=web) for more demos, papers, and resources(couldnt add all the images/videos due to Reddit limit).
https://redd.it/1pn1xym
@rStableDiffusion
arXiv.org
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image...
Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in...
My LoRa "PONGO" is avaiable on CivitAi - Link in the first comment
https://redd.it/1pmzw3x
@rStableDiffusion
https://redd.it/1pmzw3x
@rStableDiffusion
🚀 ⚡ Z-Image-Turbo-Boosted 🔥 — One-Click Ultra-Clean Images (SeedVR2 + FlashVSR + Face Upscale + Qwen-VL)
https://redd.it/1pn4ztg
@rStableDiffusion
https://redd.it/1pn4ztg
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: 🚀 ⚡ Z-Image-Turbo-Boosted 🔥 — One-Click Ultra-Clean Images (SeedVR2 + FlashVSR +…
Explore this post and more from the StableDiffusion community