✨Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
📝 Summary:
Kandinsky 5.0 is a family of state-of-the-art foundation models for high-resolution image and video generation. It includes Lite and Pro versions with varying parameters and uses advanced training techniques for superior quality and speed. This publicly available framework aims to advance generat...
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.14993
• PDF: https://arxiv.org/pdf/2511.14993
• Project Page: https://kandinskylab.ai/
• Github: https://github.com/kandinskylab/kandinsky-5
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#FoundationModels #ImageGeneration #VideoGeneration #AI #DeepLearning
📝 Summary:
Kandinsky 5.0 is a family of state-of-the-art foundation models for high-resolution image and video generation. It includes Lite and Pro versions with varying parameters and uses advanced training techniques for superior quality and speed. This publicly available framework aims to advance generat...
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.14993
• PDF: https://arxiv.org/pdf/2511.14993
• Project Page: https://kandinskylab.ai/
• Github: https://github.com/kandinskylab/kandinsky-5
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#FoundationModels #ImageGeneration #VideoGeneration #AI #DeepLearning
✨Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
📝 Summary:
Researchers introduce Instruction-Guided Lesion Segmentation ILS for CXRs, allowing diverse lesion segmentation using simple instructions. They developed MIMIC-ILS, a large-scale dataset, and ROSALIA, a vision-language model. ROSALIA accurately segments various lesions and provides textual explan...
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15186
• PDF: https://arxiv.org/pdf/2511.15186
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#MedicalAI #LesionSegmentation #ChestXray #VisionLanguageModel #DeepLearning
📝 Summary:
Researchers introduce Instruction-Guided Lesion Segmentation ILS for CXRs, allowing diverse lesion segmentation using simple instructions. They developed MIMIC-ILS, a large-scale dataset, and ROSALIA, a vision-language model. ROSALIA accurately segments various lesions and provides textual explan...
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15186
• PDF: https://arxiv.org/pdf/2511.15186
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#MedicalAI #LesionSegmentation #ChestXray #VisionLanguageModel #DeepLearning
✨VisPlay: Self-Evolving Vision-Language Models from Images
📝 Summary:
VisPlay is a self-evolving RL framework that improves Vision-Language Models using unlabeled images. It employs interacting Questioner and Reasoner roles, trained with GRPO, to enhance reasoning, generalization, and reduce hallucination. This scalable method achieves consistent improvements.
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15661
• PDF: https://arxiv.org/pdf/2511.15661
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VisionLanguageModels #ReinforcementLearning #ArtificialIntelligence #MachineLearning #SelfEvolvingAI
📝 Summary:
VisPlay is a self-evolving RL framework that improves Vision-Language Models using unlabeled images. It employs interacting Questioner and Reasoner roles, trained with GRPO, to enhance reasoning, generalization, and reduce hallucination. This scalable method achieves consistent improvements.
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15661
• PDF: https://arxiv.org/pdf/2511.15661
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VisionLanguageModels #ReinforcementLearning #ArtificialIntelligence #MachineLearning #SelfEvolvingAI
✨ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
📝 Summary:
ARC-Chapter is a large-scale video chaptering model trained on millions of long video chapters, using a new bilingual and hierarchical dataset. It introduces a novel evaluation metric, GRACE, to better reflect real-world chaptering. The model achieves state-of-the-art performance and demonstrates...
🔹 Publication Date: Published on Nov 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.14349
• PDF: https://arxiv.org/pdf/2511.14349
• Project Page: https://arcchapter.github.io/index_en.html
• Github: https://github.com/TencentARC/ARC-Chapter
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VideoChaptering #AI #MachineLearning #VideoSummarization #ComputerVision
📝 Summary:
ARC-Chapter is a large-scale video chaptering model trained on millions of long video chapters, using a new bilingual and hierarchical dataset. It introduces a novel evaluation metric, GRACE, to better reflect real-world chaptering. The model achieves state-of-the-art performance and demonstrates...
🔹 Publication Date: Published on Nov 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.14349
• PDF: https://arxiv.org/pdf/2511.14349
• Project Page: https://arcchapter.github.io/index_en.html
• Github: https://github.com/TencentARC/ARC-Chapter
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VideoChaptering #AI #MachineLearning #VideoSummarization #ComputerVision
✨Aligning Generative Music AI with Human Preferences: Methods and Challenges
📝 Summary:
This paper proposes applying preference alignment techniques to music AI to better match human preferences. It discusses methods like MusicRL and DiffRhythm+ to address unique challenges such as temporal coherence and harmonic consistency, aiming for improved interactive composition and personali...
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15038
• PDF: https://arxiv.org/pdf/2511.15038
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#GenerativeAI #MusicAI #PreferenceAlignment #AIResearch #ComputationalMusic
📝 Summary:
This paper proposes applying preference alignment techniques to music AI to better match human preferences. It discusses methods like MusicRL and DiffRhythm+ to address unique challenges such as temporal coherence and harmonic consistency, aiming for improved interactive composition and personali...
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15038
• PDF: https://arxiv.org/pdf/2511.15038
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#GenerativeAI #MusicAI #PreferenceAlignment #AIResearch #ComputationalMusic
❤1
✨Medal S: Spatio-Textual Prompt Model for Medical Segmentation
📝 Summary:
Medal S is a medical segmentation foundation model using spatio-textual prompts for efficient, high-accuracy multi-class segmentation across diverse modalities. It uniquely aligns volumetric prompts with text embeddings and processes masks in parallel, significantly outperforming prior methods.
🔹 Publication Date: Published on Nov 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13001
• PDF: https://arxiv.org/pdf/2511.13001
• Github: https://github.com/yinghemedical/Medal-S
🔹 Models citing this paper:
• https://huggingface.co/spc819/Medal-S-V1.0
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#MedicalSegmentation #FoundationModels #AI #DeepLearning #ComputerVision
📝 Summary:
Medal S is a medical segmentation foundation model using spatio-textual prompts for efficient, high-accuracy multi-class segmentation across diverse modalities. It uniquely aligns volumetric prompts with text embeddings and processes masks in parallel, significantly outperforming prior methods.
🔹 Publication Date: Published on Nov 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13001
• PDF: https://arxiv.org/pdf/2511.13001
• Github: https://github.com/yinghemedical/Medal-S
🔹 Models citing this paper:
• https://huggingface.co/spc819/Medal-S-V1.0
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#MedicalSegmentation #FoundationModels #AI #DeepLearning #ComputerVision
✨OmniParser for Pure Vision Based GUI Agent
📝 Summary:
OmniParser enhances GPT-4V's ability to act as a GUI agent by improving screen parsing. It identifies interactable icons and understands element semantics using specialized models. This significantly boosts GPT-4V's performance on benchmarks like ScreenSpot, Mind2Web, and AITW.
🔹 Publication Date: Published on Aug 1, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2408.00203
• PDF: https://arxiv.org/pdf/2408.00203
• Github: https://github.com/microsoft/omniparser
🔹 Models citing this paper:
• https://huggingface.co/microsoft/OmniParser
• https://huggingface.co/microsoft/OmniParser-v2.0
• https://huggingface.co/banao-tech/OmniParser
✨ Datasets citing this paper:
• https://huggingface.co/datasets/mlfoundations/Click-100k
✨ Spaces citing this paper:
• https://huggingface.co/spaces/callmeumer/OmniParser-v2
• https://huggingface.co/spaces/nofl/OmniParser-v2
• https://huggingface.co/spaces/SheldonLe/OmniParser-v2
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#GUIagents #ComputerVision #GPT4V #AIagents #DeepLearning
📝 Summary:
OmniParser enhances GPT-4V's ability to act as a GUI agent by improving screen parsing. It identifies interactable icons and understands element semantics using specialized models. This significantly boosts GPT-4V's performance on benchmarks like ScreenSpot, Mind2Web, and AITW.
🔹 Publication Date: Published on Aug 1, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2408.00203
• PDF: https://arxiv.org/pdf/2408.00203
• Github: https://github.com/microsoft/omniparser
🔹 Models citing this paper:
• https://huggingface.co/microsoft/OmniParser
• https://huggingface.co/microsoft/OmniParser-v2.0
• https://huggingface.co/banao-tech/OmniParser
✨ Datasets citing this paper:
• https://huggingface.co/datasets/mlfoundations/Click-100k
✨ Spaces citing this paper:
• https://huggingface.co/spaces/callmeumer/OmniParser-v2
• https://huggingface.co/spaces/nofl/OmniParser-v2
• https://huggingface.co/spaces/SheldonLe/OmniParser-v2
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#GUIagents #ComputerVision #GPT4V #AIagents #DeepLearning
arXiv.org
OmniParser for Pure Vision Based GUI Agent
The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as...
✨Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
📝 Summary:
MoS is a novel multimodal diffusion model that uses a learnable token-wise router for flexible state-based modality interactions. This achieves state-of-the-art text-to-image generation and editing with minimal parameters and computational overhead.
🔹 Publication Date: Published on Nov 15
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.12207
• PDF: https://arxiv.org/pdf/2511.12207
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#GenerativeAI #MultimodalAI #DiffusionModels #TextToImage #DeepLearning
📝 Summary:
MoS is a novel multimodal diffusion model that uses a learnable token-wise router for flexible state-based modality interactions. This achieves state-of-the-art text-to-image generation and editing with minimal parameters and computational overhead.
🔹 Publication Date: Published on Nov 15
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.12207
• PDF: https://arxiv.org/pdf/2511.12207
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#GenerativeAI #MultimodalAI #DiffusionModels #TextToImage #DeepLearning
✨What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
📝 Summary:
Ideation diversity significantly enhances AI research agent performance. Higher ideation diversity leads to stronger results on the MLE-bench benchmark across different models and scaffolds. This finding holds across various performance metrics.
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15593
• PDF: https://arxiv.org/pdf/2511.15593
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#AIResearch #IdeationDiversity #MachineLearning #AIagents #AIPerformance
📝 Summary:
Ideation diversity significantly enhances AI research agent performance. Higher ideation diversity leads to stronger results on the MLE-bench benchmark across different models and scaffolds. This finding holds across various performance metrics.
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15593
• PDF: https://arxiv.org/pdf/2511.15593
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#AIResearch #IdeationDiversity #MachineLearning #AIagents #AIPerformance
✨V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
📝 Summary:
V-ReasonBench is a new benchmark to evaluate generative video models' reasoning across structured problem-solving, spatial cognition, pattern inference, and physical dynamics. It uses diverse tasks to reveal dimension-wise differences in models, aiming to support development of human-aligned reas...
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16668
• PDF: https://arxiv.org/pdf/2511.16668
• Project Page: https://oahzxl.github.io/VReasonBench/
• Github: https://github.com/yangluo7/V-ReasonBench
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VideoGeneration #AIReasoning #GenerativeAI #Benchmarking #MachineLearning
📝 Summary:
V-ReasonBench is a new benchmark to evaluate generative video models' reasoning across structured problem-solving, spatial cognition, pattern inference, and physical dynamics. It uses diverse tasks to reveal dimension-wise differences in models, aiming to support development of human-aligned reas...
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16668
• PDF: https://arxiv.org/pdf/2511.16668
• Project Page: https://oahzxl.github.io/VReasonBench/
• Github: https://github.com/yangluo7/V-ReasonBench
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VideoGeneration #AIReasoning #GenerativeAI #Benchmarking #MachineLearning
❤1
✨Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
📝 Summary:
VANS is a new model for Video-Next-Event Prediction VNEP that generates dynamic, visually and semantically accurate video responses. It uses reinforcement learning to align a Vision-Language Model with a Video Diffusion Model, achieving state-of-the-art performance.
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16669
• PDF: https://arxiv.org/pdf/2511.16669
• Project Page: https://video-as-answer.github.io/
• Github: https://github.com/KlingTeam/VANS
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VideoAI #GenerativeAI #MachineLearning #ComputerVision #DeepLearning
📝 Summary:
VANS is a new model for Video-Next-Event Prediction VNEP that generates dynamic, visually and semantically accurate video responses. It uses reinforcement learning to align a Vision-Language Model with a Video Diffusion Model, achieving state-of-the-art performance.
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16669
• PDF: https://arxiv.org/pdf/2511.16669
• Project Page: https://video-as-answer.github.io/
• Github: https://github.com/KlingTeam/VANS
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VideoAI #GenerativeAI #MachineLearning #ComputerVision #DeepLearning
✨Scaling Spatial Intelligence with Multimodal Foundation Models
📝 Summary:
SenseNova-SI is a new scaled multimodal foundation model that achieves superior spatial intelligence. By using 8 million diverse data samples, it sets unprecedented performance on various spatial benchmarks. The models are publicly released to foster further research.
🔹 Publication Date: Published on Nov 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13719
• PDF: https://arxiv.org/pdf/2511.13719
• Project Page: https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B
• Github: https://github.com/OpenSenseNova/SenseNova-SI
🔹 Models citing this paper:
• https://huggingface.co/sensenova/SenseNova-SI-InternVL3-8B
• https://huggingface.co/sensenova/SenseNova-SI-InternVL3-2B
• https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-2B
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#MultimodalAI #FoundationModels #SpatialIntelligence #ComputerVision #AI
📝 Summary:
SenseNova-SI is a new scaled multimodal foundation model that achieves superior spatial intelligence. By using 8 million diverse data samples, it sets unprecedented performance on various spatial benchmarks. The models are publicly released to foster further research.
🔹 Publication Date: Published on Nov 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13719
• PDF: https://arxiv.org/pdf/2511.13719
• Project Page: https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B
• Github: https://github.com/OpenSenseNova/SenseNova-SI
🔹 Models citing this paper:
• https://huggingface.co/sensenova/SenseNova-SI-InternVL3-8B
• https://huggingface.co/sensenova/SenseNova-SI-InternVL3-2B
• https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-2B
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#MultimodalAI #FoundationModels #SpatialIntelligence #ComputerVision #AI
arXiv.org
Scaling Spatial Intelligence with Multimodal Foundation Models
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to...
✨Step-Audio-R1 Technical Report
📝 Summary:
Step-Audio-R1 is the first audio reasoning model. It uses Modality-Grounded Reasoning Distillation to achieve strong audio reasoning, outperforming previous models. This demonstrates that reasoning capabilities are transferable across different modalities.
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15848
• PDF: https://arxiv.org/pdf/2511.15848
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#AudioReasoning #MultimodalAI #AIResearch #MachineLearning #AudioAI
📝 Summary:
Step-Audio-R1 is the first audio reasoning model. It uses Modality-Grounded Reasoning Distillation to achieve strong audio reasoning, outperforming previous models. This demonstrates that reasoning capabilities are transferable across different modalities.
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15848
• PDF: https://arxiv.org/pdf/2511.15848
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#AudioReasoning #MultimodalAI #AIResearch #MachineLearning #AudioAI
✨First Frame Is the Place to Go for Video Content Customization
📝 Summary:
The first frame in video generation models functions as a conceptual memory buffer, storing visual elements for later reuse. This enables robust video content customization with minimal training examples, without major model changes.
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15700
• PDF: https://arxiv.org/pdf/2511.15700
• Project Page: https://firstframego.github.io
• Github: http://firstframego.github.io
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VideoGeneration #GenerativeAI #ComputerVision #DeepLearning #AICustomization
📝 Summary:
The first frame in video generation models functions as a conceptual memory buffer, storing visual elements for later reuse. This enables robust video content customization with minimal training examples, without major model changes.
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15700
• PDF: https://arxiv.org/pdf/2511.15700
• Project Page: https://firstframego.github.io
• Github: http://firstframego.github.io
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VideoGeneration #GenerativeAI #ComputerVision #DeepLearning #AICustomization
✨MiMo-Embodied: X-Embodied Foundation Model Technical Report
📝 Summary:
MiMo-Embodied is the first cross-embodied foundation model. It achieves state-of-the-art performance in both autonomous driving and embodied AI, demonstrating positive transfer through multi-stage learning and fine-tuning.
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16518
• PDF: https://arxiv.org/pdf/2511.16518
• Github: https://github.com/XiaomiMiMo/MiMo-Embodied
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#FoundationModels #EmbodiedAI #AutonomousDriving #AI #Robotics
📝 Summary:
MiMo-Embodied is the first cross-embodied foundation model. It achieves state-of-the-art performance in both autonomous driving and embodied AI, demonstrating positive transfer through multi-stage learning and fine-tuning.
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16518
• PDF: https://arxiv.org/pdf/2511.16518
• Github: https://github.com/XiaomiMiMo/MiMo-Embodied
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#FoundationModels #EmbodiedAI #AutonomousDriving #AI #Robotics
✨SAM 3D: 3Dfy Anything in Images
📝 Summary:
SAM 3D reconstructs 3D objects from single images, predicting geometry, texture, and layout. It uses a multi-stage training framework with synthetic pretraining and real-world alignment, breaking the 3D data barrier and achieving high human preference.
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16624
• PDF: https://arxiv.org/pdf/2511.16624
• Project Page: https://ai.meta.com/sam3d/
• Github: https://github.com/facebookresearch/sam-3d-objects
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#3DReconstruction #ComputerVision #AI #DeepLearning #SingleImage3D
📝 Summary:
SAM 3D reconstructs 3D objects from single images, predicting geometry, texture, and layout. It uses a multi-stage training framework with synthetic pretraining and real-world alignment, breaking the 3D data barrier and achieving high human preference.
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16624
• PDF: https://arxiv.org/pdf/2511.16624
• Project Page: https://ai.meta.com/sam3d/
• Github: https://github.com/facebookresearch/sam-3d-objects
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#3DReconstruction #ComputerVision #AI #DeepLearning #SingleImage3D
✨Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
📝 Summary:
Thinking-while-Generating TwiG interleaves textual reasoning throughout the visual generation process. This on-the-fly multimodal interaction guides and reflects on visual content as it is created, resulting in more context-aware and semantically rich outputs.
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16671
• PDF: https://arxiv.org/pdf/2511.16671
• Project Page: https://think-while-gen.github.io/
• Github: https://github.com/ZiyuGuo99/Thinking-while-Generating
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#GenerativeAI #MultimodalAI #ComputerVision #NLP #AIResearch
📝 Summary:
Thinking-while-Generating TwiG interleaves textual reasoning throughout the visual generation process. This on-the-fly multimodal interaction guides and reflects on visual content as it is created, resulting in more context-aware and semantically rich outputs.
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16671
• PDF: https://arxiv.org/pdf/2511.16671
• Project Page: https://think-while-gen.github.io/
• Github: https://github.com/ZiyuGuo99/Thinking-while-Generating
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#GenerativeAI #MultimodalAI #ComputerVision #NLP #AIResearch
✨Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs
📝 Summary:
Nemotron Elastic embeds multiple submodels within a single large language model, significantly reducing training costs by 360x compared to training separate models. This framework allows zero-shot extraction of optimized submodels for various deployment budgets without additional training or fine...
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16664
• PDF: https://arxiv.org/pdf/2511.16664
• Project Page: https://huggingface.co/nvidia/Nemotron-Elastic-12B
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#LLM #AI #MachineLearning #DeepLearning #EfficientAI
📝 Summary:
Nemotron Elastic embeds multiple submodels within a single large language model, significantly reducing training costs by 360x compared to training separate models. This framework allows zero-shot extraction of optimized submodels for various deployment budgets without additional training or fine...
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16664
• PDF: https://arxiv.org/pdf/2511.16664
• Project Page: https://huggingface.co/nvidia/Nemotron-Elastic-12B
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#LLM #AI #MachineLearning #DeepLearning #EfficientAI
✨TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
📝 Summary:
TimeViper is a hybrid Mamba-Transformer vision-language model for efficient long video understanding. It introduces a TransV module to compress redundant vision tokens into instruction tokens, enabling it to process over 10,000 frames. This achieves state-of-the-art performance while offering new...
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16595
• PDF: https://arxiv.org/pdf/2511.16595
• Project Page: https://xuboshen.github.io/TimeViper/
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#TimeViper #VisionLanguageModels #VideoUnderstanding #MambaTransformer #DeepLearning
📝 Summary:
TimeViper is a hybrid Mamba-Transformer vision-language model for efficient long video understanding. It introduces a TransV module to compress redundant vision tokens into instruction tokens, enabling it to process over 10,000 frames. This achieves state-of-the-art performance while offering new...
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16595
• PDF: https://arxiv.org/pdf/2511.16595
• Project Page: https://xuboshen.github.io/TimeViper/
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#TimeViper #VisionLanguageModels #VideoUnderstanding #MambaTransformer #DeepLearning