✨Video Generation Models Are Good Latent Reward Models
📝 Summary:
Traditional video reward models are inefficient, operating in pixel space. PRFL uses pre-trained video generation models as latent reward models, optimizing preferences entirely in latent space. This significantly improves human alignment and reduces memory and training time.
🔹 Publication Date: Published on Nov 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.21541
• PDF: https://arxiv.org/pdf/2511.21541
• Project Page: https://kululumi.github.io/PRFL/
• Github: https://kululumi.github.io/PRFL/
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VideoGeneration #ReinforcementLearning #LatentSpace #AIResearch #MachineLearning
📝 Summary:
Traditional video reward models are inefficient, operating in pixel space. PRFL uses pre-trained video generation models as latent reward models, optimizing preferences entirely in latent space. This significantly improves human alignment and reduces memory and training time.
🔹 Publication Date: Published on Nov 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.21541
• PDF: https://arxiv.org/pdf/2511.21541
• Project Page: https://kululumi.github.io/PRFL/
• Github: https://kululumi.github.io/PRFL/
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VideoGeneration #ReinforcementLearning #LatentSpace #AIResearch #MachineLearning
✨Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
📝 Summary:
Applying a head-specific sigmoid gate after Scaled Dot-Product Attention in large language models significantly improves performance, stability, and scaling. This simple modification mitigates attention sink and enhances long-context extrapolation by introducing non-linearity and sparse gating.
🔹 Publication Date: Published on May 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.06708
• PDF: https://arxiv.org/pdf/2505.06708
• Github: https://github.com/qiuzh20/gated_attention
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#LLM #AttentionMechanism #DeepLearning #NLP #AIResearch
📝 Summary:
Applying a head-specific sigmoid gate after Scaled Dot-Product Attention in large language models significantly improves performance, stability, and scaling. This simple modification mitigates attention sink and enhances long-context extrapolation by introducing non-linearity and sparse gating.
🔹 Publication Date: Published on May 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.06708
• PDF: https://arxiv.org/pdf/2505.06708
• Github: https://github.com/qiuzh20/gated_attention
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#LLM #AttentionMechanism #DeepLearning #NLP #AIResearch
Media is too big
VIEW IN TELEGRAM
✨Paper2Video: Automatic Video Generation from Scientific Papers
📝 Summary:
PaperTalker is a multi-agent framework for automatic academic video production, integrating slides, subnoscripts, speech, and talking heads. It produces more faithful, informative videos than existing methods, simplifying labor-intensive research communication.
🔹 Publication Date: Published on Oct 6
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.05096
• PDF: https://arxiv.org/pdf/2510.05096
• Project Page: https://showlab.github.io/Paper2Video/
• Github: https://showlab.github.io/Paper2Video/
✨ Datasets citing this paper:
• https://huggingface.co/datasets/ZaynZhu/Paper2Video
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VideoGeneration #AI #AcademicCommunication #MachineLearning #MultimodalAI
📝 Summary:
PaperTalker is a multi-agent framework for automatic academic video production, integrating slides, subnoscripts, speech, and talking heads. It produces more faithful, informative videos than existing methods, simplifying labor-intensive research communication.
🔹 Publication Date: Published on Oct 6
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.05096
• PDF: https://arxiv.org/pdf/2510.05096
• Project Page: https://showlab.github.io/Paper2Video/
• Github: https://showlab.github.io/Paper2Video/
✨ Datasets citing this paper:
• https://huggingface.co/datasets/ZaynZhu/Paper2Video
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VideoGeneration #AI #AcademicCommunication #MachineLearning #MultimodalAI
✨What does it mean to understand language?
📝 Summary:
Deep language understanding involves more than just surface meaning. It requires transferring information from the core language system to other brain regions for mental models, world knowledge, and memories. This offers a new strategy to study language comprehension.
🔹 Publication Date: Published on Nov 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.19757
• PDF: https://arxiv.org/pdf/2511.19757
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#LanguageUnderstanding #CognitiveScience #Neuroscience #MentalModels #NLP
📝 Summary:
Deep language understanding involves more than just surface meaning. It requires transferring information from the core language system to other brain regions for mental models, world knowledge, and memories. This offers a new strategy to study language comprehension.
🔹 Publication Date: Published on Nov 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.19757
• PDF: https://arxiv.org/pdf/2511.19757
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#LanguageUnderstanding #CognitiveScience #Neuroscience #MentalModels #NLP
This media is not supported in your browser
VIEW IN TELEGRAM
✨NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
📝 Summary:
NAF upsamples Vision Foundation Model features zero-shot by learning adaptive spatial-and-content weights. It outperforms VFM-specific upsamplers without retraining, achieving state-of-the-art performance across various tasks efficiently.
🔹 Publication Date: Published on Nov 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.18452
• PDF: https://arxiv.org/pdf/2511.18452
• Github: https://github.com/valeoai/NAF?tab=readme-ov-file
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#ZeroShotLearning #ComputerVision #FeatureUpsampling #DeepLearning #AIResearch
📝 Summary:
NAF upsamples Vision Foundation Model features zero-shot by learning adaptive spatial-and-content weights. It outperforms VFM-specific upsamplers without retraining, achieving state-of-the-art performance across various tasks efficiently.
🔹 Publication Date: Published on Nov 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.18452
• PDF: https://arxiv.org/pdf/2511.18452
• Github: https://github.com/valeoai/NAF?tab=readme-ov-file
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#ZeroShotLearning #ComputerVision #FeatureUpsampling #DeepLearning #AIResearch
🪙 +30.560$ with 300$ in a month of trading! We can teach you how to earn! FREE!
It was a challenge - a marathon 300$ to 30.000$ on trading, together with Lisa!
What is the essence of earning?: "Analyze and open a deal on the exchange, knowing where the currency rate will go. Lisa trades every day and posts signals on her channel for free."
🔹Start: $150
🔹 Goal: $20,000
🔹Period: 1.5 months.
Join and get started, there will be no second chance👇
https://news.1rj.ru/str/+L9_l-dxOJxI2ZGUy
https://news.1rj.ru/str/+L9_l-dxOJxI2ZGUy
https://news.1rj.ru/str/+L9_l-dxOJxI2ZGUy
It was a challenge - a marathon 300$ to 30.000$ on trading, together with Lisa!
What is the essence of earning?: "Analyze and open a deal on the exchange, knowing where the currency rate will go. Lisa trades every day and posts signals on her channel for free."
🔹Start: $150
🔹 Goal: $20,000
🔹Period: 1.5 months.
Join and get started, there will be no second chance👇
https://news.1rj.ru/str/+L9_l-dxOJxI2ZGUy
https://news.1rj.ru/str/+L9_l-dxOJxI2ZGUy
https://news.1rj.ru/str/+L9_l-dxOJxI2ZGUy
❤2
✨FastVLM: Efficient Vision Encoding for Vision Language Models
📝 Summary:
FastVLM optimizes Vision Language Models for high-resolution images using FastViTHD. It reduces encoding latency and visual tokens by scaling input, significantly improving time-to-first-token up to 85x while maintaining performance.
🔹 Publication Date: Published on Dec 17, 2024
🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/fastvlm-efficient-vision-encoding-for-vision-language-models
• PDF: https://arxiv.org/pdf/2412.13303
• Github: https://github.com/apple/ml-fastvlm
🔹 Models citing this paper:
• https://huggingface.co/apple/FastVLM-0.5B
• https://huggingface.co/apple/FastVLM-7B
• https://huggingface.co/onnx-community/FastVLM-0.5B-ONNX
✨ Spaces citing this paper:
• https://huggingface.co/spaces/jairwaal/image
• https://huggingface.co/spaces/akhaliq/FastVLM-0.5B-gradio
• https://huggingface.co/spaces/akhaliq/FastVLM-7B
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#FastVLM #VLM #AI #MachineLearning #ComputerVision
📝 Summary:
FastVLM optimizes Vision Language Models for high-resolution images using FastViTHD. It reduces encoding latency and visual tokens by scaling input, significantly improving time-to-first-token up to 85x while maintaining performance.
🔹 Publication Date: Published on Dec 17, 2024
🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/fastvlm-efficient-vision-encoding-for-vision-language-models
• PDF: https://arxiv.org/pdf/2412.13303
• Github: https://github.com/apple/ml-fastvlm
🔹 Models citing this paper:
• https://huggingface.co/apple/FastVLM-0.5B
• https://huggingface.co/apple/FastVLM-7B
• https://huggingface.co/onnx-community/FastVLM-0.5B-ONNX
✨ Spaces citing this paper:
• https://huggingface.co/spaces/jairwaal/image
• https://huggingface.co/spaces/akhaliq/FastVLM-0.5B-gradio
• https://huggingface.co/spaces/akhaliq/FastVLM-7B
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#FastVLM #VLM #AI #MachineLearning #ComputerVision
Arxivexplained
FastVLM: Efficient Vision Encoding for Vision Language Models - Explained Simply
By Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li et al.. # FastVLM: Making AI Vision 85x Faster Without Sacrificing Accuracy
**The Problem:** Current AI sys...
**The Problem:** Current AI sys...
❤1
Media is too big
VIEW IN TELEGRAM
✨ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
📝 Summary:
ENACT is a benchmark evaluating embodied cognition in vision-language models through egocentric world modeling tasks. It reveals a performance gap between VLMs and humans that widens with interaction, and models exhibit anthropocentric biases.
🔹 Publication Date: Published on Nov 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20937
• PDF: https://arxiv.org/pdf/2511.20937
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#EmbodiedCognition #VisionLanguageModels #AIResearch #WorldModeling #CognitiveScience
📝 Summary:
ENACT is a benchmark evaluating embodied cognition in vision-language models through egocentric world modeling tasks. It reveals a performance gap between VLMs and humans that widens with interaction, and models exhibit anthropocentric biases.
🔹 Publication Date: Published on Nov 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20937
• PDF: https://arxiv.org/pdf/2511.20937
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#EmbodiedCognition #VisionLanguageModels #AIResearch #WorldModeling #CognitiveScience
❤1
✨GigaBrain-0: A World Model-Powered Vision-Language-Action Model
📝 Summary:
GigaBrain-0 is a VLA model that uses world model-generated data to overcome limitations of real robot data, improving cross-task generalization and policy robustness. This boosts real-world performance on complex manipulation tasks.
🔹 Publication Date: Published on Oct 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.19430
• PDF: https://arxiv.org/pdf/2510.19430
• Project Page: https://gigabrain0.github.io/
• Github: https://github.com/open-gigaai/giga-brain-0
🔹 Models citing this paper:
• https://huggingface.co/open-gigaai/GigaBrain-0-3.5B-Base
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VLAModels #WorldModels #Robotics #AI #MachineLearning
📝 Summary:
GigaBrain-0 is a VLA model that uses world model-generated data to overcome limitations of real robot data, improving cross-task generalization and policy robustness. This boosts real-world performance on complex manipulation tasks.
🔹 Publication Date: Published on Oct 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.19430
• PDF: https://arxiv.org/pdf/2510.19430
• Project Page: https://gigabrain0.github.io/
• Github: https://github.com/open-gigaai/giga-brain-0
🔹 Models citing this paper:
• https://huggingface.co/open-gigaai/GigaBrain-0-3.5B-Base
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VLAModels #WorldModels #Robotics #AI #MachineLearning
❤2
✨DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing
📝 Summary:
DocETL is an agent-based system that optimizes complex document processing pipelines to significantly improve LLM accuracy. It uses logical rewriting and agent-guided evaluation to achieve 1.34 to 4.6 times higher quality outputs than current baselines.
🔹 Publication Date: Published on Oct 16, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2410.12189
• PDF: https://arxiv.org/pdf/2410.12189
• Github: https://github.com/ucbepic/docetl
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#LLM #AI #DocumentProcessing #AgentSystems #NaturalLanguageProcessing
📝 Summary:
DocETL is an agent-based system that optimizes complex document processing pipelines to significantly improve LLM accuracy. It uses logical rewriting and agent-guided evaluation to achieve 1.34 to 4.6 times higher quality outputs than current baselines.
🔹 Publication Date: Published on Oct 16, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2410.12189
• PDF: https://arxiv.org/pdf/2410.12189
• Github: https://github.com/ucbepic/docetl
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#LLM #AI #DocumentProcessing #AgentSystems #NaturalLanguageProcessing
✨Vidi: Large Multimodal Models for Video Understanding and Editing
📝 Summary:
Vidi is a family of Large Multimodal Models for video understanding and editing, excelling at temporal retrieval in long, multimodal videos. It significantly outperforms proprietary models like GPT-4o on the new VUE-TR benchmark, which supports hour-long videos and audio queries.
🔹 Publication Date: Published on Apr 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/pdf/2504.15681
• PDF: https://arxiv.org/pdf/2504.15681
• Project Page: https://bytedance.github.io/vidi-website/
• Github: https://github.com/bytedance/vidi
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#LMMs #VideoAI #MultimodalAI #AIResearch #DeepLearning
📝 Summary:
Vidi is a family of Large Multimodal Models for video understanding and editing, excelling at temporal retrieval in long, multimodal videos. It significantly outperforms proprietary models like GPT-4o on the new VUE-TR benchmark, which supports hour-long videos and audio queries.
🔹 Publication Date: Published on Apr 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/pdf/2504.15681
• PDF: https://arxiv.org/pdf/2504.15681
• Project Page: https://bytedance.github.io/vidi-website/
• Github: https://github.com/bytedance/vidi
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#LMMs #VideoAI #MultimodalAI #AIResearch #DeepLearning
❤4
✨PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides
📝 Summary:
PPTAgent improves presentation generation with a two-stage approach that analyzes reference presentations to ensure structural and content consistency. It outperforms traditional methods across content, design, and coherence.
🔹 Publication Date: Published on Jan 7
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2501.03936
• PDF: https://arxiv.org/pdf/2501.03936
• Github: https://github.com/icip-cas/PPTAgent
✨ Datasets citing this paper:
• https://huggingface.co/datasets/Forceless/Zenodo10K
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#AIPresentations #GenerativeAI #MachineLearning #NLP #TechResearch
📝 Summary:
PPTAgent improves presentation generation with a two-stage approach that analyzes reference presentations to ensure structural and content consistency. It outperforms traditional methods across content, design, and coherence.
🔹 Publication Date: Published on Jan 7
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2501.03936
• PDF: https://arxiv.org/pdf/2501.03936
• Github: https://github.com/icip-cas/PPTAgent
✨ Datasets citing this paper:
• https://huggingface.co/datasets/Forceless/Zenodo10K
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#AIPresentations #GenerativeAI #MachineLearning #NLP #TechResearch
✨WorldVLA: Towards Autoregressive Action World Model
📝 Summary:
WorldVLA unifies VLA and world models, showing mutual enhancement in image understanding and action generation. It addresses autoregressive action prediction errors with an attention mask strategy that significantly improves performance.
🔹 Publication Date: Published on Jun 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2506.21539
• PDF: https://arxiv.org/pdf/2506.21539
• Project Page: https://github.com/alibaba-damo-academy/WorldVLA
• Github: https://github.com/alibaba-damo-academy/WorldVLA
🔹 Models citing this paper:
• https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA
• https://huggingface.co/jcenaa/WorldVLA-ActionModel-LIBERO-Goal-256
• https://huggingface.co/jcenaa/WorldVLA-ActionModel-LIBERO-10-256
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#AI #MachineLearning #Robotics #ComputerVision #WorldModels
📝 Summary:
WorldVLA unifies VLA and world models, showing mutual enhancement in image understanding and action generation. It addresses autoregressive action prediction errors with an attention mask strategy that significantly improves performance.
🔹 Publication Date: Published on Jun 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2506.21539
• PDF: https://arxiv.org/pdf/2506.21539
• Project Page: https://github.com/alibaba-damo-academy/WorldVLA
• Github: https://github.com/alibaba-damo-academy/WorldVLA
🔹 Models citing this paper:
• https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA
• https://huggingface.co/jcenaa/WorldVLA-ActionModel-LIBERO-Goal-256
• https://huggingface.co/jcenaa/WorldVLA-ActionModel-LIBERO-10-256
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#AI #MachineLearning #Robotics #ComputerVision #WorldModels
❤1
✨Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
📝 Summary:
Z-Image is an efficient 6B-parameter diffusion transformer achieving state-of-the-art image generation with significantly reduced computational cost. It enables sub-second inference and consumer hardware compatibility, challenging the scale-at-all-costs paradigm.
🔹 Publication Date: Published on Nov 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22699
• PDF: https://arxiv.org/pdf/2511.22699
• Project Page: https://tongyi-mai.github.io/Z-Image-blog/
• Github: https://github.com/Tongyi-MAI/Z-Image
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#ImageGeneration #DiffusionModels #EfficientAI #FoundationModels #MachineLearning
📝 Summary:
Z-Image is an efficient 6B-parameter diffusion transformer achieving state-of-the-art image generation with significantly reduced computational cost. It enables sub-second inference and consumer hardware compatibility, challenging the scale-at-all-costs paradigm.
🔹 Publication Date: Published on Nov 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22699
• PDF: https://arxiv.org/pdf/2511.22699
• Project Page: https://tongyi-mai.github.io/Z-Image-blog/
• Github: https://github.com/Tongyi-MAI/Z-Image
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#ImageGeneration #DiffusionModels #EfficientAI #FoundationModels #MachineLearning
❤1
✨DiP: Taming Diffusion Models in Pixel Space
📝 Summary:
DiP is an efficient pixel space diffusion framework addressing the quality-efficiency trade-off without VAEs. It combines a Diffusion Transformer for global structure and a Patch Detailer Head for local details, achieving high-quality images up to 10x faster.
🔹 Publication Date: Published on Nov 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.18822
• PDF: https://arxiv.org/pdf/2511.18822
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#DiffusionModels #GenerativeAI #ImageGeneration #DeepLearning #ComputerVision
📝 Summary:
DiP is an efficient pixel space diffusion framework addressing the quality-efficiency trade-off without VAEs. It combines a Diffusion Transformer for global structure and a Patch Detailer Head for local details, achieving high-quality images up to 10x faster.
🔹 Publication Date: Published on Nov 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.18822
• PDF: https://arxiv.org/pdf/2511.18822
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#DiffusionModels #GenerativeAI #ImageGeneration #DeepLearning #ComputerVision
✨Architecture Decoupling Is Not All You Need For Unified Multimodal Model
📝 Summary:
Unified multimodal models struggle with task conflicts. This paper introduces Attention Interaction Alignment AIA loss, which learns task-specific cross-modal attention patterns. AIA loss improves generation and understanding performance without model decoupling.
🔹 Publication Date: Published on Nov 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22663
• PDF: https://arxiv.org/pdf/2511.22663
• Project Page: https://zhengdian1.github.io/AIA-project/
• Github: https://github.com/zhengdian1/AIA
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#MultimodalAI #DeepLearning #AttentionMechanisms #AIResearch #ArtificialIntelligence
📝 Summary:
Unified multimodal models struggle with task conflicts. This paper introduces Attention Interaction Alignment AIA loss, which learns task-specific cross-modal attention patterns. AIA loss improves generation and understanding performance without model decoupling.
🔹 Publication Date: Published on Nov 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22663
• PDF: https://arxiv.org/pdf/2511.22663
• Project Page: https://zhengdian1.github.io/AIA-project/
• Github: https://github.com/zhengdian1/AIA
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#MultimodalAI #DeepLearning #AttentionMechanisms #AIResearch #ArtificialIntelligence
✨DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action
📝 Summary:
DualVLA tackles action degeneration in VLAs by boosting action performance while retaining reasoning. It uses dual-layer data pruning and dual-teacher adaptive distillation. This balances precise action and multimodal understanding, leading to high success rates.
🔹 Publication Date: Published on Nov 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22134
• PDF: https://arxiv.org/pdf/2511.22134
• Project Page: https://costaliya.github.io/DualVLA/
• Github: https://costaliya.github.io/DualVLA/
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#EmbodiedAI #VLAs #AIagents #DeepLearning #AIResearch
📝 Summary:
DualVLA tackles action degeneration in VLAs by boosting action performance while retaining reasoning. It uses dual-layer data pruning and dual-teacher adaptive distillation. This balances precise action and multimodal understanding, leading to high success rates.
🔹 Publication Date: Published on Nov 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22134
• PDF: https://arxiv.org/pdf/2511.22134
• Project Page: https://costaliya.github.io/DualVLA/
• Github: https://costaliya.github.io/DualVLA/
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#EmbodiedAI #VLAs #AIagents #DeepLearning #AIResearch
✨AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement
📝 Summary:
AnyTalker generates scalable multi-person talking videos using an identity-aware Diffusion Transformer. It trains mostly on single-person videos, refining interactivity with minimal multi-person data, achieving high lip sync and naturalness.
🔹 Publication Date: Published on Nov 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.23475
• PDF: https://arxiv.org/pdf/2511.23475
• Project Page: https://hkust-c4g.github.io/AnyTalker-homepage/
• Github: https://github.com/HKUST-C4G/AnyTalker
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VideoGeneration #GenerativeAI #DiffusionModels #ComputerVision #DeepLearning
📝 Summary:
AnyTalker generates scalable multi-person talking videos using an identity-aware Diffusion Transformer. It trains mostly on single-person videos, refining interactivity with minimal multi-person data, achieving high lip sync and naturalness.
🔹 Publication Date: Published on Nov 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.23475
• PDF: https://arxiv.org/pdf/2511.23475
• Project Page: https://hkust-c4g.github.io/AnyTalker-homepage/
• Github: https://github.com/HKUST-C4G/AnyTalker
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#VideoGeneration #GenerativeAI #DiffusionModels #ComputerVision #DeepLearning
✨Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models
📝 Summary:
This paper introduces Hierarchical Sparse Attention HSA to enable Transformers to handle ultra-long contexts efficiently. The HSA-UltraLong model achieves over 90 percent accuracy on 16M token retrieval tasks, matching full attention on shorter contexts. It lays a foundation for future long conte...
🔹 Publication Date: Published on Nov 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.23319
• PDF: https://arxiv.org/pdf/2511.23319
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#LLM #LongContext #SparseAttention #Transformers #AIResearch
📝 Summary:
This paper introduces Hierarchical Sparse Attention HSA to enable Transformers to handle ultra-long contexts efficiently. The HSA-UltraLong model achieves over 90 percent accuracy on 16M token retrieval tasks, matching full attention on shorter contexts. It lays a foundation for future long conte...
🔹 Publication Date: Published on Nov 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.23319
• PDF: https://arxiv.org/pdf/2511.23319
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#LLM #LongContext #SparseAttention #Transformers #AIResearch
Media is too big
VIEW IN TELEGRAM
✨Captain Safari: A World Engine
📝 Summary:
Captain Safari is a pose-conditioned world engine that generates high-quality, 3D-consistent long videos with precise camera paths. It uses a dynamic memory and retriever of pose-aligned world tokens to outperform existing methods in quality, consistency, and trajectory following.
🔹 Publication Date: Published on Nov 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22815
• PDF: https://arxiv.org/pdf/2511.22815
• Project Page: https://johnson111788.github.io/open-safari/
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#GenerativeAI #3DVideo #ComputerVision #WorldEngine #AIResearch
📝 Summary:
Captain Safari is a pose-conditioned world engine that generates high-quality, 3D-consistent long videos with precise camera paths. It uses a dynamic memory and retriever of pose-aligned world tokens to outperform existing methods in quality, consistency, and trajectory following.
🔹 Publication Date: Published on Nov 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22815
• PDF: https://arxiv.org/pdf/2511.22815
• Project Page: https://johnson111788.github.io/open-safari/
==================================
For more data science resources:
✓ https://news.1rj.ru/str/DataScienceT
#GenerativeAI #3DVideo #ComputerVision #WorldEngine #AIResearch