ML Research Hub – Telegram
ML Research Hub
32.7K subscribers
4.01K photos
229 videos
23 files
4.32K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

📝 Summary:
ARC-Chapter is a large-scale video chaptering model trained on millions of long video chapters, using a new bilingual and hierarchical dataset. It introduces a novel evaluation metric, GRACE, to better reflect real-world chaptering. The model achieves state-of-the-art performance and demonstrates...

🔹 Publication Date: Published on Nov 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.14349
• PDF: https://arxiv.org/pdf/2511.14349
• Project Page: https://arcchapter.github.io/index_en.html
• Github: https://github.com/TencentARC/ARC-Chapter

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#VideoChaptering #AI #MachineLearning #VideoSummarization #ComputerVision
Aligning Generative Music AI with Human Preferences: Methods and Challenges

📝 Summary:
This paper proposes applying preference alignment techniques to music AI to better match human preferences. It discusses methods like MusicRL and DiffRhythm+ to address unique challenges such as temporal coherence and harmonic consistency, aiming for improved interactive composition and personali...

🔹 Publication Date: Published on Nov 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15038
• PDF: https://arxiv.org/pdf/2511.15038

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#GenerativeAI #MusicAI #PreferenceAlignment #AIResearch #ComputationalMusic
1
Medal S: Spatio-Textual Prompt Model for Medical Segmentation

📝 Summary:
Medal S is a medical segmentation foundation model using spatio-textual prompts for efficient, high-accuracy multi-class segmentation across diverse modalities. It uniquely aligns volumetric prompts with text embeddings and processes masks in parallel, significantly outperforming prior methods.

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13001
• PDF: https://arxiv.org/pdf/2511.13001
• Github: https://github.com/yinghemedical/Medal-S

🔹 Models citing this paper:
https://huggingface.co/spc819/Medal-S-V1.0

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#MedicalSegmentation #FoundationModels #AI #DeepLearning #ComputerVision
OmniParser for Pure Vision Based GUI Agent

📝 Summary:
OmniParser enhances GPT-4V's ability to act as a GUI agent by improving screen parsing. It identifies interactable icons and understands element semantics using specialized models. This significantly boosts GPT-4V's performance on benchmarks like ScreenSpot, Mind2Web, and AITW.

🔹 Publication Date: Published on Aug 1, 2024

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2408.00203
• PDF: https://arxiv.org/pdf/2408.00203
• Github: https://github.com/microsoft/omniparser

🔹 Models citing this paper:
https://huggingface.co/microsoft/OmniParser
https://huggingface.co/microsoft/OmniParser-v2.0
https://huggingface.co/banao-tech/OmniParser

Datasets citing this paper:
https://huggingface.co/datasets/mlfoundations/Click-100k

Spaces citing this paper:
https://huggingface.co/spaces/callmeumer/OmniParser-v2
https://huggingface.co/spaces/nofl/OmniParser-v2
https://huggingface.co/spaces/SheldonLe/OmniParser-v2

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#GUIagents #ComputerVision #GPT4V #AIagents #DeepLearning
Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

📝 Summary:
MoS is a novel multimodal diffusion model that uses a learnable token-wise router for flexible state-based modality interactions. This achieves state-of-the-art text-to-image generation and editing with minimal parameters and computational overhead.

🔹 Publication Date: Published on Nov 15

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.12207
• PDF: https://arxiv.org/pdf/2511.12207

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#GenerativeAI #MultimodalAI #DiffusionModels #TextToImage #DeepLearning
What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity

📝 Summary:
Ideation diversity significantly enhances AI research agent performance. Higher ideation diversity leads to stronger results on the MLE-bench benchmark across different models and scaffolds. This finding holds across various performance metrics.

🔹 Publication Date: Published on Nov 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15593
• PDF: https://arxiv.org/pdf/2511.15593

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#AIResearch #IdeationDiversity #MachineLearning #AIagents #AIPerformance
V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

📝 Summary:
V-ReasonBench is a new benchmark to evaluate generative video models' reasoning across structured problem-solving, spatial cognition, pattern inference, and physical dynamics. It uses diverse tasks to reveal dimension-wise differences in models, aiming to support development of human-aligned reas...

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16668
• PDF: https://arxiv.org/pdf/2511.16668
• Project Page: https://oahzxl.github.io/VReasonBench/
• Github: https://github.com/yangluo7/V-ReasonBench

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#VideoGeneration #AIReasoning #GenerativeAI #Benchmarking #MachineLearning
1
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

📝 Summary:
VANS is a new model for Video-Next-Event Prediction VNEP that generates dynamic, visually and semantically accurate video responses. It uses reinforcement learning to align a Vision-Language Model with a Video Diffusion Model, achieving state-of-the-art performance.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16669
• PDF: https://arxiv.org/pdf/2511.16669
• Project Page: https://video-as-answer.github.io/
• Github: https://github.com/KlingTeam/VANS

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#VideoAI #GenerativeAI #MachineLearning #ComputerVision #DeepLearning
Scaling Spatial Intelligence with Multimodal Foundation Models

📝 Summary:
SenseNova-SI is a new scaled multimodal foundation model that achieves superior spatial intelligence. By using 8 million diverse data samples, it sets unprecedented performance on various spatial benchmarks. The models are publicly released to foster further research.

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13719
• PDF: https://arxiv.org/pdf/2511.13719
• Project Page: https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B
• Github: https://github.com/OpenSenseNova/SenseNova-SI

🔹 Models citing this paper:
https://huggingface.co/sensenova/SenseNova-SI-InternVL3-8B
https://huggingface.co/sensenova/SenseNova-SI-InternVL3-2B
https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-2B

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#MultimodalAI #FoundationModels #SpatialIntelligence #ComputerVision #AI
Step-Audio-R1 Technical Report

📝 Summary:
Step-Audio-R1 is the first audio reasoning model. It uses Modality-Grounded Reasoning Distillation to achieve strong audio reasoning, outperforming previous models. This demonstrates that reasoning capabilities are transferable across different modalities.

🔹 Publication Date: Published on Nov 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15848
• PDF: https://arxiv.org/pdf/2511.15848

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#AudioReasoning #MultimodalAI #AIResearch #MachineLearning #AudioAI
First Frame Is the Place to Go for Video Content Customization

📝 Summary:
The first frame in video generation models functions as a conceptual memory buffer, storing visual elements for later reuse. This enables robust video content customization with minimal training examples, without major model changes.

🔹 Publication Date: Published on Nov 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15700
• PDF: https://arxiv.org/pdf/2511.15700
• Project Page: https://firstframego.github.io
• Github: http://firstframego.github.io

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#VideoGeneration #GenerativeAI #ComputerVision #DeepLearning #AICustomization
MiMo-Embodied: X-Embodied Foundation Model Technical Report

📝 Summary:
MiMo-Embodied is the first cross-embodied foundation model. It achieves state-of-the-art performance in both autonomous driving and embodied AI, demonstrating positive transfer through multi-stage learning and fine-tuning.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16518
• PDF: https://arxiv.org/pdf/2511.16518
• Github: https://github.com/XiaomiMiMo/MiMo-Embodied

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#FoundationModels #EmbodiedAI #AutonomousDriving #AI #Robotics
SAM 3D: 3Dfy Anything in Images

📝 Summary:
SAM 3D reconstructs 3D objects from single images, predicting geometry, texture, and layout. It uses a multi-stage training framework with synthetic pretraining and real-world alignment, breaking the 3D data barrier and achieving high human preference.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16624
• PDF: https://arxiv.org/pdf/2511.16624
• Project Page: https://ai.meta.com/sam3d/
• Github: https://github.com/facebookresearch/sam-3d-objects

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#3DReconstruction #ComputerVision #AI #DeepLearning #SingleImage3D
Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

📝 Summary:
Thinking-while-Generating TwiG interleaves textual reasoning throughout the visual generation process. This on-the-fly multimodal interaction guides and reflects on visual content as it is created, resulting in more context-aware and semantically rich outputs.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16671
• PDF: https://arxiv.org/pdf/2511.16671
• Project Page: https://think-while-gen.github.io/
• Github: https://github.com/ZiyuGuo99/Thinking-while-Generating

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#GenerativeAI #MultimodalAI #ComputerVision #NLP #AIResearch
Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

📝 Summary:
Nemotron Elastic embeds multiple submodels within a single large language model, significantly reducing training costs by 360x compared to training separate models. This framework allows zero-shot extraction of optimized submodels for various deployment budgets without additional training or fine...

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16664
• PDF: https://arxiv.org/pdf/2511.16664
• Project Page: https://huggingface.co/nvidia/Nemotron-Elastic-12B

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#LLM #AI #MachineLearning #DeepLearning #EfficientAI
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

📝 Summary:
TimeViper is a hybrid Mamba-Transformer vision-language model for efficient long video understanding. It introduces a TransV module to compress redundant vision tokens into instruction tokens, enabling it to process over 10,000 frames. This achieves state-of-the-art performance while offering new...

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16595
• PDF: https://arxiv.org/pdf/2511.16595
• Project Page: https://xuboshen.github.io/TimeViper/

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#TimeViper #VisionLanguageModels #VideoUnderstanding #MambaTransformer #DeepLearning
Media is too big
VIEW IN TELEGRAM
SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

📝 Summary:
SAM2S is a foundation model enhancing interactive video object segmentation in surgery. It leverages a new large benchmark, robust memory, and temporal learning to achieve superior accuracy 80.42 J and F and real-time performance in surgical video analysis.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16618
• PDF: https://arxiv.org/pdf/2511.16618
• Project Page: https://jinlab-imvr.github.io/SAM2S
• Github: https://github.com/jinlab-imvr/SAM2S

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#SurgicalAI #MedicalImaging #ComputerVision #FoundationModels #DeepLearning
1
NaTex: Seamless Texture Generation as Latent Color Diffusion

📝 Summary:
NaTex directly generates 3D textures using latent color diffusion and geometry-aware models. It predicts texture color in 3D space, outperforming prior methods in coherence and alignment by avoiding 2D multi-view limitations.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16317
• PDF: https://arxiv.org/pdf/2511.16317
• Project Page: https://natex-ldm.github.io/
• Github: https://natex-ldm.github.io/

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#TextureGeneration #DiffusionModels #3DGraphics #ComputerVision #DeepLearning
PartUV: Part-Based UV Unwrapping of 3D Meshes

📝 Summary:
PartUV is a novel UV unwrapping pipeline for noisy AI-generated 3D meshes. It uses part decomposition and geometric heuristics to generate significantly fewer, part-aligned charts with low distortion. PartUV outperforms existing methods in chart count and seam length on diverse datasets.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16659
• PDF: https://arxiv.org/pdf/2511.16659
• Project Page: https://www.zhaoningwang.com/PartUV/

==================================

For more data science resources:
https://news.1rj.ru/str/DataScienceT

#UVUnwrapping #3DMeshes #ComputerGraphics #GeometricProcessing #AI