NEW BOT Телеграм, страница

ML Research Hub

🌟 Mixture-of-Mamba: A Method to Increase MMLM Efficiency.

Mixture-of-Mamba is an experimental architecture that makes multimodal models (those that work on different types of data, such as text, images, and speech) more efficient and faster. It uses the idea of sparsity to reduce the amount of computation while maintaining high model performance.

Mixture-of-Mamba adds modality-aware sparsity to Mamba blocks and dynamically selects modality-specific weights in each input processing component of Mamba blocks.

Unlike MoE-Mamba, where sparsity is applied only to MLP layers, Mixture-of-Mamba modifies the Mamba block structure directly. Modality-specific parameterization is applied to the input projection, intermediate and output projections. Convolutional layers and state transitions remain shared.

Mixture-of-Mamba is trained in 3 modal modes: Transfusion (alternating text and continuous image tokens with diffusion loss), Chameleon (alternating text and discrete image tokens), and an extended tri-modal environment with speech inclusion.

In Transfusion, Mixture-of-Mamba achieves equivalent image loss while using only 34.76% of the total compute resources (FLOPs) at 1.4B model scale. In the Chameleon scenario, it achieves equivalent image loss while using 42.50% of the FLOPs and text loss while using 65.40% of the FLOPs. In the trimodal environment, Mixture-of-Mamba achieves speech loss while using 24.80% of the FLOPs at 1.4B model scale.

▶️ A practical implementation of the architecture is available in the project's Github repository .

📌 Licensing: MIT License.

🟡

Arxiv

🖥

GitHub

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

❤4👍2

1.66K views11:03

ML Research Hub

LLM-AutoDiff: Auto-Differentiate Any LLM Workflow

28 Jan 2025 · Li Yin, Zhangyang Wang ·

Large Language Models (LLMs) have reshaped natural language processing, powering applications from multi-hop retrieval and question answering to autonomous agent workflows. Yet, prompt engineering -- the task of crafting textual inputs to effectively direct LLMs -- remains difficult and labor-intensive, particularly for complex pipelines that combine multiple LLM calls with functional operations like retrieval and data formatting. We introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering (APE) that extends textual gradient-based methods (such as Text-Grad) to multi-component, potentially cyclic LLM architectures. Implemented within the AdalFlow library, LLM-AutoDiff treats each textual input as a trainable parameter and uses a frozen backward engine LLM to generate feedback-akin to textual gradients -- that guide iterative prompt updates. Unlike prior single-node approaches, LLM-AutoDiff inherently accommodates functional nodes, preserves time-sequential behavior in repeated calls (e.g., multi-hop loops), and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts (instructions, formats, or few-shot examples). It further boosts training efficiency by focusing on error-prone samples through selective gradient computation. Across diverse tasks, including single-step classification, multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff consistently outperforms existing textual gradient baselines in both accuracy and training cost. By unifying prompt optimization through a graph-centric lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating LLM workflows - mirroring the transformative role that automatic differentiation libraries have long played in neural network research.

Paper: https://arxiv.org/pdf/2501.16673v2.pdf

Code: https://github.com/sylphai-inc/adalflow

Dataset: HotpotQA

❤2🔥2👍1

1.73K views12:12

ML Research Hub

0:28

This media is not supported in your browser

VIEW IN TELEGRAM

⭐️

The first Open Source analogue of Deep Research from OpenAI.

Implementation of an AI researcher that continuously searches for information based on a user's request until the system is sure that it has collected all the necessary data.

To do this, he uses several services:

- SERPAPI : To perform a search on Google.
- Jina : To retrieve and extract the contents of web pages.
- OpenRouter (default model: anthropic/claude-3.5-haiku): Interacts with LLM to generate search queries, evaluate page relevance, and understand context.

🟢

Functions
- Iterative research cycle : The system iteratively refines its search queries.
- Asynchronous processing: Searching, web parsing and context evaluation are all performed in parallel for increased speed.
- Duplicate Filtering: Aggregates and deduplicates links on each cycle, ensuring that the same information is not processed twice.

▪️ Github
▪️ Google Colab

Please open Telegram to view this post

VIEW IN TELEGRAM

👍2❤1

1.82K viewsedited 14:38

ML Research Hub

CoSTI: Consistency Models for (a faster) Spatio-Temporal Imputation

31 Jan 2025 · Javier Solís-García, Belén Vega-Márquez, Juan A. Nepomuceno, Isabel A. Nepomuceno-Chamorro ·

Multivariate Time Series Imputation (MTSI) is crucial for many applications, such as healthcare monitoring and traffic management, where incomplete data can compromise decision-making. Existing state-of-the-art methods, like Denoising Diffusion Probabilistic Models (DDPMs), achieve high imputation accuracy; however, they suffer from significant computational costs and are notably time-consuming due to their iterative nature. In this work, we propose CoSTI, an innovative adaptation of Consistency Models (CMs) for the MTSI domain. CoSTI employs Consistency Training to achieve comparable imputation quality to DDPMs while drastically reducing inference times, making it more suitable for real-time applications. We evaluate CoSTI across multiple datasets and missing data scenarios, demonstrating up to a 98% reduction in imputation time with performance on par with diffusion-based models. This work bridges the gap between efficiency and accuracy in generative imputation tasks, providing a scalable solution for handling missing data in critical spatio-temporal systems.

Paper: https://arxiv.org/pdf/2501.19364v1.pdf

Code: https://github.com/javiersgjavi/costi

👍1🔥1

2.09K views16:13

ML Research Hub

RIGNO: A Graph-based framework for robust and accurate operator learning for PDEs on arbitrary domains

31 Jan 2025 · Sepehr Mousavi, Shizheng Wen, Levi Lingsch, Maximilian Herde, Bogdan Raonić, Siddhartha Mishra ·

Learning the solution operators of PDEs on arbitrary domains is challenging due to the diversity of possible domain shapes, in addition to the often intricate underlying physics. We propose an end-to-end graph neural network (GNN) based neural operator to learn PDE solution operators from data on point clouds in arbitrary domains. Our multi-scale model maps data between input/output point clouds by passing it through a downsampled regional mesh. Many novel elements are also incorporated to ensure resolution invariance and temporal continuity. Our model, termed RIGNO, is tested on a challenging suite of benchmarks, composed of various time-dependent and steady PDEs defined on a diverse set of domains. We demonstrate that RIGNO is significantly more accurate than neural operator baselines and robustly generalizes to unseen spatial resolutions and time instances.

Paper: https://arxiv.org/pdf/2501.19205v1.pdf

Code: https://github.com/camlab-ethz/rigno

❤4👍3

2.57K views07:21

ML Research Hub

Demystifying Long Chain-of-Thought Reasoning in LLMs

🖥

paper
🧠 code

Please open Telegram to view this post

VIEW IN TELEGRAM

👍4❤2

1.93K viewsedited 14:44

ML Research Hub

0:55

This media is not supported in your browser

VIEW IN TELEGRAM

⭐️

New release from #Deepseek : DeepSeek-VL2-small (16B MoE) for vision-language tasks.

A demo of the new #model is now available on #huggingface

🚀

Excellent model for #OCR tasks, text extraction, image recognition and chat use.

🤗 HF: https://huggingface.co/spaces/deepseek-ai/deepseek-vl2-small

Please open Telegram to view this post

VIEW IN TELEGRAM

👍3

2.02K viewsedited 15:06

ML Research Hub

s1: Simple test-time scaling

31 Jan 2025 · Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto ·

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1

Paper: https://arxiv.org/pdf/2501.19393v2.pdf

Code: https://github.com/simplescaling/s1

Datasets: MATH - GPQA

👍1

1.64K viewsedited 07:22

ML Research Hub

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

13 Dec 2024 · Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan ·

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage #DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, #DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.

Paper: https://arxiv.org/pdf/2412.10302v1.pdf

Code: https://github.com/deepseek-ai/deepseek-vl2

Datasets: RefCOCO TextVQA MMBench
DocVQA

❤1👍1

1.69K viewsedited 07:24

ML Research Hub

CycleGuardian: A Framework for Automatic RespiratorySound classification Based on Improved Deep clustering and Contrastive Learning

🖥

Github: https://github.com/chumingqian/CycleGuardian

📕

Paper: https://arxiv.org/abs/2502.00734v1

🌟 Dataset: https://paperswithcode.com/dataset/icbhi-respiratory-sound-database

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

❤1

1.91K views08:28

ML Research Hub

0:20

This media is not supported in your browser

VIEW IN TELEGRAM

🔥

VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding

VideoLLaMA is a series of multimodal models (MLLM) designed for various image and video understanding tasks!

🌟 The models support text, image and video processing capabilities.

The models are suitable for creating universal applications capable of solving a wide range of problems related to the analysis of visual information.

🖐️ Results of 7B model: DocVQA: 94.9, MathVision: 26.2, VideoMME: 66.2/70.3, MLVU: 73.0

🤏

2B-model results for mobile devices: MMMU: 45.3, VideoMME: 59.6/63.4

🔐 Licensing: Apache-2.0

🔳

Github: https://github.com/DAMO-NLP-SG/VideoLLaMA3

🔳

Image Demo: https://huggingface.co/spaces/lixin4ever/VideoLLaMA3-Image

🔳

Video Demo: https://huggingface.co/spaces/lixin4ever/VideoLLaMA3

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

👍2🙏1

1.86K views05:20

ML Research Hub

ChunkFormer: Masked Chunking Conformer For Long-Form Speech Trannoscription

ICASSP 2025 2025 · Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau ·

Deploying ASR models at an industrial scale poses significant challenges in hardware resource management, especially for long-form trannoscription tasks where audio may last for hours. Large Conformer models, despite their capabilities, are limited to processing only 15 minutes of audio on an 80GB GPU. Furthermore, variable input lengths worsen inefficiencies, as standard batching leads to excessive padding, increasing resource consumption and execution time. To address this, we introduce ChunkFormer, an efficient ASR model that uses chunk-wise processing with relative right context, enabling long audio trannoscriptions on low-memory GPUs. ChunkFormer handles up to 16 hours of audio on an 80GB GPU, 1.5x longer than the current state-of-the-art FastConformer, while also boosting long-form trannoscription performance with up to 7.7% absolute reduction on word error rate and maintaining accuracy on shorter tasks compared to Conformer. By eliminating the need for padding in standard batching, ChunkFormer's masked batching technique reduces execution time and memory usage by more than 3x in batch processing, substantially reducing costs for a wide range of ASR systems, particularly regarding GPU resources for models serving in real-world applications.

Paper: https://github.com/khanld/chunkformer/blob/main/docs/paper.pdf

Code: https://github.com/khanld/chunkformer

Datasets: Common - Voice VIVOS

Notes: Ranked #1 on Speech Recognition on VIVOS

👍2

2.15K viewsedited 07:19

About

Blog

Apps

Platform