NEW BOT Телеграм, страница

ML Research Hub

LLM-AutoDiff: Auto-Differentiate Any LLM Workflow

28 Jan 2025 · Li Yin, Zhangyang Wang ·

Large Language Models (LLMs) have reshaped natural language processing, powering applications from multi-hop retrieval and question answering to autonomous agent workflows. Yet, prompt engineering -- the task of crafting textual inputs to effectively direct LLMs -- remains difficult and labor-intensive, particularly for complex pipelines that combine multiple LLM calls with functional operations like retrieval and data formatting. We introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering (APE) that extends textual gradient-based methods (such as Text-Grad) to multi-component, potentially cyclic LLM architectures. Implemented within the AdalFlow library, LLM-AutoDiff treats each textual input as a trainable parameter and uses a frozen backward engine LLM to generate feedback-akin to textual gradients -- that guide iterative prompt updates. Unlike prior single-node approaches, LLM-AutoDiff inherently accommodates functional nodes, preserves time-sequential behavior in repeated calls (e.g., multi-hop loops), and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts (instructions, formats, or few-shot examples). It further boosts training efficiency by focusing on error-prone samples through selective gradient computation. Across diverse tasks, including single-step classification, multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff consistently outperforms existing textual gradient baselines in both accuracy and training cost. By unifying prompt optimization through a graph-centric lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating LLM workflows - mirroring the transformative role that automatic differentiation libraries have long played in neural network research.

Paper: https://arxiv.org/pdf/2501.16673v2.pdf

Code: https://github.com/sylphai-inc/adalflow

Dataset: HotpotQA

❤2🔥2👍1

1.71K views12:12

ML Research Hub

0:28

This media is not supported in your browser

VIEW IN TELEGRAM

⭐️

The first Open Source analogue of Deep Research from OpenAI.

Implementation of an AI researcher that continuously searches for information based on a user's request until the system is sure that it has collected all the necessary data.

To do this, he uses several services:

- SERPAPI : To perform a search on Google.
- Jina : To retrieve and extract the contents of web pages.
- OpenRouter (default model: anthropic/claude-3.5-haiku): Interacts with LLM to generate search queries, evaluate page relevance, and understand context.

🟢

Functions
- Iterative research cycle : The system iteratively refines its search queries.
- Asynchronous processing: Searching, web parsing and context evaluation are all performed in parallel for increased speed.
- Duplicate Filtering: Aggregates and deduplicates links on each cycle, ensuring that the same information is not processed twice.

▪️ Github
▪️ Google Colab

Please open Telegram to view this post

VIEW IN TELEGRAM

👍2❤1

1.81K viewsedited 14:38

ML Research Hub

CoSTI: Consistency Models for (a faster) Spatio-Temporal Imputation

31 Jan 2025 · Javier Solís-García, Belén Vega-Márquez, Juan A. Nepomuceno, Isabel A. Nepomuceno-Chamorro ·

Multivariate Time Series Imputation (MTSI) is crucial for many applications, such as healthcare monitoring and traffic management, where incomplete data can compromise decision-making. Existing state-of-the-art methods, like Denoising Diffusion Probabilistic Models (DDPMs), achieve high imputation accuracy; however, they suffer from significant computational costs and are notably time-consuming due to their iterative nature. In this work, we propose CoSTI, an innovative adaptation of Consistency Models (CMs) for the MTSI domain. CoSTI employs Consistency Training to achieve comparable imputation quality to DDPMs while drastically reducing inference times, making it more suitable for real-time applications. We evaluate CoSTI across multiple datasets and missing data scenarios, demonstrating up to a 98% reduction in imputation time with performance on par with diffusion-based models. This work bridges the gap between efficiency and accuracy in generative imputation tasks, providing a scalable solution for handling missing data in critical spatio-temporal systems.

Paper: https://arxiv.org/pdf/2501.19364v1.pdf

Code: https://github.com/javiersgjavi/costi

👍1🔥1

2.07K views16:13

ML Research Hub

RIGNO: A Graph-based framework for robust and accurate operator learning for PDEs on arbitrary domains

31 Jan 2025 · Sepehr Mousavi, Shizheng Wen, Levi Lingsch, Maximilian Herde, Bogdan Raonić, Siddhartha Mishra ·

Learning the solution operators of PDEs on arbitrary domains is challenging due to the diversity of possible domain shapes, in addition to the often intricate underlying physics. We propose an end-to-end graph neural network (GNN) based neural operator to learn PDE solution operators from data on point clouds in arbitrary domains. Our multi-scale model maps data between input/output point clouds by passing it through a downsampled regional mesh. Many novel elements are also incorporated to ensure resolution invariance and temporal continuity. Our model, termed RIGNO, is tested on a challenging suite of benchmarks, composed of various time-dependent and steady PDEs defined on a diverse set of domains. We demonstrate that RIGNO is significantly more accurate than neural operator baselines and robustly generalizes to unseen spatial resolutions and time instances.

Paper: https://arxiv.org/pdf/2501.19205v1.pdf

Code: https://github.com/camlab-ethz/rigno

❤4👍3

2.55K views07:21

ML Research Hub

Demystifying Long Chain-of-Thought Reasoning in LLMs

🖥

paper
🧠 code

Please open Telegram to view this post

VIEW IN TELEGRAM

👍4❤2

1.91K viewsedited 14:44

ML Research Hub

0:55

This media is not supported in your browser

VIEW IN TELEGRAM

⭐️

New release from #Deepseek : DeepSeek-VL2-small (16B MoE) for vision-language tasks.

A demo of the new #model is now available on #huggingface

🚀

Excellent model for #OCR tasks, text extraction, image recognition and chat use.

🤗 HF: https://huggingface.co/spaces/deepseek-ai/deepseek-vl2-small

Please open Telegram to view this post

VIEW IN TELEGRAM

👍3

2K viewsedited 15:06

ML Research Hub

s1: Simple test-time scaling

31 Jan 2025 · Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto ·

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1

Paper: https://arxiv.org/pdf/2501.19393v2.pdf

Code: https://github.com/simplescaling/s1

Datasets: MATH - GPQA

👍1

1.62K viewsedited 07:22

ML Research Hub

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

13 Dec 2024 · Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan ·

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage #DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, #DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.

Paper: https://arxiv.org/pdf/2412.10302v1.pdf

Code: https://github.com/deepseek-ai/deepseek-vl2

Datasets: RefCOCO TextVQA MMBench
DocVQA

❤1👍1

1.67K viewsedited 07:24

ML Research Hub

CycleGuardian: A Framework for Automatic RespiratorySound classification Based on Improved Deep clustering and Contrastive Learning

🖥

Github: https://github.com/chumingqian/CycleGuardian

📕

Paper: https://arxiv.org/abs/2502.00734v1

🌟 Dataset: https://paperswithcode.com/dataset/icbhi-respiratory-sound-database

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

❤1

1.89K views08:28

ML Research Hub

0:20

This media is not supported in your browser

VIEW IN TELEGRAM

🔥

VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding

VideoLLaMA is a series of multimodal models (MLLM) designed for various image and video understanding tasks!

🌟 The models support text, image and video processing capabilities.

The models are suitable for creating universal applications capable of solving a wide range of problems related to the analysis of visual information.

🖐️ Results of 7B model: DocVQA: 94.9, MathVision: 26.2, VideoMME: 66.2/70.3, MLVU: 73.0

🤏

2B-model results for mobile devices: MMMU: 45.3, VideoMME: 59.6/63.4

🔐 Licensing: Apache-2.0

🔳

Github: https://github.com/DAMO-NLP-SG/VideoLLaMA3

🔳

Image Demo: https://huggingface.co/spaces/lixin4ever/VideoLLaMA3-Image

🔳

Video Demo: https://huggingface.co/spaces/lixin4ever/VideoLLaMA3

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

👍2🙏1

1.84K views05:20

ML Research Hub

ChunkFormer: Masked Chunking Conformer For Long-Form Speech Trannoscription

ICASSP 2025 2025 · Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau ·

Deploying ASR models at an industrial scale poses significant challenges in hardware resource management, especially for long-form trannoscription tasks where audio may last for hours. Large Conformer models, despite their capabilities, are limited to processing only 15 minutes of audio on an 80GB GPU. Furthermore, variable input lengths worsen inefficiencies, as standard batching leads to excessive padding, increasing resource consumption and execution time. To address this, we introduce ChunkFormer, an efficient ASR model that uses chunk-wise processing with relative right context, enabling long audio trannoscriptions on low-memory GPUs. ChunkFormer handles up to 16 hours of audio on an 80GB GPU, 1.5x longer than the current state-of-the-art FastConformer, while also boosting long-form trannoscription performance with up to 7.7% absolute reduction on word error rate and maintaining accuracy on shorter tasks compared to Conformer. By eliminating the need for padding in standard batching, ChunkFormer's masked batching technique reduces execution time and memory usage by more than 3x in batch processing, substantially reducing costs for a wide range of ASR systems, particularly regarding GPU resources for models serving in real-world applications.

Paper: https://github.com/khanld/chunkformer/blob/main/docs/paper.pdf

Code: https://github.com/khanld/chunkformer

Datasets: Common - Voice VIVOS

Notes: Ranked #1 on Speech Recognition on VIVOS

👍2

2.12K viewsedited 07:19

ML Research Hub

please add your freinds and your techers

❤1

1.79K views07:26

ML Research Hub

Detecting Backdoor Samples in Contrastive Language Image Pretraining

3 Feb 2025 · Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, James Bailey ·

Contrastive language-image pretraining (CLIP) has been found to be vulnerable to poisoning backdoor attacks where the adversary can achieve an almost perfect attack success rate on CLIP models by poisoning only 0.01\% of the training dataset. This raises security concerns on the current practice of pretraining large-scale models on unscrutinized web data using CLIP. In this work, we analyze the representations of backdoor-poisoned samples learned by CLIP models and find that they exhibit unique characteristics in their local subspace, i.e., their local neighborhoods are far more sparse than that of clean samples. Based on this finding, we conduct a systematic study on detecting CLIP backdoor attacks and show that these attacks can be easily and efficiently detected by traditional density ratio-based local outlier detectors, whereas existing backdoor sample detection methods fail. Our experiments also reveal that an unintentional backdoor already exists in the original CC3M dataset and has been trained into a popular open-source model released by OpenCLIP. Based on our detector, one can clean up a million-scale web dataset (e.g., CC3M) efficiently within 15 minutes using 4 Nvidia A100 GPUs.

Paper: https://arxiv.org/pdf/2502.01385v1.pdf

Code: https://github.com/HanxunH/Detect-CLIP-Backdoor-Samples

Datasets: Conceptual Captions CC12M RedCaps

👍1

1.8K views15:34

ML Research Hub

Efficient Reasoning with Hidden Thinking

31 Jan 2025 · Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, Jiuxiang Gu ·

Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual reasoning introduces significant inefficiencies. In this work, we propose
(as hidden llama), an efficient reasoning framework that leverages reasoning CoTs at hidden latent space. We design the Heima Encoder to condense each intermediate CoT into a compact, higher-level hidden representation using a single thinking token, effectively minimizing verbosity and reducing the overall number of tokens required during the reasoning process. Meanwhile, we design corresponding Heima Decoder with traditional Large Language Models (LLMs) to adaptively interpret the hidden representations into variable-length textual sequence, reconstructing reasoning processes that closely resemble the original CoTs. Experimental results across diverse reasoning MLLM benchmarks demonstrate that Heima model achieves higher generation efficiency while maintaining or even better zero-shot task accuracy. Moreover, the effective reconstruction of multimodal reasoning processes with Heima Decoder validates both the robustness and interpretability of our approach.

Paper: https://arxiv.org/pdf/2501.19201v1.pdf

Code: https://github.com/shawnricecake/heima

Datasets: MMBench - MM-Vet - MathVista - MMStar - HallusionBench

❤2👍1

1.97K views18:13

About

Blog

Apps

Platform