NEW BOT Телеграм, страница

ML Research Hub

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

13 Dec 2024 · Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan ·

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage #DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, #DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.

Paper: https://arxiv.org/pdf/2412.10302v1.pdf

Code: https://github.com/deepseek-ai/deepseek-vl2

Datasets: RefCOCO TextVQA MMBench
DocVQA

❤1👍1

1.67K viewsedited 07:24

ML Research Hub

CycleGuardian: A Framework for Automatic RespiratorySound classification Based on Improved Deep clustering and Contrastive Learning

🖥

Github: https://github.com/chumingqian/CycleGuardian

📕

Paper: https://arxiv.org/abs/2502.00734v1

🌟 Dataset: https://paperswithcode.com/dataset/icbhi-respiratory-sound-database

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

❤1

1.89K views08:28

ML Research Hub

0:20

This media is not supported in your browser

VIEW IN TELEGRAM

🔥

VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding

VideoLLaMA is a series of multimodal models (MLLM) designed for various image and video understanding tasks!

🌟 The models support text, image and video processing capabilities.

The models are suitable for creating universal applications capable of solving a wide range of problems related to the analysis of visual information.

🖐️ Results of 7B model: DocVQA: 94.9, MathVision: 26.2, VideoMME: 66.2/70.3, MLVU: 73.0

🤏

2B-model results for mobile devices: MMMU: 45.3, VideoMME: 59.6/63.4

🔐 Licensing: Apache-2.0

🔳

Github: https://github.com/DAMO-NLP-SG/VideoLLaMA3

🔳

Image Demo: https://huggingface.co/spaces/lixin4ever/VideoLLaMA3-Image

🔳

Video Demo: https://huggingface.co/spaces/lixin4ever/VideoLLaMA3

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

👍2🙏1

1.84K views05:20

ML Research Hub

ChunkFormer: Masked Chunking Conformer For Long-Form Speech Trannoscription

ICASSP 2025 2025 · Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau ·

Deploying ASR models at an industrial scale poses significant challenges in hardware resource management, especially for long-form trannoscription tasks where audio may last for hours. Large Conformer models, despite their capabilities, are limited to processing only 15 minutes of audio on an 80GB GPU. Furthermore, variable input lengths worsen inefficiencies, as standard batching leads to excessive padding, increasing resource consumption and execution time. To address this, we introduce ChunkFormer, an efficient ASR model that uses chunk-wise processing with relative right context, enabling long audio trannoscriptions on low-memory GPUs. ChunkFormer handles up to 16 hours of audio on an 80GB GPU, 1.5x longer than the current state-of-the-art FastConformer, while also boosting long-form trannoscription performance with up to 7.7% absolute reduction on word error rate and maintaining accuracy on shorter tasks compared to Conformer. By eliminating the need for padding in standard batching, ChunkFormer's masked batching technique reduces execution time and memory usage by more than 3x in batch processing, substantially reducing costs for a wide range of ASR systems, particularly regarding GPU resources for models serving in real-world applications.

Paper: https://github.com/khanld/chunkformer/blob/main/docs/paper.pdf

Code: https://github.com/khanld/chunkformer

Datasets: Common - Voice VIVOS

Notes: Ranked #1 on Speech Recognition on VIVOS

👍2

2.12K viewsedited 07:19

ML Research Hub

please add your freinds and your techers

❤1

1.79K views07:26

ML Research Hub

Detecting Backdoor Samples in Contrastive Language Image Pretraining

3 Feb 2025 · Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, James Bailey ·

Contrastive language-image pretraining (CLIP) has been found to be vulnerable to poisoning backdoor attacks where the adversary can achieve an almost perfect attack success rate on CLIP models by poisoning only 0.01\% of the training dataset. This raises security concerns on the current practice of pretraining large-scale models on unscrutinized web data using CLIP. In this work, we analyze the representations of backdoor-poisoned samples learned by CLIP models and find that they exhibit unique characteristics in their local subspace, i.e., their local neighborhoods are far more sparse than that of clean samples. Based on this finding, we conduct a systematic study on detecting CLIP backdoor attacks and show that these attacks can be easily and efficiently detected by traditional density ratio-based local outlier detectors, whereas existing backdoor sample detection methods fail. Our experiments also reveal that an unintentional backdoor already exists in the original CC3M dataset and has been trained into a popular open-source model released by OpenCLIP. Based on our detector, one can clean up a million-scale web dataset (e.g., CC3M) efficiently within 15 minutes using 4 Nvidia A100 GPUs.

Paper: https://arxiv.org/pdf/2502.01385v1.pdf

Code: https://github.com/HanxunH/Detect-CLIP-Backdoor-Samples

Datasets: Conceptual Captions CC12M RedCaps

👍1

1.8K views15:34

ML Research Hub

Efficient Reasoning with Hidden Thinking

31 Jan 2025 · Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, Jiuxiang Gu ·

Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual reasoning introduces significant inefficiencies. In this work, we propose
(as hidden llama), an efficient reasoning framework that leverages reasoning CoTs at hidden latent space. We design the Heima Encoder to condense each intermediate CoT into a compact, higher-level hidden representation using a single thinking token, effectively minimizing verbosity and reducing the overall number of tokens required during the reasoning process. Meanwhile, we design corresponding Heima Decoder with traditional Large Language Models (LLMs) to adaptively interpret the hidden representations into variable-length textual sequence, reconstructing reasoning processes that closely resemble the original CoTs. Experimental results across diverse reasoning MLLM benchmarks demonstrate that Heima model achieves higher generation efficiency while maintaining or even better zero-shot task accuracy. Moreover, the effective reconstruction of multimodal reasoning processes with Heima Decoder validates both the robustness and interpretability of our approach.

Paper: https://arxiv.org/pdf/2501.19201v1.pdf

Code: https://github.com/shawnricecake/heima

Datasets: MMBench - MM-Vet - MathVista - MMStar - HallusionBench

❤2👍1

1.97K views18:13

ML Research Hub

Data Formulator 2: Iteratively Creating Rich Visualizations with AI

28 Aug 2024 · Chenglong Wang, Bongshin Lee, Steven Drucker, Dan Marshall, Jianfeng Gao ·

To create rich visualizations, data analysts often need to iterate back and forth among data processing and chart specification to achieve their goals. To achieve this, analysts need not only proficiency in data transformation and visualization tools but also efforts to manage the branching history consisting of many different versions of data and charts. Recent LLM-powered AI systems have greatly improved visualization authoring experiences, for example by mitigating manual data transformation barriers via LLMs' code generation ability. However, these systems do not work well for iterative visualization authoring, because they often require analysts to provide, in a single turn, a text-only prompt that fully describes the complex visualization task to be performed, which is unrealistic to both users and models in many cases. In this paper, we present Data Formulator 2, an LLM-powered visualization system to address these challenges. With Data Formulator 2, users describe their visualization intent with blended UI and natural language inputs, and data transformation are delegated to AI. To support iteration, Data Formulator 2 lets users navigate their iteration history and reuse previous designs towards new ones so that they don't need to start from scratch every time. In a user study with eight participants, we observed that Data Formulator 2 allows participants to develop their own iteration strategies to complete challenging data exploration sessions.

Paper: https://arxiv.org/pdf/2408.16119v1.pdf

Code: https://github.com/microsoft/data-formulator

👍3

1.7K viewsedited 08:05

ML Research Hub

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

5 Aug 2024 · Daniel Fleischer, Moshe Berchansky, Moshe Wasserblat, Peter Izsak ·

Implementing Retrieval-Augmented Generation (RAG) systems is inherently complex, requiring deep understanding of data, use cases, and intricate design decisions. Additionally, evaluating these systems presents significant challenges, necessitating assessment of both retrieval accuracy and generative quality through a multi-faceted approach. We introduce RAG Foundry, an open-source framework for augmenting large language models for RAG use cases. RAG Foundry integrates data creation, training, inference and evaluation into a single workflow, facilitating the creation of data-augmented datasets for training and evaluating large language models in RAG settings. This integration enables rapid prototyping and experimentation with various RAG techniques, allowing users to easily generate datasets and train RAG models using internal or specialized knowledge sources. We demonstrate the framework effectiveness by augmenting and fine-tuning Llama-3 and Phi-3 models with diverse RAG configurations, showcasing consistent improvements across three knowledge-intensive datasets.

Paper: https://arxiv.org/pdf/2408.02545v1.pdf

Code: https://github.com/intellabs/ragfoundry

Datasets: TriviaQA - PubMedQA

👍4

2.02K viewsedited 08:08

ML Research Hub

1:27

This media is not supported in your browser

VIEW IN TELEGRAM

🔬MedRAX: A groundbreaking AI agent designed for medical tasks!

What is MedRAX?

MedRAX is the first general-purpose AI agent that combines state-of-the-art chest X-ray analysis tools and multimodal large language models into a single framework that can dynamically reason about complex medical queries without additional training.

🎯 What is so good about MedRAX?

While specialized AI models excel at specific chest X-ray tasks, they often struggle with complex analysis and can produce inaccurate recommendations. Many healthcare professionals want a single, robust system that can handle complex queries while maintaining accuracy. MedRAX aims to be that tool.

🛠 Integrated tools:

- Visual quality control: CheXagent and LLaVA-Med
- Segmentation : MedSAM & ChestX-Det
- Report generation : CheXpert Plus
- Classification : TorchXRayVision
- Grounding Maira-2
- Synthetic data : RoentGen

💡 Key Features:

- Seamless integration of specialized medical tools with multimodal reasoning based on large language models.
- Dynamic Orchestration: Intelligently select and coordinate tools for complex queries.
- Clinical Focus: Designed for real medical processes.

📊 ChestAgentBench:

The developers also released ChestAgentBench , a comprehensive medical agent benchmark built on 675 expert-reviewed clinical cases and including 2,500 complex medical queries across 7 categories.

🎉 The results speak for themselves:
- 63.1% accuracy on ChestAgentBench
- Sota performance on CheXbench
- Outperforms both general-purpose and specialized medical models

▪️ Paper : https://arxiv.org/abs/2502.02673
▪️ Github : https://github.com/bowang-lab/MedRAX

👍2❤1🙏1

2K views15:19

ML Research Hub

0:14

This media is not supported in your browser

VIEW IN TELEGRAM

🌟 RT-DETRv2: An Improved CV Model for Real-Time Object Detection.

RT-DETRv2 is a new version of RT-DETR, an alternative to YOLO. RT-DETRv2 has received a number of improvements: increased flexibility, usability and performance.

The key change is the modification of the deformable attention module in the decoder. RT-DETRv2 proposes to set a different number of sampling points for features of different scales. This allows for more efficient extraction of multi-scale features, making it more adaptive to multiple detection scenarios.

To make the model more practical, we replaced the DETR-specific grid_sample operator with an optional discrete_sample operator, which performs rounding of the predicted sample offsets, speeding up the process without significant loss of accuracy.

RT-DETRv2 is trained using a dynamic data augmentation strategy. In the early stages, more intensive augmentation methods are used to help the model generalize better to the data. In the later stages, the level of augmentation is reduced, allowing the model to adapt to the target domain.

The new version uses hyperparameter customization depending on the scale of the model. For example, for ResNet18, the learning rate increases, while for larger models - ResNet101, it decreases.

RT-DETRv2 was tested on the COCO dataset, where the model showed an improvement in the AP metric by 0.3–1.4 points compared to RT-DETR, while maintaining high performance. For example, RT-DETRv2-S with the ResNet18 architecture achieved an AP of 47.9, which is 1.4 points higher than RT-DETR-S.

Scripts for finetune RT-DETRv2 with Trainer or Accelerate are hosted in the HuggingFace repository on Github, and a simple inference notebook locally - here or run in Google Collab.

📌 Licensing: Apache 2.0

🟡

Article

🟡

Arxiv

🟡

Google Collab Inference

🖥

Github

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

👍1

2.12K views07:11

ML Research Hub

47df5a49_9430_47e9_bf5a_7640f0706832_17c659c7.gif

23.6 MB

One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt

23 Jan 2025 · Tao Liu, Kai Wang, Senmao Li, Joost Van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng ·

Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input denoscription for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments.

Paper: https://arxiv.org/pdf/2501.13554v2.pdf

Code: https://github.com/byliutao/1prompt1story

👍2

1.89K viewsedited 08:37

About

Blog

Apps

Platform