NEW BOT Телеграм, страница

🔥

VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding

VideoLLaMA is a series of multimodal models (MLLM) designed for various image and video understanding tasks!

🌟 The models support text, image and video processing capabilities.

The models are suitable for creating universal applications capable of solving a wide range of problems related to the analysis of visual information.

🖐️ Results of 7B model: DocVQA: 94.9, MathVision: 26.2, VideoMME: 66.2/70.3, MLVU: 73.0

🤏

2B-model results for mobile devices: MMMU: 45.3, VideoMME: 59.6/63.4

🔐 Licensing: Apache-2.0

🔳

Github: https://github.com/DAMO-NLP-SG/VideoLLaMA3

🔳

Image Demo: https://huggingface.co/spaces/lixin4ever/VideoLLaMA3-Image

🔳

Video Demo: https://huggingface.co/spaces/lixin4ever/VideoLLaMA3

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

👍2🙏1

1.84K views05:20

ML Research Hub

ChunkFormer: Masked Chunking Conformer For Long-Form Speech Trannoscription

ICASSP 2025 2025 · Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau ·

Deploying ASR models at an industrial scale poses significant challenges in hardware resource management, especially for long-form trannoscription tasks where audio may last for hours. Large Conformer models, despite their capabilities, are limited to processing only 15 minutes of audio on an 80GB GPU. Furthermore, variable input lengths worsen inefficiencies, as standard batching leads to excessive padding, increasing resource consumption and execution time. To address this, we introduce ChunkFormer, an efficient ASR model that uses chunk-wise processing with relative right context, enabling long audio trannoscriptions on low-memory GPUs. ChunkFormer handles up to 16 hours of audio on an 80GB GPU, 1.5x longer than the current state-of-the-art FastConformer, while also boosting long-form trannoscription performance with up to 7.7% absolute reduction on word error rate and maintaining accuracy on shorter tasks compared to Conformer. By eliminating the need for padding in standard batching, ChunkFormer's masked batching technique reduces execution time and memory usage by more than 3x in batch processing, substantially reducing costs for a wide range of ASR systems, particularly regarding GPU resources for models serving in real-world applications.

Paper: https://github.com/khanld/chunkformer/blob/main/docs/paper.pdf

Code: https://github.com/khanld/chunkformer

Datasets: Common - Voice VIVOS

Notes: Ranked #1 on Speech Recognition on VIVOS

👍2

2.12K viewsedited 07:19

ML Research Hub

please add your freinds and your techers

❤1

1.79K views07:26

ML Research Hub

Detecting Backdoor Samples in Contrastive Language Image Pretraining

3 Feb 2025 · Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, James Bailey ·

Contrastive language-image pretraining (CLIP) has been found to be vulnerable to poisoning backdoor attacks where the adversary can achieve an almost perfect attack success rate on CLIP models by poisoning only 0.01\% of the training dataset. This raises security concerns on the current practice of pretraining large-scale models on unscrutinized web data using CLIP. In this work, we analyze the representations of backdoor-poisoned samples learned by CLIP models and find that they exhibit unique characteristics in their local subspace, i.e., their local neighborhoods are far more sparse than that of clean samples. Based on this finding, we conduct a systematic study on detecting CLIP backdoor attacks and show that these attacks can be easily and efficiently detected by traditional density ratio-based local outlier detectors, whereas existing backdoor sample detection methods fail. Our experiments also reveal that an unintentional backdoor already exists in the original CC3M dataset and has been trained into a popular open-source model released by OpenCLIP. Based on our detector, one can clean up a million-scale web dataset (e.g., CC3M) efficiently within 15 minutes using 4 Nvidia A100 GPUs.

Paper: https://arxiv.org/pdf/2502.01385v1.pdf

Code: https://github.com/HanxunH/Detect-CLIP-Backdoor-Samples

Datasets: Conceptual Captions CC12M RedCaps

👍1

1.8K views15:34

ML Research Hub

Efficient Reasoning with Hidden Thinking

31 Jan 2025 · Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, Jiuxiang Gu ·

Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual reasoning introduces significant inefficiencies. In this work, we propose
(as hidden llama), an efficient reasoning framework that leverages reasoning CoTs at hidden latent space. We design the Heima Encoder to condense each intermediate CoT into a compact, higher-level hidden representation using a single thinking token, effectively minimizing verbosity and reducing the overall number of tokens required during the reasoning process. Meanwhile, we design corresponding Heima Decoder with traditional Large Language Models (LLMs) to adaptively interpret the hidden representations into variable-length textual sequence, reconstructing reasoning processes that closely resemble the original CoTs. Experimental results across diverse reasoning MLLM benchmarks demonstrate that Heima model achieves higher generation efficiency while maintaining or even better zero-shot task accuracy. Moreover, the effective reconstruction of multimodal reasoning processes with Heima Decoder validates both the robustness and interpretability of our approach.

Paper: https://arxiv.org/pdf/2501.19201v1.pdf

Code: https://github.com/shawnricecake/heima

Datasets: MMBench - MM-Vet - MathVista - MMStar - HallusionBench

❤2👍1

1.97K views18:13

ML Research Hub

Data Formulator 2: Iteratively Creating Rich Visualizations with AI

28 Aug 2024 · Chenglong Wang, Bongshin Lee, Steven Drucker, Dan Marshall, Jianfeng Gao ·

To create rich visualizations, data analysts often need to iterate back and forth among data processing and chart specification to achieve their goals. To achieve this, analysts need not only proficiency in data transformation and visualization tools but also efforts to manage the branching history consisting of many different versions of data and charts. Recent LLM-powered AI systems have greatly improved visualization authoring experiences, for example by mitigating manual data transformation barriers via LLMs' code generation ability. However, these systems do not work well for iterative visualization authoring, because they often require analysts to provide, in a single turn, a text-only prompt that fully describes the complex visualization task to be performed, which is unrealistic to both users and models in many cases. In this paper, we present Data Formulator 2, an LLM-powered visualization system to address these challenges. With Data Formulator 2, users describe their visualization intent with blended UI and natural language inputs, and data transformation are delegated to AI. To support iteration, Data Formulator 2 lets users navigate their iteration history and reuse previous designs towards new ones so that they don't need to start from scratch every time. In a user study with eight participants, we observed that Data Formulator 2 allows participants to develop their own iteration strategies to complete challenging data exploration sessions.

Paper: https://arxiv.org/pdf/2408.16119v1.pdf

Code: https://github.com/microsoft/data-formulator

👍3

1.7K viewsedited 08:05

ML Research Hub

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

5 Aug 2024 · Daniel Fleischer, Moshe Berchansky, Moshe Wasserblat, Peter Izsak ·

Implementing Retrieval-Augmented Generation (RAG) systems is inherently complex, requiring deep understanding of data, use cases, and intricate design decisions. Additionally, evaluating these systems presents significant challenges, necessitating assessment of both retrieval accuracy and generative quality through a multi-faceted approach. We introduce RAG Foundry, an open-source framework for augmenting large language models for RAG use cases. RAG Foundry integrates data creation, training, inference and evaluation into a single workflow, facilitating the creation of data-augmented datasets for training and evaluating large language models in RAG settings. This integration enables rapid prototyping and experimentation with various RAG techniques, allowing users to easily generate datasets and train RAG models using internal or specialized knowledge sources. We demonstrate the framework effectiveness by augmenting and fine-tuning Llama-3 and Phi-3 models with diverse RAG configurations, showcasing consistent improvements across three knowledge-intensive datasets.

Paper: https://arxiv.org/pdf/2408.02545v1.pdf

Code: https://github.com/intellabs/ragfoundry

Datasets: TriviaQA - PubMedQA

👍4

2.02K viewsedited 08:08

ML Research Hub

1:27

This media is not supported in your browser

VIEW IN TELEGRAM

🔬MedRAX: A groundbreaking AI agent designed for medical tasks!

What is MedRAX?

MedRAX is the first general-purpose AI agent that combines state-of-the-art chest X-ray analysis tools and multimodal large language models into a single framework that can dynamically reason about complex medical queries without additional training.

🎯 What is so good about MedRAX?

While specialized AI models excel at specific chest X-ray tasks, they often struggle with complex analysis and can produce inaccurate recommendations. Many healthcare professionals want a single, robust system that can handle complex queries while maintaining accuracy. MedRAX aims to be that tool.

🛠 Integrated tools:

- Visual quality control: CheXagent and LLaVA-Med
- Segmentation : MedSAM & ChestX-Det
- Report generation : CheXpert Plus
- Classification : TorchXRayVision
- Grounding Maira-2
- Synthetic data : RoentGen

💡 Key Features:

- Seamless integration of specialized medical tools with multimodal reasoning based on large language models.
- Dynamic Orchestration: Intelligently select and coordinate tools for complex queries.
- Clinical Focus: Designed for real medical processes.

📊 ChestAgentBench:

The developers also released ChestAgentBench , a comprehensive medical agent benchmark built on 675 expert-reviewed clinical cases and including 2,500 complex medical queries across 7 categories.

🎉 The results speak for themselves:
- 63.1% accuracy on ChestAgentBench
- Sota performance on CheXbench
- Outperforms both general-purpose and specialized medical models

▪️ Paper : https://arxiv.org/abs/2502.02673
▪️ Github : https://github.com/bowang-lab/MedRAX

👍2❤1🙏1

2K views15:19

ML Research Hub

0:14

This media is not supported in your browser

VIEW IN TELEGRAM

🌟 RT-DETRv2: An Improved CV Model for Real-Time Object Detection.

RT-DETRv2 is a new version of RT-DETR, an alternative to YOLO. RT-DETRv2 has received a number of improvements: increased flexibility, usability and performance.

The key change is the modification of the deformable attention module in the decoder. RT-DETRv2 proposes to set a different number of sampling points for features of different scales. This allows for more efficient extraction of multi-scale features, making it more adaptive to multiple detection scenarios.

To make the model more practical, we replaced the DETR-specific grid_sample operator with an optional discrete_sample operator, which performs rounding of the predicted sample offsets, speeding up the process without significant loss of accuracy.

RT-DETRv2 is trained using a dynamic data augmentation strategy. In the early stages, more intensive augmentation methods are used to help the model generalize better to the data. In the later stages, the level of augmentation is reduced, allowing the model to adapt to the target domain.

The new version uses hyperparameter customization depending on the scale of the model. For example, for ResNet18, the learning rate increases, while for larger models - ResNet101, it decreases.

RT-DETRv2 was tested on the COCO dataset, where the model showed an improvement in the AP metric by 0.3–1.4 points compared to RT-DETR, while maintaining high performance. For example, RT-DETRv2-S with the ResNet18 architecture achieved an AP of 47.9, which is 1.4 points higher than RT-DETR-S.

Scripts for finetune RT-DETRv2 with Trainer or Accelerate are hosted in the HuggingFace repository on Github, and a simple inference notebook locally - here or run in Google Collab.

📌 Licensing: Apache 2.0

🟡

Article

🟡

Arxiv

🟡

Google Collab Inference

🖥

Github

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

👍1

2.12K views07:11

ML Research Hub

47df5a49_9430_47e9_bf5a_7640f0706832_17c659c7.gif

23.6 MB

One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt

23 Jan 2025 · Tao Liu, Kai Wang, Senmao Li, Joost Van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng ·

Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input denoscription for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments.

Paper: https://arxiv.org/pdf/2501.13554v2.pdf

Code: https://github.com/byliutao/1prompt1story

👍2

1.89K viewsedited 08:37

ML Research Hub

One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation

4 Feb 2025 · Jianze Li, JieZhang Cao, Yong Guo, Wenbo Li, Yulun Zhang ·

Diffusion models (DMs) have significantly advanced the development of real-world image super-resolution (Real-ISR), but the computational cost of multi-step diffusion models limits their application. One-step diffusion models generate high-quality images in a one sampling step, greatly reducing computational overhead and inference latency. However, most existing one-step diffusion methods are constrained by the performance of the teacher model, where poor teacher performance results in image artifacts. To address this limitation, we propose FluxSR, a novel one-step diffusion Real-ISR technique based on flow matching models. We use the state-of-the-art diffusion model FLUX.1-dev as both the teacher model and the base model. First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR. Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss and introduce Attention Diversification Loss (ADL) as a regularization term to reduce token similarity in transformer, thereby eliminating high-frequency artifacts. Comprehensive experiments demonstrate that our method outperforms existing one-step diffusion-based Real-ISR methods.

Paper: https://arxiv.org/pdf/2502.01993v1.pdf

Code: https://github.com/jianzeli-114/fluxsr

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek

https://news.1rj.ru/str/DataScienceT

👍4❤1

1.91K views10:12

ML Research Hub

SGLang: Efficient Execution of Structured Language Model Programs

12 Dec 2023 · Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng ·

Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex language model programs. SGLang consists of a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control. The runtime accelerates execution with novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments show that SGLang achieves up to 6.4x higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat.

Paper: https://arxiv.org/pdf/2312.07104v2.pdf

Code: https://github.com/sgl-project/sglang

Datasets: MMLU - HellaSwag - LLaVA-Bench

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek

https://news.1rj.ru/str/DataScienceT

👍2❤1

2K views11:09

ML Research Hub

Forwarded from Machine Learning with Python

Some people asked me about a resource for learning about Transformers.

Here's a good one I am sharing again -- it covers just about everything you need to know.

brandonrohrer.com/transformers

Amazing stuff. It's totally worth your weekend.

https://news.1rj.ru/str/CodeProgrammer

👍5

1.71K views13:13

About

Blog

Apps

Platform