DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
13 Dec 2024 · Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan ·
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage #DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, #DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.
Paper: https://arxiv.org/pdf/2412.10302v1.pdf
Code: https://github.com/deepseek-ai/deepseek-vl2
Datasets: RefCOCO TextVQA MMBench
DocVQA
13 Dec 2024 · Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan ·
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage #DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, #DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.
Paper: https://arxiv.org/pdf/2412.10302v1.pdf
Code: https://github.com/deepseek-ai/deepseek-vl2
Datasets: RefCOCO TextVQA MMBench
DocVQA
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
https://news.1rj.ru/str/DataScienceT
❤1👍1
CycleGuardian: A Framework for Automatic RespiratorySound classification Based on Improved Deep clustering and Contrastive Learning
🖥 Github: https://github.com/chumingqian/CycleGuardian
📕 Paper: https://arxiv.org/abs/2502.00734v1
🌟 Dataset: https://paperswithcode.com/dataset/icbhi-respiratory-sound-database
🌟 Dataset: https://paperswithcode.com/dataset/icbhi-respiratory-sound-database
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
Please open Telegram to view this post
VIEW IN TELEGRAM
❤1
VideoLLaMA is a series of multimodal models (MLLM) designed for various image and video understanding tasks!
The models are suitable for creating universal applications capable of solving a wide range of problems related to the analysis of visual information.
🖐️ Results of 7B model: DocVQA: 94.9, MathVision: 26.2, VideoMME: 66.2/70.3, MLVU: 73.0
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
Please open Telegram to view this post
VIEW IN TELEGRAM
👍2🙏1
ChunkFormer: Masked Chunking Conformer For Long-Form Speech Trannoscription
ICASSP 2025 2025 · Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau ·
Deploying ASR models at an industrial scale poses significant challenges in hardware resource management, especially for long-form trannoscription tasks where audio may last for hours. Large Conformer models, despite their capabilities, are limited to processing only 15 minutes of audio on an 80GB GPU. Furthermore, variable input lengths worsen inefficiencies, as standard batching leads to excessive padding, increasing resource consumption and execution time. To address this, we introduce ChunkFormer, an efficient ASR model that uses chunk-wise processing with relative right context, enabling long audio trannoscriptions on low-memory GPUs. ChunkFormer handles up to 16 hours of audio on an 80GB GPU, 1.5x longer than the current state-of-the-art FastConformer, while also boosting long-form trannoscription performance with up to 7.7% absolute reduction on word error rate and maintaining accuracy on shorter tasks compared to Conformer. By eliminating the need for padding in standard batching, ChunkFormer's masked batching technique reduces execution time and memory usage by more than 3x in batch processing, substantially reducing costs for a wide range of ASR systems, particularly regarding GPU resources for models serving in real-world applications.
Paper: https://github.com/khanld/chunkformer/blob/main/docs/paper.pdf
Code: https://github.com/khanld/chunkformer
Datasets: Common - Voice VIVOS
Notes: Ranked #1 on Speech Recognition on VIVOS
ICASSP 2025 2025 · Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau ·
Deploying ASR models at an industrial scale poses significant challenges in hardware resource management, especially for long-form trannoscription tasks where audio may last for hours. Large Conformer models, despite their capabilities, are limited to processing only 15 minutes of audio on an 80GB GPU. Furthermore, variable input lengths worsen inefficiencies, as standard batching leads to excessive padding, increasing resource consumption and execution time. To address this, we introduce ChunkFormer, an efficient ASR model that uses chunk-wise processing with relative right context, enabling long audio trannoscriptions on low-memory GPUs. ChunkFormer handles up to 16 hours of audio on an 80GB GPU, 1.5x longer than the current state-of-the-art FastConformer, while also boosting long-form trannoscription performance with up to 7.7% absolute reduction on word error rate and maintaining accuracy on shorter tasks compared to Conformer. By eliminating the need for padding in standard batching, ChunkFormer's masked batching technique reduces execution time and memory usage by more than 3x in batch processing, substantially reducing costs for a wide range of ASR systems, particularly regarding GPU resources for models serving in real-world applications.
Paper: https://github.com/khanld/chunkformer/blob/main/docs/paper.pdf
Code: https://github.com/khanld/chunkformer
Datasets: Common - Voice VIVOS
Notes: Ranked #1 on Speech Recognition on VIVOS
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
https://news.1rj.ru/str/DataScienceT
👍2
Detecting Backdoor Samples in Contrastive Language Image Pretraining
3 Feb 2025 · Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, James Bailey ·
Contrastive language-image pretraining (CLIP) has been found to be vulnerable to poisoning backdoor attacks where the adversary can achieve an almost perfect attack success rate on CLIP models by poisoning only 0.01\% of the training dataset. This raises security concerns on the current practice of pretraining large-scale models on unscrutinized web data using CLIP. In this work, we analyze the representations of backdoor-poisoned samples learned by CLIP models and find that they exhibit unique characteristics in their local subspace, i.e., their local neighborhoods are far more sparse than that of clean samples. Based on this finding, we conduct a systematic study on detecting CLIP backdoor attacks and show that these attacks can be easily and efficiently detected by traditional density ratio-based local outlier detectors, whereas existing backdoor sample detection methods fail. Our experiments also reveal that an unintentional backdoor already exists in the original CC3M dataset and has been trained into a popular open-source model released by OpenCLIP. Based on our detector, one can clean up a million-scale web dataset (e.g., CC3M) efficiently within 15 minutes using 4 Nvidia A100 GPUs.
Paper: https://arxiv.org/pdf/2502.01385v1.pdf
Code: https://github.com/HanxunH/Detect-CLIP-Backdoor-Samples
Datasets: Conceptual Captions CC12M RedCaps
3 Feb 2025 · Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, James Bailey ·
Contrastive language-image pretraining (CLIP) has been found to be vulnerable to poisoning backdoor attacks where the adversary can achieve an almost perfect attack success rate on CLIP models by poisoning only 0.01\% of the training dataset. This raises security concerns on the current practice of pretraining large-scale models on unscrutinized web data using CLIP. In this work, we analyze the representations of backdoor-poisoned samples learned by CLIP models and find that they exhibit unique characteristics in their local subspace, i.e., their local neighborhoods are far more sparse than that of clean samples. Based on this finding, we conduct a systematic study on detecting CLIP backdoor attacks and show that these attacks can be easily and efficiently detected by traditional density ratio-based local outlier detectors, whereas existing backdoor sample detection methods fail. Our experiments also reveal that an unintentional backdoor already exists in the original CC3M dataset and has been trained into a popular open-source model released by OpenCLIP. Based on our detector, one can clean up a million-scale web dataset (e.g., CC3M) efficiently within 15 minutes using 4 Nvidia A100 GPUs.
Paper: https://arxiv.org/pdf/2502.01385v1.pdf
Code: https://github.com/HanxunH/Detect-CLIP-Backdoor-Samples
Datasets: Conceptual Captions CC12M RedCaps
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
https://news.1rj.ru/str/DataScienceT
👍1
Efficient Reasoning with Hidden Thinking
31 Jan 2025 · Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, Jiuxiang Gu ·
Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual reasoning introduces significant inefficiencies. In this work, we propose
(as hidden llama), an efficient reasoning framework that leverages reasoning CoTs at hidden latent space. We design the Heima Encoder to condense each intermediate CoT into a compact, higher-level hidden representation using a single thinking token, effectively minimizing verbosity and reducing the overall number of tokens required during the reasoning process. Meanwhile, we design corresponding Heima Decoder with traditional Large Language Models (LLMs) to adaptively interpret the hidden representations into variable-length textual sequence, reconstructing reasoning processes that closely resemble the original CoTs. Experimental results across diverse reasoning MLLM benchmarks demonstrate that Heima model achieves higher generation efficiency while maintaining or even better zero-shot task accuracy. Moreover, the effective reconstruction of multimodal reasoning processes with Heima Decoder validates both the robustness and interpretability of our approach.
Paper: https://arxiv.org/pdf/2501.19201v1.pdf
Code: https://github.com/shawnricecake/heima
Datasets: MMBench - MM-Vet - MathVista - MMStar - HallusionBench
31 Jan 2025 · Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, Jiuxiang Gu ·
Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual reasoning introduces significant inefficiencies. In this work, we propose
(as hidden llama), an efficient reasoning framework that leverages reasoning CoTs at hidden latent space. We design the Heima Encoder to condense each intermediate CoT into a compact, higher-level hidden representation using a single thinking token, effectively minimizing verbosity and reducing the overall number of tokens required during the reasoning process. Meanwhile, we design corresponding Heima Decoder with traditional Large Language Models (LLMs) to adaptively interpret the hidden representations into variable-length textual sequence, reconstructing reasoning processes that closely resemble the original CoTs. Experimental results across diverse reasoning MLLM benchmarks demonstrate that Heima model achieves higher generation efficiency while maintaining or even better zero-shot task accuracy. Moreover, the effective reconstruction of multimodal reasoning processes with Heima Decoder validates both the robustness and interpretability of our approach.
Paper: https://arxiv.org/pdf/2501.19201v1.pdf
Code: https://github.com/shawnricecake/heima
Datasets: MMBench - MM-Vet - MathVista - MMStar - HallusionBench
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
https://news.1rj.ru/str/DataScienceT
❤2👍1
Data Formulator 2: Iteratively Creating Rich Visualizations with AI
28 Aug 2024 · Chenglong Wang, Bongshin Lee, Steven Drucker, Dan Marshall, Jianfeng Gao ·
To create rich visualizations, data analysts often need to iterate back and forth among data processing and chart specification to achieve their goals. To achieve this, analysts need not only proficiency in data transformation and visualization tools but also efforts to manage the branching history consisting of many different versions of data and charts. Recent LLM-powered AI systems have greatly improved visualization authoring experiences, for example by mitigating manual data transformation barriers via LLMs' code generation ability. However, these systems do not work well for iterative visualization authoring, because they often require analysts to provide, in a single turn, a text-only prompt that fully describes the complex visualization task to be performed, which is unrealistic to both users and models in many cases. In this paper, we present Data Formulator 2, an LLM-powered visualization system to address these challenges. With Data Formulator 2, users describe their visualization intent with blended UI and natural language inputs, and data transformation are delegated to AI. To support iteration, Data Formulator 2 lets users navigate their iteration history and reuse previous designs towards new ones so that they don't need to start from scratch every time. In a user study with eight participants, we observed that Data Formulator 2 allows participants to develop their own iteration strategies to complete challenging data exploration sessions.
Paper: https://arxiv.org/pdf/2408.16119v1.pdf
Code: https://github.com/microsoft/data-formulator
28 Aug 2024 · Chenglong Wang, Bongshin Lee, Steven Drucker, Dan Marshall, Jianfeng Gao ·
To create rich visualizations, data analysts often need to iterate back and forth among data processing and chart specification to achieve their goals. To achieve this, analysts need not only proficiency in data transformation and visualization tools but also efforts to manage the branching history consisting of many different versions of data and charts. Recent LLM-powered AI systems have greatly improved visualization authoring experiences, for example by mitigating manual data transformation barriers via LLMs' code generation ability. However, these systems do not work well for iterative visualization authoring, because they often require analysts to provide, in a single turn, a text-only prompt that fully describes the complex visualization task to be performed, which is unrealistic to both users and models in many cases. In this paper, we present Data Formulator 2, an LLM-powered visualization system to address these challenges. With Data Formulator 2, users describe their visualization intent with blended UI and natural language inputs, and data transformation are delegated to AI. To support iteration, Data Formulator 2 lets users navigate their iteration history and reuse previous designs towards new ones so that they don't need to start from scratch every time. In a user study with eight participants, we observed that Data Formulator 2 allows participants to develop their own iteration strategies to complete challenging data exploration sessions.
Paper: https://arxiv.org/pdf/2408.16119v1.pdf
Code: https://github.com/microsoft/data-formulator
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
https://news.1rj.ru/str/DataScienceT
👍3
RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation
5 Aug 2024 · Daniel Fleischer, Moshe Berchansky, Moshe Wasserblat, Peter Izsak ·
Implementing Retrieval-Augmented Generation (RAG) systems is inherently complex, requiring deep understanding of data, use cases, and intricate design decisions. Additionally, evaluating these systems presents significant challenges, necessitating assessment of both retrieval accuracy and generative quality through a multi-faceted approach. We introduce RAG Foundry, an open-source framework for augmenting large language models for RAG use cases. RAG Foundry integrates data creation, training, inference and evaluation into a single workflow, facilitating the creation of data-augmented datasets for training and evaluating large language models in RAG settings. This integration enables rapid prototyping and experimentation with various RAG techniques, allowing users to easily generate datasets and train RAG models using internal or specialized knowledge sources. We demonstrate the framework effectiveness by augmenting and fine-tuning Llama-3 and Phi-3 models with diverse RAG configurations, showcasing consistent improvements across three knowledge-intensive datasets.
Paper: https://arxiv.org/pdf/2408.02545v1.pdf
Code: https://github.com/intellabs/ragfoundry
Datasets: TriviaQA - PubMedQA
5 Aug 2024 · Daniel Fleischer, Moshe Berchansky, Moshe Wasserblat, Peter Izsak ·
Implementing Retrieval-Augmented Generation (RAG) systems is inherently complex, requiring deep understanding of data, use cases, and intricate design decisions. Additionally, evaluating these systems presents significant challenges, necessitating assessment of both retrieval accuracy and generative quality through a multi-faceted approach. We introduce RAG Foundry, an open-source framework for augmenting large language models for RAG use cases. RAG Foundry integrates data creation, training, inference and evaluation into a single workflow, facilitating the creation of data-augmented datasets for training and evaluating large language models in RAG settings. This integration enables rapid prototyping and experimentation with various RAG techniques, allowing users to easily generate datasets and train RAG models using internal or specialized knowledge sources. We demonstrate the framework effectiveness by augmenting and fine-tuning Llama-3 and Phi-3 models with diverse RAG configurations, showcasing consistent improvements across three knowledge-intensive datasets.
Paper: https://arxiv.org/pdf/2408.02545v1.pdf
Code: https://github.com/intellabs/ragfoundry
Datasets: TriviaQA - PubMedQA
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
https://news.1rj.ru/str/DataScienceT
👍4
This media is not supported in your browser
VIEW IN TELEGRAM
🔬MedRAX: A groundbreaking AI agent designed for medical tasks!
What is MedRAX?
MedRAX is the first general-purpose AI agent that combines state-of-the-art chest X-ray analysis tools and multimodal large language models into a single framework that can dynamically reason about complex medical queries without additional training.
🎯 What is so good about MedRAX?
While specialized AI models excel at specific chest X-ray tasks, they often struggle with complex analysis and can produce inaccurate recommendations. Many healthcare professionals want a single, robust system that can handle complex queries while maintaining accuracy. MedRAX aims to be that tool.
🛠 Integrated tools:
- Visual quality control: CheXagent and LLaVA-Med
- Segmentation : MedSAM & ChestX-Det
- Report generation : CheXpert Plus
- Classification : TorchXRayVision
- Grounding Maira-2
- Synthetic data : RoentGen
💡 Key Features:
- Seamless integration of specialized medical tools with multimodal reasoning based on large language models.
- Dynamic Orchestration: Intelligently select and coordinate tools for complex queries.
- Clinical Focus: Designed for real medical processes.
📊 ChestAgentBench:
The developers also released ChestAgentBench , a comprehensive medical agent benchmark built on 675 expert-reviewed clinical cases and including 2,500 complex medical queries across 7 categories.
🎉 The results speak for themselves:
- 63.1% accuracy on ChestAgentBench
- Sota performance on CheXbench
- Outperforms both general-purpose and specialized medical models
▪️ Paper : https://arxiv.org/abs/2502.02673
▪️ Github : https://github.com/bowang-lab/MedRAX
What is MedRAX?
MedRAX is the first general-purpose AI agent that combines state-of-the-art chest X-ray analysis tools and multimodal large language models into a single framework that can dynamically reason about complex medical queries without additional training.
🎯 What is so good about MedRAX?
While specialized AI models excel at specific chest X-ray tasks, they often struggle with complex analysis and can produce inaccurate recommendations. Many healthcare professionals want a single, robust system that can handle complex queries while maintaining accuracy. MedRAX aims to be that tool.
🛠 Integrated tools:
- Visual quality control: CheXagent and LLaVA-Med
- Segmentation : MedSAM & ChestX-Det
- Report generation : CheXpert Plus
- Classification : TorchXRayVision
- Grounding Maira-2
- Synthetic data : RoentGen
💡 Key Features:
- Seamless integration of specialized medical tools with multimodal reasoning based on large language models.
- Dynamic Orchestration: Intelligently select and coordinate tools for complex queries.
- Clinical Focus: Designed for real medical processes.
📊 ChestAgentBench:
The developers also released ChestAgentBench , a comprehensive medical agent benchmark built on 675 expert-reviewed clinical cases and including 2,500 complex medical queries across 7 categories.
🎉 The results speak for themselves:
- 63.1% accuracy on ChestAgentBench
- Sota performance on CheXbench
- Outperforms both general-purpose and specialized medical models
▪️ Paper : https://arxiv.org/abs/2502.02673
▪️ Github : https://github.com/bowang-lab/MedRAX
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
https://news.1rj.ru/str/DataScienceT
👍2❤1🙏1
RT-DETRv2 is a new version of RT-DETR, an alternative to YOLO. RT-DETRv2 has received a number of improvements: increased flexibility, usability and performance.
The key change is the modification of the
deformable attention module in the decoder. RT-DETRv2 proposes to set a different number of sampling points for features of different scales. This allows for more efficient extraction of multi-scale features, making it more adaptive to multiple detection scenarios.To make the model more practical, we replaced the DETR-specific
grid_sample operator with an optional discrete_sample operator, which performs rounding of the predicted sample offsets, speeding up the process without significant loss of accuracy.RT-DETRv2 is trained using a dynamic data augmentation strategy. In the early stages, more intensive augmentation methods are used to help the model generalize better to the data. In the later stages, the level of augmentation is reduced, allowing the model to adapt to the target domain.
The new version uses hyperparameter customization depending on the scale of the model. For example, for ResNet18, the learning rate increases, while for larger models - ResNet101, it decreases.
RT-DETRv2 was tested on the COCO dataset, where the model showed an improvement in the AP metric by 0.3–1.4 points compared to RT-DETR, while maintaining high performance. For example, RT-DETRv2-S with the ResNet18 architecture achieved an AP of 47.9, which is 1.4 points higher than RT-DETR-S.
Scripts for finetune RT-DETRv2 with Trainer or Accelerate are hosted in the HuggingFace repository on Github, and a simple inference notebook locally - here or run in Google Collab.
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
Please open Telegram to view this post
VIEW IN TELEGRAM
Please open Telegram to view this post
VIEW IN TELEGRAM
👍1
47df5a49_9430_47e9_bf5a_7640f0706832_17c659c7.gif
23.6 MB
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt
23 Jan 2025 · Tao Liu, Kai Wang, Senmao Li, Joost Van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng ·
Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input denoscription for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments.
Paper: https://arxiv.org/pdf/2501.13554v2.pdf
Code: https://github.com/byliutao/1prompt1story
23 Jan 2025 · Tao Liu, Kai Wang, Senmao Li, Joost Van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng ·
Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input denoscription for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments.
Paper: https://arxiv.org/pdf/2501.13554v2.pdf
Code: https://github.com/byliutao/1prompt1story
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
https://news.1rj.ru/str/DataScienceT
👍2