ML Research Hub – Telegram
ML Research Hub
32.7K subscribers
3.97K photos
225 videos
23 files
4.27K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
Accelerating Data Processing and Benchmarking of AI Models for Pathology

10 Feb 2025 · Andrew Zhang, Guillaume Jaume, Anurag Vaidya, Tong Ding, Faisal Mahmood ·

Advances in foundation modeling have reshaped computational pathology. However, the increasing number of available #models and lack of standardized benchmarks make it increasingly complex to assess their strengths, limitations, and potential for further development. To address these challenges, we introduce a new suite of software tools for whole-slide image processing, foundation model benchmarking, and curated publicly available tasks. We anticipate that these resources will promote transparency, reproducibility, and continued progress in the field.


Paper: https://arxiv.org/pdf/2502.06750v1.pdf

Codes:
https://github.com/mahmoodlab/trident
https://github.com/mahmoodlab/patho-bench

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek

https://news.1rj.ru/str/DataScienceT
👍1
LIMO: Less is More for Reasoning

5 Feb 2025 · Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, PengFei Liu ·

We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (>100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on #MATH, improving from previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (#LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model's encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as "cognitive templates" that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning


Paper: https://arxiv.org/pdf/2502.03387v1.pdf

Codes:
https://github.com/gair-nlp/limo
https://github.com/zhaoolee/garss

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek

https://news.1rj.ru/str/DataScienceT
👍3
FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

7 Feb 2025 · Shilong Zhang, Wenbo Li, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, Ping Luo ·

DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high resolution outputs, further amplifying computational demands especially for single stage DiT models. To address these challenges, we propose a novel two stage framework, #FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage establishes flow matching between low and high resolutions, effectively generating fine details with minimal NFEs. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output before committing to full resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability .


Paper: https://arxiv.org/pdf/2502.05179v1.pdf

Code: https://github.com/foundationvision/flashvideo

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek

https://news.1rj.ru/str/DataScienceT
👍21
OmniParser for Pure Vision Based GUI Agent

1 Aug 2024 · Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah

The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. To fill these gaps, we introduce \textsc{OmniParser}, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of #GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first curated an interactable icon detection dataset using popular webpages and an icon denoscription dataset. These datasets were utilized to fine-tune specialized models: a detection model to parse interactable regions on the screen and a caption model to extract the functional semantics of the detected elements. \textsc{#OmniParser} significantly improves GPT-4V's performance on ScreenSpot benchmark. And on #Mind2Web and AITW benchmark, \textsc{OmniParser} with screenshot only input #outperforms the GPT-4V baselines requiring additional information outside of screenshot.


Paper: https://arxiv.org/pdf/2408.00203v1.pdf

Code: https://github.com/microsoft/omniparser

Dataset: ScreenSpot

Note: Ranked #10 on Natural Language Visual Grounding on #ScreenSpot

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek

https://news.1rj.ru/str/DataScienceT
👍1
PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation

20 Jan 2025 · Jinyu Wang, Jingjing Fu, Rui Wang, Lei Song, Jiang Bian

Despite notable advancements in Retrieval-Augmented Generation (#RAG) systems that expand large language model (LLM) capabilities through external retrieval, these systems often struggle to meet the complex and diverse needs of real-world industrial applications. The reliance on retrieval alone proves insufficient for extracting deep, domain-specific knowledge performing in logical reasoning from specialized corpora. To address this, we introduces PecIalized KnowledgE and Rationale Augmentation Generation (PIKE-RAG), focusing on extracting, understanding, and applying specialized knowledge, while constructing coherent rationale to incrementally steer LLMs toward accurate responses. Recognizing the diverse challenges of industrial tasks, we introduce a new paradigm that classifies tasks based on their complexity in knowledge extraction and application, allowing for a systematic evaluation of RAG systems' problem-solving capabilities. This strategic approach offers a roadmap for the phased development and enhancement of RAG systems, tailored to meet the evolving demands of industrial applications. Furthermore, we propose knowledge atomizing and knowledge-aware task decomposition to effectively extract multifaceted knowledge from the data chunks and iteratively construct the rationale based on original query and the accumulated knowledge, respectively, showcasing exceptional performance across various benchmarks.


Paper: https://arxiv.org/pdf/2501.11551v2.pdf

Code: https://github.com/microsoft/pike-rag

Datasets: HotpotQA - 2WikiMultiHopQA

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek

https://news.1rj.ru/str/DataScienceT
👍31
Enhance-A-Video: Better Generated Video for Free

11 Feb 2025 · Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You

DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its simple design, our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning. Across various DiT-based video generation models, our approach demonstrates promising improvements in both temporal consistency and visual quality. We hope this research can inspire future explorations in video generation enhancement.


Paper: https://arxiv.org/pdf/2502.07508v1.pdf

Code: https://github.com/NUS-HPC-AI-Lab/Enhance-A-Video

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek

https://news.1rj.ru/str/DataScienceT
1👍1
Accelerating Data Processing and Benchmarking of AI Models for Pathology

10 Feb 2025 · Andrew Zhang, Guillaume Jaume, Anurag Vaidya, Tong Ding, Faisal Mahmood

Advances in foundation modeling have reshaped computational pathology. However, the increasing number of available models and lack of standardized benchmarks make it increasingly complex to assess their strengths, limitations, and potential for further development. To address these challenges, we introduce a new suite of software tools for whole-slide image processing, foundation model benchmarking, and curated publicly available tasks. We anticipate that these resources will promote transparency, reproducibility, and continued progress in the field.

Paper: https://arxiv.org/pdf/2502.06750v1.pdf

Codes:
https://github.com/mahmoodlab/trident
https://github.com/mahmoodlab/patho-bench

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek

https://news.1rj.ru/str/DataScienceT
👍1
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.


Paper: https://arxiv.org/pdf/2502.10248v1.pdf

Codes:
https://github.com/phixion/phixion
https://github.com/stepfun-ai/step-video-t2v

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek

https://news.1rj.ru/str/DataScienceT
👍3
KET-RAG: A Cost-Efficient Multi-Granular Indexing Framework for Graph-RAG

13 Feb 2025 · Yiqian Huang, Shiqi Zhang, Xiaokui Xiao ·

Graph-RAG constructs a knowledge graph from text chunks to improve retrieval in Large Language Model (LLM)-based question answering. It is particularly useful in domains such as biomedicine, law, and political science, where retrieval often requires multi-hop reasoning over proprietary documents. Some existing Graph-RAG systems construct #KNN graphs based on text chunk relevance, but this coarse-grained approach fails to capture entity relationships within texts, leading to sub-par retrieval and generation quality. To address this, recent solutions leverage LLMs to extract entities and relationships from text chunks, constructing triplet-based knowledge graphs. However, this approach incurs significant indexing costs, especially for large document collections. To ensure a good result accuracy while reducing the indexing cost, we propose KET-RAG, a multi-granular indexing framework. KET-RAG first identifies a small set of key text chunks and leverages an #LLM to construct a knowledge graph skeleton. It then builds a text-keyword bipartite graph from all text chunks, serving as a lightweight alternative to a full knowledge graph. During retrieval, KET-RAG searches both structures: it follows the local search strategy of existing Graph-RAG systems on the skeleton while mimicking this search on the bipartite graph to improve retrieval quality. We evaluate eight solutions on two real-world datasets, demonstrating that KET-RAG outperforms all competitors in indexing cost, retrieval effectiveness, and generation quality. Notably, it achieves comparable or superior retrieval quality to Microsoft's Graph-RAG while reducing indexing costs by over an order of magnitude. Additionally, it improves the generation quality by up to 32.4% while lowering indexing costs by around 20%.


Paper: https://arxiv.org/pdf/2502.09304v1.pdf

Code: https://github.com/waetr/KET-RAG

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek

https://news.1rj.ru/str/DataScienceT
👍2
1 Year Perplexity Pro on Your Mail
💰 Price:$20 or ₹1500

1 Year You.com Pro on Your Mail
💰 Price:$25 or ₹1800

💰Combo Offer:40$

Original Price:200$

How I activate ?
I activate account through voucher codes on your mail for 1 year.

💡 Features Included
Advanced AI Models:
• DeepResearch
•GPT-4o, o1, o3 mini(High)
• Deepseek r1[USA Hosted Uncensored]
• Llama 3.1
•Claude 3.5 Sonnet, Claude 3.5 Haiku
•Grok-2(Grok 3 coming too confirmed by its CEO)
•FILE ANALYSIS
•PRO SEARCH

Image Generation 🎥
•Flux, DALL-E 3
•Playground v3, Stable Diffusion XL

✔️ What You Get
•1 year of full access.
•A 12-month warranty is included.

💨 This post will be deleted/removed after 24 hours so save my username or contact immediately.

💰 Payment Method: Crypto[LTC or USDT] or UPI
For Inquiry/Purchase DM: @AiChatBoss
2👍2🔥1
OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). By employing an ASR+X training strategy, OSUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.


Paper: https://arxiv.org/pdf/2501.13306v2.pdf

Code: https://github.com/aslp-lab/osum

Datasets: LibriSpeech - IEMOCAP

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek

https://news.1rj.ru/str/DataScienceT
👍4
This media is not supported in your browser
VIEW IN TELEGRAM
Zep: A Temporal Knowledge Graph Architecture for Agent Memory

20 Jan 2025 · Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef ·

We introduce Zep, a novel memory layer service for AI agents that outperforms the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR) benchmark. Additionally, Zep excels in more comprehensive and challenging evaluations than DMR that better reflect real-world enterprise use cases. While existing retrieval-augmented generation (#RAG) frameworks for large language model (LLM)-based agents are limited to static document retrieval, enterprise applications demand dynamic knowledge integration from diverse sources including ongoing conversations and business data. Zep addresses this fundamental limitation through its core component Graphiti -- a temporally-aware knowledge graph engine that dynamically synthesizes both unstructured conversational data and structured business data while maintaining historical relationships. In the #DMR benchmark, which the MemGPT team established as their primary evaluation metric, Zep demonstrates superior performance (94.8% vs 93.4%). Beyond DMR, Zep's capabilities are further validated through the more challenging LongMemEval benchmark, which better reflects enterprise use cases through complex temporal reasoning tasks. In this evaluation, #Zep achieves substantial results with accuracy improvements of up to 18.5% while simultaneously reducing response latency by 90% compared to baseline implementations. These results are particularly pronounced in enterprise-critical tasks such as cross-session information synthesis and long-term context maintenance, demonstrating Zep's effectiveness for deployment in real-world applications.


Paper: https://arxiv.org/pdf/2501.13956v1.pdf

Code: https://github.com/getzep/graphiti

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents

https://news.1rj.ru/str/DataScienceT
👍3
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

14 Feb 2025 · Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, Beng Chin Ooi ·

We present #HealthGPT, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained large language models (#LLMs). This is achieved through a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is complemented by a tailored hierarchical visual perception approach and a three-stage learning strategy. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks.


Paper: https://github.com/dcdmllm/healthgpt

Code: https://github.com/dcdmllm/healthgpt

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents

https://news.1rj.ru/str/DataScienceT
👍62