The Hundred-Page Language Models Book
Read it:
https://github.com/aburkov/theLMbook
Read it:
https://github.com/aburkov/theLMbook
#LLM #NLP #ML #AI #PYTHON #PYTORCH
https://news.1rj.ru/str/DataScienceM
👍4
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Paper: https://arxiv.org/pdf/2502.10248v1.pdf
Codes:
https://github.com/phixion/phixion
https://github.com/stepfun-ai/step-video-t2v
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
Paper: https://arxiv.org/pdf/2502.10248v1.pdf
Codes:
https://github.com/phixion/phixion
https://github.com/stepfun-ai/step-video-t2v
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
https://news.1rj.ru/str/DataScienceT
👍3
Bridging Text and Vision: A Multi-View Text-Vision Registration Approach for Cross-Modal Place Recognition
🖥 Github: https://github.com/nuozimiaowu/Text4VPR
📕 Paper: https://arxiv.org/abs/2502.14195v1
🌟 Dataset: https://paperswithcode.com/task/cross-modal-place-recognition
🌟 Dataset: https://paperswithcode.com/task/cross-modal-place-recognition
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
👍1
KET-RAG: A Cost-Efficient Multi-Granular Indexing Framework for Graph-RAG
13 Feb 2025 · Yiqian Huang, Shiqi Zhang, Xiaokui Xiao ·
Paper: https://arxiv.org/pdf/2502.09304v1.pdf
Code: https://github.com/waetr/KET-RAG
13 Feb 2025 · Yiqian Huang, Shiqi Zhang, Xiaokui Xiao ·
Graph-RAG constructs a knowledge graph from text chunks to improve retrieval in Large Language Model (LLM)-based question answering. It is particularly useful in domains such as biomedicine, law, and political science, where retrieval often requires multi-hop reasoning over proprietary documents. Some existing Graph-RAG systems construct #KNN graphs based on text chunk relevance, but this coarse-grained approach fails to capture entity relationships within texts, leading to sub-par retrieval and generation quality. To address this, recent solutions leverage LLMs to extract entities and relationships from text chunks, constructing triplet-based knowledge graphs. However, this approach incurs significant indexing costs, especially for large document collections. To ensure a good result accuracy while reducing the indexing cost, we propose KET-RAG, a multi-granular indexing framework. KET-RAG first identifies a small set of key text chunks and leverages an #LLM to construct a knowledge graph skeleton. It then builds a text-keyword bipartite graph from all text chunks, serving as a lightweight alternative to a full knowledge graph. During retrieval, KET-RAG searches both structures: it follows the local search strategy of existing Graph-RAG systems on the skeleton while mimicking this search on the bipartite graph to improve retrieval quality. We evaluate eight solutions on two real-world datasets, demonstrating that KET-RAG outperforms all competitors in indexing cost, retrieval effectiveness, and generation quality. Notably, it achieves comparable or superior retrieval quality to Microsoft's Graph-RAG while reducing indexing costs by over an order of magnitude. Additionally, it improves the generation quality by up to 32.4% while lowering indexing costs by around 20%.
Paper: https://arxiv.org/pdf/2502.09304v1.pdf
Code: https://github.com/waetr/KET-RAG
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
https://news.1rj.ru/str/DataScienceT
👍2
Forwarded from ML - DS/DA/DE - AI [Jobs, InterviewPrep]
✅ 1 Year Perplexity Pro on Your Mail
💰 Price:$20 or ₹1500
✅ 1 Year You.com Pro on Your Mail
💰 Price:$25 or ₹1800
💰Combo Offer:40$
Original Price:200$
How I activate ?
I activate account through voucher codes on your mail for 1 year.
💡 Features Included
✅Advanced AI Models:
• DeepResearch
•GPT-4o, o1, o3 mini(High)
• Deepseek r1[USA Hosted Uncensored]
• Llama 3.1
•Claude 3.5 Sonnet, Claude 3.5 Haiku
•Grok-2(Grok 3 coming too confirmed by its CEO)
•FILE ANALYSIS
•PRO SEARCH
✅Image Generation 🎥
•Flux, DALL-E 3
•Playground v3, Stable Diffusion XL
✔️ What You Get
•1 year of full access.
•A 12-month warranty is included.
💨 This post will be deleted/removed after 24 hours so save my username or contact immediately.
💰 Payment Method: Crypto[LTC or USDT] or UPI
✅ For Inquiry/Purchase DM: @AiChatBoss
💰 Price:$20 or ₹1500
✅ 1 Year You.com Pro on Your Mail
💰 Price:$25 or ₹1800
💰Combo Offer:40$
Original Price:200$
How I activate ?
I activate account through voucher codes on your mail for 1 year.
💡 Features Included
✅Advanced AI Models:
• DeepResearch
•GPT-4o, o1, o3 mini(High)
• Deepseek r1[USA Hosted Uncensored]
• Llama 3.1
•Claude 3.5 Sonnet, Claude 3.5 Haiku
•Grok-2(Grok 3 coming too confirmed by its CEO)
•FILE ANALYSIS
•PRO SEARCH
✅Image Generation 🎥
•Flux, DALL-E 3
•Playground v3, Stable Diffusion XL
✔️ What You Get
•1 year of full access.
•A 12-month warranty is included.
💨 This post will be deleted/removed after 24 hours so save my username or contact immediately.
💰 Payment Method: Crypto[LTC or USDT] or UPI
✅ For Inquiry/Purchase DM: @AiChatBoss
❤2👍2🔥1
OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia
Paper: https://arxiv.org/pdf/2501.13306v2.pdf
Code: https://github.com/aslp-lab/osum
Datasets: LibriSpeech - IEMOCAP
Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). By employing an ASR+X training strategy, OSUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.
Paper: https://arxiv.org/pdf/2501.13306v2.pdf
Code: https://github.com/aslp-lab/osum
Datasets: LibriSpeech - IEMOCAP
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek
https://news.1rj.ru/str/DataScienceT
👍4
This media is not supported in your browser
VIEW IN TELEGRAM
Zep: A Temporal Knowledge Graph Architecture for Agent Memory
20 Jan 2025 · Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef ·
Paper: https://arxiv.org/pdf/2501.13956v1.pdf
Code: https://github.com/getzep/graphiti
20 Jan 2025 · Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef ·
We introduce Zep, a novel memory layer service for AI agents that outperforms the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR) benchmark. Additionally, Zep excels in more comprehensive and challenging evaluations than DMR that better reflect real-world enterprise use cases. While existing retrieval-augmented generation (#RAG) frameworks for large language model (LLM)-based agents are limited to static document retrieval, enterprise applications demand dynamic knowledge integration from diverse sources including ongoing conversations and business data. Zep addresses this fundamental limitation through its core component Graphiti -- a temporally-aware knowledge graph engine that dynamically synthesizes both unstructured conversational data and structured business data while maintaining historical relationships. In the #DMR benchmark, which the MemGPT team established as their primary evaluation metric, Zep demonstrates superior performance (94.8% vs 93.4%). Beyond DMR, Zep's capabilities are further validated through the more challenging LongMemEval benchmark, which better reflects enterprise use cases through complex temporal reasoning tasks. In this evaluation, #Zep achieves substantial results with accuracy improvements of up to 18.5% while simultaneously reducing response latency by 90% compared to baseline implementations. These results are particularly pronounced in enterprise-critical tasks such as cross-session information synthesis and long-term context maintenance, demonstrating Zep's effectiveness for deployment in real-world applications.
Paper: https://arxiv.org/pdf/2501.13956v1.pdf
Code: https://github.com/getzep/graphiti
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents
https://news.1rj.ru/str/DataScienceT
👍3
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation
14 Feb 2025 · Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, Beng Chin Ooi ·
Paper: https://github.com/dcdmllm/healthgpt
Code: https://github.com/dcdmllm/healthgpt
14 Feb 2025 · Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, Beng Chin Ooi ·
We present #HealthGPT, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained large language models (#LLMs). This is achieved through a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is complemented by a tailored hierarchical visual perception approach and a three-stage learning strategy. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks.
Paper: https://github.com/dcdmllm/healthgpt
Code: https://github.com/dcdmllm/healthgpt
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents
https://news.1rj.ru/str/DataScienceT
👍6❤2
7236de5b-aee2-4773-bff5-e6b7de08ea88.gif
48.6 MB
Fractal Generative Models
24 Feb 2025 · Tianhong Li, Qinyi Sun, Lijie Fan, Kaiming He ·
Paper: https://arxiv.org/pdf/2502.17437v1.pdf
Code: https://github.com/LTH14/fractalgen
24 Feb 2025 · Tianhong Li, Qinyi Sun, Lijie Fan, Kaiming He ·
Modularization is a cornerstone of computer science, abstracting complex functions into atomic building blocks. In this paper, we introduce a new level of modularization by abstracting generative models into atomic generative modules. Analogous to fractals in mathematics, our method constructs a new type of generative model by recursively invoking atomic generative modules, resulting in self-similar fractal architectures that we call fractal generative models. As a running example, we instantiate our fractal framework using autoregressive models as the atomic generative modules and examine it on the challenging task of pixel-by-pixel image generation, demonstrating strong performance in both likelihood estimation and generation quality. We hope this work could open a new paradigm in generative modeling and provide a fertile ground for future research. Code is available at https://github.com/LTH14/fractalgen.
Paper: https://arxiv.org/pdf/2502.17437v1.pdf
Code: https://github.com/LTH14/fractalgen
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents
https://news.1rj.ru/str/DataScienceT
👍7❤3
Slamming: Training a Speech Language Model on One GPU in a Day
19 Feb 2025 · Gallil Maimon, Avishai Elmakies, Yossi Adi ·
Paper: https://arxiv.org/pdf/2502.15814v1.pdf
Code: https://github.com/slp-rl/slamkit
19 Feb 2025 · Gallil Maimon, Avishai Elmakies, Yossi Adi ·
We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to #SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .
Paper: https://arxiv.org/pdf/2502.15814v1.pdf
Code: https://github.com/slp-rl/slamkit
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents
https://news.1rj.ru/str/DataScienceT
👍1
🎨 Can AI design truly novel concepts like humans? Check SYNTHIA, a breakthrough in T2I generation!
🤖 SYNTHIA composes affordances to create visually novel & functionally coherent designs.
📄 https://arxiv.org/pdf/2502.17793
💻 https://github.com/HyeonjeongHa/SYNTHIA
🎥 https://youtube.com/watch?v=KvsOx44WdzM
🤖 SYNTHIA composes affordances to create visually novel & functionally coherent designs.
📄 https://arxiv.org/pdf/2502.17793
💻 https://github.com/HyeonjeongHa/SYNTHIA
🎥 https://youtube.com/watch?v=KvsOx44WdzM
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #DeepSeek #RAG #Agents
https://news.1rj.ru/str/DataScienceT
👍3
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
🖥 Github: https://github.com/thu-coai/AISafetyLab
📕 Paper: https://arxiv.org/abs/2502.16776v1
🌟 Dataset: https://paperswithcode.com/dataset/gptfuzzer
https://news.1rj.ru/str/DataScienceT🧠
🌟 Dataset: https://paperswithcode.com/dataset/gptfuzzer
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
❤1
Forwarded from ENG. Hussein Sheikho
This channels is for Programmers, Coders, Software Engineers.
0️⃣ Python
1️⃣ Data Science
2️⃣ Machine Learning
3️⃣ Data Visualization
4️⃣ Artificial Intelligence
5️⃣ Data Analysis
6️⃣ Statistics
7️⃣ Deep Learning
8️⃣ programming Languages
✅ https://news.1rj.ru/str/addlist/8_rRW2scgfRhOTc0
✅ https://news.1rj.ru/str/codeprogrammer
Please open Telegram to view this post
VIEW IN TELEGRAM
👍1
Magma: A Foundation Model for Multimodal AI Agents
18 Feb 2025 · Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao ·
Paper: https://arxiv.org/pdf/2502.13130v1.pdf
Code: https://github.com/microsoft/Magma
Datasets: Something-Something V2 - EPIC-KITCHENS-100 - Open-X-Embodiment - Ego4D
18 Feb 2025 · Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao ·
We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at https://microsoft.github.io/Magma.
Paper: https://arxiv.org/pdf/2502.13130v1.pdf
Code: https://github.com/microsoft/Magma
Datasets: Something-Something V2 - EPIC-KITCHENS-100 - Open-X-Embodiment - Ego4D
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents
https://news.1rj.ru/str/DataScienceT
👍7
From System 1 to System 2: A Survey of Reasoning Large Language Models
24 Feb 2025 · Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhijiang Guo, Le Song, Cheng-Lin Liu ·
Paper: https://arxiv.org/pdf/2502.17419v1.pdf
Code: https://github.com/zzli2022/awesome-slow-reason-system
Datasets: GSM8K - MedQA - MathVista - GPQA - MMLU-Pro - PGPS9K
24 Feb 2025 · Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhijiang Guo, Le Song, Cheng-Lin Liu ·
Achieving human-level intelligence requires refining the transition from the fast, intuitive System 1 to the slower, more deliberate System 2 reasoning. While System 1 excels in quick, heuristic decisions, System 2 relies on logical reasoning for more accurate judgments and reduced biases. Foundational Large Language Models (LLMs) excel at fast decision-making but lack the depth for complex reasoning, as they have not yet fully embraced the step-by-step analysis characteristic of true System 2 thinking. Recently, reasoning LLMs like OpenAI's o1/o3 and DeepSeek's R1 have demonstrated expert-level performance in fields such as mathematics and coding, closely mimicking the deliberate reasoning of System 2 and showcasing human-like cognitive abilities. This survey begins with a brief overview of the progress in foundational LLMs and the early development of System 2 technologies, exploring how their combination has paved the way for reasoning LLMs. Next, we discuss how to construct reasoning #LLMs, analyzing their features, the core methods enabling advanced reasoning, and the evolution of various reasoning LLMs. Additionally, we provide an overview of reasoning benchmarks, offering an in-depth comparison of the performance of representative reasoning LLMs. Finally, we explore promising directions for advancing reasoning LLMs and maintain a real-time \href{https://github.com/zzli2022/Awesome-Slow-Reason-System}{GitHub Repository} to track the latest developments. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this rapidly evolving field.
Paper: https://arxiv.org/pdf/2502.17419v1.pdf
Code: https://github.com/zzli2022/awesome-slow-reason-system
Datasets: GSM8K - MedQA - MathVista - GPQA - MMLU-Pro - PGPS9K
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents
https://news.1rj.ru/str/DataScienceT
👍4
Magma: A Foundation Model for Multimodal AI Agents
18 Feb 2025 · Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao ·
Paper: https://arxiv.org/pdf/2502.13130v1.pdf
Code: https://github.com/microsoft/Magma
Datasets: Something-Something V2 - EPIC-KITCHENS-100 - Open-X-Embodiment - Ego4D
18 Feb 2025 · Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao ·
We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that #SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, #Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at https://microsoft.github.io/Magma.
Paper: https://arxiv.org/pdf/2502.13130v1.pdf
Code: https://github.com/microsoft/Magma
Datasets: Something-Something V2 - EPIC-KITCHENS-100 - Open-X-Embodiment - Ego4D
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents
https://news.1rj.ru/str/DataScienceT
👍2