Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
🖥 Github: https://github.com/opengvlab/piip
📕 Paper: https://arxiv.org/abs/2501.07783v1
⭐️ Dataset: https://paperswithcode.com/dataset/gqa
https://news.1rj.ru/str/DataScienceT🧠
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
Please open Telegram to view this post
VIEW IN TELEGRAM
👍4❤1
FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors
Paper: https://arxiv.org/pdf/2501.08225v1.pdf
Code: https://github.com/ybybzhang/framepainter
https://news.1rj.ru/str/DataScienceT✈️
Paper: https://arxiv.org/pdf/2501.08225v1.pdf
Code: https://github.com/ybybzhang/framepainter
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
Paper: https://arxiv.org/pdf/2407.15811v1.pdf
code: https://github.com/sonyresearch/micro_diffusion
Datasets: MS COCO
https://news.1rj.ru/str/DataScienceT🧠
Paper: https://arxiv.org/pdf/2407.15811v1.pdf
code: https://github.com/sonyresearch/micro_diffusion
Datasets: MS COCO
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
❤1
MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation
Paper: https://arxiv.org/pdf/2501.06713v2.pdf
Code: https://github.com/hkuds/minirag
https://news.1rj.ru/str/DataScienceT🧠
Paper: https://arxiv.org/pdf/2501.06713v2.pdf
Code: https://github.com/hkuds/minirag
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
❤3👍2
Continual Forgetting for Pre-trained Vision Models (CVPR2024)
🖥 Github: https://github.com/bjzhb666/GS-LoRA
📕 Paper: https://arxiv.org/abs/2501.09705v1
🧠 Dataset: https://paperswithcode.com/dataset/coco
https://news.1rj.ru/str/DataScienceT🧠
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
❤1👍1
Tensor Product Attention Is All You Need
Paper: https://arxiv.org/pdf/2501.06425v1.pdf
Code: https://github.com/tensorgi/t6
Dataset: MMLU
https://news.1rj.ru/str/DataScienceT✅
Paper: https://arxiv.org/pdf/2501.06425v1.pdf
Code: https://github.com/tensorgi/t6
Dataset: MMLU
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
👍1
UnCommon Objects in 3D
We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI. uCO3D is the largest publicly-available collection of high-resolution videos of objects with 3D annotations that ensures full-360 coverage. uCO3D is significantly more diverse than MVImgNet and CO3Dv2, covering more than 1,000 object categories. It is also of higher quality, due to extensive quality checks of both the collected videos and the 3D annotations. Similar to analogous datasets, uCO3D contains annotations for 3D camera poses, depth maps and sparse point clouds. In addition, each object is equipped with a caption and a 3D Gaussian Splat reconstruction. We train several large 3D models on MVImgNet, CO3Dv2, and uCO3D and obtain superior results using the latter, showing that uCO3D is better for learning applications.
Paper: https://arxiv.org/pdf/2501.07574v1.pdf
Code: https://github.com/facebookresearch/uco3d
DataSet: MS COCO
https://news.1rj.ru/str/DataScienceT🐻❄️
We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI. uCO3D is the largest publicly-available collection of high-resolution videos of objects with 3D annotations that ensures full-360 coverage. uCO3D is significantly more diverse than MVImgNet and CO3Dv2, covering more than 1,000 object categories. It is also of higher quality, due to extensive quality checks of both the collected videos and the 3D annotations. Similar to analogous datasets, uCO3D contains annotations for 3D camera poses, depth maps and sparse point clouds. In addition, each object is equipped with a caption and a 3D Gaussian Splat reconstruction. We train several large 3D models on MVImgNet, CO3Dv2, and uCO3D and obtain superior results using the latter, showing that uCO3D is better for learning applications.
Paper: https://arxiv.org/pdf/2501.07574v1.pdf
Code: https://github.com/facebookresearch/uco3d
DataSet: MS COCO
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
❤3👍1
The GAN is dead; long live the GAN! A Modern GAN Baseline
There is a widely-spread claim that GANs are difficult to train, and GAN architectures in the literature are littered with empirical tricks. We provide evidence against this claim and build a modern GAN baseline in a more principled manner. First, we derive a well-behaved regularized relativistic GAN loss that addresses issues of mode dropping and non-convergence that were previously tackled via a bag of ad-hoc tricks. We analyze our loss mathematically and prove that it admits local convergence guarantees, unlike most existing relativistic losses. Second, our new loss allows us to discard all ad-hoc tricks and replace outdated backbones used in common GANs with modern architectures. Using StyleGAN2 as an example, we present a roadmap of simplification and modernization that results in a new minimalist baseline -- R3GAN. Despite being simple, our approach surpasses StyleGAN2 on FFHQ, ImageNet, CIFAR, and Stacked MNIST datasets, and compares favorably against state-of-the-art GANs and diffusion models.
Paper: https://arxiv.org/pdf/2501.05441v1.pdf
Code: https://github.com/brownvc/r3gan
Dataset: CIFAR-10
https://news.1rj.ru/str/DataScienceT😵💫
There is a widely-spread claim that GANs are difficult to train, and GAN architectures in the literature are littered with empirical tricks. We provide evidence against this claim and build a modern GAN baseline in a more principled manner. First, we derive a well-behaved regularized relativistic GAN loss that addresses issues of mode dropping and non-convergence that were previously tackled via a bag of ad-hoc tricks. We analyze our loss mathematically and prove that it admits local convergence guarantees, unlike most existing relativistic losses. Second, our new loss allows us to discard all ad-hoc tricks and replace outdated backbones used in common GANs with modern architectures. Using StyleGAN2 as an example, we present a roadmap of simplification and modernization that results in a new minimalist baseline -- R3GAN. Despite being simple, our approach surpasses StyleGAN2 on FFHQ, ImageNet, CIFAR, and Stacked MNIST datasets, and compares favorably against state-of-the-art GANs and diffusion models.
Paper: https://arxiv.org/pdf/2501.05441v1.pdf
Code: https://github.com/brownvc/r3gan
Dataset: CIFAR-10
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
❤4👍2
Cold-Start Recommendation towards the Era of Large Language Models (LLMs): A Comprehensive Survey and Roadmap
Cold-start problem is one of the long-standing challenges in recommender systems, focusing on accurately modeling new or interaction-limited users or items to provide better recommendations. Due to the diversification of internet platforms and the exponential growth of users and items, the importance of cold-start recommendation (CSR) is becoming increasingly evident. At the same time, large language models (LLMs) have achieved tremendous success and possess strong capabilities in modeling user and item information, providing new potential for cold-start recommendations. However, the research community on CSR still lacks a comprehensive review and reflection in this field. Based on this, in this paper, we stand in the context of the era of large language models and provide a comprehensive review and discussion on the roadmap, related literature, and future directions of CSR. Specifically, we have conducted an exploration of the development path of how existing CSR utilizes information, from content features, graph relations, and domain information, to the world knowledge possessed by large language models, aiming to provide new insights for both the research and industrial communities on CSR. Related resources of cold-start recommendations are collected and continuously updated for the community in https://github.com/YuanchenBei/Awesome-Cold-Start-Recommendation.
Paper: https://arxiv.org/pdf/2501.01945v2.pdf
Code: https://github.com/yuanchenbei/awesome-cold-start-recommendation
https://news.1rj.ru/str/DataScienceT🩷
Cold-start problem is one of the long-standing challenges in recommender systems, focusing on accurately modeling new or interaction-limited users or items to provide better recommendations. Due to the diversification of internet platforms and the exponential growth of users and items, the importance of cold-start recommendation (CSR) is becoming increasingly evident. At the same time, large language models (LLMs) have achieved tremendous success and possess strong capabilities in modeling user and item information, providing new potential for cold-start recommendations. However, the research community on CSR still lacks a comprehensive review and reflection in this field. Based on this, in this paper, we stand in the context of the era of large language models and provide a comprehensive review and discussion on the roadmap, related literature, and future directions of CSR. Specifically, we have conducted an exploration of the development path of how existing CSR utilizes information, from content features, graph relations, and domain information, to the world knowledge possessed by large language models, aiming to provide new insights for both the research and industrial communities on CSR. Related resources of cold-start recommendations are collected and continuously updated for the community in https://github.com/YuanchenBei/Awesome-Cold-Start-Recommendation.
Paper: https://arxiv.org/pdf/2501.01945v2.pdf
Code: https://github.com/yuanchenbei/awesome-cold-start-recommendation
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
👍1
Evolutionary Computation in the Era of Large Language Model: Survey and Roadmap
Large language models (LLMs) have not only revolutionized natural language processing but also extended their prowess to various domains, marking a significant stride towards artificial general intelligence. The interplay between LLMs and evolutionary algorithms (EAs), despite differing in objectives and methodologies, share a common pursuit of applicability in complex problems. Meanwhile, EA can provide an optimization framework for LLM's further enhancement under black-box settings, empowering LLM with flexible global search capacities. On the other hand, the abundant domain knowledge inherent in LLMs could enable EA to conduct more intelligent searches. Furthermore, the text processing and generative capabilities of LLMs would aid in deploying EAs across a wide range of tasks. Based on these complementary advantages, this paper provides a thorough review and a forward-looking roadmap, categorizing the reciprocal inspiration into two main avenues: LLM-enhanced EA and EA-enhanced #LLM. Some integrated synergy methods are further introduced to exemplify the complementarity between LLMs and EAs in diverse scenarios, including code generation, software engineering, neural architecture search, and various generation tasks. As the first comprehensive review focused on the EA research in the era of #LLMs, this paper provides a foundational stepping stone for understanding the collaborative potential of LLMs and EAs. The identified challenges and future directions offer guidance for researchers and practitioners to unlock the full potential of this innovative collaboration in propelling advancements in optimization and artificial intelligence.
Paper: https://arxiv.org/pdf/2401.10034v3.pdf
Code: https://github.com/wuxingyu-ai/llm4ec
https://news.1rj.ru/str/DataScienceT⭐️
Large language models (LLMs) have not only revolutionized natural language processing but also extended their prowess to various domains, marking a significant stride towards artificial general intelligence. The interplay between LLMs and evolutionary algorithms (EAs), despite differing in objectives and methodologies, share a common pursuit of applicability in complex problems. Meanwhile, EA can provide an optimization framework for LLM's further enhancement under black-box settings, empowering LLM with flexible global search capacities. On the other hand, the abundant domain knowledge inherent in LLMs could enable EA to conduct more intelligent searches. Furthermore, the text processing and generative capabilities of LLMs would aid in deploying EAs across a wide range of tasks. Based on these complementary advantages, this paper provides a thorough review and a forward-looking roadmap, categorizing the reciprocal inspiration into two main avenues: LLM-enhanced EA and EA-enhanced #LLM. Some integrated synergy methods are further introduced to exemplify the complementarity between LLMs and EAs in diverse scenarios, including code generation, software engineering, neural architecture search, and various generation tasks. As the first comprehensive review focused on the EA research in the era of #LLMs, this paper provides a foundational stepping stone for understanding the collaborative potential of LLMs and EAs. The identified challenges and future directions offer guidance for researchers and practitioners to unlock the full potential of this innovative collaboration in propelling advancements in optimization and artificial intelligence.
Paper: https://arxiv.org/pdf/2401.10034v3.pdf
Code: https://github.com/wuxingyu-ai/llm4ec
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
👍2
Forwarded from Learn Python Hub
Unlock a treasure trove of knowledge with our exclusive paid channel! For just $2 a month, gain access to thousands of valuable resources, including essential books and premium courses from Coursera and Udemy. Plus, dive into exciting paid projects! Enjoy hassle-free automatic payments via Telegram. Join us today! Link
Telegram
Data Science Premium (Books & Courses)
access to thousands of valuable resources, including essential books and courses.
Paid books
Paid courses from coursera and Udemy
Paid project
Paid books
Paid courses from coursera and Udemy
Paid project
👍2
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.
Paper: https://arxiv.org/pdf/2501.04001v1.pdf
Code: https://github.com/magic-research/Sa2VA
Dataset: Visual Question Answering (VQA)
https://news.1rj.ru/str/DataScienceT❤️
This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.
Paper: https://arxiv.org/pdf/2501.04001v1.pdf
Code: https://github.com/magic-research/Sa2VA
Dataset: Visual Question Answering (VQA)
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
❤2👍2
3DGS-to-PC: Convert a 3D Gaussian Splatting Scene into a Dense Point Cloud or Mesh
3D Gaussian Splatting (3DGS) excels at producing highly detailed 3D reconstructions, but these scenes often require specialised renderers for effective visualisation. In contrast, point clouds are a widely used 3D representation and are compatible with most popular 3D processing software, yet converting 3DGS scenes into point clouds is a complex challenge. In this work we introduce 3DGS-to-PC, a flexible and highly customisable framework that is capable of transforming 3DGS scenes into dense, high-accuracy point clouds. We sample points probabilistically from each Gaussian as a 3D density function. We additionally threshold new points using the Mahalanobis distance to the Gaussian centre, preventing extreme outliers. The result is a point cloud that closely represents the shape encoded into the 3D Gaussian scene. Individual Gaussians use spherical harmonics to adapt colours depending on view, and each point may contribute only subtle colour hints to the resulting rendered scene. To avoid spurious or incorrect colours that do not fit with the final point cloud, we recalculate Gaussian colours via a customised image rendering approach, assigning each Gaussian the colour of the pixel to which it contributes most across all views. 3DGS-to-PC also supports mesh generation through Poisson Surface Reconstruction, applied to points sampled from predicted surface Gaussians. This allows coloured meshes to be generated from 3DGS scenes without the need for re-training. This package is highly customisable and capability of simple integration into existing 3DGS pipelines. 3DGS-to-PC provides a powerful tool for converting 3DGS data into point cloud and surface-based formats.
Paper: https://arxiv.org/pdf/2501.07478v1.pdf
Code: https://github.com/lewis-stuart-11/3dgs-to-pc
Dataset: NeRF
https://news.1rj.ru/str/DataScienceT💚
3D Gaussian Splatting (3DGS) excels at producing highly detailed 3D reconstructions, but these scenes often require specialised renderers for effective visualisation. In contrast, point clouds are a widely used 3D representation and are compatible with most popular 3D processing software, yet converting 3DGS scenes into point clouds is a complex challenge. In this work we introduce 3DGS-to-PC, a flexible and highly customisable framework that is capable of transforming 3DGS scenes into dense, high-accuracy point clouds. We sample points probabilistically from each Gaussian as a 3D density function. We additionally threshold new points using the Mahalanobis distance to the Gaussian centre, preventing extreme outliers. The result is a point cloud that closely represents the shape encoded into the 3D Gaussian scene. Individual Gaussians use spherical harmonics to adapt colours depending on view, and each point may contribute only subtle colour hints to the resulting rendered scene. To avoid spurious or incorrect colours that do not fit with the final point cloud, we recalculate Gaussian colours via a customised image rendering approach, assigning each Gaussian the colour of the pixel to which it contributes most across all views. 3DGS-to-PC also supports mesh generation through Poisson Surface Reconstruction, applied to points sampled from predicted surface Gaussians. This allows coloured meshes to be generated from 3DGS scenes without the need for re-training. This package is highly customisable and capability of simple integration into existing 3DGS pipelines. 3DGS-to-PC provides a powerful tool for converting 3DGS data into point cloud and surface-based formats.
Paper: https://arxiv.org/pdf/2501.07478v1.pdf
Code: https://github.com/lewis-stuart-11/3dgs-to-pc
Dataset: NeRF
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
👍1
DeepSeek-V3 Technical Report
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in #DeepSeek V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
Paper: https://arxiv.org/pdf/2412.19437v1.pdf
Code: https://github.com/deepseek-ai/deepseek-v3
#aiagents #ai #llm #ml #machinelearning #python
https://news.1rj.ru/str/DataScienceT💚
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in #DeepSeek V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
Paper: https://arxiv.org/pdf/2412.19437v1.pdf
Code: https://github.com/deepseek-ai/deepseek-v3
#aiagents #ai #llm #ml #machinelearning #python
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
👍2❤1
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of #AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient #MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong #OCR capability and 1.8M pixel high-resolution #image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.
Paper: https://arxiv.org/pdf/2408.01800v1.pdf
Codes:
https://github.com/OpenBMB/MiniCPM-o
https://github.com/openbmb/minicpm-v
Datasets: Video-MME
#MachineLearning #DeepLearning #BigData #Datascience #ML #HealthTech #DataVisualization #ArtificialInteligence #SoftwareEngineering #GenAI #deeplearning #ChatGPT #OpenAI #python #AI #keras #SQL #Statistics
https://news.1rj.ru/str/DataScienceT❤️
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of #AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient #MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong #OCR capability and 1.8M pixel high-resolution #image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.
Paper: https://arxiv.org/pdf/2408.01800v1.pdf
Codes:
https://github.com/OpenBMB/MiniCPM-o
https://github.com/openbmb/minicpm-v
Datasets: Video-MME
#MachineLearning #DeepLearning #BigData #Datascience #ML #HealthTech #DataVisualization #ArtificialInteligence #SoftwareEngineering #GenAI #deeplearning #ChatGPT #OpenAI #python #AI #keras #SQL #Statistics
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
👍3
Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise
Generative modeling aims to transform random noise into structured outputs. In this work, we enhance video diffusion models by allowing motion control via structured latent noise sampling. This is achieved by just a change in data: we pre-process training videos to yield structured noise. Consequently, our method is agnostic to diffusion model design, requiring no changes to model architectures or training pipelines. Specifically, we propose a novel noise warping algorithm, fast enough to run in real time, that replaces random temporal Gaussianity with correlated warped noise derived from optical flow fields, while preserving the spatial Gaussianity. The efficiency of our algorithm enables us to fine-tune modern video diffusion base models using warped noise with minimal overhead, and provide a one-stop solution for a wide range of user-friendly motion control: local object motion control, global camera movement control, and motion transfer. The harmonization between temporal coherence and spatial Gaussianity in our warped noise leads to effective motion control while maintaining per-frame pixel quality. Extensive experiments and user studies demonstrate the advantages of our method, making it a robust and scalable approach for controlling motion in video diffusion models.
Paper: https://arxiv.org/pdf/2501.08331v2.pdf
Code:
https://github.com/gowiththeflowpaper/gowiththeflowpaper.github.io
https://github.com/vgenai-netflix-eyeline-research/go-with-the-flow
https://news.1rj.ru/str/DataScienceT🌟
Generative modeling aims to transform random noise into structured outputs. In this work, we enhance video diffusion models by allowing motion control via structured latent noise sampling. This is achieved by just a change in data: we pre-process training videos to yield structured noise. Consequently, our method is agnostic to diffusion model design, requiring no changes to model architectures or training pipelines. Specifically, we propose a novel noise warping algorithm, fast enough to run in real time, that replaces random temporal Gaussianity with correlated warped noise derived from optical flow fields, while preserving the spatial Gaussianity. The efficiency of our algorithm enables us to fine-tune modern video diffusion base models using warped noise with minimal overhead, and provide a one-stop solution for a wide range of user-friendly motion control: local object motion control, global camera movement control, and motion transfer. The harmonization between temporal coherence and spatial Gaussianity in our warped noise leads to effective motion control while maintaining per-frame pixel quality. Extensive experiments and user studies demonstrate the advantages of our method, making it a robust and scalable approach for controlling motion in video diffusion models.
Paper: https://arxiv.org/pdf/2501.08331v2.pdf
Code:
https://github.com/gowiththeflowpaper/gowiththeflowpaper.github.io
https://github.com/vgenai-netflix-eyeline-research/go-with-the-flow
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
❤1👍1
Forwarded from Tomas
🎁 Your balance is credited $4,000 , the owner of the channel wants to contact you!
Dear subscriber, we would like to thank you very much for supporting our channel, and as a token of our gratitude we would like to provide you with free access to Lisa's investor channel, with the help of which you can earn today
T.me/Lisainvestor
Be sure to take advantage of our gift, admission is free, don't miss the opportunity, change your life for the better.
You can follow the link :
https://news.1rj.ru/str/+-FM_9cBcSGUyZmFh
Dear subscriber, we would like to thank you very much for supporting our channel, and as a token of our gratitude we would like to provide you with free access to Lisa's investor channel, with the help of which you can earn today
T.me/Lisainvestor
Be sure to take advantage of our gift, admission is free, don't miss the opportunity, change your life for the better.
You can follow the link :
https://news.1rj.ru/str/+-FM_9cBcSGUyZmFh
👍3❤1
Transformers 2: Self-adaptive LLMs
Paper: https://arxiv.org/pdf/2501.06252v2.pdf
Code:
https://github.com/SakanaAI/self-adaptive-llms
https://github.com/codelion/adaptive-classifier
Datasets: GSM8K - HumanEval - MATH
MBPP - TextVQA - OK-VQA - ARC (AI2 Reasoning Challenge)
https://news.1rj.ru/str/DataScienceT❤️
Paper: https://arxiv.org/pdf/2501.06252v2.pdf
Code:
https://github.com/SakanaAI/self-adaptive-llms
https://github.com/codelion/adaptive-classifier
Datasets: GSM8K - HumanEval - MATH
MBPP - TextVQA - OK-VQA - ARC (AI2 Reasoning Challenge)
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
👍3
Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks
paper: https://arxiv.org/pdf/2412.00733v3.pdf
Code: https://github.com/fudan-generative-vision/hallo3
https://news.1rj.ru/str/DataScienceT😮
paper: https://arxiv.org/pdf/2412.00733v3.pdf
Code: https://github.com/fudan-generative-vision/hallo3
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
👍5
Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback
Paper: https://arxiv.org/pdf/2412.15838v2.pdf
Code: https://github.com/pku-alignment/align-anything
Dataset: LLaVA-Bench
https://news.1rj.ru/str/DataScienceT😱
Paper: https://arxiv.org/pdf/2412.15838v2.pdf
Code: https://github.com/pku-alignment/align-anything
Dataset: LLaVA-Bench
https://news.1rj.ru/str/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
👍3