This media is not supported in your browser
VIEW IN TELEGRAM
👩🦱Physical-Hair Diffusion👩🦱
👉CONTROLHAIR is novel hybrid framework that integrates a physics simulator with conditional video diffusion to enable controllable dynamic hair rendering. Repo announced💙
👉Review https://t.ly/78LHr
👉Paper https://lnkd.in/epm-A9Fq
👉Project https://lnkd.in/evsjz298
👉Repo TBA
👉CONTROLHAIR is novel hybrid framework that integrates a physics simulator with conditional video diffusion to enable controllable dynamic hair rendering. Repo announced💙
👉Review https://t.ly/78LHr
👉Paper https://lnkd.in/epm-A9Fq
👉Project https://lnkd.in/evsjz298
👉Repo TBA
❤7🔥2👏1
This media is not supported in your browser
VIEW IN TELEGRAM
🔩Code-Agentic Education🔩
👉Show Lab unveils Code2Video: agentic, code-centric framework that generates HQ educational videos from knowledge points. Clarity, coherence & reproducibility. Repo under MIT💙
👉Review https://t.ly/Fv4LJ
👉Paper https://arxiv.org/pdf/2510.01174
👉Repo https://github.com/showlab/Code2Video/
👉Project https://showlab.github.io/Code2Video/
👉Show Lab unveils Code2Video: agentic, code-centric framework that generates HQ educational videos from knowledge points. Clarity, coherence & reproducibility. Repo under MIT💙
👉Review https://t.ly/Fv4LJ
👉Paper https://arxiv.org/pdf/2510.01174
👉Repo https://github.com/showlab/Code2Video/
👉Project https://showlab.github.io/Code2Video/
❤8🔥2
epi_11 (online-video-cutter.com).mp4
1.1 MB
🎷🎷 Clink! Chop! Thud! 🎷🎷
👉Sounding Object Detection: while an environment may contain many objects, only a few are directly involved in producing sound during an interaction. This model detects the sounding object in a video. Code/Data announced 💙
👉Review https://t.ly/VK_1h
👉Paper https://lnkd.in/depNjVXm
👉Project https://lnkd.in/dF63EZFG
👉Repo TBA
👉Sounding Object Detection: while an environment may contain many objects, only a few are directly involved in producing sound during an interaction. This model detects the sounding object in a video. Code/Data announced 💙
👉Review https://t.ly/VK_1h
👉Paper https://lnkd.in/depNjVXm
👉Project https://lnkd.in/dF63EZFG
👉Repo TBA
🔥6❤2😍1
👉 A proof I'm not a bot...
My (short) interview to one of the biggest Italian media: AI in 2016, HPC / Quantum and how I created my startup: https://www.linkedin.com/posts/visionarynet_ai-itw25-ai-activity-7381215486115643392-t7an
Thanks for the support (and of course a new paper coming in a few hours)
My (short) interview to one of the biggest Italian media: AI in 2016, HPC / Quantum and how I created my startup: https://www.linkedin.com/posts/visionarynet_ai-itw25-ai-activity-7381215486115643392-t7an
Thanks for the support (and of course a new paper coming in a few hours)
❤21🔥8👏4😍3⚡1
This media is not supported in your browser
VIEW IN TELEGRAM
🎺Visual Grounding RVOS🎺
👉ReferDINO is a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception & cross-modal spatio-temporal reasoning. Code, Demo & checkpoints💙
👉Review https://t.ly/rOdkP
👉Paper https://lnkd.in/efuAFQdE
👉Project https://lnkd.in/dK3wMZqv
👉Repo https://lnkd.in/d3i2PsNF
👉ReferDINO is a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception & cross-modal spatio-temporal reasoning. Code, Demo & checkpoints💙
👉Review https://t.ly/rOdkP
👉Paper https://lnkd.in/efuAFQdE
👉Project https://lnkd.in/dK3wMZqv
👉Repo https://lnkd.in/d3i2PsNF
🔥8❤3👏1
This media is not supported in your browser
VIEW IN TELEGRAM
💄Pixel-Perfect Depth (SOTA)💄
👉Pixel-Perfect Depth is a mono-depth estimation model with pixel-space diffusion transformers. New SOTA. Repo under Apache 2.0💙
👉Review https://t.ly/75PGo
👉Paper https://lnkd.in/d8wxFpyY
👉Project https://lnkd.in/dV5HhsqH
👉Repo https://lnkd.in/d9JKFBJq
👉Demo https://lnkd.in/d3wBkKJ9
👉Pixel-Perfect Depth is a mono-depth estimation model with pixel-space diffusion transformers. New SOTA. Repo under Apache 2.0💙
👉Review https://t.ly/75PGo
👉Paper https://lnkd.in/d8wxFpyY
👉Project https://lnkd.in/dV5HhsqH
👉Repo https://lnkd.in/d9JKFBJq
👉Demo https://lnkd.in/d3wBkKJ9
🔥17🤯5❤4
This media is not supported in your browser
VIEW IN TELEGRAM
↗️ TrackVLA++ Visual Tracking↘️
👉TrackVLA++ is a novel Vision-Language-Action model that incorporates spatial reasoning and target identification memory, enabling SOTA performance in both long-horizon and highly crowded tracking scenarios. Model announced💙
👉Review https://t.ly/ruYzc
👉Paper https://arxiv.org/pdf/2510.07134
👉Project pku-epic.github.io/TrackVLA-plus-plus-Web/
👉Repo TBA
👉TrackVLA++ is a novel Vision-Language-Action model that incorporates spatial reasoning and target identification memory, enabling SOTA performance in both long-horizon and highly crowded tracking scenarios. Model announced💙
👉Review https://t.ly/ruYzc
👉Paper https://arxiv.org/pdf/2510.07134
👉Project pku-epic.github.io/TrackVLA-plus-plus-Web/
👉Repo TBA
🔥6❤1🤣1
This media is not supported in your browser
VIEW IN TELEGRAM
🫧 Detect Anything via MLLM 🫧
👉Rex-Omni is a 3B-multimodal model that unifies visual perception tasks, including object detection, OCR, pointing, key-pointing & visual prompting into a single next point prediction framework. Impressive results. Repo under IDEA License 1.0💙
👉Review https://t.ly/DCTk_
👉Paper https://lnkd.in/d4VDD-9j
👉Project https://lnkd.in/d6unEyvq
👉Repo https://lnkd.in/dkYJFe-x
👉Rex-Omni is a 3B-multimodal model that unifies visual perception tasks, including object detection, OCR, pointing, key-pointing & visual prompting into a single next point prediction framework. Impressive results. Repo under IDEA License 1.0💙
👉Review https://t.ly/DCTk_
👉Paper https://lnkd.in/d4VDD-9j
👉Project https://lnkd.in/d6unEyvq
👉Repo https://lnkd.in/dkYJFe-x
1🔥19❤11👍2👏1
This media is not supported in your browser
VIEW IN TELEGRAM
🫙Universal Feature Up-Sampling🫙
👉AnyUp is a novel method for feature up-sampling that can be applied to ANY vision feature at ANY resolution, without encoder-specific training: inference-time feature-agnostic up-sampling architecture to improve up-sampling quality. Repo CC-4.0💙
👉Review https://t.ly/HvEw9
👉Paper https://arxiv.org/pdf/2510.12764
👉Project https://wimmerth.github.io/anyup/
👉Repo https://github.com/wimmerth/anyup
👉AnyUp is a novel method for feature up-sampling that can be applied to ANY vision feature at ANY resolution, without encoder-specific training: inference-time feature-agnostic up-sampling architecture to improve up-sampling quality. Repo CC-4.0💙
👉Review https://t.ly/HvEw9
👉Paper https://arxiv.org/pdf/2510.12764
👉Project https://wimmerth.github.io/anyup/
👉Repo https://github.com/wimmerth/anyup
❤16🔥7👏1
This media is not supported in your browser
VIEW IN TELEGRAM
🦄 City-Tour -> Simulation 🦄
👉UrbanVerse is a novel system to convert real-world urban scenes from city-tour videos into physics-aware, interactive simulation environments, enabling scalable robot learning in urban spaces with real-world generalization. Repo & Data announced 💙
👉Review https://t.ly/UvXNS
👉Paper https://arxiv.org/pdf/2510.15018
👉Project https://urbanverseproject.github.io/
👉Repo TBA
👉UrbanVerse is a novel system to convert real-world urban scenes from city-tour videos into physics-aware, interactive simulation environments, enabling scalable robot learning in urban spaces with real-world generalization. Repo & Data announced 💙
👉Review https://t.ly/UvXNS
👉Paper https://arxiv.org/pdf/2510.15018
👉Project https://urbanverseproject.github.io/
👉Repo TBA
❤12🤩2👍1🔥1😢1
🌵All-in-One Dense Keypoints🌵
👉DeepDetect is a novel all-in-one, dense keypoints detector that unifies the strengths of SIFT, ORB, BRISK, FAST, AGAST, Harris, Shi-Tomasi, Canny & Sobel into a neural net. DAMN ROMANTIC. Repo under MIT💙
👉Review https://t.ly/VKGct
👉Paper https://arxiv.org/pdf/2510.17422
👉Repo https://github.com/saktx/DeepDetect
👉DeepDetect is a novel all-in-one, dense keypoints detector that unifies the strengths of SIFT, ORB, BRISK, FAST, AGAST, Harris, Shi-Tomasi, Canny & Sobel into a neural net. DAMN ROMANTIC. Repo under MIT💙
👉Review https://t.ly/VKGct
👉Paper https://arxiv.org/pdf/2510.17422
👉Repo https://github.com/saktx/DeepDetect
❤16🔥3👏2
This media is not supported in your browser
VIEW IN TELEGRAM
🔥 SAM 2++: Track Anything 🔥
👉SAM 2++ is a novel unified model towards tracking at any granularity, including masks, boxes, and points. Impressive results but no code announced😢
👉Review https://t.ly/I392_
👉Paper arxiv.org/pdf/2510.18822
👉Project tracking-any-granularity.github.io/
👉Repo :(
👉SAM 2++ is a novel unified model towards tracking at any granularity, including masks, boxes, and points. Impressive results but no code announced😢
👉Review https://t.ly/I392_
👉Paper arxiv.org/pdf/2510.18822
👉Project tracking-any-granularity.github.io/
👉Repo :(
❤12🔥7👏3
AI with Papers - Artificial Intelligence & Deep Learning
🦄 City-Tour -> Simulation 🦄 👉UrbanVerse is a novel system to convert real-world urban scenes from city-tour videos into physics-aware, interactive simulation environments, enabling scalable robot learning in urban spaces with real-world generalization. Repo…
Repo (pretty empty) now online: https://github.com/OatmealLiu/UrbanVerse
GitHub
GitHub - OatmealLiu/UrbanVerse: Scaling Urban Simulation - Infinite Physically-Plausible Urban Simulation = IsaacSim(Physically…
Scaling Urban Simulation - Infinite Physically-Plausible Urban Simulation = IsaacSim(Physically-Accurate Assets × Real-World City-Tour Layouts) - OatmealLiu/UrbanVerse
❤4
This media is not supported in your browser
VIEW IN TELEGRAM
🏜️Omni Driving Models🏜️
👉OmniNWM is a unified panoramic navigation world model that advances autonomous driving by jointly generating multi-modal states (RGB, semantics, depth, 3D occupancy), enabling precise action control & facilitating closed-loop evaluation through occupancy-based dense rewards. Repo under Apache 2.0💙
👉Review https://t.ly/ktXvz
👉Paper https://lnkd.in/eFKSZnrc
👉Project https://lnkd.in/eSDfccv8
👉Repo https://lnkd.in/efCSvjtp
👉OmniNWM is a unified panoramic navigation world model that advances autonomous driving by jointly generating multi-modal states (RGB, semantics, depth, 3D occupancy), enabling precise action control & facilitating closed-loop evaluation through occupancy-based dense rewards. Repo under Apache 2.0💙
👉Review https://t.ly/ktXvz
👉Paper https://lnkd.in/eFKSZnrc
👉Project https://lnkd.in/eSDfccv8
👉Repo https://lnkd.in/efCSvjtp
🔥6❤1👏1🤩1
This media is not supported in your browser
VIEW IN TELEGRAM
🐠ITTO: Protocol for Dynamic Tracking🐠
👉ITTO by Caltech is a novel long-range tracking benchmark suite for evaluating and diagnosing tracking methods on complex and long-range motions. Repo under CC BY-NC 4.0💙
👉Review https://t.ly/tN84a
👉Paper https://arxiv.org/pdf/2510.19819
👉Project https://glab-caltech.github.io/ITTO/
👉Repo https://github.com/ilonadem/itto
👉ITTO by Caltech is a novel long-range tracking benchmark suite for evaluating and diagnosing tracking methods on complex and long-range motions. Repo under CC BY-NC 4.0💙
👉Review https://t.ly/tN84a
👉Paper https://arxiv.org/pdf/2510.19819
👉Project https://glab-caltech.github.io/ITTO/
👉Repo https://github.com/ilonadem/itto
❤6🔥1
This media is not supported in your browser
VIEW IN TELEGRAM
🦗Character Mixing Generation🦗
👉MBZUAI unveils the first ever video-gen system able to preserve character ID, behavior & original style while generating plausible interactions between characters that have never coexisted - from cartoons (We Bare Bears, Tom & Jerry) to realistic humans (Mr. Bean, Young Sheldon)
👉Review https://t.ly/tN84a
👉Paper https://lnkd.in/dhKMwukv
👉Project https://lnkd.in/dBkJs48h
👉Repo https://lnkd.in/dw_uzgAk
👉MBZUAI unveils the first ever video-gen system able to preserve character ID, behavior & original style while generating plausible interactions between characters that have never coexisted - from cartoons (We Bare Bears, Tom & Jerry) to realistic humans (Mr. Bean, Young Sheldon)
👉Review https://t.ly/tN84a
👉Paper https://lnkd.in/dhKMwukv
👉Project https://lnkd.in/dBkJs48h
👉Repo https://lnkd.in/dw_uzgAk
🤩5❤1👍1👏1
This media is not supported in your browser
VIEW IN TELEGRAM
🧷Generative Point Tracking w/ FM🧷
👉Generative Point Tracker (GenPT) is a novel generative framework for modelling multi-modal trajectories. Able to capture the multi-modality in point trajectories. Repo under MIT💙
👉Review https://t.ly/MMFrt
👉Paper https://arxiv.org/pdf/2510.20951
👉Project mtesfaldet.net/genpt_projpage/
👉Repo https://github.com/tesfaldet/genpt
👉Generative Point Tracker (GenPT) is a novel generative framework for modelling multi-modal trajectories. Able to capture the multi-modality in point trajectories. Repo under MIT💙
👉Review https://t.ly/MMFrt
👉Paper https://arxiv.org/pdf/2510.20951
👉Project mtesfaldet.net/genpt_projpage/
👉Repo https://github.com/tesfaldet/genpt
🔥7❤1👍1👏1
This media is not supported in your browser
VIEW IN TELEGRAM
🦄Unified Region-Level MLLM🦄
👉PixeRefers is an unified multimodal LLM framework that supports precise, region-specific understanding in both static images and dynamic videos, overcoming the holistic, scene-level bias of prior MLLMs. SOTA results. Demo, Repo & Dataset available💙
👉Review https://t.ly/WH4dQ
👉Paper arxiv.org/pdf/2510.23603
👉Project circleradon.github.io/PixelRefer
👉Repo https://github.com/alibaba-damo-academy/PixelRefer
👉PixeRefers is an unified multimodal LLM framework that supports precise, region-specific understanding in both static images and dynamic videos, overcoming the holistic, scene-level bias of prior MLLMs. SOTA results. Demo, Repo & Dataset available💙
👉Review https://t.ly/WH4dQ
👉Paper arxiv.org/pdf/2510.23603
👉Project circleradon.github.io/PixelRefer
👉Repo https://github.com/alibaba-damo-academy/PixelRefer
🔥4❤2🤯2👏1
This media is not supported in your browser
VIEW IN TELEGRAM
🌱PlanarTrack: Large Planar Tracking🌱
👉PlanarTrack is a large-scale HQ and challenging benchmark for planar tracking: 1,150 sequences with 733K+ frames, including 1,000 short-term & 150 long-term videos. Repo & Dataset available💙
👉Review https://t.ly/mYNi7
👉Paper arxiv.org/pdf/2510.23368
👉Repo https://lnkd.in/edb3GMyT
👉Project https://lnkd.in/eC-hVB-U
👉Data https://lnkd.in/eew2j4tM
👉PlanarTrack is a large-scale HQ and challenging benchmark for planar tracking: 1,150 sequences with 733K+ frames, including 1,000 short-term & 150 long-term videos. Repo & Dataset available💙
👉Review https://t.ly/mYNi7
👉Paper arxiv.org/pdf/2510.23368
👉Repo https://lnkd.in/edb3GMyT
👉Project https://lnkd.in/eC-hVB-U
👉Data https://lnkd.in/eew2j4tM
🔥11❤5👏2👍1