A team from NYU, NVIDIA, UC Berkeley, and Stanford released H*Bench—a benchmark that moves visual AI out of curated household scenes into actual complexity: transportation hubs, retail spaces, urban streets across 12 countries.
The task is deceptively simple: given a 360° panorama and limited field of view, rotate your head, find an object or identify a navigable path. This mimics how humans actually search—not passively processing a full scene, but actively exploring with strategic head and eye movements.
The results expose where current models break down.
The researchers built HVS-3B by fine-tuning Qwen2.5-VL-3B with supervised learning and RL Performance jumped significantly:
Object search: 14.83% → 47.38% success rate
Path search: 6.44% → 24.94% success rate
The asymmetry is revealing. Object search—essentially visual grounding with rotation control—responds well to post-training. Path search—requiring physical commonsense, spatial reasoning, and social conventions—remains stubbornly difficult.
The task is deceptively simple: given a 360° panorama and limited field of view, rotate your head, find an object or identify a navigable path. This mimics how humans actually search—not passively processing a full scene, but actively exploring with strategic head and eye movements.
The results expose where current models break down.
The researchers built HVS-3B by fine-tuning Qwen2.5-VL-3B with supervised learning and RL Performance jumped significantly:
Object search: 14.83% → 47.38% success rate
Path search: 6.44% → 24.94% success rate
The asymmetry is revealing. Object search—essentially visual grounding with rotation control—responds well to post-training. Path search—requiring physical commonsense, spatial reasoning, and social conventions—remains stubbornly difficult.
humanoid-vstar.github.io
Thinking in 360°
Humanoid Visual Search in the Wild
🔥2🥰2👏2
NeurIPS 2025: LLMs can solve RL tasks without any external component.
Researchers introduce Prompted Policy Search (ProPS), an RL method based only LLMs and in-context learning.
The tutorials are all implemented in Colab and can be run online with a free Gemini account:
Tutorial 1.
Tutorial 2.
Researchers introduce Prompted Policy Search (ProPS), an RL method based only LLMs and in-context learning.
The tutorials are all implemented in Colab and can be run online with a free Gemini account:
Tutorial 1.
Tutorial 2.
Google
In_Context_Learning_Demo_PromptedPolicySearch(ProPS).ipynb
Colab notebook
🔥2🥰2👏2
#DeepSeek released DeepSeek-Math-V2: Towards Self-Verifiable Mathematical Reasoning
It shows LLMs can now self-verify proofs, not just output solutions. DeepSeekMath-V2 achieves gold-level IMO 2025, CMO 2024, and 118/120 Putnam 2024, pointing to a future of deep, trustworthy mathematical reasoning.
GitHub.
It shows LLMs can now self-verify proofs, not just output solutions. DeepSeekMath-V2 achieves gold-level IMO 2025, CMO 2024, and 118/120 Putnam 2024, pointing to a future of deep, trustworthy mathematical reasoning.
GitHub.
huggingface.co
deepseek-ai/DeepSeek-Math-V2 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
🆒4❤3👍2👏2
ByteDance: what if video generation could follow any semantic instruction without retraining or task-specific hacks?
Enter Video-As-Prompt (VAP).
By treating a reference video as an in-context semantic prompt and steering a frozen Video DiT with a plug-and-play MoT expert plus temporally biased position embeddings, it avoids artifacts, prevents forgetting, and delivers strong zero-shot control.
Trained on the new 100K-pair VAP-Data, it reaches a 38.7% user preference rate, rivaling specialized commercial models.
Enter Video-As-Prompt (VAP).
By treating a reference video as an in-context semantic prompt and steering a frozen Video DiT with a plug-and-play MoT expert plus temporally biased position embeddings, it avoids artifacts, prevents forgetting, and delivers strong zero-shot control.
Trained on the new 100K-pair VAP-Data, it reaches a 38.7% user preference rate, rivaling specialized commercial models.
NeurIPS 2025 Best Paper Awards
Here are the winners that are already making waves:
1. Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
They built Infinity-Chat and showed that LLMs are converging into an “artificial hivemind”: same-y answers, collapsed diversity, and reward models that are completely miscalibrated for real human preferences. Scary and fascinating.
2. Gated Attention for Large Language Models
Simple idea: add a learned sigmoid gate after each attention head. Result? Better performance, no attention sink issues, rock-solid long-context extrapolation on 15B MoE and 1.7B dense models. Sometimes the simplest tricks win.
3. 1000 Layer Networks for Self-Supervised RL
Deep networks (literally 1024 layers!) + contrastive self-supervised RL = new goal-reaching abilities that shallow nets never discover. Depth is underrated in RL.
4. Why Diffusion Models Don’t Memorize
Tony Bonnaire, Giulio Biroli et al.
Elegant theory: two time scales in diffusion training → implicit dynamical regularization prevents memorization even in massively over-parameterized models. Explains why they generalize so well.
Runner-ups that are also fire:
- RL doesn’t actually teach LLMs new reasoning skills beyond the base model (sorry, RLHF believers)
- Optimal bounds for transductive online learning finally settled after 30 years
- Superposition is the reason scaling laws work so cleanly
Full list
Here are the winners that are already making waves:
1. Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
They built Infinity-Chat and showed that LLMs are converging into an “artificial hivemind”: same-y answers, collapsed diversity, and reward models that are completely miscalibrated for real human preferences. Scary and fascinating.
2. Gated Attention for Large Language Models
Simple idea: add a learned sigmoid gate after each attention head. Result? Better performance, no attention sink issues, rock-solid long-context extrapolation on 15B MoE and 1.7B dense models. Sometimes the simplest tricks win.
3. 1000 Layer Networks for Self-Supervised RL
Deep networks (literally 1024 layers!) + contrastive self-supervised RL = new goal-reaching abilities that shallow nets never discover. Depth is underrated in RL.
4. Why Diffusion Models Don’t Memorize
Tony Bonnaire, Giulio Biroli et al.
Elegant theory: two time scales in diffusion training → implicit dynamical regularization prevents memorization even in massively over-parameterized models. Explains why they generalize so well.
Runner-ups that are also fire:
- RL doesn’t actually teach LLMs new reasoning skills beyond the base model (sorry, RLHF believers)
- Optimal bounds for transductive online learning finally settled after 30 years
- Superposition is the reason scaling laws work so cleanly
Full list
❤🔥5🔥3🥰2❤1
StepFun released GELab-Zero-4B-preview — a 4B multimodal GUI agent fine-tuned for Android.
It understands taps, swipes, typing & waits, and can perform complex, multi-app tasks.
Built on Qwen3-VL-4B-Instruct.
HuggingFace.
GitHub.
It understands taps, swipes, typing & waits, and can perform complex, multi-app tasks.
Built on Qwen3-VL-4B-Instruct.
HuggingFace.
GitHub.
huggingface.co
stepfun-ai/GELab-Zero-4B-preview · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
👏4🔥3🥰2
#DeepSeek just launched DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents
1. DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API.
2. DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now.
Thinking in Tool-Use:
- Introduced a new massive agent training data synthesis method covering 1,800+ environments & 85k+ complex instructions.
- DeepSeek-V3.2 integrate thinking directly into tool-use, and also supports tool-use in both thinking and non-thinking modes.
API update:
- V3.2: Same usage pattern as V3.2-Exp.
- V3.2-Speciale: Served via a temporary endpoint: base_url="
Same pricing as V3.2, no tool calls, available until Dec 15th, 2025, 15:59 (UTC Time).
V3.2 now supports Thinking in Tool-Use — details
1. DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API.
2. DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now.
Thinking in Tool-Use:
- Introduced a new massive agent training data synthesis method covering 1,800+ environments & 85k+ complex instructions.
- DeepSeek-V3.2 integrate thinking directly into tool-use, and also supports tool-use in both thinking and non-thinking modes.
API update:
- V3.2: Same usage pattern as V3.2-Exp.
- V3.2-Speciale: Served via a temporary endpoint: base_url="
Same pricing as V3.2, no tool calls, available until Dec 15th, 2025, 15:59 (UTC Time).
V3.2 now supports Thinking in Tool-Use — details
👏4🔥3🥰2
Google introduced Budget Tracker for smarter AI agents
Current LLM agents waste tool-call budgets.
This work unveils Budget Tracker and BATS, enabling agents to dynamically adapt planning based on remaining resources.
Current LLM agents waste tool-call budgets.
This work unveils Budget Tracker and BATS, enabling agents to dynamically adapt planning based on remaining resources.
🔥4🥰3👏2
We have a new best text-to-video model that beats Google's Veo. Runway Gen-4.5, or Whisper Thunder, has +20 ELO on preference data over Veo 3, the difference between Veo 3 and Sora 2 Pro.
Does text-to-vid, image-to-vid, keyframes. 5-10s of output. No audio.
Does text-to-vid, image-to-vid, keyframes. 5-10s of output. No audio.
Runwayml
Runway Research | Introducing Runway Gen-4.5
A new frontier for video generation. State-of-the-art motion quality, prompt adherence and visual fidelity.
🔥4👍3👏3
Sam Altman told staff today that he was declaring a “code red” as ChatGPT faces growing threats from Google and other AI makers.
He wrote that he’s marshaling more resources to improve model behavior and other features in the chatbot.
In an internal Slack memo, Sam said he's directing more employees to work on improving ChatGPT for over 800 million weekly users, with key code red priorities including personalizing the chatbot so each person can customize how it interacts with them, improving ImageGen, improving model behavior, boosting speed and reliability, and minimizing overrefusals
OpenAI is delaying ads (which the company is testing but hasn't publicly acknowledged, according to a person with knowledge of the plans), AI agents (which aim to automate tasks related to shopping and health), Pulse, and plans to release a new reasoning model next week that Sam said beats Google's Gemini 3 in OpenAI's internal tests
He wrote that he’s marshaling more resources to improve model behavior and other features in the chatbot.
In an internal Slack memo, Sam said he's directing more employees to work on improving ChatGPT for over 800 million weekly users, with key code red priorities including personalizing the chatbot so each person can customize how it interacts with them, improving ImageGen, improving model behavior, boosting speed and reliability, and minimizing overrefusals
OpenAI is delaying ads (which the company is testing but hasn't publicly acknowledged, according to a person with knowledge of the plans), AI agents (which aim to automate tasks related to shopping and health), Pulse, and plans to release a new reasoning model next week that Sam said beats Google's Gemini 3 in OpenAI's internal tests
The Information
OpenAI CEO Declares ‘Code Red’ to Combat Threats to ChatGPT, Delays Ads Effort
OpenAI CEO Sam Altman on Monday told employees he was declaring a “code red” to marshalmore resources to improve ChatGPT as threats rise from Google and other artificial intelligence competitors, according to an internal memo. As a result,OpenAI plans to…
👏4🔥3🥰3
The world's first Co-Scientist integrating AI and XR. Meet LabOS.
It uses multimodal perception, self-evolving agents, and XR tools to see what researchers see, grasp experimental context, and assist in real time.
From cancer immunotherapy target discovery to stem-cell engineering, it turns labs into collaborative spaces where human insight and machine smarts evolve together, proving modern science moves fastest when thought and action team up.
Paper
It uses multimodal perception, self-evolving agents, and XR tools to see what researchers see, grasp experimental context, and assist in real time.
From cancer immunotherapy target discovery to stem-cell engineering, it turns labs into collaborative spaces where human insight and machine smarts evolve together, proving modern science moves fastest when thought and action team up.
Paper
arXiv.org
LabOS: The AI-XR Co-Scientist That Sees and Works With Humans
Modern science advances fastest when thought meets action. LabOS represents the first AI co-scientist that unites computational reasoning with physical experimentation through multimodal...
❤6🆒6🔥5👏1
Mistral released the Mistral 3 family of models
Small models Ministral 3 (14B, 8B, 3B), each released with base, instruct and reasoning versions.
And Mistral Large 3, a frontier class open source MoE. Apache 2.0.
Small models Ministral 3 (14B, 8B, 3B), each released with base, instruct and reasoning versions.
And Mistral Large 3, a frontier class open source MoE. Apache 2.0.
mistral.ai
Introducing Mistral 3 | Mistral AI
A family of frontier open-source multimodal models
🔥6❤3👏2
Shopify just shipped Tangle - the first open source experimentation platform with content-based caching and visual editor that's actually pleasant to use.
The CPU time savings alone are ridiculous (seeing 1+ year saved at Shopify).
The CPU time savings alone are ridiculous (seeing 1+ year saved at Shopify).
Shopify
Tangle: An open-source ML experimentation platform built for scale (2025) - Shopify
Tangle saves months of compute time, makes every experiment automatically reproducible, and allows teammates to share computation without coordination.
🔥6👍3🥰2
Diffusion Language Models are hyped lately, but hard to reproduce due to missing frameworks and high training costs.
Berkeley and UIUC show a surprisingly simple path: using their dLLM toolkit, they teach BERT to chat via discrete diffusion.
No generative pretraining, about 50 GPU hours, and ModernBERT large chat v0 reaches near Qwen1.5 0.5B quality with only lightweight SFT.
Even better, they open sourced the full training and inference pipeline plus a Hello World example, along with the extensible dllm framework. Efficient, cheap, and beginner friendly.
Models.
Berkeley and UIUC show a surprisingly simple path: using their dLLM toolkit, they teach BERT to chat via discrete diffusion.
No generative pretraining, about 50 GPU hours, and ModernBERT large chat v0 reaches near Qwen1.5 0.5B quality with only lightweight SFT.
Even better, they open sourced the full training and inference pipeline plus a Hello World example, along with the extensible dllm framework. Efficient, cheap, and beginner friendly.
Models.
GitHub
GitHub - ZHZisZZ/dllm: dLLM: Simple Diffusion Language Modeling
dLLM: Simple Diffusion Language Modeling. Contribute to ZHZisZZ/dllm development by creating an account on GitHub.
❤3🔥3👏3
UK passes law officially recognizing crypto as third kind of property
The Block
UK passes law officially recognizing crypto as third kind of property
Local industry body CryptoUK said this gives crypto a "clearer legal footing" in related crimes or litigation.
👏3🔥2🥰2
A promising step toward practical, efficient compute in memory systems
A new memristor based ADC with adaptive quantization shows the possibility: analog AI hardware could unlock its full potential without bulky converters in the way.
It delivers strong CIFAR10 and ImageNet performance at just 5 bits, achieves up to 15.1x better energy efficiency and 12.9x smaller area, and cuts CIM system overhead by more than half.
A new memristor based ADC with adaptive quantization shows the possibility: analog AI hardware could unlock its full potential without bulky converters in the way.
It delivers strong CIFAR10 and ImageNet performance at just 5 bits, achieves up to 15.1x better energy efficiency and 12.9x smaller area, and cuts CIM system overhead by more than half.
Nature
Memristor-based adaptive analog-to-digital conversion for efficient and accurate compute-in-memory
Nature Communications - Hong et al. report an adaptive memristor-based analog-to-digital converter which leverages the programmable nature of memristors to implement optimal, data-aware...
🔥3🥰3👏3
OpenAI published blog post stating: confessions can keep language models honest.
Poof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.
Even when models learn to cheat, they’ll still admit it...
Poof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.
Even when models learn to cheat, they’ll still admit it...
Openai
How confessions can keep language models honest
We’re sharing an early, proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.
🔥3🥰3👏2
Google introduced the Massive Sound Embedding Benchmark (MSEB).
This new open-source framework evaluates universal sound understanding across 8 core tasks, from retrieval to reconstruction, in order to accelerate progress in multimodal AI.
This new open-source framework evaluates universal sound understanding across 8 core tasks, from retrieval to reconstruction, in order to accelerate progress in multimodal AI.
research.google
From Waveforms to Wisdom: The New Benchmark for Auditory Intelligence
❤3👍3🔥2
Best Paper(DB track) Award at #NeurIPS2025 for Artificial Hivemind
Researchers from University of Washington, CMU, and Allen Institute have identified a fundamental problem in modern language models - the "Artificial Hivemind effect". HuggingFace.
Different models independently generate identical responses to open-ended questions. GPT-4, Qwen, Llama, Mixtral - all write "time is a river" when asked for a metaphor about time.
Average semantic similarity across different model families: 71-82%. This isn't a bug in one model. It's a systemic property of current LLM training paradigms.
The study covers 70+ models using the INFINITY-CHAT dataset:
- 26K real-world open-ended queries from WildChat
- 17 categories (from creative writing to philosophical questions)
- 31,250 human annotations (25 independent annotators per example)
Two forms of collapse:
• Intra-model: a single model repeats itself with pairwise similarity >0.8 in 79% of cases (even at temperature=1.0)
• Inter-model: different models produce identical phrases and structures.
Critical finding: LLM judges and reward models systematically fail when evaluating alternative responses of similar quality. Correlation with humans drops from 0.4 to 0.05 on examples with diverse content.
For business:
This creates an "AI feedback loop" - models are trained based on evaluations from other models that are themselves poorly calibrated for diversity.
Implications: → Reduced innovation potential in AI assistants → Standardization of creative content → Loss of alternative perspectives in strategic analysis → Risk of homogenizing user thinking patterns.
The future of AI should not be echoes of one voice, but a chorus of many.
Researchers from University of Washington, CMU, and Allen Institute have identified a fundamental problem in modern language models - the "Artificial Hivemind effect". HuggingFace.
Different models independently generate identical responses to open-ended questions. GPT-4, Qwen, Llama, Mixtral - all write "time is a river" when asked for a metaphor about time.
Average semantic similarity across different model families: 71-82%. This isn't a bug in one model. It's a systemic property of current LLM training paradigms.
The study covers 70+ models using the INFINITY-CHAT dataset:
- 26K real-world open-ended queries from WildChat
- 17 categories (from creative writing to philosophical questions)
- 31,250 human annotations (25 independent annotators per example)
Two forms of collapse:
• Intra-model: a single model repeats itself with pairwise similarity >0.8 in 79% of cases (even at temperature=1.0)
• Inter-model: different models produce identical phrases and structures.
Critical finding: LLM judges and reward models systematically fail when evaluating alternative responses of similar quality. Correlation with humans drops from 0.4 to 0.05 on examples with diverse content.
For business:
This creates an "AI feedback loop" - models are trained based on evaluations from other models that are themselves poorly calibrated for diversity.
Implications: → Reduced innovation potential in AI assistants → Standardization of creative content → Loss of alternative perspectives in strategic analysis → Risk of homogenizing user thinking patterns.
The future of AI should not be echoes of one voice, but a chorus of many.
arXiv.org
Artificial Hivemind: The Open-Ended Homogeneity of Language Models...
Language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar...
🔥8❤5👏4
The official announcement is pending, but Google is signing a multi-year partnership with Replit.
CNBC
Google partners with Replit, in vibe-coding push
Replit will expand usage of Google Cloud services, add more of Google's models onto its platform, and support AI coding use cases for enterprise customers.
🔥5❤4👏2
Anthropic released Interviewer that lets interview people at scale by using Claude
This helps expand the kind of research you can do.
This helps expand the kind of research you can do.
Anthropic
Introducing Anthropic Interviewer
What 1,250 professionals told us about working with AI
❤4👍4🔥4