All about AI, Web 3.0, BCI – Telegram
All about AI, Web 3.0, BCI
3.22K subscribers
724 photos
26 videos
161 files
3.08K links
This channel about AI, Web 3.0 and brain computer interface(BCI)

owner @Aniaslanyan
Download Telegram
A team from NYU, NVIDIA, UC Berkeley, and Stanford released H*Bench—a benchmark that moves visual AI out of curated household scenes into actual complexity: transportation hubs, retail spaces, urban streets across 12 countries.

The task is deceptively simple: given a 360° panorama and limited field of view, rotate your head, find an object or identify a navigable path. This mimics how humans actually search—not passively processing a full scene, but actively exploring with strategic head and eye movements.
The results expose where current models break down.

The researchers built HVS-3B by fine-tuning Qwen2.5-VL-3B with supervised learning and RL Performance jumped significantly:

Object search: 14.83% → 47.38% success rate
Path search: 6.44% → 24.94% success rate
The asymmetry is revealing. Object search—essentially visual grounding with rotation control—responds well to post-training. Path search—requiring physical commonsense, spatial reasoning, and social conventions—remains stubbornly difficult.
🔥2🥰2👏2
NeurIPS 2025: LLMs can solve RL tasks without any external component.

Researchers introduce Prompted Policy Search (ProPS), an RL method based only LLMs and in-context learning.

The tutorials are all implemented in Colab and can be run online with a free Gemini account:

Tutorial 1.
Tutorial 2.
🔥2🥰2👏2
#DeepSeek released DeepSeek-Math-V2: Towards Self-Verifiable Mathematical Reasoning

It shows LLMs can now self-verify proofs, not just output solutions. DeepSeekMath-V2 achieves gold-level IMO 2025, CMO 2024, and 118/120 Putnam 2024, pointing to a future of deep, trustworthy mathematical reasoning.

GitHub.
🆒43👍2👏2
ByteDance: what if video generation could follow any semantic instruction without retraining or task-specific hacks?

Enter
Video-As-Prompt (VAP).

By treating a reference video as an in-context semantic prompt and steering a frozen Video DiT with a plug-and-play MoT expert plus temporally biased position embeddings, it avoids artifacts, prevents forgetting, and delivers strong zero-shot control.

Trained on the new 100K-pair VAP-Data, it reaches a 38.7% user preference rate, rivaling specialized commercial models.
NeurIPS 2025 Best Paper Awards

Here are the winners that are already making waves:

1.
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

They built Infinity-Chat and showed that LLMs are converging into an “artificial hivemind”: same-y answers, collapsed diversity, and reward models that are completely miscalibrated for real human preferences. Scary and fascinating.

2. Gated Attention for Large Language Models

Simple idea: add a learned sigmoid gate after each attention head. Result? Better performance, no attention sink issues, rock-solid long-context extrapolation on 15B MoE and 1.7B dense models. Sometimes the simplest tricks win.

3. 1000 Layer Networks for Self-Supervised RL

Deep networks (literally 1024 layers!) + contrastive self-supervised RL = new goal-reaching abilities that shallow nets never discover. Depth is underrated in RL.

4. Why Diffusion Models Don’t Memorize

Tony Bonnaire, Giulio Biroli et al.
Elegant theory: two time scales in diffusion training → implicit dynamical regularization prevents memorization even in massively over-parameterized models. Explains why they generalize so well.

Runner-ups that are also fire:
- RL doesn’t actually teach LLMs new reasoning skills beyond the base model (sorry, RLHF believers)
- Optimal bounds for transductive online learning finally settled after 30 years
- Superposition is the reason scaling laws work so cleanly

Full
list
❤‍🔥5🔥3🥰21
StepFun released GELab-Zero-4B-preview — a 4B multimodal GUI agent fine-tuned for Android.

It understands taps, swipes, typing & waits, and can perform complex, multi-app tasks.
Built on Qwen3-VL-4B-Instruct.

HuggingFace.
GitHub.
👏4🔥3🥰2
#DeepSeek just launched DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents

1. DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API.

2. DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now.

Thinking in Tool-Use:

- Introduced a new massive agent training data synthesis method covering 1,800+ environments & 85k+ complex instructions.

- DeepSeek-V3.2 integrate thinking directly into tool-use, and also supports tool-use in both thinking and non-thinking modes.

API update:

- V3.2: Same usage pattern as V3.2-Exp.
- V3.2-Speciale: Served via a temporary endpoint: base_url="
Same pricing as V3.2, no tool calls, available until Dec 15th, 2025, 15:59 (UTC Time).

V3.2 now supports Thinking in Tool-Use — details
👏4🔥3🥰2
Google introduced Budget Tracker for smarter AI agents

Current LLM agents waste tool-call budgets.

This work unveils Budget Tracker and BATS, enabling agents to dynamically adapt planning based on remaining resources.
🔥4🥰3👏2
We have a new best text-to-video model that beats Google's Veo. Runway Gen-4.5, or Whisper Thunder, has +20 ELO on preference data over Veo 3, the difference between Veo 3 and Sora 2 Pro.

Does text-to-vid, image-to-vid, keyframes. 5-10s of output. No audio.
🔥4👍3👏3
Sam Altman told staff today that he was declaring a “code red” as ChatGPT faces growing threats from Google and other AI makers.

He wrote that he’s marshaling more resources to improve model behavior and other features in the chatbot.

In an internal Slack memo, Sam said he's directing more employees to work on improving ChatGPT for over 800 million weekly users, with key code red priorities including personalizing the chatbot so each person can customize how it interacts with them, improving ImageGen, improving model behavior, boosting speed and reliability, and minimizing overrefusals

OpenAI is delaying ads (which the company is testing but hasn't publicly acknowledged, according to a person with knowledge of the plans), AI agents (which aim to automate tasks related to shopping and health), Pulse, and plans to release a new reasoning model next week that Sam said beats Google's Gemini 3 in OpenAI's internal tests
👏4🔥3🥰3
The world's first Co-Scientist integrating AI and XR. Meet LabOS.

It uses multimodal perception, self-evolving agents, and XR tools to see what researchers see, grasp experimental context, and assist in real time.

From cancer immunotherapy target discovery to stem-cell engineering, it turns labs into collaborative spaces where human insight and machine smarts evolve together, proving modern science moves fastest when thought and action team up.

Paper
6🆒6🔥5👏1
Mistral released the Mistral 3 family of models

Small models Ministral 3 (14B, 8B, 3B), each released with base, instruct and reasoning versions.

And Mistral Large 3, a frontier class open source MoE. Apache 2.0.
🔥63👏2
Shopify just shipped Tangle - the first open source experimentation platform with content-based caching and visual editor that's actually pleasant to use.

The CPU time savings alone are ridiculous (seeing 1+ year saved at Shopify).
🔥6👍3🥰2
Diffusion Language Models are hyped lately, but hard to reproduce due to missing frameworks and high training costs.

Berkeley and UIUC show a surprisingly simple path: using their dLLM toolkit, they teach BERT to chat via discrete diffusion.

No generative pretraining, about 50 GPU hours, and ModernBERT large chat v0 reaches near Qwen1.5 0.5B quality with only lightweight SFT.

Even better, they open sourced the full training and inference pipeline plus a Hello World example, along with the extensible dllm framework. Efficient, cheap, and beginner friendly.

Models.
3🔥3👏3
A promising step toward practical, efficient compute in memory systems

A new memristor based ADC with adaptive quantization shows the possibility: analog AI hardware could unlock its full potential without bulky converters in the way.

It delivers strong CIFAR10 and ImageNet performance at just 5 bits, achieves up to 15.1x better energy efficiency and 12.9x smaller area, and cuts CIM system overhead by more than half.
🔥3🥰3👏3
OpenAI published blog post stating: confessions can keep language models honest.

Poof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.

Even when models learn to cheat, they’ll still admit it...
🔥3🥰3👏2
Google introduced the Massive Sound Embedding Benchmark (MSEB).

This new open-source framework evaluates universal sound understanding across 8 core tasks, from retrieval to reconstruction, in order to accelerate progress in multimodal AI.
3👍3🔥2
Best Paper(DB track) Award at #NeurIPS2025 for Artificial Hivemind

Researchers from University of Washington, CMU, and Allen Institute have identified a fundamental problem in modern language models - the "Artificial Hivemind effect". HuggingFace.

Different models independently generate identical responses to open-ended questions. GPT-4, Qwen, Llama, Mixtral - all write "time is a river" when asked for a metaphor about time.

Average semantic similarity across different model families: 71-82%. This isn't a bug in one model. It's a systemic property of current LLM training paradigms.

The study covers 70+ models using the INFINITY-CHAT dataset:
- 26K real-world open-ended queries from WildChat
- 17 categories (from creative writing to philosophical questions)
- 31,250 human annotations (25 independent annotators per example)

Two forms of collapse:

Intra-model: a single model repeats itself with pairwise similarity >0.8 in 79% of cases (even at temperature=1.0)

Inter-model: different models produce identical phrases and structures.

Critical finding: LLM judges and reward models systematically fail when evaluating alternative responses of similar quality. Correlation with humans drops from 0.4 to 0.05 on examples with diverse content.

For business:
This creates an "AI feedback loop" - models are trained based on evaluations from other models that are themselves poorly calibrated for diversity.
Implications: → Reduced innovation potential in AI assistants → Standardization of creative content → Loss of alternative perspectives in strategic analysis → Risk of homogenizing user thinking patterns.

The future of AI should not be echoes of one voice, but a chorus of many.
🔥85👏4
Anthropic released Interviewer that lets interview people at scale by using Claude

This helps expand the kind of research you can do.
4👍4🔥4