NEW BOT Телеграм, страница

[ai chips] Nvidia's latest move in the AI hardware race: specialized chips for inference

Nvidia just announced the Rubin CPX - a GPU specifically optimized for the prefill phase of inference. This is fascinating because it challenges the "one chip fits all" approach we've seen dominating AI infrastructure.
The core insight: prefill (generating the first token) is compute-heavy but barely uses memory bandwidth, while decode (generating subsequent tokens) is the opposite - memory-bound with underutilized compute. Running both on the same high-end GPU with expensive HBM wastes resources.
Rubin CPX uses cheaper GDDR7 instead of HBM (cutting memory cost by 50%+), drops NVLink for simple PCIe, but maintains strong FP4 compute - 20 PFLOPS dense. It's designed to be drastically cheaper per unit while being better suited for its specific workload.
The competitive angle is brutal: AMD and others were just catching up with rack-scale designs, and now they need to develop specialized prefill chips too, pushing their roadmaps back another cycle.
This disaggregated approach (separate hardware for prefill/decode) hints at where inference infrastructure is heading - not just software optimization, but purpose-built silicon for different phases of the same task.

https://newsletter.semianalysis.com/p/another-giant-leap-the-rubin-cpx-specialized-accelerator-rack

Semianalysis

Another Giant Leap: The Rubin CPX Specialized Accelerator & Rack

New Prefill Specialized GPU, Rack Architecture, BOM, Disaggregated PD, Higher Perf per TCO, Lower TCO, GDDR7 & HBM Market Trends

159 views11:11

Engineer Readings

[#AI #LLM #MachineLearning #EdgeComputing]
Running LLMs with Just SQL? New Research Makes It Possible

Researchers dropped TranSQL+ - a wild approach to run Large Language Models using only SQL queries in a regular database. No GPUs, no fancy ML frameworks needed.
The idea: Convert the entire LLM (like Llama3-8B) into SQL queries that run in DuckDB or similar databases. Sounds crazy, but it actually works!
Results on a modest laptop (4 cores, 16GB RAM):

20× faster than DeepSpeed for first token
4× faster for generating subsequent tokens
Works entirely on CPU with limited memory

Why this matters:
• Runs on ANY device with a database (phones, laptops, IoT)
• No need to compile for different hardware
• Databases already know how to manage memory efficiently
Perfect for privacy-focused edge deployments where you can't rely on cloud GPUs.

Paper: https://arxiv.org/pdf/2502.02818

🔥2🤯2

129 views17:40

Engineer Readings

[#AI #AICoding #AIAgents #AutonomousAgents #MultiAgentSystems #Cursor]

Scaling AI Coding: Lessons From Running Hundreds of Agents

Cursor shares how they pushed the limits of AI by running hundreds of autonomous coding agents at the same time on real software projects.

Instead of short tasks, these agents worked for weeks, edited shared codebases, and even helped build complex products like a web browser.

The biggest lesson?
Uncoordinated agents create chaos — but a planner + worker system keeps them aligned, focused, and productive over long periods.

The article shows that with the right structure, AI teams can tackle massive engineering challenges, similar to real human teams — and we’re just getting started.

🔗 Read more: https://cursor.com/blog/scaling-agents

Cursor

Scaling long-running autonomous coding

We've been experimenting with running coding agents autonomously for weeks at a time.

🔥2

121 views05:15

About

Blog

Apps

Platform