NEW BOT Телеграм, страница

[ai chips] Nvidia's latest move in the AI hardware race: specialized chips for inference

Nvidia just announced the Rubin CPX - a GPU specifically optimized for the prefill phase of inference. This is fascinating because it challenges the "one chip fits all" approach we've seen dominating AI infrastructure.
The core insight: prefill (generating the first token) is compute-heavy but barely uses memory bandwidth, while decode (generating subsequent tokens) is the opposite - memory-bound with underutilized compute. Running both on the same high-end GPU with expensive HBM wastes resources.
Rubin CPX uses cheaper GDDR7 instead of HBM (cutting memory cost by 50%+), drops NVLink for simple PCIe, but maintains strong FP4 compute - 20 PFLOPS dense. It's designed to be drastically cheaper per unit while being better suited for its specific workload.
The competitive angle is brutal: AMD and others were just catching up with rack-scale designs, and now they need to develop specialized prefill chips too, pushing their roadmaps back another cycle.
This disaggregated approach (separate hardware for prefill/decode) hints at where inference infrastructure is heading - not just software optimization, but purpose-built silicon for different phases of the same task.

https://newsletter.semianalysis.com/p/another-giant-leap-the-rubin-cpx-specialized-accelerator-rack

Semianalysis

Another Giant Leap: The Rubin CPX Specialized Accelerator & Rack

New Prefill Specialized GPU, Rack Architecture, BOM, Disaggregated PD, Higher Perf per TCO, Lower TCO, GDDR7 & HBM Market Trends

230 views11:11

Engineer Readings

[#AI #LLM #MachineLearning #EdgeComputing]
Running LLMs with Just SQL? New Research Makes It Possible

Researchers dropped TranSQL+ - a wild approach to run Large Language Models using only SQL queries in a regular database. No GPUs, no fancy ML frameworks needed.
The idea: Convert the entire LLM (like Llama3-8B) into SQL queries that run in DuckDB or similar databases. Sounds crazy, but it actually works!
Results on a modest laptop (4 cores, 16GB RAM):

20× faster than DeepSpeed for first token
4× faster for generating subsequent tokens
Works entirely on CPU with limited memory

Why this matters:
• Runs on ANY device with a database (phones, laptops, IoT)
• No need to compile for different hardware
• Databases already know how to manage memory efficiently
Perfect for privacy-focused edge deployments where you can't rely on cloud GPUs.

Paper: https://arxiv.org/pdf/2502.02818

🔥2🤯2

215 views17:40

Engineer Readings

[#AI #AICoding #AIAgents #AutonomousAgents #MultiAgentSystems #Cursor]

Scaling AI Coding: Lessons From Running Hundreds of Agents

Cursor shares how they pushed the limits of AI by running hundreds of autonomous coding agents at the same time on real software projects.

Instead of short tasks, these agents worked for weeks, edited shared codebases, and even helped build complex products like a web browser.

The biggest lesson?
Uncoordinated agents create chaos — but a planner + worker system keeps them aligned, focused, and productive over long periods.

The article shows that with the right structure, AI teams can tackle massive engineering challenges, similar to real human teams — and we’re just getting started.

🔗 Read more: https://cursor.com/blog/scaling-agents

Cursor

Scaling long-running autonomous coding

We've been experimenting with running coding agents autonomously for weeks at a time.

🔥2

260 views05:15

Engineer Readings

[#AI #context #Manus]

Context Engineering for AI Agents: Lessons from Building Manus

The Manus team shares key insights from building their AI agent system, focusing on context engineering rather than training custom models. The article covers critical strategies like designing around KV-cache for better performance, using the filesystem as unlimited context storage, and keeping error traces to help agents learn from mistakes.
Key takeaways: maximize cache hit rates by keeping prompts stable, mask tools instead of removing them to maintain context integrity, and leverage the filesystem for persistent memory beyond token limits.

🔗 Read more: https://manus.im/de/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus

manus.im

Kontext-Engineering für KI-Agenten: Lektionen aus dem Aufbau von Manus

Dieser Beitrag teilt die lokalen Optima, zu denen Manus durch unser eigenes "SGD" gelangt ist. Wenn Sie Ihren eigenen KI-Agenten entwickeln, hoffen wir, dass diese Prinzipien Ihnen helfen, schneller zu konvergieren.

235 views08:51

Engineer Readings

[#AI #MachineLearning #LLM #AIInference #Hardware #Groq #LPU #AIAccelerators #DeepLearning #TechInnovation #ComputerArchitecture #AIHardware]

How Groq's LPU Achieves Blazing AI Inference Speed

Ever wondered how Groq runs a 1-trillion-parameter model like Kimi K2 in real-time? Their Language Processing Unit (LPU) is rewriting the rules of AI inference.

Key Innovations:
TruePoint Numerics – Strategic precision where it matters. 100 bits of intermediate accumulation enable 2-4× speedup over BF16 with zero accuracy loss. FP32 for critical operations, FP8 for error-tolerant layers.

SRAM-First Architecture – Hundreds of megabytes of on-chip SRAM as primary storage (not cache). Traditional GPUs suffer from HBM latency (hundreds of nanoseconds); LPU eliminates the wait with instant weight access.

Static Scheduling – The compiler pre-computes the entire execution graph down to individual clock cycles. No cache coherency protocols, no runtime delays. Deterministic execution enables tensor parallelism without tail latency.

Tensor Parallelism – Unlike GPUs that scale throughput via data parallelism, LPUs distribute single operations across chips to reduce latency. This is why trillion-parameter models generate tokens in real-time.

RealScale Interconnect – Plesiosynchronous chip-to-chip protocol aligns hundreds of LPUs to act as a single core. The compiler schedules both compute AND network timing.

The Results? First-gen LPU on 14nm process delivers 40× performance improvements. MMLU benchmarks show strong accuracy with no quality degradation.
Groq isn't optimizing around the edges—they rebuilt inference from the ground up for speed, scale, and efficiency.

🔗 Read the full technical breakdown: https://groq.com/blog/inside-the-lpu-deconstructing-groq-speed

Groq

Inside the LPU: Deconstructing Groq’s Speed

Discover how Groq's Language Processing Units (LPUs) achieve breakthrough AI inference speeds with 4 key architectural innovations: SRAM-centric design for instant weight access, statically scheduled networks for predictable performance, tensor parallelism…

154 views18:13

Engineer Readings

[#ai #engineering]
I think it’s important to watch this short video of a Netflix Staff engineer about usage of AI/LLM and perception society puts into that. There are some learning I wish more people could understand.
Thank you and I wish you all a great day!

https://youtu.be/eIoohUmYpGI?si=9A2q5kxelLZy7L5N

YouTube

"I shipped code I don't understand and I bet you have too" – Jake Nations, Netflix

In 1968, the term ""Software Crisis"" emerged when systems grew beyond what developers could manage. Every generation since has ""solved"" it with more powerful tools, only to create even bigger problems.

Today, AI accelerates the pattern into the Infinite…

👍2

127 views12:42

About

Blog

Apps

Platform