Engineer Readings – Telegram
[debugging] Hash-Based Bisect Debugging in Compilers and Runtimes

https://research.swtch.com/bisect
Happy New Year everyone! 🎄
🎄16
[ai chips] Nvidia's latest move in the AI hardware race: specialized chips for inference

Nvidia just announced the Rubin CPX - a GPU specifically optimized for the prefill phase of inference. This is fascinating because it challenges the "one chip fits all" approach we've seen dominating AI infrastructure.
The core insight: prefill (generating the first token) is compute-heavy but barely uses memory bandwidth, while decode (generating subsequent tokens) is the opposite - memory-bound with underutilized compute. Running both on the same high-end GPU with expensive HBM wastes resources.
Rubin CPX uses cheaper GDDR7 instead of HBM (cutting memory cost by 50%+), drops NVLink for simple PCIe, but maintains strong FP4 compute - 20 PFLOPS dense. It's designed to be drastically cheaper per unit while being better suited for its specific workload.
The competitive angle is brutal: AMD and others were just catching up with rack-scale designs, and now they need to develop specialized prefill chips too, pushing their roadmaps back another cycle.
This disaggregated approach (separate hardware for prefill/decode) hints at where inference infrastructure is heading - not just software optimization, but purpose-built silicon for different phases of the same task.

https://newsletter.semianalysis.com/p/another-giant-leap-the-rubin-cpx-specialized-accelerator-rack
[#AI #LLM #MachineLearning #EdgeComputing]
Running LLMs with Just SQL? New Research Makes It Possible

Researchers dropped TranSQL+ - a wild approach to run Large Language Models using only SQL queries in a regular database. No GPUs, no fancy ML frameworks needed.
The idea: Convert the entire LLM (like Llama3-8B) into SQL queries that run in DuckDB or similar databases. Sounds crazy, but it actually works!
Results on a modest laptop (4 cores, 16GB RAM):

20× faster than DeepSpeed for first token
4× faster for generating subsequent tokens
Works entirely on CPU with limited memory

Why this matters:
• Runs on ANY device with a database (phones, laptops, IoT)
• No need to compile for different hardware
• Databases already know how to manage memory efficiently
Perfect for privacy-focused edge deployments where you can't rely on cloud GPUs.

Paper: https://arxiv.org/pdf/2502.02818
🔥2🤯2
[#AI #AICoding #AIAgents #AutonomousAgents #MultiAgentSystems #Cursor]

Scaling AI Coding: Lessons From Running Hundreds of Agents


Cursor shares how they pushed the limits of AI by running hundreds of autonomous coding agents at the same time on real software projects.

Instead of short tasks, these agents worked for weeks, edited shared codebases, and even helped build complex products like a web browser.

The biggest lesson?
Uncoordinated agents create chaos — but a planner + worker system keeps them aligned, focused, and productive over long periods.

The article shows that with the right structure, AI teams can tackle massive engineering challenges, similar to real human teams — and we’re just getting started.

🔗 Read more: https://cursor.com/blog/scaling-agents
🔥2
[#AI #context #Manus]

Context Engineering for AI Agents: Lessons from Building Manus

The Manus team shares key insights from building their AI agent system, focusing on context engineering rather than training custom models. The article covers critical strategies like designing around KV-cache for better performance, using the filesystem as unlimited context storage, and keeping error traces to help agents learn from mistakes.
Key takeaways: maximize cache hit rates by keeping prompts stable, mask tools instead of removing them to maintain context integrity, and leverage the filesystem for persistent memory beyond token limits.

🔗 Read more: https://manus.im/de/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus
[#AI #MachineLearning #LLM #AIInference #Hardware #Groq #LPU #AIAccelerators #DeepLearning #TechInnovation #ComputerArchitecture #AIHardware]

How Groq's LPU Achieves Blazing AI Inference Speed

Ever wondered how Groq runs a 1-trillion-parameter model like Kimi K2 in real-time? Their Language Processing Unit (LPU) is rewriting the rules of AI inference.

Key Innovations:
TruePoint Numerics – Strategic precision where it matters. 100 bits of intermediate accumulation enable 2-4× speedup over BF16 with zero accuracy loss. FP32 for critical operations, FP8 for error-tolerant layers.

SRAM-First Architecture – Hundreds of megabytes of on-chip SRAM as primary storage (not cache). Traditional GPUs suffer from HBM latency (hundreds of nanoseconds); LPU eliminates the wait with instant weight access.

Static Scheduling – The compiler pre-computes the entire execution graph down to individual clock cycles. No cache coherency protocols, no runtime delays. Deterministic execution enables tensor parallelism without tail latency.

Tensor Parallelism – Unlike GPUs that scale throughput via data parallelism, LPUs distribute single operations across chips to reduce latency. This is why trillion-parameter models generate tokens in real-time.

RealScale Interconnect – Plesiosynchronous chip-to-chip protocol aligns hundreds of LPUs to act as a single core. The compiler schedules both compute AND network timing.

The Results? First-gen LPU on 14nm process delivers 40× performance improvements. MMLU benchmarks show strong accuracy with no quality degradation.
Groq isn't optimizing around the edges—they rebuilt inference from the ground up for speed, scale, and efficiency.

🔗 Read the full technical breakdown: https://groq.com/blog/inside-the-lpu-deconstructing-groq-speed