[nvidia][gpu][architecture]
https://t.co/4jLomyexEu
Inside NVIDIA GPUs: Anatomy of high performance matmul kernelshttps://t.co/4jLomyexEu
Aleksagordic
Inside NVIDIA GPUs: Anatomy of high performance matmul kernels - Aleksa Gordić
From GPU architecture and PTX/SASS to warp-tiling and deep asynchronous tensor core pipelines.
🔥3
[oracle][ai][db]
Oracle released Oracle AI database.
Notes:
https://www.oracle.com/news/announcement/ai-world-database-26ai-powers-the-ai-for-data-revolution-2025-10-14/
Oracle released Oracle AI database.
Notes:
https://www.oracle.com/news/announcement/ai-world-database-26ai-powers-the-ai-for-data-revolution-2025-10-14/
Oracle
Oracle AI Database 26ai Powers the AI for Data Revolution
Oracle AI Database 26ai architects AI into the core of data management, furthering Oracle’s commitment to help customers securely bring AI to all their data, everywhere.
[ai chips] Nvidia's latest move in the AI hardware race: specialized chips for inference
Nvidia just announced the Rubin CPX - a GPU specifically optimized for the prefill phase of inference. This is fascinating because it challenges the "one chip fits all" approach we've seen dominating AI infrastructure.
The core insight: prefill (generating the first token) is compute-heavy but barely uses memory bandwidth, while decode (generating subsequent tokens) is the opposite - memory-bound with underutilized compute. Running both on the same high-end GPU with expensive HBM wastes resources.
Rubin CPX uses cheaper GDDR7 instead of HBM (cutting memory cost by 50%+), drops NVLink for simple PCIe, but maintains strong FP4 compute - 20 PFLOPS dense. It's designed to be drastically cheaper per unit while being better suited for its specific workload.
The competitive angle is brutal: AMD and others were just catching up with rack-scale designs, and now they need to develop specialized prefill chips too, pushing their roadmaps back another cycle.
This disaggregated approach (separate hardware for prefill/decode) hints at where inference infrastructure is heading - not just software optimization, but purpose-built silicon for different phases of the same task.
https://newsletter.semianalysis.com/p/another-giant-leap-the-rubin-cpx-specialized-accelerator-rack
Nvidia just announced the Rubin CPX - a GPU specifically optimized for the prefill phase of inference. This is fascinating because it challenges the "one chip fits all" approach we've seen dominating AI infrastructure.
The core insight: prefill (generating the first token) is compute-heavy but barely uses memory bandwidth, while decode (generating subsequent tokens) is the opposite - memory-bound with underutilized compute. Running both on the same high-end GPU with expensive HBM wastes resources.
Rubin CPX uses cheaper GDDR7 instead of HBM (cutting memory cost by 50%+), drops NVLink for simple PCIe, but maintains strong FP4 compute - 20 PFLOPS dense. It's designed to be drastically cheaper per unit while being better suited for its specific workload.
The competitive angle is brutal: AMD and others were just catching up with rack-scale designs, and now they need to develop specialized prefill chips too, pushing their roadmaps back another cycle.
This disaggregated approach (separate hardware for prefill/decode) hints at where inference infrastructure is heading - not just software optimization, but purpose-built silicon for different phases of the same task.
https://newsletter.semianalysis.com/p/another-giant-leap-the-rubin-cpx-specialized-accelerator-rack
Semianalysis
Another Giant Leap: The Rubin CPX Specialized Accelerator & Rack
New Prefill Specialized GPU, Rack Architecture, BOM, Disaggregated PD, Higher Perf per TCO, Lower TCO, GDDR7 & HBM Market Trends
[#AI #LLM #MachineLearning #EdgeComputing]
Running LLMs with Just SQL? New Research Makes It Possible
Researchers dropped TranSQL+ - a wild approach to run Large Language Models using only SQL queries in a regular database. No GPUs, no fancy ML frameworks needed.
The idea: Convert the entire LLM (like Llama3-8B) into SQL queries that run in DuckDB or similar databases. Sounds crazy, but it actually works!
Results on a modest laptop (4 cores, 16GB RAM):
20× faster than DeepSpeed for first token
4× faster for generating subsequent tokens
Works entirely on CPU with limited memory
Why this matters:
• Runs on ANY device with a database (phones, laptops, IoT)
• No need to compile for different hardware
• Databases already know how to manage memory efficiently
Perfect for privacy-focused edge deployments where you can't rely on cloud GPUs.
Paper: https://arxiv.org/pdf/2502.02818
Running LLMs with Just SQL? New Research Makes It Possible
Researchers dropped TranSQL+ - a wild approach to run Large Language Models using only SQL queries in a regular database. No GPUs, no fancy ML frameworks needed.
The idea: Convert the entire LLM (like Llama3-8B) into SQL queries that run in DuckDB or similar databases. Sounds crazy, but it actually works!
Results on a modest laptop (4 cores, 16GB RAM):
20× faster than DeepSpeed for first token
4× faster for generating subsequent tokens
Works entirely on CPU with limited memory
Why this matters:
• Runs on ANY device with a database (phones, laptops, IoT)
• No need to compile for different hardware
• Databases already know how to manage memory efficiently
Perfect for privacy-focused edge deployments where you can't rely on cloud GPUs.
Paper: https://arxiv.org/pdf/2502.02818
🔥2🤯2
[#AI #AICoding #AIAgents #AutonomousAgents #MultiAgentSystems #Cursor]
Scaling AI Coding: Lessons From Running Hundreds of Agents
Cursor shares how they pushed the limits of AI by running hundreds of autonomous coding agents at the same time on real software projects.
Instead of short tasks, these agents worked for weeks, edited shared codebases, and even helped build complex products like a web browser.
The biggest lesson?
Uncoordinated agents create chaos — but a planner + worker system keeps them aligned, focused, and productive over long periods.
The article shows that with the right structure, AI teams can tackle massive engineering challenges, similar to real human teams — and we’re just getting started.
🔗 Read more: https://cursor.com/blog/scaling-agents
Scaling AI Coding: Lessons From Running Hundreds of Agents
Cursor shares how they pushed the limits of AI by running hundreds of autonomous coding agents at the same time on real software projects.
Instead of short tasks, these agents worked for weeks, edited shared codebases, and even helped build complex products like a web browser.
The biggest lesson?
Uncoordinated agents create chaos — but a planner + worker system keeps them aligned, focused, and productive over long periods.
The article shows that with the right structure, AI teams can tackle massive engineering challenges, similar to real human teams — and we’re just getting started.
🔗 Read more: https://cursor.com/blog/scaling-agents
Cursor
Scaling long-running autonomous coding
We've been experimenting with running coding agents autonomously for weeks at a time.
🔥2
[#AI #context #Manus]
Context Engineering for AI Agents: Lessons from Building Manus
The Manus team shares key insights from building their AI agent system, focusing on context engineering rather than training custom models. The article covers critical strategies like designing around KV-cache for better performance, using the filesystem as unlimited context storage, and keeping error traces to help agents learn from mistakes.
Key takeaways: maximize cache hit rates by keeping prompts stable, mask tools instead of removing them to maintain context integrity, and leverage the filesystem for persistent memory beyond token limits.
🔗 Read more: https://manus.im/de/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus
Context Engineering for AI Agents: Lessons from Building Manus
The Manus team shares key insights from building their AI agent system, focusing on context engineering rather than training custom models. The article covers critical strategies like designing around KV-cache for better performance, using the filesystem as unlimited context storage, and keeping error traces to help agents learn from mistakes.
Key takeaways: maximize cache hit rates by keeping prompts stable, mask tools instead of removing them to maintain context integrity, and leverage the filesystem for persistent memory beyond token limits.
🔗 Read more: https://manus.im/de/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus
manus.im
Kontext-Engineering für KI-Agenten: Lektionen aus dem Aufbau von Manus
Dieser Beitrag teilt die lokalen Optima, zu denen Manus durch unser eigenes "SGD" gelangt ist. Wenn Sie Ihren eigenen KI-Agenten entwickeln, hoffen wir, dass diese Prinzipien Ihnen helfen, schneller zu konvergieren.
[#AI #MachineLearning #LLM #AIInference #Hardware #Groq #LPU #AIAccelerators #DeepLearning #TechInnovation #ComputerArchitecture #AIHardware]
How Groq's LPU Achieves Blazing AI Inference Speed
Ever wondered how Groq runs a 1-trillion-parameter model like Kimi K2 in real-time? Their Language Processing Unit (LPU) is rewriting the rules of AI inference.
Key Innovations:
TruePoint Numerics – Strategic precision where it matters. 100 bits of intermediate accumulation enable 2-4× speedup over BF16 with zero accuracy loss. FP32 for critical operations, FP8 for error-tolerant layers.
SRAM-First Architecture – Hundreds of megabytes of on-chip SRAM as primary storage (not cache). Traditional GPUs suffer from HBM latency (hundreds of nanoseconds); LPU eliminates the wait with instant weight access.
Static Scheduling – The compiler pre-computes the entire execution graph down to individual clock cycles. No cache coherency protocols, no runtime delays. Deterministic execution enables tensor parallelism without tail latency.
Tensor Parallelism – Unlike GPUs that scale throughput via data parallelism, LPUs distribute single operations across chips to reduce latency. This is why trillion-parameter models generate tokens in real-time.
RealScale Interconnect – Plesiosynchronous chip-to-chip protocol aligns hundreds of LPUs to act as a single core. The compiler schedules both compute AND network timing.
The Results? First-gen LPU on 14nm process delivers 40× performance improvements. MMLU benchmarks show strong accuracy with no quality degradation.
Groq isn't optimizing around the edges—they rebuilt inference from the ground up for speed, scale, and efficiency.
🔗 Read the full technical breakdown: https://groq.com/blog/inside-the-lpu-deconstructing-groq-speed
How Groq's LPU Achieves Blazing AI Inference Speed
Ever wondered how Groq runs a 1-trillion-parameter model like Kimi K2 in real-time? Their Language Processing Unit (LPU) is rewriting the rules of AI inference.
Key Innovations:
TruePoint Numerics – Strategic precision where it matters. 100 bits of intermediate accumulation enable 2-4× speedup over BF16 with zero accuracy loss. FP32 for critical operations, FP8 for error-tolerant layers.
SRAM-First Architecture – Hundreds of megabytes of on-chip SRAM as primary storage (not cache). Traditional GPUs suffer from HBM latency (hundreds of nanoseconds); LPU eliminates the wait with instant weight access.
Static Scheduling – The compiler pre-computes the entire execution graph down to individual clock cycles. No cache coherency protocols, no runtime delays. Deterministic execution enables tensor parallelism without tail latency.
Tensor Parallelism – Unlike GPUs that scale throughput via data parallelism, LPUs distribute single operations across chips to reduce latency. This is why trillion-parameter models generate tokens in real-time.
RealScale Interconnect – Plesiosynchronous chip-to-chip protocol aligns hundreds of LPUs to act as a single core. The compiler schedules both compute AND network timing.
The Results? First-gen LPU on 14nm process delivers 40× performance improvements. MMLU benchmarks show strong accuracy with no quality degradation.
Groq isn't optimizing around the edges—they rebuilt inference from the ground up for speed, scale, and efficiency.
🔗 Read the full technical breakdown: https://groq.com/blog/inside-the-lpu-deconstructing-groq-speed
Groq
Inside the LPU: Deconstructing Groq’s Speed
Discover how Groq's Language Processing Units (LPUs) achieve breakthrough AI inference speeds with 4 key architectural innovations: SRAM-centric design for instant weight access, statically scheduled networks for predictable performance, tensor parallelism…
[#ai #engineering]
I think it’s important to watch this short video of a Netflix Staff engineer about usage of AI/LLM and perception society puts into that. There are some learning I wish more people could understand.
Thank you and I wish you all a great day!
https://youtu.be/eIoohUmYpGI?si=9A2q5kxelLZy7L5N
I think it’s important to watch this short video of a Netflix Staff engineer about usage of AI/LLM and perception society puts into that. There are some learning I wish more people could understand.
Thank you and I wish you all a great day!
https://youtu.be/eIoohUmYpGI?si=9A2q5kxelLZy7L5N
YouTube
"I shipped code I don't understand and I bet you have too" – Jake Nations, Netflix
In 1968, the term ""Software Crisis"" emerged when systems grew beyond what developers could manage. Every generation since has ""solved"" it with more powerful tools, only to create even bigger problems.
Today, AI accelerates the pattern into the Infinite…
Today, AI accelerates the pattern into the Infinite…
👍2