In this article, “Making Your System Observable” outlines practical techniques for evolving from scattered logs to coherent observability across services. Readers will discover why a holistic signals-first mindset matters more than bolting on dashboards late in the game.
https://www.architecture-weekly.com/p/making-your-system-observability
https://www.architecture-weekly.com/p/making-your-system-observability
Architecture-Weekly
Making your system observability predictable
Everyone claims that observability is the key for production readiness. Yet, most of us just adds auto-instrumentation right before going to production and call it a day. That's fine, but not enough. Inspired by Martin Thwaites take, I showed how to add prectictable…
❤3
This blogpost by Yandex SRE Dmitry Ziablov recounts a late-night incident that turned a harmless retry loop into a production outage. He dissects the cascade of failures and offers a framework for spotting bad retry patterns before they bite.
https://medium.com/yandex/good-retry-bad-retry-an-incident-story-648072d3cee6
https://medium.com/yandex/good-retry-bad-retry-an-incident-story-648072d3cee6
Medium
Good Retry, Bad Retry: An Incident Story
Sometimes, a seemingly simple and obvious solution can lead to a series of problems later on. This is especially true when adding retries.
💩2
The piece argues that traces beat metrics when you need to pinpoint latency spikes and hidden dependencies. It walks through three concrete debugging scenarios that show why span data can surface root causes in seconds.
https://jaywhy13.hashnode.dev/3-reasons-traces-better-than-metrics-for-debugging-your-application
https://jaywhy13.hashnode.dev/3-reasons-traces-better-than-metrics-for-debugging-your-application
❤1👍1
In Slack’s detailed write-up, engineers share how the Unified Grid architecture split a monolithic workspace into isolated “cells” to serve enterprises with hundreds of thousands of users. The narrative dives into sharding strategy, migration challenges, and the performance wins that followed.
https://slack.engineering/unified-grid-how-we-re-architected-slack-for-our-largest-customers/
https://slack.engineering/unified-grid-how-we-re-architected-slack-for-our-largest-customers/
slack.engineering
Unified Grid: How We Re-Architected Slack for Our Largest Customers
All software is built atop a core set of assumptions. As new code is added and new use-cases emerge, software can become unmoored from those assumptions. When this happens, a fundamental tension arises between revisiting those foundational assumptions—which…
❤1
This post explains how Sharyash Agrawal tamed CPU throttling in Go services running under Kubernetes limits. From choosing the right GC knob to tuning Go’s runtime scheduler, the guide helps teams avoid sudden latency spikes.
https://medium.com/@sharyash81/solving-cpu-throttling-issue-in-golang-applications-before-hitting-the-cpu-limit-in-kubernetes-7d8f40da6477
https://medium.com/@sharyash81/solving-cpu-throttling-issue-in-golang-applications-before-hitting-the-cpu-limit-in-kubernetes-7d8f40da6477
Medium
Solving CPU throttling issue in Golang applications before hitting the CPU limit in Kubernetes.
We faced an issue within our Kubernetes cluster wherein certain multi-threaded Golang applications, for which CPU limit has been set, are…
👍3❤1
The essay walks through a hands-on pipeline that signs Kubernetes container images with Cosign, enforces them with Kyverno, and stores keys in HashiCorp Vault—all wired together in GitLab CI. You’ll leave with a reproducible template for securing your software supply chain.
https://angapov.medium.com/kubernetes-container-images-signing-using-cosign-kyverno-hashicorp-vault-and-gitlab-ci-c4e2041d1310
https://angapov.medium.com/kubernetes-container-images-signing-using-cosign-kyverno-hashicorp-vault-and-gitlab-ci-c4e2041d1310
Medium
Kubernetes container images signing using Cosign, Kyverno, HashiCorp Vault and GitLab CI
Container images are the crucial part of the applications running in Kubernetes. But how can we make sure that the container images that we…
👍4❤2
A batteries-included Python client library for Kubernetes that feels familiar for folks who already know how to use kubectl
https://github.com/kr8s-org/kr8s
https://github.com/kr8s-org/kr8s
GitHub
GitHub - kr8s-org/kr8s: A batteries-included Python client library for Kubernetes that feels familiar for folks who already know…
A batteries-included Python client library for Kubernetes that feels familiar for folks who already know how to use kubectl - kr8s-org/kr8s
👍5
KubeBlocks is an open-source control plane software that runs and manages databases, message queues and other stateful applications on K8s.
https://github.com/apecloud/kubeblocks
https://github.com/apecloud/kubeblocks
GitHub
GitHub - apecloud/kubeblocks: KubeBlocks is a Kubernetes Operator designed to manage a variety of databases and streaming systems…
KubeBlocks is a Kubernetes Operator designed to manage a variety of databases and streaming systems, including MySQL, PostgreSQL, MongoDB, Redis, RabbitMQ, RocketMQ, and more, within Kubernetes env...
Understand the intricacies of container communication within a Kubernetes pod, exploring the various mechanisms and considerations for enabling effective interaction between containers in a shared environment. This article provides insights into Kubernetes networking concepts.
https://medium.com/@sumuduliyan/container-communication-inside-a-kubernetes-pod-a5e84d607ef2
https://medium.com/@sumuduliyan/container-communication-inside-a-kubernetes-pod-a5e84d607ef2
Medium
Container Communication Inside a Kubernetes Pod
Pod is the smallest unit in kubernetes. Every pod in a Kubernetes cluster is assigned a unique IP address, which is used for communication…
❤3
Explore the challenges and solutions for managing stateful applications in Kubernetes using Operators, gaining insights into how to effectively handle persistent data and complex deployments. This blog post delves into the complexities of stateful workloads in Kubernetes.
https://blog.palark.com/stateful-in-kubernetes-and-operators/
https://blog.palark.com/stateful-in-kubernetes-and-operators/
Palark
Stateful apps in Kubernetes. From history and fundamentals to operators | Tech blog | Palark
Learn what you should consider before running stateful components apps in Kubernetes, how these apps work in K8s, and which operators we use for ClickHouse, Redis, Kafka, PostgreSQL, and MySQL.
👍6
Forwarded from Python notes
This piece provides a guide to building a Retrieval-Augmented Generation (RAG) system using Anthropic's Claude, PostgreSQL, and Python on AWS. The tutorial walks through setting up the necessary PostgreSQL extensions and using Amazon Bedrock to create an application that generates more accurate AI responses.
https://www.tigerdata.com/blog/building-a-rag-system-with-claude-postgresql-python-on-aws
https://www.tigerdata.com/blog/building-a-rag-system-with-claude-postgresql-python-on-aws
Tiger Data Blog
Building a RAG System With Claude, PostgreSQL & Python on AWS
A walkthrough of building a RAG system using Anthropic Claude and PostgreSQL on Amazon Bedrock to make your AI app responses more accurate and context-aware.
👍2
Delve into the innovative approach of building a serverless ACID-compliant database, understanding the techniques and trade-offs involved in achieving transactional consistency in a serverless environment. This article explores a novel database architecture.
https://notes.eatonphil.com/2024-09-29-build-a-serverless-acid-database-with-this-one-neat-trick.html
https://notes.eatonphil.com/2024-09-29-build-a-serverless-acid-database-with-this-one-neat-trick.html
JET Pilot is an open-source Kubernetes desktop client that focuses on less clutter, speed and good looks.
https://github.com/unxsist/jet-pilot
https://github.com/unxsist/jet-pilot
GitHub
GitHub - unxsist/jet-pilot: JET Pilot is an open-source Kubernetes desktop client that focuses on less clutter, speed and good…
JET Pilot is an open-source Kubernetes desktop client that focuses on less clutter, speed and good looks. - unxsist/jet-pilot
🔥3
Explore the strategies and techniques Cloudflare employs to improve the resilience of its platform, ensuring high availability and reliability for its global network. This blog post provides insights into building a resilient infrastructure.
https://blog.cloudflare.com/nl-nl/improving-platform-resilience-at-cloudflare/
https://blog.cloudflare.com/nl-nl/improving-platform-resilience-at-cloudflare/
❤4
Bare metal host provisioning integration for Kubernetes
https://github.com/metal3-io/baremetal-operator
https://github.com/metal3-io/baremetal-operator
GitHub
GitHub - metal3-io/baremetal-operator: Bare metal host provisioning integration for Kubernetes
Bare metal host provisioning integration for Kubernetes - metal3-io/baremetal-operator
👍3
Discover what platform engineering meant for the SREs at Adidas in this insightful report. It examines the cultural and technical shifts that occurred within the organization.
https://thenewstack.io/what-platform-engineering-meant-for-adidass-sres/
https://thenewstack.io/what-platform-engineering-meant-for-adidass-sres/
The New Stack
What Platform Engineering Meant for Adidas’s SREs
Moving from monolithic to microservices architecture demands platform engineering and observability, but brought new challenges to Adidas’s site reliability engineering team.
❤2👍2🔥1
KubeSnapIt – A PowerShell tool for managing Kubernetes snapshots, restorations, and comparisons with ease. Capture snapshots of your Kubernetes resources, restore them when needed, and compare snapshots or live cluster states to track changes over time.
https://github.com/KubeDeckio/KubeSnapIt
https://github.com/KubeDeckio/KubeSnapIt
GitHub
GitHub - KubeDeckio/KubeSnapIt: KubeSnapIt – A PowerShell tool for managing Kubernetes snapshots, restorations, and comparisons…
KubeSnapIt – A PowerShell tool for managing Kubernetes snapshots, restorations, and comparisons with ease. Capture snapshots of your Kubernetes resources, restore them when needed, and compare snap...
Examine the concept of implicit Service Level Objectives (SLOs) and the potential risks they pose to system reliability and performance. This article highlights the importance of defining explicit SLOs for better service management.
https://blog.relyabilit.ie/implicit-slos-and-their-dangers/
https://blog.relyabilit.ie/implicit-slos-and-their-dangers/
RelyAbility Blog
Implicit SLOs and their dangers
This is a topic of intermediate complexity in SLOs. If you are coming to this cold, we recommend you read a few other pieces about SLOs first, then this will make a fair bit more sense to you.
SLOs, as you may know, have a dual nature: they have both
SLOs, as you may know, have a dual nature: they have both