DevOps & SRE notes
How did you start your morning? Cloudflare decided that you’d had too much of the internet.
A change made to how Cloudflare's Web Application Firewall parses requests caused Cloudflare's network to be unavailable for several minutes this morning. This was not an attack; the change was deployed by our team to help mitigate the industry-wide vulnerability disclosed this week in React Server Components. We will share more information as we have it today.
https://www.cloudflarestatus.com/incidents/lfrm31y6sw9q
https://www.cloudflarestatus.com/incidents/lfrm31y6sw9q
Cloudflarestatus
Cloudflare Service Issues
Cloudflare's Status Page - Cloudflare Service Issues.
👍4
Cloudflare has had two major outages in less than 30 days. Are big tech companies broken? Can’t they be examples of good role models? Or is it just that shit happens?
Final Results
35%
Yes, everything is broken. They’re no longer a good example of solid engineering practices
65%
No, it’s fine. Shit happens, come on.
👍3💯1
Will Sulzer's report details the process of deploying self-hosted GitHub Action Runners on Google Kubernetes Engine (GKE) using a rootless Docker-in-Docker setup. The instructions focus on achieving this with minimal privileges for enhanced security.
https://medium.com/google-cloud/github-action-runners-on-gke-with-dind-rootless-bd54e23516c9
https://medium.com/google-cloud/github-action-runners-on-gke-with-dind-rootless-bd54e23516c9
Medium
Deploying GitHub Action Runners on GKE with dind-rootless
TLDR: This article describes the steps to configure and deploy self-hosted GitHub Action Runners using docker:dind-rootless to Google…
👍2🔥2
This analysis explores how eBPF (extended Berkeley Packet Filter) can be used to gain insights into real-time SSL/TLS encrypted traffic. The author, TJ. Podobnik, discusses how this technology allows for monitoring without compromising security.
https://medium.com/all-things-ebpf/what-insights-can-ebpf-provide-into-real-time-ssl-tls-encrypted-traffic-and-how-435c8ad33efc
https://medium.com/all-things-ebpf/what-insights-can-ebpf-provide-into-real-time-ssl-tls-encrypted-traffic-and-how-435c8ad33efc
Medium
What Insights Can eBPF Provide into Real-Time SSL/TLS Encrypted Traffic and How?
Anteon: Monitoring Encrypted Traffic on Kubernetes using Alaz eBPF Agent
👍5
This post by Brian Chambers reflects on the lessons learned from launching an edge compute platform at Chick-fil-A. It discusses the challenges and successes of developing and scaling the platform from within the Enterprise Architecture team.
https://medium.com/chick-fil-atech/what-we-learned-from-launching-edge-compute-from-enterprise-architecture-1dc34e49482f
https://medium.com/chick-fil-atech/what-we-learned-from-launching-edge-compute-from-enterprise-architecture-1dc34e49482f
Medium
What We Learned from Launching Edge Compute from Enterprise Architecture
by Brian Chambers
👍1
Intelligence for Kubernetes. World's most promising Kubernetes Visualization Tool for Developer and Platform Engineering teams.
https://github.com/KusionStack/karpor
https://github.com/KusionStack/karpor
GitHub
GitHub - KusionStack/karpor: Intelligence for Kubernetes. World's most promising Kubernetes Visualization Tool for Developer and…
Intelligence for Kubernetes. World's most promising Kubernetes Visualization Tool for Developer and Platform Engineering teams. - GitHub - KusionStack/karpor: Intelligence for Kubernetes. ...
❤5
Mark Tinderholt's dispatch demonstrates how to use Terraform's testing features, from examples to assertions, in real-world scenarios. The author provides a guide to setting up a testing structure that maximizes the benefits of these tools.
https://www.marktinderholt.com/infrastructure-as-code/terraform/azure/cloud/2024/10/30/test-vwan.html
https://www.marktinderholt.com/infrastructure-as-code/terraform/azure/cloud/2024/10/30/test-vwan.html
Mark Tinderholt’s Blog.
Terraform Testing Unleashed: From Examples to Assertions in Real-World Scenarios
Terraform’s test command, introduced in version 1.6.0, opens up the possibility to integrate testing directly within Terraform configurations. This feature allows module developers to verify functionality, ensure compatibility, and simulate different scenarios…
👍3
Please open Telegram to view this post
VIEW IN TELEGRAM
GitHub
GitHub - hashicorp/terraform-cdk: Define infrastructure resources using programming constructs and provision them using HashiCorp…
Define infrastructure resources using programming constructs and provision them using HashiCorp Terraform - hashicorp/terraform-cdk
❤1😢1
This article discusses the importance of the "what went well" section in incident write-ups, arguing that it's more than just a morale booster. Lorin Hochstein suggests that detailing successful improvisations and diagnostic work can be a powerful learning tool for future incident responders.
https://surfingcomplexity.blog/2025/06/14/what-went-well-is-more-than-just-a-pat-on-the-back/
https://surfingcomplexity.blog/2025/06/14/what-went-well-is-more-than-just-a-pat-on-the-back/
Surfing Complexity
“What went well” is more than just a pat on the back
When writing up my impressions of the GCP incident report, Cindy Sridharan’s tweet reminded me that I failed to comment on an important part of it, how the responders brought the overloaded s…
👍3
Forwarded from DevOps & SRE notes (tutunak)
Looking for a hosting platform to practice with Linux, Kubernetes, etc.? Register using my referral link on DigitalOcean and get $200 in credit for 60 days. By registering through my referral link, you also support this Telegram channel.
👉 Register
👉 Register
🔥4❤3👍3👏1
This piece, "The MTTI Manifesto," argues for the importance of a new metric in incident response: Mean Time to Isolate. The author contends that the majority of outage time is spent identifying the problem's source, not fixing it, and that focusing on MTTI can drive significant improvements in system architecture and observability.
https://www.oldschoolburke.com/the-mtti-manifesto/
https://www.oldschoolburke.com/the-mtti-manifesto/
Old School Burke
012: The MTTI Manifesto
Mean Time to Isolate
👍5
AWSDoor is a red team automation tool designed to simulate advanced attacker behavior in AWS environments
https://github.com/OtterHacker/AWSDoor
https://github.com/OtterHacker/AWSDoor
GitHub
GitHub - OtterHacker/AWSDoor: AWSDoor is a red team automation tool designed to simulate advanced attacker behavior in AWS environments
AWSDoor is a red team automation tool designed to simulate advanced attacker behavior in AWS environments - OtterHacker/AWSDoor
❤2
This write-up explores the emerging discipline of AI Reliability Engineering (AIRe) as the "Third Age of SRE." It argues that the unique challenges of AI workloads, such as their probabilistic nature and new failure modes like model decay, require an evolution of traditional Site Reliability Engineering principles.
https://thenewstack.io/ai-reliability-engineering-welcome-to-the-third-age-of-sre/
https://thenewstack.io/ai-reliability-engineering-welcome-to-the-third-age-of-sre/
The New Stack
AI Reliability Engineering: Welcome to the Third Age of SRE
SREs must build AI we can trust, leveraging the emerging ecosystem of tools and standards.
This dispatch offers a detailed walkthrough for backend engineers on creating a Kubernetes Operator using Go and Kubebuilder. The author, Amr Elhewy, simplifies complex DevOps concepts by building a practical "PodTracker" operator that sends Slack notifications for new pod creations.
https://hewi.blog/a-backend-engineer-lost-in-the-devops-world-making-a-kubernetes-operator-with-go
https://hewi.blog/a-backend-engineer-lost-in-the-devops-world-making-a-kubernetes-operator-with-go
🔥3