Did you have to leetcode to get your DevOps role and was it worth it (i.e. financially)?
I have never had to leetcode for my DevOps jobs in the past 10 years. However, none of what I’ve ever done is more than 30% noscripting/coding. I have learnt typenoscript and go just to stay competitive but no one ever tested me on it. That being said, I’m working in a LCOL region of the US and I’m in the top percentile of this region. It’s not bad. I get envious at the FAANG income-earners from time to time but I largely can’t complain. Anybody else see benefits from learning leetcode for this field in particular?
https://redd.it/1ohk7dn
@r_devops
I have never had to leetcode for my DevOps jobs in the past 10 years. However, none of what I’ve ever done is more than 30% noscripting/coding. I have learnt typenoscript and go just to stay competitive but no one ever tested me on it. That being said, I’m working in a LCOL region of the US and I’m in the top percentile of this region. It’s not bad. I get envious at the FAANG income-earners from time to time but I largely can’t complain. Anybody else see benefits from learning leetcode for this field in particular?
https://redd.it/1ohk7dn
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Monitoring Jenkins Nodes with Datadog
Hi Community,
We have a Jenkins controller connected to multiple build nodes.
I’d like to monitor the health and performance of these nodes using Datadog.
I’ve explored the available Jenkins metrics and events, but haven’t been able to find a clear way to capture node-level metrics (such as connectivity, availability, or job execution health) through Datadog.
Has anyone implemented Datadog monitoring for Jenkins nodes successfully?
If so, could you please share how you achieved it or point me toward relevant configuration steps or documentation?
Appreciate any guidance or best practices you can provide!
Thanks,
https://redd.it/1ohl2v1
@r_devops
Hi Community,
We have a Jenkins controller connected to multiple build nodes.
I’d like to monitor the health and performance of these nodes using Datadog.
I’ve explored the available Jenkins metrics and events, but haven’t been able to find a clear way to capture node-level metrics (such as connectivity, availability, or job execution health) through Datadog.
Has anyone implemented Datadog monitoring for Jenkins nodes successfully?
If so, could you please share how you achieved it or point me toward relevant configuration steps or documentation?
Appreciate any guidance or best practices you can provide!
Thanks,
https://redd.it/1ohl2v1
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
AWS Apprunner - impossible to deploy with - how do you use it??
[](https://www.reddit.com/r/aws/?f=flair_name%3A%22containers%22)trying to develop on app runner, cdk, python etc. w/ a webapp react and nextjs and node server and docker
keep running into "An error occurred (InvalidRequestException) when calling the StartDeployment operation: Can't start a deployment on the specified service, because it isn't in RUNNING state. "
you would think you can just cancel the deployment, but it is fully greyed out - can't do anything and its just hanging with very limited logging.
how do you properly develop on this thing?
https://redd.it/1ohnrse
@r_devops
[](https://www.reddit.com/r/aws/?f=flair_name%3A%22containers%22)trying to develop on app runner, cdk, python etc. w/ a webapp react and nextjs and node server and docker
keep running into "An error occurred (InvalidRequestException) when calling the StartDeployment operation: Can't start a deployment on the specified service, because it isn't in RUNNING state. "
you would think you can just cancel the deployment, but it is fully greyed out - can't do anything and its just hanging with very limited logging.
how do you properly develop on this thing?
https://redd.it/1ohnrse
@r_devops
Reddit
Amazon Web Services (AWS): S3, EC2, SQS, RDS, DynamoDB, IAM, CloudFormation, Route 53, VPC and more
News, articles and tools covering Amazon Web Services (AWS), including S3, EC2, SQS, RDS, DynamoDB, IAM, CloudFormation, AWS-CDK, Route 53, CloudFront, Lambda, VPC, Cloudwatch, Glacier and more.
Outage Tracker | Updog By Datadog
https://updog.ai/ \- found a new? thing from them. Nothing new, but I like the feeling of it, better than alts
https://redd.it/1ohsixj
@r_devops
https://updog.ai/ \- found a new? thing from them. Nothing new, but I like the feeling of it, better than alts
https://redd.it/1ohsixj
@r_devops
Datadog
Outage Tracker | Updog By Datadog
Updog By Datadog lets you spot issues early, backed by real impact across Datadog customer base. No status page updates.
Playwright tests failing on Windows but fine on macOS
Running the same Playwright suite locally on macOS and CI on Windows runners - works perfectly on Mac, randomly fails on Windows. Tried disabling video recording and headless mode, no luck. Anyone else seen platform-specific instability like this?
https://redd.it/1ohrw95
@r_devops
Running the same Playwright suite locally on macOS and CI on Windows runners - works perfectly on Mac, randomly fails on Windows. Tried disabling video recording and headless mode, no luck. Anyone else seen platform-specific instability like this?
https://redd.it/1ohrw95
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Self-hosting mysql on a Hetzner server
With all those managed databases out there it's an 'easy' choice to go for that, as we did years ago. Currently paying 130 for 8gb ram and 4vcpu but I was wondering how hard would it actually be to have this mysql db self hosted on a Hetzner server. The DB is mainly used for 8-9 integration/middleware applications so there is always throughput but no application (passwords etc) data is stored.
What are things I should think about and would running this DB on a dedicated server, next to some Docker applications (the laravel apps) be fine? Off course we would setup automatic backups
Reason why I am looking into this is mainly costs.
https://redd.it/1ohu7yj
@r_devops
With all those managed databases out there it's an 'easy' choice to go for that, as we did years ago. Currently paying 130 for 8gb ram and 4vcpu but I was wondering how hard would it actually be to have this mysql db self hosted on a Hetzner server. The DB is mainly used for 8-9 integration/middleware applications so there is always throughput but no application (passwords etc) data is stored.
What are things I should think about and would running this DB on a dedicated server, next to some Docker applications (the laravel apps) be fine? Off course we would setup automatic backups
Reason why I am looking into this is mainly costs.
https://redd.it/1ohu7yj
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How be up to date?
I’m a DevOps Engineer focused on building, improving and maintaining AWS Infrastructures so basically my Stack is AWS, Terraform, Github Actions, a bit of Ansible (and Linux of course). Those are my daily tools, however I want to apply to Big Tech companies and I realize they require multiple DevOps tools… As you might know, DevOps implies multiples tools so how do you keep up to date with all of them? It is frustrating
https://redd.it/1ohwcqk
@r_devops
I’m a DevOps Engineer focused on building, improving and maintaining AWS Infrastructures so basically my Stack is AWS, Terraform, Github Actions, a bit of Ansible (and Linux of course). Those are my daily tools, however I want to apply to Big Tech companies and I realize they require multiple DevOps tools… As you might know, DevOps implies multiples tools so how do you keep up to date with all of them? It is frustrating
https://redd.it/1ohwcqk
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How do you deal with stagnation when everything else about your job is great?
Hi everyone,
I’m a 13-year IT professional with experience mainly across DevOps, Cloud, and a bit of Data Engineering. I recently joined a service-based company about six months ago. The pay is decent, work-life balance is great, and the office is close by. I only need to go in a few days a month — so overall, it’s a very comfortable setup.
But the project and tech stack are extremely outdated. I was hired to help modernize things through DevOps, but most of the challenges are people- and process-related, not technical. The team is still learning very basic stuff, and there’s hardly any opportunity to work on modern tooling or architecture.
For the last few years, my learning curve was steep and exciting, but ever since joining this project, it’s almost flat. I’m starting to worry that staying in such an environment for too long could make me technologically handicapped in the long run.
I really don’t want to get stuck in a comfort zone and then realize years later that I’ve fallen behind. Because if, at some point, I want to switch jobs — whether for growth or monetary reasons — I might struggle to stay relevant.
So, I wanted to ask:
👉 How do you handle situations like this?
👉 How do you keep your skills sharp and your career moving forward when your current role offers comfort but little learning?
Would love to hear how others have navigated this phase without losing momentum.
https://redd.it/1ohwpk7
@r_devops
Hi everyone,
I’m a 13-year IT professional with experience mainly across DevOps, Cloud, and a bit of Data Engineering. I recently joined a service-based company about six months ago. The pay is decent, work-life balance is great, and the office is close by. I only need to go in a few days a month — so overall, it’s a very comfortable setup.
But the project and tech stack are extremely outdated. I was hired to help modernize things through DevOps, but most of the challenges are people- and process-related, not technical. The team is still learning very basic stuff, and there’s hardly any opportunity to work on modern tooling or architecture.
For the last few years, my learning curve was steep and exciting, but ever since joining this project, it’s almost flat. I’m starting to worry that staying in such an environment for too long could make me technologically handicapped in the long run.
I really don’t want to get stuck in a comfort zone and then realize years later that I’ve fallen behind. Because if, at some point, I want to switch jobs — whether for growth or monetary reasons — I might struggle to stay relevant.
So, I wanted to ask:
👉 How do you handle situations like this?
👉 How do you keep your skills sharp and your career moving forward when your current role offers comfort but little learning?
Would love to hear how others have navigated this phase without losing momentum.
https://redd.it/1ohwpk7
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Experiment - bridging the gap between traditional networking and modern automation/API-driven approaches with AI
I work as a network admin, the only time you hear about our team is when something breaks. We spend the vast amount of time auditing the network, doing enhancements, verifying redundancies, all the boring things that needs to be done. Been thinking a lot about bridging the gap between traditional networking and modern automation/API-driven approaches to be create tools and ultimately have proactive alarming and troubleshooting. Here’s a project I am starting to document that I’ve been working on: https://youtu.be/rRZvta53QzI
There are a lot of videos of people showing a proof of concept of what AI can do for different application but nothing in-depth is out there. I spent the last 6 month really pushing the limits relative to the work I do to create something that is scalable, secure, restrictive and practical. Coding wise I did support for Adobe Cold Fusion application a lifetime ago and PowerShell noscripting so the concepts for programming I do understand but I am a Network admin first.
I would be curious to see if there is anyone that are actual developers exploring this space at this depth.
https://redd.it/1ohvdif
@r_devops
I work as a network admin, the only time you hear about our team is when something breaks. We spend the vast amount of time auditing the network, doing enhancements, verifying redundancies, all the boring things that needs to be done. Been thinking a lot about bridging the gap between traditional networking and modern automation/API-driven approaches to be create tools and ultimately have proactive alarming and troubleshooting. Here’s a project I am starting to document that I’ve been working on: https://youtu.be/rRZvta53QzI
There are a lot of videos of people showing a proof of concept of what AI can do for different application but nothing in-depth is out there. I spent the last 6 month really pushing the limits relative to the work I do to create something that is scalable, secure, restrictive and practical. Coding wise I did support for Adobe Cold Fusion application a lifetime ago and PowerShell noscripting so the concepts for programming I do understand but I am a Network admin first.
I would be curious to see if there is anyone that are actual developers exploring this space at this depth.
https://redd.it/1ohvdif
@r_devops
YouTube
AI Network Automation Demo Network Engineer - What is possible
🤖 I Built an AI Network Engineer That Solves Problems in Seconds
Watch this AI agent find devices instantly, and configure switch ports
🎯 WHAT YOU'LL SEE IN THIS VIDEO:
- MAC address lookup in 10 seconds (used to take 10 minutes)
- Interactive port provisioning…
Watch this AI agent find devices instantly, and configure switch ports
🎯 WHAT YOU'LL SEE IN THIS VIDEO:
- MAC address lookup in 10 seconds (used to take 10 minutes)
- Interactive port provisioning…
DNS Rebinding: Making Your Browser Attack Your Local Network 🌐
https://instatunnel.my/blog/dns-rebinding-making-your-browser-attack-your-local-network
https://redd.it/1ohzy30
@r_devops
https://instatunnel.my/blog/dns-rebinding-making-your-browser-attack-your-local-network
https://redd.it/1ohzy30
@r_devops
InstaTunnel
DNS Rebinding Attack: How Browsers Attack Your Local Network
Learn how DNS rebinding attacks exploit browser security to access your routers, IoT devices, and internal services. Discover protection strategies for 2025.
Guide How to add Basic Auth to Prometheus (or any app) on Kubernetes with AWS ALB Ingress (using Nginx sidecar)
I recently tackled a common challenge that many of us face: securing internal dashboards like Prometheus when exposed via an AWS ALB Ingress. While ALBs are powerful, they don't offer native Basic Auth, often pushing you towards more complex OIDC solutions when a simple password gate is all that's needed.
I've put together a comprehensive guide on how to implement this using an Nginx sidecar pattern directly within your Prometheus (or any) application pod. This allows Nginx to act as the authentication layer, proxying requests to your app only after successful authentication.
What the guide covers:
The fundamental problem of ALB & Basic Auth.
Step-by-step setup of the Nginx sidecar with custom
Detailed `values.yaml` configurations for `kube-prometheus-stack` to include the sidecar, volume mounts, and service/ingress adjustments.
Crucially, how to implement a "smart" health check that validates the entire application's health, not just Nginx's.
This is a real-world, production-tested approach that avoids over-complication. I'm keen to hear your thoughts and experiences!
Read the full article here: https://www.dheeth.blog/enabling-basic-auth-kubernetes-alb-ingress/
Happy to answer any questions in the comments!
https://redd.it/1oi0ztc
@r_devops
I recently tackled a common challenge that many of us face: securing internal dashboards like Prometheus when exposed via an AWS ALB Ingress. While ALBs are powerful, they don't offer native Basic Auth, often pushing you towards more complex OIDC solutions when a simple password gate is all that's needed.
I've put together a comprehensive guide on how to implement this using an Nginx sidecar pattern directly within your Prometheus (or any) application pod. This allows Nginx to act as the authentication layer, proxying requests to your app only after successful authentication.
What the guide covers:
The fundamental problem of ALB & Basic Auth.
Step-by-step setup of the Nginx sidecar with custom
nginx.conf, 401.html, and health.html.Detailed `values.yaml` configurations for `kube-prometheus-stack` to include the sidecar, volume mounts, and service/ingress adjustments.
Crucially, how to implement a "smart" health check that validates the entire application's health, not just Nginx's.
This is a real-world, production-tested approach that avoids over-complication. I'm keen to hear your thoughts and experiences!
Read the full article here: https://www.dheeth.blog/enabling-basic-auth-kubernetes-alb-ingress/
Happy to answer any questions in the comments!
https://redd.it/1oi0ztc
@r_devops
www.dheeth.blog
How to Enable Basic Auth with ALB Ingress in Kubernetes (Step-by-Step Guide) | Pawan Kumar Writes
Learn how to enable Basic auth for Prometheus ingress with ALB in Kubernetes by adding an NGINX proxy sidecar in kube-prometheus-stack Helm chart. Full guide with YAML examples, annotations, and best practices.
what's a "best practice" you actually disagree with?
We hear a lot of dogma about the "right" way to do things in DevOps. But sometimes, strict adherence to a best practice can create more complexity than it solves.
What's one commonly held "best practice" you've chosen to ignore in a specific context, and what was the result? Did it backfire or did it actually work better for your team?
https://redd.it/1oi1daa
@r_devops
We hear a lot of dogma about the "right" way to do things in DevOps. But sometimes, strict adherence to a best practice can create more complexity than it solves.
What's one commonly held "best practice" you've chosen to ignore in a specific context, and what was the result? Did it backfire or did it actually work better for your team?
https://redd.it/1oi1daa
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Observability Sessions at KubeCon Atlanta (Nov 10-13)
Here's what's on the observability track that's relevant to day-to-day ops work:
OpenTelemetry sessions:
[Taming Telemetry at Scale](https://sched.co/27FUv) \- standardizing observability across teams (Tue 11:15 AM)
Just Do It: OpAMP \- Nike's production agent management setup (Tue 3:15 PM)
[Instrumentation Score](https://sched.co/27FWx) \- figuring out if your traces are useful or just noise (Tue 4:15 PM)
Tracing LLM apps \- observability for non-deterministic workloads (Wed 5:41 PM)
CI/CD + deployment observability:
[End-to-end CI/CD observability with OTel](https://colocatedeventsna2025.sched.com/event/28D4A) \- instrumenting your entire pipeline, not just prod (Wed 2:05 PM)
Automated rollbacks using telemetry signals \- feature flags that rollback based on metrics (Wed 4:35 PM)
[Making ML pipelines traceable](https://colocatedeventsna2025.sched.com/event/28D7e) \- KitOps + Argo for MLOps observability (Wed 3:20 PM)
Observability for AI agents in K8s \- platform design for agentic workloads (Wed 4:00 PM)
Observability Day on Nov 10 is worth hitting if you have an All-Access pass. Smaller rooms, better Q&A, less chaos.
Full breakdown with first-timer tips: https://signoz.io/blog/kubecon-atlanta-2025-observability-guide/
Disclaimer: I work at SigNoz. We'll be at Booth 1372 if anyone wants to talk shop about observability costs or self-hosting.
https://redd.it/1oi1vw6
@r_devops
Here's what's on the observability track that's relevant to day-to-day ops work:
OpenTelemetry sessions:
[Taming Telemetry at Scale](https://sched.co/27FUv) \- standardizing observability across teams (Tue 11:15 AM)
Just Do It: OpAMP \- Nike's production agent management setup (Tue 3:15 PM)
[Instrumentation Score](https://sched.co/27FWx) \- figuring out if your traces are useful or just noise (Tue 4:15 PM)
Tracing LLM apps \- observability for non-deterministic workloads (Wed 5:41 PM)
CI/CD + deployment observability:
[End-to-end CI/CD observability with OTel](https://colocatedeventsna2025.sched.com/event/28D4A) \- instrumenting your entire pipeline, not just prod (Wed 2:05 PM)
Automated rollbacks using telemetry signals \- feature flags that rollback based on metrics (Wed 4:35 PM)
[Making ML pipelines traceable](https://colocatedeventsna2025.sched.com/event/28D7e) \- KitOps + Argo for MLOps observability (Wed 3:20 PM)
Observability for AI agents in K8s \- platform design for agentic workloads (Wed 4:00 PM)
Observability Day on Nov 10 is worth hitting if you have an All-Access pass. Smaller rooms, better Q&A, less chaos.
Full breakdown with first-timer tips: https://signoz.io/blog/kubecon-atlanta-2025-observability-guide/
Disclaimer: I work at SigNoz. We'll be at Booth 1372 if anyone wants to talk shop about observability costs or self-hosting.
https://redd.it/1oi1vw6
@r_devops
Sched
KubeCon + CloudNativeCon North America 2025: Taming Telemetry at Scale: Platform Blue...
View more about this event at KubeCon + CloudNativeCon North America 2025
CI/CD pipelines are starting to feel like products we need to maintain
I remember when setting up CI/CD was supposed to simplify releases. Build, test, deploy, done.
Now it feels like maintaining the pipeline is a full-time job on its own.
Every team wants a slightly different workflow. Every dependency update breaks a step.
Secrets expire, runners go missing, and self-hosted agents crash right before release.
And somehow, fixing the pipeline always takes priority over fixing the app.
At this point, it feels like we’re running two products: the one we ship to customers, and the one that ships the product.
anyone else feel like their CI/CD setup has become its own mini ecosystem?
How do you keep it lean and reliable without turning into a build engineer 24/7?
https://redd.it/1oi3clf
@r_devops
I remember when setting up CI/CD was supposed to simplify releases. Build, test, deploy, done.
Now it feels like maintaining the pipeline is a full-time job on its own.
Every team wants a slightly different workflow. Every dependency update breaks a step.
Secrets expire, runners go missing, and self-hosted agents crash right before release.
And somehow, fixing the pipeline always takes priority over fixing the app.
At this point, it feels like we’re running two products: the one we ship to customers, and the one that ships the product.
anyone else feel like their CI/CD setup has become its own mini ecosystem?
How do you keep it lean and reliable without turning into a build engineer 24/7?
https://redd.it/1oi3clf
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
From CSI to ESO
Does anyone struggling with migration from CSI drive to ESO using AZ KeyVault for springboot and angular microservices on kubernetes?
I feel like the maven tests and the volumes are giving me the finger 🤣🤣.
Looking forward to hear some other stories and maybe we can share experiences and learn 🤝
https://redd.it/1oi4ngn
@r_devops
Does anyone struggling with migration from CSI drive to ESO using AZ KeyVault for springboot and angular microservices on kubernetes?
I feel like the maven tests and the volumes are giving me the finger 🤣🤣.
Looking forward to hear some other stories and maybe we can share experiences and learn 🤝
https://redd.it/1oi4ngn
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
My Analysis on AWS US-EAST-1 outage!
I know I’m very late to this, but I spent some time digging into what actually happened during the AWS US-EAST-1 outage on October 19–20, 2025.
This wasn’t a typical “AWS had issues” situation. It was a complete control plane failure that revealed just how fragile large-scale cloud systems can be.
The outage originated in AWS’s **us-east-1 (Northern Virginia)** region their oldest and most critical region.
Nearly every major online service touches this region in some capacity: Netflix, Zoom, Reddit, Coinbase, and even [Amazon.com](http://Amazon.com) itself.
When us-east-1 fails, the internet feels it.
At around **11:49 PM PST**, AWS began seeing widespread errors with **DynamoDB**, a service that underpins several other AWS systems like EC2, Lambda, and IAM.
This time, it wasn’t due to hardware or a DDoS attack it was a **software race condition** inside DynamoDB’s internal DNS automation.
# The Root Cause
AWS’s internal DNS management for DynamoDB works through two components:
* A **Planner**, which generates routing and DNS update plans.
* An **Enactor**, which applies those updates.
On that night, two Enactors ran simultaneously on different versions of a DNS plan.
The older one was delayed but eventually overwrote the newer one.
Then, an automated cleanup process deleted the valid DNS record.
Result: DynamoDB’s DNS entries were gone. Without DNS, no system including AWS’s own could locate DynamoDB endpoints.
# When AWS Lost Access to Itself ?
Once DynamoDB’s DNS disappeared, all services that depended on it started failing.
Internal control planes couldn’t find state data or connect to back-end resources.
In effect, AWS lost access to its own infrastructure.
Automation failed silently because the cleanup process “succeeded” from a system perspective.
There was no alert, no rollback, no safeguard. Manual recovery was the only option.
The Cascade Effect
Here’s how the failure spread:
* **EC2** control plane failed first, halting new instance launches.
* **Autoscaling** stopped working.
* **Network Load Balancers** began marking healthy instances as unhealthy, triggering false failovers.
* **Lambda**, **SQS**, and **IAM** started failing, breaking authentication and workflows globally.
* Even AWS engineers struggled to access internal consoles to begin recovery.
What started as a DNS error in DynamoDB quickly became a multi-service cascade failure.
# Congestive Collapse During Recovery
When DynamoDB was restored, millions of clients attempted to reconnect simultaneously.
This caused a phenomenon known as **congestive collapse** recovery traffic overwhelmed the control plane again.
AWS had to throttle API calls and disable automation loops to let systems stabilize.
Fixing the bug took a few hours, but restoring full service stability took much longer.
# The Global Impact:
Over 17 million outage reports were recorded across more than 60 countries.
Major services including Snapchat, Reddit, Coinbase, Netflix, and [Amazon.com](http://Amazon.com) were affected.
Banking portals, government services, and educational platforms experienced downtime — all due to a single regional failure.
# AWS Recovery Process:
AWS engineers manually restored DNS records using Route 53, disabled faulty automation processes, and slowly re-enabled systems.
The root issue was fixed in about three hours, but full recovery took over twelve hours because of the cascade effects.
# Key Lessons
1. **A region is a failure domain.** Multi-AZ designs alone don’t protect against regional collapse.
2. **Keep critical control systems (like CI/CD and IAM)** outside your main region.
3. **Managed services aren’t immune to failure.** Design for graceful degradation.
4. **Multi-region architecture should be the baseline**, not a luxury.
5. **Test for cascading failures** — not just isolated ones.
Even the most sophisticated cloud systems can fail if the fundamentals aren’t protected.
How would you design around a region-wide failure like this?
Would you go
I know I’m very late to this, but I spent some time digging into what actually happened during the AWS US-EAST-1 outage on October 19–20, 2025.
This wasn’t a typical “AWS had issues” situation. It was a complete control plane failure that revealed just how fragile large-scale cloud systems can be.
The outage originated in AWS’s **us-east-1 (Northern Virginia)** region their oldest and most critical region.
Nearly every major online service touches this region in some capacity: Netflix, Zoom, Reddit, Coinbase, and even [Amazon.com](http://Amazon.com) itself.
When us-east-1 fails, the internet feels it.
At around **11:49 PM PST**, AWS began seeing widespread errors with **DynamoDB**, a service that underpins several other AWS systems like EC2, Lambda, and IAM.
This time, it wasn’t due to hardware or a DDoS attack it was a **software race condition** inside DynamoDB’s internal DNS automation.
# The Root Cause
AWS’s internal DNS management for DynamoDB works through two components:
* A **Planner**, which generates routing and DNS update plans.
* An **Enactor**, which applies those updates.
On that night, two Enactors ran simultaneously on different versions of a DNS plan.
The older one was delayed but eventually overwrote the newer one.
Then, an automated cleanup process deleted the valid DNS record.
Result: DynamoDB’s DNS entries were gone. Without DNS, no system including AWS’s own could locate DynamoDB endpoints.
# When AWS Lost Access to Itself ?
Once DynamoDB’s DNS disappeared, all services that depended on it started failing.
Internal control planes couldn’t find state data or connect to back-end resources.
In effect, AWS lost access to its own infrastructure.
Automation failed silently because the cleanup process “succeeded” from a system perspective.
There was no alert, no rollback, no safeguard. Manual recovery was the only option.
The Cascade Effect
Here’s how the failure spread:
* **EC2** control plane failed first, halting new instance launches.
* **Autoscaling** stopped working.
* **Network Load Balancers** began marking healthy instances as unhealthy, triggering false failovers.
* **Lambda**, **SQS**, and **IAM** started failing, breaking authentication and workflows globally.
* Even AWS engineers struggled to access internal consoles to begin recovery.
What started as a DNS error in DynamoDB quickly became a multi-service cascade failure.
# Congestive Collapse During Recovery
When DynamoDB was restored, millions of clients attempted to reconnect simultaneously.
This caused a phenomenon known as **congestive collapse** recovery traffic overwhelmed the control plane again.
AWS had to throttle API calls and disable automation loops to let systems stabilize.
Fixing the bug took a few hours, but restoring full service stability took much longer.
# The Global Impact:
Over 17 million outage reports were recorded across more than 60 countries.
Major services including Snapchat, Reddit, Coinbase, Netflix, and [Amazon.com](http://Amazon.com) were affected.
Banking portals, government services, and educational platforms experienced downtime — all due to a single regional failure.
# AWS Recovery Process:
AWS engineers manually restored DNS records using Route 53, disabled faulty automation processes, and slowly re-enabled systems.
The root issue was fixed in about three hours, but full recovery took over twelve hours because of the cascade effects.
# Key Lessons
1. **A region is a failure domain.** Multi-AZ designs alone don’t protect against regional collapse.
2. **Keep critical control systems (like CI/CD and IAM)** outside your main region.
3. **Managed services aren’t immune to failure.** Design for graceful degradation.
4. **Multi-region architecture should be the baseline**, not a luxury.
5. **Test for cascading failures** — not just isolated ones.
Even the most sophisticated cloud systems can fail if the fundamentals aren’t protected.
How would you design around a region-wide failure like this?
Would you go
Amazon
Amazon.com. Spend less. Smile more.
Free shipping on millions of items. Get the best of Shopping and Entertainment with Prime. Enjoy low prices and great deals on the largest selection of everyday essentials and other products, including fashion, home, beauty, electronics, Alexa Devices, sporting…
multi-region, multi-cloud, or focus on reducing blast radius within AWS itself?
https://redd.it/1oi5o2g
@r_devops
https://redd.it/1oi5o2g
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Bifrost: An LLM Gateway built for enterprise-grade reliability, governance, and scale(50x Faster than LiteLLM)
If you’re building LLM applications at scale, your gateway can’t be the bottleneck. That’s why we built **Bifrost**, a high-performance, fully self-hosted LLM gateway in Go. It’s 50× faster than LiteLLM, built for speed, reliability, and full control across multiple providers.
The project is **fully open-source**. Try it, star it, or contribute directly: [https://github.com/maximhq/bifrost](https://github.com/maximhq/bifrost)
**Key Highlights:**
* **Ultra-low overhead:** \~11µs per request at 5K RPS, scales linearly under high load.
* **Adaptive load balancing:** Distributes requests across providers and keys based on latency, errors, and throughput limits.
* **Cluster mode resilience:** Nodes synchronize in a peer-to-peer network, so failures don’t disrupt routing or lose data.
* **Drop-in OpenAI-compatible API:** Works with existing LLM projects, one endpoint for 250+ models.
* **Full multi-provider support:** OpenAI, Anthropic, AWS Bedrock, Google Vertex, Azure, and more.
* **Automatic failover:** Handles provider failures gracefully with retries and multi-tier fallbacks.
* **Semantic caching:** deduplicates similar requests to reduce repeated inference costs.
* **Multimodal support:** Text, images, audio, speech, trannoscription; all through a single API.
* **Observability:** Out-of-the-box OpenTelemetry support for observability. Built-in dashboard for quick glances without any complex setup.
* **Extensible & configurable:** Plugin based architecture, Web UI or file-based config.
* **Governance:** SAML support for SSO and Role-based access control and policy enforcement for team collaboration.
**Benchmarks (identical hardware vs LiteLLM): Setup: S**ingle t3.medium instance. Mock llm with 1.5 seconds latency
|Metric|LiteLLM|Bifrost|Improvement|
|:-|:-|:-|:-|
|**p99 Latency**|90.72s|1.68s|\~54× faster|
|**Throughput**|44.84 req/sec|424 req/sec|\~9.4× higher|
|**Memory Usage**|372MB|120MB|\~3× lighter|
|**Mean Overhead**|\~500µs|**11µs @ 5K RPS**|\~45× lower|
**Why it matters:**
Bifrost behaves like core infrastructure: minimal overhead, high throughput, multi-provider routing, built-in reliability, and total control. It’s designed for teams building production-grade AI systems who need performance, failover, and observability out of the box
https://redd.it/1oi5xtk
@r_devops
If you’re building LLM applications at scale, your gateway can’t be the bottleneck. That’s why we built **Bifrost**, a high-performance, fully self-hosted LLM gateway in Go. It’s 50× faster than LiteLLM, built for speed, reliability, and full control across multiple providers.
The project is **fully open-source**. Try it, star it, or contribute directly: [https://github.com/maximhq/bifrost](https://github.com/maximhq/bifrost)
**Key Highlights:**
* **Ultra-low overhead:** \~11µs per request at 5K RPS, scales linearly under high load.
* **Adaptive load balancing:** Distributes requests across providers and keys based on latency, errors, and throughput limits.
* **Cluster mode resilience:** Nodes synchronize in a peer-to-peer network, so failures don’t disrupt routing or lose data.
* **Drop-in OpenAI-compatible API:** Works with existing LLM projects, one endpoint for 250+ models.
* **Full multi-provider support:** OpenAI, Anthropic, AWS Bedrock, Google Vertex, Azure, and more.
* **Automatic failover:** Handles provider failures gracefully with retries and multi-tier fallbacks.
* **Semantic caching:** deduplicates similar requests to reduce repeated inference costs.
* **Multimodal support:** Text, images, audio, speech, trannoscription; all through a single API.
* **Observability:** Out-of-the-box OpenTelemetry support for observability. Built-in dashboard for quick glances without any complex setup.
* **Extensible & configurable:** Plugin based architecture, Web UI or file-based config.
* **Governance:** SAML support for SSO and Role-based access control and policy enforcement for team collaboration.
**Benchmarks (identical hardware vs LiteLLM): Setup: S**ingle t3.medium instance. Mock llm with 1.5 seconds latency
|Metric|LiteLLM|Bifrost|Improvement|
|:-|:-|:-|:-|
|**p99 Latency**|90.72s|1.68s|\~54× faster|
|**Throughput**|44.84 req/sec|424 req/sec|\~9.4× higher|
|**Memory Usage**|372MB|120MB|\~3× lighter|
|**Mean Overhead**|\~500µs|**11µs @ 5K RPS**|\~45× lower|
**Why it matters:**
Bifrost behaves like core infrastructure: minimal overhead, high throughput, multi-provider routing, built-in reliability, and total control. It’s designed for teams building production-grade AI systems who need performance, failover, and observability out of the box
https://redd.it/1oi5xtk
@r_devops
GitHub
GitHub - maximhq/bifrost: Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails,…
Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS. - maximhq/bifrost
whats cheaper than AWS fargate? for container deploys
whats cheaper than AWS fargate?
We use fargate at work and it's convenient but im getting annoyed containers being shutdown overnight for costs causing bunch of problems (for me as a dev).
I just want to deploy containers to some non-aws cheaper platform so they run 24/7. does OVH/hetzner have something like this?
or others that are NOT azure/google?
What do you guys use?
https://redd.it/1oi5j5m
@r_devops
whats cheaper than AWS fargate?
We use fargate at work and it's convenient but im getting annoyed containers being shutdown overnight for costs causing bunch of problems (for me as a dev).
I just want to deploy containers to some non-aws cheaper platform so they run 24/7. does OVH/hetzner have something like this?
or others that are NOT azure/google?
What do you guys use?
https://redd.it/1oi5j5m
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Kubernets homelab
Hello guys
I’ve just finished my internship in the DevOps/cloud field, working with GKE, Terraform, Terragrunt and many more tools. I’m now curious to deepen my foundation: do you recommend investing money to build a homelab setup? Is it worth it?
And if yes how much do you think it can cost?
https://redd.it/1oi7lab
@r_devops
Hello guys
I’ve just finished my internship in the DevOps/cloud field, working with GKE, Terraform, Terragrunt and many more tools. I’m now curious to deepen my foundation: do you recommend investing money to build a homelab setup? Is it worth it?
And if yes how much do you think it can cost?
https://redd.it/1oi7lab
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
playwright vs selenium alternatives: spent 6 months with flaky tests before finding something stable
Our pipeline has maybe 80 end to end tests and probably 15 of them are flaky. They'll pass locally every time, pass in CI most of the time, but fail randomly maybe 1 in 10 runs. Usually timing issues or something with how the test environment loads.
The problem is now nobody trusts the CI results. If the build fails, first instinct is to just rerun it instead of actually investigating. I've tried increasing wait times, adding retry logic, all the standard stuff. It helps but doesn't solve it.
I know the real answer is probably to rewrite the tests to be more resilient but nobody has time for that. We're a small team and rewriting tests doesn't ship features.
Wondering if anyone's found tools that just handle this better out of the box. We use playwright currently. I tested spur a bit and it seemed more stable but haven't fully migrated anything yet. Would rather not spend three months rewriting our entire test suite if there's a better approach.
What's actually worked for other teams dealing with this?
https://redd.it/1oi8z4m
@r_devops
Our pipeline has maybe 80 end to end tests and probably 15 of them are flaky. They'll pass locally every time, pass in CI most of the time, but fail randomly maybe 1 in 10 runs. Usually timing issues or something with how the test environment loads.
The problem is now nobody trusts the CI results. If the build fails, first instinct is to just rerun it instead of actually investigating. I've tried increasing wait times, adding retry logic, all the standard stuff. It helps but doesn't solve it.
I know the real answer is probably to rewrite the tests to be more resilient but nobody has time for that. We're a small team and rewriting tests doesn't ship features.
Wondering if anyone's found tools that just handle this better out of the box. We use playwright currently. I tested spur a bit and it seemed more stable but haven't fully migrated anything yet. Would rather not spend three months rewriting our entire test suite if there's a better approach.
What's actually worked for other teams dealing with this?
https://redd.it/1oi8z4m
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community