Reddit DevOps – Telegram
what's a "best practice" you actually disagree with?

We hear a lot of dogma about the "right" way to do things in DevOps. But sometimes, strict adherence to a best practice can create more complexity than it solves.

What's one commonly held "best practice" you've chosen to ignore in a specific context, and what was the result? Did it backfire or did it actually work better for your team?

https://redd.it/1oi1daa
@r_devops
Observability Sessions at KubeCon Atlanta (Nov 10-13)

Here's what's on the observability track that's relevant to day-to-day ops work:

OpenTelemetry sessions:

[Taming Telemetry at Scale](https://sched.co/27FUv) \- standardizing observability across teams (Tue 11:15 AM)
Just Do It: OpAMP \- Nike's production agent management setup (Tue 3:15 PM)
[Instrumentation Score](https://sched.co/27FWx) \- figuring out if your traces are useful or just noise (Tue 4:15 PM)
Tracing LLM apps \- observability for non-deterministic workloads (Wed 5:41 PM)

CI/CD + deployment observability:

[End-to-end CI/CD observability with OTel](https://colocatedeventsna2025.sched.com/event/28D4A) \- instrumenting your entire pipeline, not just prod (Wed 2:05 PM)
Automated rollbacks using telemetry signals \- feature flags that rollback based on metrics (Wed 4:35 PM)
[Making ML pipelines traceable](https://colocatedeventsna2025.sched.com/event/28D7e) \- KitOps + Argo for MLOps observability (Wed 3:20 PM)
Observability for AI agents in K8s \- platform design for agentic workloads (Wed 4:00 PM)

Observability Day on Nov 10 is worth hitting if you have an All-Access pass. Smaller rooms, better Q&A, less chaos.

Full breakdown with first-timer tips: https://signoz.io/blog/kubecon-atlanta-2025-observability-guide/

Disclaimer: I work at SigNoz. We'll be at Booth 1372 if anyone wants to talk shop about observability costs or self-hosting.

https://redd.it/1oi1vw6
@r_devops
CI/CD pipelines are starting to feel like products we need to maintain

I remember when setting up CI/CD was supposed to simplify releases. Build, test, deploy, done.
Now it feels like maintaining the pipeline is a full-time job on its own.

Every team wants a slightly different workflow. Every dependency update breaks a step.
Secrets expire, runners go missing, and self-hosted agents crash right before release.
And somehow, fixing the pipeline always takes priority over fixing the app.

At this point, it feels like we’re running two products: the one we ship to customers, and the one that ships the product.

anyone else feel like their CI/CD setup has become its own mini ecosystem?
How do you keep it lean and reliable without turning into a build engineer 24/7?

https://redd.it/1oi3clf
@r_devops
From CSI to ESO

Does anyone struggling with migration from CSI drive to ESO using AZ KeyVault for springboot and angular microservices on kubernetes?

I feel like the maven tests and the volumes are giving me the finger 🤣🤣.

Looking forward to hear some other stories and maybe we can share experiences and learn 🤝

https://redd.it/1oi4ngn
@r_devops
My Analysis on AWS US-EAST-1 outage!

I know I’m very late to this, but I spent some time digging into what actually happened during the AWS US-EAST-1 outage on October 19–20, 2025.
This wasn’t a typical “AWS had issues” situation. It was a complete control plane failure that revealed just how fragile large-scale cloud systems can be.

The outage originated in AWS’s **us-east-1 (Northern Virginia)** region their oldest and most critical region.
Nearly every major online service touches this region in some capacity: Netflix, Zoom, Reddit, Coinbase, and even [Amazon.com](http://Amazon.com) itself.
When us-east-1 fails, the internet feels it.

At around **11:49 PM PST**, AWS began seeing widespread errors with **DynamoDB**, a service that underpins several other AWS systems like EC2, Lambda, and IAM.
This time, it wasn’t due to hardware or a DDoS attack it was a **software race condition** inside DynamoDB’s internal DNS automation.

# The Root Cause

AWS’s internal DNS management for DynamoDB works through two components:

* A **Planner**, which generates routing and DNS update plans.
* An **Enactor**, which applies those updates.

On that night, two Enactors ran simultaneously on different versions of a DNS plan.
The older one was delayed but eventually overwrote the newer one.
Then, an automated cleanup process deleted the valid DNS record.
Result: DynamoDB’s DNS entries were gone. Without DNS, no system including AWS’s own could locate DynamoDB endpoints.

# When AWS Lost Access to Itself ?

Once DynamoDB’s DNS disappeared, all services that depended on it started failing.
Internal control planes couldn’t find state data or connect to back-end resources.
In effect, AWS lost access to its own infrastructure.

Automation failed silently because the cleanup process “succeeded” from a system perspective.
There was no alert, no rollback, no safeguard. Manual recovery was the only option.

The Cascade Effect

Here’s how the failure spread:

* **EC2** control plane failed first, halting new instance launches.
* **Autoscaling** stopped working.
* **Network Load Balancers** began marking healthy instances as unhealthy, triggering false failovers.
* **Lambda**, **SQS**, and **IAM** started failing, breaking authentication and workflows globally.
* Even AWS engineers struggled to access internal consoles to begin recovery.

What started as a DNS error in DynamoDB quickly became a multi-service cascade failure.

# Congestive Collapse During Recovery

When DynamoDB was restored, millions of clients attempted to reconnect simultaneously.
This caused a phenomenon known as **congestive collapse** recovery traffic overwhelmed the control plane again.
AWS had to throttle API calls and disable automation loops to let systems stabilize.
Fixing the bug took a few hours, but restoring full service stability took much longer.

# The Global Impact:

Over 17 million outage reports were recorded across more than 60 countries.
Major services including Snapchat, Reddit, Coinbase, Netflix, and [Amazon.com](http://Amazon.com) were affected.
Banking portals, government services, and educational platforms experienced downtime — all due to a single regional failure.

# AWS Recovery Process:

AWS engineers manually restored DNS records using Route 53, disabled faulty automation processes, and slowly re-enabled systems.
The root issue was fixed in about three hours, but full recovery took over twelve hours because of the cascade effects.



# Key Lessons

1. **A region is a failure domain.** Multi-AZ designs alone don’t protect against regional collapse.
2. **Keep critical control systems (like CI/CD and IAM)** outside your main region.
3. **Managed services aren’t immune to failure.** Design for graceful degradation.
4. **Multi-region architecture should be the baseline**, not a luxury.
5. **Test for cascading failures** — not just isolated ones.

Even the most sophisticated cloud systems can fail if the fundamentals aren’t protected.

How would you design around a region-wide failure like this?
Would you go
Bifrost: An LLM Gateway built for enterprise-grade reliability, governance, and scale(50x Faster than LiteLLM)

If you’re building LLM applications at scale, your gateway can’t be the bottleneck. That’s why we built **Bifrost**, a high-performance, fully self-hosted LLM gateway in Go. It’s 50× faster than LiteLLM, built for speed, reliability, and full control across multiple providers.

The project is **fully open-source**. Try it, star it, or contribute directly: [https://github.com/maximhq/bifrost](https://github.com/maximhq/bifrost)

**Key Highlights:**

* **Ultra-low overhead:** \~11µs per request at 5K RPS, scales linearly under high load.
* **Adaptive load balancing:** Distributes requests across providers and keys based on latency, errors, and throughput limits.
* **Cluster mode resilience:** Nodes synchronize in a peer-to-peer network, so failures don’t disrupt routing or lose data.
* **Drop-in OpenAI-compatible API:** Works with existing LLM projects, one endpoint for 250+ models.
* **Full multi-provider support:** OpenAI, Anthropic, AWS Bedrock, Google Vertex, Azure, and more.
* **Automatic failover:** Handles provider failures gracefully with retries and multi-tier fallbacks.
* **Semantic caching:** deduplicates similar requests to reduce repeated inference costs.
* **Multimodal support:** Text, images, audio, speech, trannoscription; all through a single API.
* **Observability:** Out-of-the-box OpenTelemetry support for observability. Built-in dashboard for quick glances without any complex setup.
* **Extensible & configurable:** Plugin based architecture, Web UI or file-based config.
* **Governance:** SAML support for SSO and Role-based access control and policy enforcement for team collaboration.

**Benchmarks (identical hardware vs LiteLLM): Setup: S**ingle t3.medium instance. Mock llm with 1.5 seconds latency

|Metric|LiteLLM|Bifrost|Improvement|
|:-|:-|:-|:-|
|**p99 Latency**|90.72s|1.68s|\~54× faster|
|**Throughput**|44.84 req/sec|424 req/sec|\~9.4× higher|
|**Memory Usage**|372MB|120MB|\~3× lighter|
|**Mean Overhead**|\~500µs|**11µs @ 5K RPS**|\~45× lower|

**Why it matters:**

Bifrost behaves like core infrastructure: minimal overhead, high throughput, multi-provider routing, built-in reliability, and total control. It’s designed for teams building production-grade AI systems who need performance, failover, and observability out of the box

https://redd.it/1oi5xtk
@r_devops
whats cheaper than AWS fargate? for container deploys

whats cheaper than AWS fargate?

We use fargate at work and it's convenient but im getting annoyed containers being shutdown overnight for costs causing bunch of problems (for me as a dev).

I just want to deploy containers to some non-aws cheaper platform so they run 24/7. does OVH/hetzner have something like this?

or others that are NOT azure/google?

What do you guys use?

https://redd.it/1oi5j5m
@r_devops
Kubernets homelab

Hello guys
I’ve just finished my internship in the DevOps/cloud field, working with GKE, Terraform, Terragrunt and many more tools. I’m now curious to deepen my foundation: do you recommend investing money to build a homelab setup? Is it worth it?
And if yes how much do you think it can cost?

https://redd.it/1oi7lab
@r_devops
playwright vs selenium alternatives: spent 6 months with flaky tests before finding something stable

Our pipeline has maybe 80 end to end tests and probably 15 of them are flaky. They'll pass locally every time, pass in CI most of the time, but fail randomly maybe 1 in 10 runs. Usually timing issues or something with how the test environment loads.

The problem is now nobody trusts the CI results. If the build fails, first instinct is to just rerun it instead of actually investigating. I've tried increasing wait times, adding retry logic, all the standard stuff. It helps but doesn't solve it.

I know the real answer is probably to rewrite the tests to be more resilient but nobody has time for that. We're a small team and rewriting tests doesn't ship features.

Wondering if anyone's found tools that just handle this better out of the box. We use playwright currently. I tested spur a bit and it seemed more stable but haven't fully migrated anything yet. Would rather not spend three months rewriting our entire test suite if there's a better approach.

What's actually worked for other teams dealing with this?

https://redd.it/1oi8z4m
@r_devops
Practicing interviews taught me more about my job than any cert

I didn't expect mock interviews to change how I handle emergencies. I've done AWS certifications, Jenkins pipelines, and Prometheus dashboards. All useful, sure. But none of them taught me how to work in the real world.

While prepping for a role switch, I started running scenario drills from iqb interview question bank and recording myself with my beyz coding assistant. GPT would also randomly throw up mock interview questions like "Pipeline rollback error" or "Alarms surge at 2 a.m.."

Replaying my own answers, I realized my thinking was scattered. There was a huge gap between what I thought in my head and what I actually said. I'd jump straight to a Terraform or Kubernetes fix, skipping the rollback logic and even forgetting who was responsible for what. I began to wonder if I was easily disrupted by the backlog of tasks at work, too.

Many weeks passed in this chaotic state... with no clear idea of what I'd actually done, whether I'd made any progress, or whether I'd documented anything. So, when faced with many interview questions, I couldn't use STAR or other methods to describe the challenges I encountered and the final results of my projects.

So now, I've started taking notes again... I write down my thoughts before I start. Then I list to-do items. For example, I check Grafana trends, connect with PagerDuty, and review recent merges in GitHub, and then take action. This helps me slow down and avoid making stupid mistakes that waste time re-analyzing bugs.

https://redd.it/1oiabuh
@r_devops
How do you all feel about Wiz?

Curious who’s used the DSO tool/platform Wiz, what your experiences were, and your opinions on it… is it widely used in the industry and I’ve just somehow managed to not be exposed to it to this point?

I’m being asked to review our org’s proposal to use it as part of our DSO implementation plan I just found out exists and am slightly annoyed there’s a bunch of vendor products on here I’ve not been exposed to, which is really saying something tbh haha.

https://redd.it/1oie7ji
@r_devops
Amazon layoffs, any infra engineers impacted?

Today, Amazon announced 30k layoffs, most posts on LinkedIn I’ve seen were from HR/Recruiting. Curious to know if they laid off any DevOps/SRE as that would imply a lot of Amazon engineers would be coming into the market. Anyone hear anything?

https://redd.it/1oigvwn
@r_devops
Intel SGX alternative migration - moved to Intel TDX and AMD SEV with better results

Built our entire privacy stack around Intel SGX. Then Intel announced they're discontinuing the attestation service in 2025.

Spent two months in panic mode migrating everything. Painful process but honestly ended up in a better place than before.

New setup uses Intel TDX and AMD SEV with a universal API layer so we're not locked into one vendor anymore. Performance is actually better than SGX was and we have proper redundancy now. If one TEE vendor has issues we can failover to another.

If you're still on SGX, start planning your migration now. The deadline is closer than you think and these projects always take longer than estimated.

https://redd.it/1oiaznw
@r_devops
Looking to learn more about authentication

Hey there,

For some background, I started as a dev 10+ years ago, always did some infra on the side, and switched to mainly infra ~6 years ago.

My specialty is kubernetes, including metal clusters and a lot of observability on the Grafana stack at interesting scale (a few dozen TB of logs a day).

Thing is, I'm behind on authentication / authorization subjects, as it was often already in place or managed by someone else.

I'm currently trying to redo the auth system for a personal project, and taking a lot of time to learn about all the ways to solve my issues (centralizing auth / perms, authenticating Apis via gateway, trying to follow zero trust more closely with maybe some mesh).

I'd be happy to share the knowledge I have, and receive some in return in subjects I'm weaker at.

If anyone is interested in a conversation, hit me up!

Cheers

https://redd.it/1oiemho
@r_devops
Would you let devs do this?

In our organization, we have a team that is responsible for 'devops'. They connect the security, dev, and infra teams needs to deliver a product. The development team has recently decided that they want to be completely in charge of building their artifacts in their own systems (local laptops, etc) and that the folks with devops responsibilities only need to take their artifacts and run them. We've expressed our concerns with this process to management, but it appears it's a losing battle of attrition with them. The current pipeline has many security processes built in that can notify devs early of issues and allow them to fix before even getting to a test or deployment stage. Am I crazy for thinking we shouldn't shift those processes to deployment time and keep the roles/responsibilities separated as they are? What do you all think and what do you do in your orgs?

This isn't a time issue as the time to run the current pipeline with the security features in place takes less than 4 minutes.

https://redd.it/1oisa32
@r_devops
rolling back to bare metal kubernetes on top of Linux?

Since Broadcom is raising our license cost 300% (after negotiation and discount) we're looking for options to reduce our license footprint.

Our existing k8s is just running on Linux vms in our vsphere with rancher. we have some workloads in Tanzu but nothing critical.

Have I just been out of the game in running os' on bare metal servers or is there a good reason why we don't just convert a chunk to of our esx servers to Debian and run kubernetes on there? it'll make a couple hundred thousand dollars difference annually...

https://redd.it/1oit22p
@r_devops
Suggestion

honesty, Linode’s fine but it feels kinda outdated the support’s okay, but the UI and performance can be inconsistent. I know there’s gcp, azure, and aws out there which one’s the best to learn that’s modern, flexible, and still affordable?

https://redd.it/1oiv2d5
@r_devops
Suggestion about learning active directory

Hello All ,
I am learning devops from scratch from youtube. I have started with AWS - recently i learned IAM after that there is a topic called active directory setup. The use case : youtuber told was if there is many users ( ex count users count : 2000) it will be difficult to setup user and setup iam role and do role switch and all those things . While learning this topic i can understand what he is doing and how he is doing but it is difficult to co relate as i do not have a networking background . Should i learn this topic is it important for devops learning . Please share your inputs.

https://redd.it/1oiw1rp
@r_devops
Self-hosted alternatives to Jira that don't require a PhD to set up?

We want to move away from Atlassian but every self-hosted alternative seems to require days of configuration or is missing critical features. What are people actually using that works out of the box?

https://redd.it/1oijtow
@r_devops