Reddit DevOps – Telegram
Stop trusting your Terraform State file. It’s lying to you.

I've been in a bit of a debate with my platform team this week and wanted to sanity check this with you guys.

We’re doing a massive migration for a Sovereign Cloud environment, so compliance is tight. During the audit, I realized something that scared the hell out of me: we treat our Terraform State file like it's the gospel truth. But it's not. It's just a cached memory of what infrastructure used to look like.

The moment a Junior Admin hotfixes a Security Group in the AWS Console at 2 AM because "prod is down," that State file is technically corrupt. It doesn't match Reality (the Cloud API) anymore.

Most of our pipelines were just running terraform plan, which compares Git vs State. It assumes State is accurate. It completely ignores the fact that someone might have clicked around in the console three days ago.

So, I forced a change: The Hard Drift Gate.

We added terraform plan -refresh-only -detailed-exitcode before the regular plan. If it returns Exit Code 2 (Drift Detected), the pipeline dies. Hard stop. No deploying new code until you acknowledge or import the manual changes.

The pushback has been real. Half my team hates it. They say it kills velocity because they can't just "blast out a fix" if there's existing drift from a previous hotfix. They have to clean up the mess first.

My argument: Deploying on top of unknown manual changes isn't "velocity," it's negligence. Especially when a manual change might have exposed a private bucket to the public internet, and your standard apply might just silently overwrite it (or worse, ignore it).

I wrote up the exact bash logic we used to trap the exit codes and how we filter out "noise" vs "actual risks" (like data residency violations). I pinned the full write-up to my profile if anyone wants to steal the noscript, to avoid spamming the sub.

Am I being too strict here? How do you guys handle the "ClickOps" gap? Do you block the pipeline, or just let Terraform bulldoze over the manual changes and hope for the best?

https://redd.it/1qk60ll
@r_devops
PM question: what to do when automation become just another project?

I sit between product and QA, and lately automation is feeling like a whole project all on its own.

manual regression is slow and frustrating but every time we try to automate more it seems to come with a load of headaches: months of setup, new tools to learn, not to mention only one or two people on the team actually know how it works.

it’s making automation hard to justify when timelines are already tight.

for teams that actually made the transition to automated testing what made it click?

trying to figure it out before we invest more time into this.

https://redd.it/1qk2h48
@r_devops
Have you used adviser.sh - if so, what were your experiences?

Someone told me about this at work today. Looking at the blurb for https://github.com/adviserlabs/docs/tree/main, it seems to promise a way to run large-scale compute and data workflows without having to know how infrastructure, cloud configuration, or orchestration details work...

I’m generally skeptical of “magic” abstractions, but I’ve spent a fair amount of time dealing with HPC clusters, cloud schedulers, and workflow tooling, so I can see how something like this could be useful

What’s it actually like in practice?

https://redd.it/1qke9uz
@r_devops
How do you version independent Reusable Workflows in a single repo?

I'm trying to set up a centralized repository for my organization's GitHub Actions Reusable Workflows. I want to use Release Please to automate semantic versioning and changelog generation.

The problem:

I have multiple workflows that serve different purposes (e.g., ci.yml, deploy-aws.yml). Ideally, I want to version them independently (monorepo style) so a breaking change in "Deploy" doesn't force a major version bump for "CI".

However, I'm hitting a wall:

1. ⁠GitHub requires all reusable workflows to reside in .github/workflows/ (a flat file structure).

2. ⁠Release Please (and most semantic release tools) relies on folder separation to detect independent packages and manage separate versions.

Because all the YAML files sit in one folder, the tooling treats the repo as a single package

I wonder how other organizations manage that? since I guess shared workflows are pretty common

https://redd.it/1qk63vg
@r_devops
I built an open source AI agent for incident response

I worked on database infra at a big company and spent a lot of time on call. We had a ton of alerts and dashboards, and I hated jumping between a million tabs just to understand what was going on.

So I built an open source AI agent to help with that.

It runs alongside an incident and:

reads alerts, logs, metrics, and Slack
keeps a running summary of what’s happening
tracks what’s been tried and what hasn’t
suggests mitigations (like rolling back a deploy or drafting a fix PR), but a human has to approve anything before it runs

I used earlier versions during real incidents and it was useful enough that I kept working on it. This is the first open source release.

Repo: https://github.com/incidentfox/incidentfox
README has setup instructions and a demo you can run locally.

https://redd.it/1qkjqqf
@r_devops
Terraform AWS Infrastructure Framework (Multi-Env, Name-Based, Scales by Config)

🚀 Excited to share my latest open-source project: a Terraform framework for AWS focused on multi-environment infrastructure management.

After building and refining patterns across multiple environments, I open-sourced a framework that helps teams keep deployments consistent across dev / qe / prod.

The problem:
- Managing AWS infra across dev / qe / prod usually leads to:
- Configuration drift between environments
- Hardcoded resource IDs everywhere
- Repetitive boilerplate when adding “one more” resource
- Complex dependency management across modules

The solution:
A workspace-based framework with automation:

- Automatic resource linking — reference resources by name, not IDs. The framework resolves and injects IDs automatically across modules.
- DRY architecture — one codebase for dev / qe / prod using Terraform workspaces.
- Scale by configuration, not code — create unlimited resources WITHOUT re-calling modules. Just add entries in a .tfvars file using plain-English names (e.g., “prod_vpc”, “private_subnet_az1”, “eks_cluster_sg”).

What’s included:
- VPC networking (multi-AZ, public/private subnets)
- Internet gateway, NAT gateway, route tables, EIPs
- Security groups + SG-to-SG references
- VPC endpoints (Gateway & Interface)
- EKS cluster + managed node groups

Real example:
# terraform.tfvars (add more entries, no new module blocks)
eks_clusters = {
prod = {
my_cluster = {
cluster_version = "1.34"
vpc_name = "prod_vpc" # name, not ID
subnet_name = ["pri_sub1", "pri_sub2"] # names, not IDs
sg_name = ["eks_cluster_sg"] # name, not ID
}
}
}
# Framework injects vpc_id, subnet_ids, sg_ids automatically

GitHub:
https://github.com/rajarshigit2441139/terraform-aws-infrastructure-framework

Looking for:
- Feedback from the community
- Contributors interested in IaC patterns
- Teams standardizing AWS deployments

Question:
What are your biggest challenges with multi-environment Terraform? How do you handle cross-module references today?

#Terraform #AWS #InfrastructureAsCode #DevOps #CloudEngineering #EKS #Kubernetes #OpenSource #CloudArchitecture #SRE

https://redd.it/1qkjko7
@r_devops
Shall we introduce Rule against AI Generated Content?

We’ve been seeing an increase in AI generated content, especially from new accounts.

We’re considering adding a Low-effort / Low-quality rule that would include AI-generated posts.

We want your input before making changes.. please share your thoughts below.

https://redd.it/1qkliqo
@r_devops
Is specialising in GCP good for my career or should I move?

Hey,


Looking for advice.

I have spent nearly 5 years at my current devops job because it's ideal for me in terms of team chemistry, learning and WLB. The only "issue" is that we use Google Cloud- which I like using, but not sure if that matters.

I know AWS is the dominant cloud provider, am I sabotaging my career development by staying longer at this place? Obviously you can say cloud skills transfer over but loads of job denoscriptions say (2/3/4+ years experience in AWS/Azure) which is a lot of roles I might just be screened out of.

Everyone is different but wondered what other people's opinion would be on this. I would probably have to move to a similar mid or junior level, should I move just to improve career prospects? Could I still get hired for other cloud roles with extensive experience in GCP if i showed I could learn?


Also want to add I have already built personal projects in AWS, but they only have value up to a certain point I feel. Employers want production management and org level adminstration experience, of that I have very little.

https://redd.it/1qkm8w5
@r_devops
When to use Ansible vs Terraform, and where does Argo CD fit?

I’m trying to clearly understand where Ansible, Terraform, and Argo CD fit in a modern Kubernetes/GitOps setup, and I’d like to sanity-check my understanding with the community.

From what I understand so far:

Terraform is used for infrastructure provisioning (VMs, networks, cloud resources, managed K8s, etc.)
Ansible is used for server configuration (OS packages, files, services), usually before or outside Kubernetes

This part makes sense to me.

Where I get confused is Argo CD.

Let’s say:

A Kubernetes cluster (EKS / k3s / etc.) is created using Terraform
Now I want to install Argo CD on that cluster

Questions:

1. What is the industry-standard way to install Argo CD?
Terraform Kubernetes provider?
Ansible?
Or just a simple `kubectl apply` / bash noscript?
2. Is the common pattern:
Terraform → infra + cluster
One-time bootstrap (`kubectl apply`) → Argo CD
Argo CD → manages everything else in the cluster?
3. In my case, I plan to:
Install a base Argo CD
Then use Argo CD itself to install and manage the Argo CD Vault Plugin

Basically, I want to avoid tool overlap and follow what’s actually used in production today, not just what’s technically possible.

Would appreciate hearing how others are doing this in real setups.

\---
Disclaimer:
Used AI to help write and format this post for grammar and readability.

https://redd.it/1qkn8vd
@r_devops
As an SWE, for your next greenfield project, would you choose Pulumi over OpenTofu/Terraform/Ansible for the infra part?

I'm curious about the long-term alive-ness and future-proofing of investing time into Pulumi. As someone currently looking at a fresh start, is it worth the pivot for a new project?

https://redd.it/1qkp531
@r_devops
ARM build server for hosting Gitlab runners

I'm in academia where we don't have the most sophisticated DevOps setup. Hope it's acceptable to ask a basic question here.

I want to deploy docker images from our Gitlab's CI/CD to ARM-based linux systems and am looking for a cost-efficient solution to do so. Using our x86 build server to build for ARM via QEMU wasn't a good solution - it takes forever and the result differ from native builds. So I'm looking to set up a small ARM server specific to this task.

A Mac Mini appears to be an inexpensive yet relatively powerful solution to me. Any reason why this would be a bad idea? Would love to hear opinions!

https://redd.it/1qkqbsw
@r_devops
59,000,000 People Watched at the Same Time Here’s How this company Backend Didn’t Go Down

During the Cricket World Cup, **Hotstar**(An indian OTT) handled **\~59 million concurrent live streams**.

That number sounds fake until you think about what it really means:

* Millions of open TCP connections
* Sudden traffic spikes within seconds
* Kubernetes clusters scaling under pressure
* NAT Gateways, IP exhaustion, autoscaling limits
* One misconfiguration → total outage

I made a breakdown video explaining **how Hotstar’s backend survived this scale**, focusing on **real engineering problems**, not marketing slides.

Topics I coverd:

* Kubernetes / EKS behavior during traffic bursts
* Why NAT Gateways and IPs become silent killers at scale
* Load balancing + horizontal autoscaling under live traffic
* Lessons applicable to any high-traffic system (not just OTT)

Netflix Mike Tyson vs Jake Paul was 65 million concurrent viewers and jake paul iconic statement was "We crashed the site". So, even company like netflix have hard time handling big loads

If you’ve ever worked on:

* High-traffic systems
* Live streaming
* Kubernetes at scale
* Incident response during peak load

You’ll probably enjoy this.

[https://www.youtube.com/watch?v=rgljdkngjpc](https://www.youtube.com/watch?v=rgljdkngjpc)

Happy to answer questions or go deeper into any part.

https://redd.it/1qksl00
@r_devops
Our enterprise cloud security budget is under scrutiny. We’re paying $250K for current CNAPP, Orca came in 40% cheaper. Would you consider switching?

Our CFO questioned our current CNAPP (wiz) spend at $250K+ annually in the last cost review. Had to find ways to get it down. Got a quote from Orca that's 40% less for similar coverage.

For those who've evaluated both platforms is the price gap justified for enterprise deployments? We're heavy on AWS/Azure with about 2K workloads. The current tool works but the cost scrutiny is real.

Our main concerns are detection quality, false positive rates, and how well each integrates with our existing CI/CD pipeline. Any experiences would help.

https://redd.it/1qkwfrx
@r_devops
Incident management across teams is an absolute disaster

We have a decent setup for tracking our own infrastructure incidents but when something affects multiple teams it becomes total chaos. When a major incident happens we're literally updating three different places and nobody has a single source of truth. Post mortems take forever because we're piecing together timelines from different tools. Our on call rotation also doesn't sync well with who actually needs to respond. I wonder, how are you successfully handling cross functional incident tracking without creating more overhead?

https://redd.it/1qkzwlf
@r_devops
Advice Failed SC

So I wanted to get some advice from anyone who's had this happen or been through anything similar.

For context today I've just failed my required SC which was a conditional part of the job offer.

Without divulging much info it wasn't due to me or anything I did it was just to an association with someone (although haven't spoke to them in years) so I was/am a bit blindsided by this as I'm very likely to be terminated and left without a job.

Nothing has been fully confirmed yet and my current lead/manager has expressed he does not want to lose me and will try his best to keep me but its not fully his decision and termination has not been taken off the table.

Any advice/guidance?

https://redd.it/1ql1oim
@r_devops
Kubernetes IDE options

Hey everyone, I am currently using Lens as k8s IDE but it consumes too much resources it seems. I want to change it. So I wonder what Kubernetes IDE you are using.

https://redd.it/1ql1ncy
@r_devops
I have tons of commit in by hands-on project just to verify CI pipeline. how professional solve this problem ?

I have a pipeline to test my app and if it passes, push the new image of the app to github, but github actions require my secret key for a specific feature. I want to run the app in kubernetes statefulset so I deactivate my secret key require feature. but every change I done in my yaml files or in webapp code, I have to push it to github repo, so it will trigger actions and if it pass the test step, it will move to push new image step and my statefulset can pull the latest image and I can see that change I have done effect my statefulset.
so if I want to add a feature in my webapp, I have to think run it in my local, then I have to think about will it be problem in github actions and statefulset.
I just too tried from this cycle. is there any way to test my github actions before I push it to github repo? or how you guys test your yaml files ?

here is my solutions :
1 - Instead pull the image from the repo, I can create the image locally and I can try, but I won't know will it pass my test step of pipeline
2 - I can create a fork from the main repo and push too many commit, when I merge it with main, it will look 1 commit
3 - I find an app named "act" to run github actions locally, but they are not pulling variables from github repo

https://redd.it/1ql4fq1
@r_devops
How are you actually handling observability in 2026? (Beyond the marketing fluff)

Every year observability gets pitched as simpler and basically solved. Unified platforms, clean dashboards, smarter alerts.

In reality, when something breaks it still feels messy.

I am curious how people are actually handling this in 2026. What does observability look like for you in practice right now.

https://redd.it/1qlj4h7
@r_devops
DevOps Vouchers Extension

Hi

I bought a DevOps foundation and SRE exam voucher from the DevOps institute back in 2022.
A few life events happened and I wasn't able to give the exam. I'd like to attempt the exams now.

The platform was webassessor back then. Now i think its peoplecert.

I emailed their customer support and the people cert team picked up stating they have no records of my purchase.

I can provide the receipt emails, voucher codes and my email id for proof of payments.

Any one who encountered such an issue before or knows how to resolve?

Will really appreciate because its around $400 of hard earned money



https://redd.it/1qllumj
@r_devops
From DevOps Engineer to Consultant

Has anyone in Europe gone from a DevOps engineer role to work self employed in Europe? How easy or difficult is it? Any tips on how to do the change?

https://redd.it/1qlmufo
@r_devops
curl killed their bug bounty because of AI slop. So what’s your org’s “rate limit” for human attention?

curl just shut down their bug bounty program because they were getting buried in low-quality AI “vuln reports.”

This feels like alert fatigue, but for security intake. If it’s basically free to generate noise, the humans become the bottleneck, everyone stops trusting the channel, and the one real report gets lost in the pile.

How are you handling this in your org? Security side or ops side. Any filters/gating that actually work?

Source: https://github.com/curl/curl/pull/20312

https://redd.it/1qlqgnt
@r_devops