Reddit DevOps – Telegram
Needs genuine suggestions!!

I passed my AWS Solutions Architect Associate (SAA) exam last week after preparing for 2 months

A bit about me in here about what all I have been doing and have learnt while preparing AWS SAA

\- Do have working knowledge of Linux

\- Python: not a pro, but I understand the basics and can read/write noscripts

\- Built a small AWS cloud project focused on automation and have basic python projects too

\- Basics of Jenkins

\- Not currently working, but I do have 1+ year of experience as an L1 Compute Engineer at a well known company that works with Servers

Right now I’m a bit confused about the next steps.

\- What should I be focusing on next to break into a cloud role?

\- Should I go deeper into AWS (projects, services), improve Python, or start learning DevOps tools like Docker/Terraform? What should be my immediate next focus?

\- And most importantly should I start applying for cloud roles now, or wait until I skill up more? By the roles I mean cloud support and more

Any advice, roadmap suggestions, or personal experiences would really help.

https://redd.it/1qjw8vc
@r_devops
DevOps conference

Hello! Genuinely curious if you guys are tired of seeing Star Wars theme at industry conferences?

I work for a major tech software company specifically in the QA space and I am thinking of switching the theme of our swag and booth and was wondering if anyone might be able to suggest some themes that would actually draw interest and be a little bit more novel. What would you guys like to get when it comes to swag? What would you guys like to see when it comes to a theme that would stand out and catch your attention?

I’m pondering the idea of retro games or games as a whole things such as Nintendo or maybe even board games or some fair games..

Thank you in advance!

https://redd.it/1qjvjp9
@r_devops
Built a skill for Opsy that answers "WTF is costing me money on AWS?"

I've been running a few side projects on AWS and got tired of the monthly ritual of opening Cost Explorer, seeing random charges, and thinking "wtf is this?"

So I built aws-wtf \- a skill for Opsy (CLI DevOps agent) that:

1. Pulls your cost breakdown via Cost Explorer API
2. Maps charges to actual resources \- no more guessing what eipalloc-07fa453a5acbb5651 is
3. Exports everything to CSV with resource names, ARNs, regions, and human-readable explanations
4. Identifies cost offsets like credits and free tier

ex output:

|Resource|Category|Charge|Monthly Cost|
|:-|:-|:-|:-|
|my-app-backend|Container|ECS Fargate vCPU (0.5 vCPU)|$18.51|
|my-app-prod|Networking|Application Load Balancer hourly|$16.42|
|my-app-prod|Database|RDS db.t3.micro PostgreSQL|$12.82|



Run it monthly before your bill arrives, or when onboarding to a new account to understand what's running.

Link: https://github.com/opsyhq/opsy/tree/main/skills/aws-wtf

Would love feedback. What other AWS mysteries would be useful to decode?

https://redd.it/1qjzpbl
@r_devops
What we actually alert on vs what we just log after years of alert fatigue

Spent the last few weeks documenting our monitoring setup and realized the most important thing isn't the tools. It's knowing what deserves a page vs what should just be a Slack message vs what should just be logged.

Our rule is simple. Alert on symptoms, not causes. High CPU doesn't always mean a problem. Users getting 5xx errors is always a problem.

We break it into three tiers. Page someone when users are affected right now. Slack notification when something needs attention today like a cert expiring in 14 days. Just log it when it's interesting but not urgent.

The other thing that took us years to learn is that if an alert fires and we consistently do nothing about it, we delete the alert. Alert fatigue is real and it makes you ignore the alerts that actually matter.

Wrote up the full 10-layer framework we use covering everything from infrastructure to log patterns. Each layer exists because we missed it once and got burned.

https://tasrieit.com/blog/10-layer-monitoring-framework-production-kubernetes-2026

What's your approach to deciding what gets a page vs a notification?

https://redd.it/1qk1qsn
@r_devops
Questions when hiring Juniors

Hey guys,

I am going to hire 2 jrs to the team and I was wondering what kind of questions do you all ask? I am more into fetting their mindset as experience even tho preferred, is not required. I am more looking into getting someone that transitioned from development, especially backend, rather than sys admin. Not sure if I am fair or not but instead of supporters, I am more looking for engineers. How do you guys approach this?

Thanks

EDIT: Thanks a lot for the answers. I see that I am thinking the same way with most of you guys. The post may have been misleading but I am also more insterested in their mindset, curiosity, etc. I am not trying to be harsh towards jrs or anything, I am just a mid who is forced to be lead lol

https://redd.it/1qjz4t0
@r_devops
What’s the worst production outage you’ve seen caused by env/config issues?

I’ve seen multiple production issues caused by environment variables:

\- missing keys

\- wrong formats

\- prod using dev values

\- CI passing but prod breaking at runtime

In one case, everything looked green until deployment.

How do teams here actually prevent env/config-related failures?

Do you validate configs in CI, or rely on conventions and docs?



https://redd.it/1qk4zol
@r_devops
RESUME Review request (7+ YOE, staff Platform Engineering)

This is my current resume : https://imgur.com/a/H9ztGeD

I've recently been laid off due to company wide restructuring.

I took a break and have started rewriting my resume to target Platform Engineering / DevEx roles.

Is there anything that screams red flags on my resume? (I Deffo want to re-write the service discovery bulletpoint, it comes across as low impact BS compared to the actual work done, and i want to be concise to keep it to one page)

I have been getting interview calls and recruiters reaching out, but most of them tend to fall far below my comp range (Ideally 200k$+ and remote as a baseline, which as it stands is still a sizable paycut from my previous role). I've restarted the leetcode grind (Which hopefully I won't need to grind hards for serious Platform/DevEx roles) for some of the faang tier postings, but I don't think i'll apply to them for a few more weeks.

Edit: Definitely need to fix grammar in quite a few places

https://redd.it/1qk5b9i
@r_devops
Server setup - suggestions

We have a beefy server with 2x64 AMD EPYC cores, 1+ TB RAM, multiple Nvidia data center GPUs, etc.

The plan is to use it to train AI models with images, videos, lidar data, etc, and of course maybe host some LLMs as well, and later with more GPUs and/or servers possible.

Currently, I started the setup with a Proxmox, configuring and setting up everything with Ansible to have everything in a git repo, and the plan is to have Kubernetes running on Talos VMs to be able to use Kubeflow Pipelines to be able to efficiently schedule GPUs where required, with possible Nvidia MIG if needed, and to be able to run ML pipelines easily.

Is this a bad way of doing this?
Any recommendations for these kind of use cases?


https://redd.it/1qka15z
@r_devops
Stop trusting your Terraform State file. It’s lying to you.

I've been in a bit of a debate with my platform team this week and wanted to sanity check this with you guys.

We’re doing a massive migration for a Sovereign Cloud environment, so compliance is tight. During the audit, I realized something that scared the hell out of me: we treat our Terraform State file like it's the gospel truth. But it's not. It's just a cached memory of what infrastructure used to look like.

The moment a Junior Admin hotfixes a Security Group in the AWS Console at 2 AM because "prod is down," that State file is technically corrupt. It doesn't match Reality (the Cloud API) anymore.

Most of our pipelines were just running terraform plan, which compares Git vs State. It assumes State is accurate. It completely ignores the fact that someone might have clicked around in the console three days ago.

So, I forced a change: The Hard Drift Gate.

We added terraform plan -refresh-only -detailed-exitcode before the regular plan. If it returns Exit Code 2 (Drift Detected), the pipeline dies. Hard stop. No deploying new code until you acknowledge or import the manual changes.

The pushback has been real. Half my team hates it. They say it kills velocity because they can't just "blast out a fix" if there's existing drift from a previous hotfix. They have to clean up the mess first.

My argument: Deploying on top of unknown manual changes isn't "velocity," it's negligence. Especially when a manual change might have exposed a private bucket to the public internet, and your standard apply might just silently overwrite it (or worse, ignore it).

I wrote up the exact bash logic we used to trap the exit codes and how we filter out "noise" vs "actual risks" (like data residency violations). I pinned the full write-up to my profile if anyone wants to steal the noscript, to avoid spamming the sub.

Am I being too strict here? How do you guys handle the "ClickOps" gap? Do you block the pipeline, or just let Terraform bulldoze over the manual changes and hope for the best?

https://redd.it/1qk60ll
@r_devops
PM question: what to do when automation become just another project?

I sit between product and QA, and lately automation is feeling like a whole project all on its own.

manual regression is slow and frustrating but every time we try to automate more it seems to come with a load of headaches: months of setup, new tools to learn, not to mention only one or two people on the team actually know how it works.

it’s making automation hard to justify when timelines are already tight.

for teams that actually made the transition to automated testing what made it click?

trying to figure it out before we invest more time into this.

https://redd.it/1qk2h48
@r_devops
Have you used adviser.sh - if so, what were your experiences?

Someone told me about this at work today. Looking at the blurb for https://github.com/adviserlabs/docs/tree/main, it seems to promise a way to run large-scale compute and data workflows without having to know how infrastructure, cloud configuration, or orchestration details work...

I’m generally skeptical of “magic” abstractions, but I’ve spent a fair amount of time dealing with HPC clusters, cloud schedulers, and workflow tooling, so I can see how something like this could be useful

What’s it actually like in practice?

https://redd.it/1qke9uz
@r_devops
How do you version independent Reusable Workflows in a single repo?

I'm trying to set up a centralized repository for my organization's GitHub Actions Reusable Workflows. I want to use Release Please to automate semantic versioning and changelog generation.

The problem:

I have multiple workflows that serve different purposes (e.g., ci.yml, deploy-aws.yml). Ideally, I want to version them independently (monorepo style) so a breaking change in "Deploy" doesn't force a major version bump for "CI".

However, I'm hitting a wall:

1. ⁠GitHub requires all reusable workflows to reside in .github/workflows/ (a flat file structure).

2. ⁠Release Please (and most semantic release tools) relies on folder separation to detect independent packages and manage separate versions.

Because all the YAML files sit in one folder, the tooling treats the repo as a single package

I wonder how other organizations manage that? since I guess shared workflows are pretty common

https://redd.it/1qk63vg
@r_devops
I built an open source AI agent for incident response

I worked on database infra at a big company and spent a lot of time on call. We had a ton of alerts and dashboards, and I hated jumping between a million tabs just to understand what was going on.

So I built an open source AI agent to help with that.

It runs alongside an incident and:

reads alerts, logs, metrics, and Slack
keeps a running summary of what’s happening
tracks what’s been tried and what hasn’t
suggests mitigations (like rolling back a deploy or drafting a fix PR), but a human has to approve anything before it runs

I used earlier versions during real incidents and it was useful enough that I kept working on it. This is the first open source release.

Repo: https://github.com/incidentfox/incidentfox
README has setup instructions and a demo you can run locally.

https://redd.it/1qkjqqf
@r_devops
Terraform AWS Infrastructure Framework (Multi-Env, Name-Based, Scales by Config)

🚀 Excited to share my latest open-source project: a Terraform framework for AWS focused on multi-environment infrastructure management.

After building and refining patterns across multiple environments, I open-sourced a framework that helps teams keep deployments consistent across dev / qe / prod.

The problem:
- Managing AWS infra across dev / qe / prod usually leads to:
- Configuration drift between environments
- Hardcoded resource IDs everywhere
- Repetitive boilerplate when adding “one more” resource
- Complex dependency management across modules

The solution:
A workspace-based framework with automation:

- Automatic resource linking — reference resources by name, not IDs. The framework resolves and injects IDs automatically across modules.
- DRY architecture — one codebase for dev / qe / prod using Terraform workspaces.
- Scale by configuration, not code — create unlimited resources WITHOUT re-calling modules. Just add entries in a .tfvars file using plain-English names (e.g., “prod_vpc”, “private_subnet_az1”, “eks_cluster_sg”).

What’s included:
- VPC networking (multi-AZ, public/private subnets)
- Internet gateway, NAT gateway, route tables, EIPs
- Security groups + SG-to-SG references
- VPC endpoints (Gateway & Interface)
- EKS cluster + managed node groups

Real example:
# terraform.tfvars (add more entries, no new module blocks)
eks_clusters = {
prod = {
my_cluster = {
cluster_version = "1.34"
vpc_name = "prod_vpc" # name, not ID
subnet_name = ["pri_sub1", "pri_sub2"] # names, not IDs
sg_name = ["eks_cluster_sg"] # name, not ID
}
}
}
# Framework injects vpc_id, subnet_ids, sg_ids automatically

GitHub:
https://github.com/rajarshigit2441139/terraform-aws-infrastructure-framework

Looking for:
- Feedback from the community
- Contributors interested in IaC patterns
- Teams standardizing AWS deployments

Question:
What are your biggest challenges with multi-environment Terraform? How do you handle cross-module references today?

#Terraform #AWS #InfrastructureAsCode #DevOps #CloudEngineering #EKS #Kubernetes #OpenSource #CloudArchitecture #SRE

https://redd.it/1qkjko7
@r_devops
Shall we introduce Rule against AI Generated Content?

We’ve been seeing an increase in AI generated content, especially from new accounts.

We’re considering adding a Low-effort / Low-quality rule that would include AI-generated posts.

We want your input before making changes.. please share your thoughts below.

https://redd.it/1qkliqo
@r_devops
Is specialising in GCP good for my career or should I move?

Hey,


Looking for advice.

I have spent nearly 5 years at my current devops job because it's ideal for me in terms of team chemistry, learning and WLB. The only "issue" is that we use Google Cloud- which I like using, but not sure if that matters.

I know AWS is the dominant cloud provider, am I sabotaging my career development by staying longer at this place? Obviously you can say cloud skills transfer over but loads of job denoscriptions say (2/3/4+ years experience in AWS/Azure) which is a lot of roles I might just be screened out of.

Everyone is different but wondered what other people's opinion would be on this. I would probably have to move to a similar mid or junior level, should I move just to improve career prospects? Could I still get hired for other cloud roles with extensive experience in GCP if i showed I could learn?


Also want to add I have already built personal projects in AWS, but they only have value up to a certain point I feel. Employers want production management and org level adminstration experience, of that I have very little.

https://redd.it/1qkm8w5
@r_devops
When to use Ansible vs Terraform, and where does Argo CD fit?

I’m trying to clearly understand where Ansible, Terraform, and Argo CD fit in a modern Kubernetes/GitOps setup, and I’d like to sanity-check my understanding with the community.

From what I understand so far:

Terraform is used for infrastructure provisioning (VMs, networks, cloud resources, managed K8s, etc.)
Ansible is used for server configuration (OS packages, files, services), usually before or outside Kubernetes

This part makes sense to me.

Where I get confused is Argo CD.

Let’s say:

A Kubernetes cluster (EKS / k3s / etc.) is created using Terraform
Now I want to install Argo CD on that cluster

Questions:

1. What is the industry-standard way to install Argo CD?
Terraform Kubernetes provider?
Ansible?
Or just a simple `kubectl apply` / bash noscript?
2. Is the common pattern:
Terraform → infra + cluster
One-time bootstrap (`kubectl apply`) → Argo CD
Argo CD → manages everything else in the cluster?
3. In my case, I plan to:
Install a base Argo CD
Then use Argo CD itself to install and manage the Argo CD Vault Plugin

Basically, I want to avoid tool overlap and follow what’s actually used in production today, not just what’s technically possible.

Would appreciate hearing how others are doing this in real setups.

\---
Disclaimer:
Used AI to help write and format this post for grammar and readability.

https://redd.it/1qkn8vd
@r_devops
As an SWE, for your next greenfield project, would you choose Pulumi over OpenTofu/Terraform/Ansible for the infra part?

I'm curious about the long-term alive-ness and future-proofing of investing time into Pulumi. As someone currently looking at a fresh start, is it worth the pivot for a new project?

https://redd.it/1qkp531
@r_devops
ARM build server for hosting Gitlab runners

I'm in academia where we don't have the most sophisticated DevOps setup. Hope it's acceptable to ask a basic question here.

I want to deploy docker images from our Gitlab's CI/CD to ARM-based linux systems and am looking for a cost-efficient solution to do so. Using our x86 build server to build for ARM via QEMU wasn't a good solution - it takes forever and the result differ from native builds. So I'm looking to set up a small ARM server specific to this task.

A Mac Mini appears to be an inexpensive yet relatively powerful solution to me. Any reason why this would be a bad idea? Would love to hear opinions!

https://redd.it/1qkqbsw
@r_devops
59,000,000 People Watched at the Same Time Here’s How this company Backend Didn’t Go Down

During the Cricket World Cup, **Hotstar**(An indian OTT) handled **\~59 million concurrent live streams**.

That number sounds fake until you think about what it really means:

* Millions of open TCP connections
* Sudden traffic spikes within seconds
* Kubernetes clusters scaling under pressure
* NAT Gateways, IP exhaustion, autoscaling limits
* One misconfiguration → total outage

I made a breakdown video explaining **how Hotstar’s backend survived this scale**, focusing on **real engineering problems**, not marketing slides.

Topics I coverd:

* Kubernetes / EKS behavior during traffic bursts
* Why NAT Gateways and IPs become silent killers at scale
* Load balancing + horizontal autoscaling under live traffic
* Lessons applicable to any high-traffic system (not just OTT)

Netflix Mike Tyson vs Jake Paul was 65 million concurrent viewers and jake paul iconic statement was "We crashed the site". So, even company like netflix have hard time handling big loads

If you’ve ever worked on:

* High-traffic systems
* Live streaming
* Kubernetes at scale
* Incident response during peak load

You’ll probably enjoy this.

[https://www.youtube.com/watch?v=rgljdkngjpc](https://www.youtube.com/watch?v=rgljdkngjpc)

Happy to answer questions or go deeper into any part.

https://redd.it/1qksl00
@r_devops
Our enterprise cloud security budget is under scrutiny. We’re paying $250K for current CNAPP, Orca came in 40% cheaper. Would you consider switching?

Our CFO questioned our current CNAPP (wiz) spend at $250K+ annually in the last cost review. Had to find ways to get it down. Got a quote from Orca that's 40% less for similar coverage.

For those who've evaluated both platforms is the price gap justified for enterprise deployments? We're heavy on AWS/Azure with about 2K workloads. The current tool works but the cost scrutiny is real.

Our main concerns are detection quality, false positive rates, and how well each integrates with our existing CI/CD pipeline. Any experiences would help.

https://redd.it/1qkwfrx
@r_devops