Reddit DevOps – Telegram
How would you define proactive AWS Hygiene and Ownership process

We currently lack a standardized way to track ownership, lifespan, and relevance of AWS resources, especially in non-prod accounts. This leads to unused resources, unnecessary cost, and ambiguity during alerts or incidents. We need a proactive process to keep AWS environments clean and accountable.

While I will give some thoughts about this. I want to ask to fellow people, how would you define a process? What steps should be good here? What requirements do you feel we as DevOps need here?

https://redd.it/1pzlj8c
@r_devops
Holiday hack: EKS with your own machines

Hey folks, I’m hacking on a side project over the holidays and would love a sanity check from folks running EKS at scale.

Problem: EKS/EC2 is still a big chunk of my AWS bills even after the “usual” optimizations. I’m exploring a way to reduce EKS costs even further without rewriting everything from scratch without EKS.

Most advice (and what I’ve done before) clusters around:

- Spot + smart autoscaling (Karpenter, consolidation, mixed instance types)
- Rightsizing requests/limits, bin packing, node shapes, and deleting idle workloads
- Graviton/ARM where possible
- Reduce cross-AZ spend (or even go single AZ if you can)
- FinOps visibility (Kubecost, etc.) to find the real culprits (eg, unallocated requests)
- “Kubernetes tax” avoidance: move some workloads to ECS/Fargate when you can

But even after doing all this, EC2 is just… Expensive.

So I'm playing around with a hybrid EKS cluster:

- Keep the managed EKS control plane in AWS
- Run worker nodes on much cheaper compute outside AWS (e.g. bare metal servers on Hetzner)
- Burst to EC2 for spikes using labels/taints + Karpenter on the AWS node pools

AWS now offers “EKS Hybrid Nodes” for this, but the pricing is even more expensive than EC2 itself (why?), so I’m experimenting with a hybrid setup without that managed layer.

Questions for the crowd:

- Would you ever run production workloads on off-AWS worker nodes while keeping EKS control plane in AWS? Why/why not?
- What’s the biggest deal-breaker: networking latency, security boundaries, ops overhead, supportability, something else?

If this resonates, I’m happy to share more details (or a small writeup) once I’ve cleaned it up a bit.

https://redd.it/1pzom7p
@r_devops
👍1
I made a CLI game to learn Kubernetes by fixing broken clusters (50 levels, runs locally on kind)

Hey ,


I built this thing called K8sQuest because I was tired of paying for cloud sandboxes and wanted to practice debugging broken clusters.


## What it is


It's basically a game that intentionally breaks things in your local kind cluster and makes you fix them. 50 levels total, going from "why is this pod crashing" to "here's 9 broken things in a production scenario, good luck."


Runs entirely on Docker Desktop with kind. No cloud costs.


## How it works


1. Run ./play.sh - game starts, breaks something in k8s
2. Open another terminal and debug with kubectl
3. Fix it however you want
4. Run validate in the game to check
5. Get a debrief explaining what was wrong and why


The game Has hints, progress tracking, and step-by-step guides if you get stuck.


## What you'll debug


- World 1: CrashLoopBackOff, ImagePullBackOff, pending pods, labels, ports
- World 2: Deployments, HPA, liveness/readiness probes, rollbacks
- World 3: Services, DNS, Ingress, NetworkPolicies
- World 4: PVs, PVCs, StatefulSets, ConfigMaps, Secrets
- World 5: RBAC, SecurityContext, node scheduling, resource quotas


Level 50 is intentionally chaotic - multiple failures at once.


## Install


    git clone https://github.com/Aryan4266/k8squest.git
cd k8squest
./install.sh
./play.sh



Needs: Docker Desktop, kubectl, kind, python3


## Why I made this


Reading docs didn't really stick for me. I learn better when things are broken and I have to figure out why. This simulates the actual debugging you do in prod, but locally and with hints.


Also has safety guards so you can't accidentally nuke your whole cluster (learned that the hard way).


Feedback welcome. If it helps you learn, cool. If you find bugs or have ideas for more levels, let me know.


GitHub: https://github.com/Aryan4266/k8squest

https://redd.it/1pzr4jh
@r_devops
Docker's hardened images, just Bitnami panic marketing or useful?

Our team's been burned by vendor rug pulls before. Docker drops these hardened images right after Bitnami licensing drama. Feels suspicious.

Limited to Alpine/Debian only, CVE scanning still inconsistent between tools, and suppressed vulns worry me.

Anyone moving prod workloads to these? What's your take?

https://redd.it/1pzrz1p
@r_devops
How do you integrate identity verification into CI/CD without slowing pipelines?

Hey folks, DevOps teams always need identity verification that plugs straight into pipelines without blocking deployments or creating security gaps since most solutions either slow everything down or leave staging environments exposed and we're looking for clean API handoffs delivering reliable signals at real scale.

Does anyone know of what works seamlessly for CI/CD flows?

https://redd.it/1pzuoy1
@r_devops
I got tired of the GitHub runner scare, so I moved my CI/CD to a self-hosted Gitea runner.

With the recent uncertainty around GitHub runner pricing and data privacy, I finally moved my personal projects to a self-hosted Gitea instance running on Docker.

The biggest finding: Gitea Actions is compatible with existing GitHub Actions .yaml files. I didn't have to rewrite my pipelines; I just spun up a local runner container, pointed it to my Gitea instance, and the existing noscripts worked immediately.

It’s now running on my home server (Portainer) with $0 cost, zero cold-starts, and total data privacy.

Full walkthrough of the docker-compose setup and runner registration:https://youtu.be/-tCRlfaOMjM

Is anyone else running Gitea Actions for actual production workloads yet? Curious how it scales.

https://redd.it/1pzvjv0
@r_devops
How do u know a CloudFormation CHANGE won’t break something subtle?

You change one resource.
The stack deploys successfully.
Nothing errors.

But something downstream breaks.

How do you catch that before deploy?
Or do you just accept the risk?

Curious how people think about this in practice.


https://redd.it/1pzu7dl
@r_devops
Does anyone here use rapidapi? Having issues making a payment

I'm trying to add my card to purchase a subnoscription yet my card keeps declining. So then I decide to use klarna as a loan payback option and it gets declined. Then I use affirm for loan payback and the loan was charged but the payment was blocked by rapidapi. The only possible conclusion why this happened is I was making api calls from my laptop while using hotspot so I don't know if rapidapi considered this a proxy and decided to block me from making payments?

https://redd.it/1q00zkz
@r_devops
I built a browser extension for managing multiple AWS accounts

I wanted to share this browser extension I built a few days ago. I built it to solve my own problem while working with different clients’ AWS environments. My password manager was not very helpful, as it struggled to keep credentials organized in one place and quickly became messy.

So I decided to build a solution for myself, and I thought I would share it here in case others are dealing with a similar issue.

The extension is very simple and does the following:

Stores AWS accounts with nicknames and color coding
Displays a colored banner in the AWS console to identify the current account
Supports one click account switching
Provides keyboard shortcuts (Cmd or Ctrl + Shift + 1 to 5) for frequently used accounts
Allows importing accounts from CSV or `~/.aws/config`
Groups accounts by project or client

I have currently published it on the Firefox Store:
https://addons.mozilla.org/en-US/firefox/addon/aws-omniconsole/

The source code is also available on GitHub:
https://github.com/mraza007/aws-omni

https://redd.it/1q02rc4
@r_devops
Is it just me or are some KodKloud course materials AI-generated?

Been using KodeKloud for a while now — love the hands-on labs and sandbox environments, they're genuinely useful for practical learning.

But I've started noticing some of the written course content has all the hallmarks of AI-generated text:

Forced analogies every other paragraph ("think of it like a VIP list...")
Formulaic transitions ("First things first," "Next up," "Time for a test run")
Repeated phrases/typos that suggest no human reviewed it ("violations and violations," "real-world world scenario")
Generic safety disclaimers at the end

Combined with other production issues I've noticed — choppy video edits, inconsistent audio quality, pixelated graphics, cropped screenshots cutting off text — it feels like they're prioritizing quantity over quality.

Anyone else noticing this? For what we pay, I'd expect better QA on the content. The practical stuff is solid but the courseware itself feels rushed.

EDIT: Typo in the noscript, oops, KodeKloud.

https://redd.it/1q04riy
@r_devops
Artifactory nginx replacement

I am hosting Artifactory on EKS with nginx ingress controller for url rewrite. Since nginx ingress controller will be retired, what to use instead? First though is to use ALB because it now supports url rewrite. Any other options?

Please let me know your opinions and experience.

Thank you.

https://redd.it/1q071hx
@r_devops
Stuck on the Java 8 / Spring Boot 2 upgrade. Do you need a "Map" or a "Driver"?

We are currently debating how to handle a massive legacy migration (Java/Spring) that has been postponing for years. The team is paralyzed because nobody knows the blast radius or the exact effort involved.

We are trying to validate what would actually unblock teams in this situation.

The Hypothetical Solution:
Imagine a "Risk Intelligence Service" where you grant read-access to the repo, and you get back a comprehensive Upgrade Strategy Report.
It identifies exactly what breaks, where the test gaps are, and provides a step-by-step migration plan (e.g., "Fix these 3 libs first, then upgrade module X").

My question to Engineering Managers / Tech Leads:
If you had budget ($3k-$10k range) to solve this headache, which option would you actually buy?
- Option A (The Map): "Just give us the deep-dive analysis and the plan. We have the devs, we just need to know exactly what to do so we don't waste weeks on research."
- Option B (The Driver): "I don't want a report. I want you to come in, do the grunt work (refactoring/upgrading), and hand me a clean PR."
- Option C (Status Quo): "We wouldn't pay for either. We just accept the pain and do it manually in-house."

Trying to figure out if the bottleneck is knowledge (risk assessment) or capacity (doing the work).

https://redd.it/1q07zt7
@r_devops
The cognitive overhead of cloud infra choices feels under-discussed


Curious how people here think about this from an ops perspective.

We started on AWS (like most teams), and functionally it does everything we need. That said, once you move past basic usage, the combination of IAM complexity, cost attribution, and compliance-related questions adds a non-trivial amount of cognitive overhead. For context, our requirements are fairly standard: VMs, networking, backups, and some basic automation,,, nothing particularly exotic.

Because we’re EU-focused, I’ve been benchmarking a few non-hyperscaler setups in parallel, mostly as a sanity check to understand tradeoffs rather than as a migration plan. One of the environments I tested was a Swiss-based IaaS (Xelon), primarily to look at API completeness, snapshot semantics, and what day-2 operations actually feel like compared to AWS.

The experience was mixed in predictable ways: fewer abstractions and less surface area, but also a smaller ecosystem and less polish overall. It did, however, make it easier to reason about certain operational behaviors.

Idk what the “right” long-term answer is, but I’m interested in how others approach this in practice: Do you default to hyperscalers until scale demands otherwise, or do you intentionally optimize for simplicity earlier on?


https://redd.it/1q07j39
@r_devops
How did you get into DevOps and what actually mattered early on?

I’m learning DevOps right now and trying to be smart about where I spend my time.

For people already working in DevOps:

- What actually helped you get your first role?

- What did you stress about early on that didn’t really matter later?

- When did you personally feel “ready” for a job versus just learning tools?

One thing I keep thinking about is commands. I understand concepts pretty well, but I don’t always remember exact syntax. In real work, do you mostly rely on memory, or is it normal to lean on docs, old noscripts, and Google as long as you understand what you’re doing?
I’m more interested in real experiences than generic advice. Would love to hear how it was for you.

https://redd.it/1q09vxa
@r_devops