Reddit DevOps – Telegram
is 40% infrastructure waste just the industry standard?

Posted yesterday in r/kubernetes about how every cluster I audit seems to have 40-50% memory waste, and the thread turned into a massive debate about fear-based provisioning.

The pattern i'm seeing everywhere is developers requesting huge limits (e.g., 8Gi) for apps that sit at 500Mi usage. When asked why, the answer is always "we're terrified of OOMKills."

We are basically paying a fear tax to AWS just to soothe anxiety.

Wanted to get the r/devops perspective on this since you guys deal with the process side more: is this a tooling failure (we need better VPA/autoscaling) or a culture failure (devs have zero incentive to care about costs)?

I wrote a bash noscript to quantify this gap and found \~$40k/yr of fear waste on a single medium cluster.

Curious if you guys fight this battle or just accept the 40% waste as the cost of doing business?

noscript i used to find the waste is here if you want to check your own ratios:https://github.com/WozzHQ/wozz

https://redd.it/1pib0u7
@r_devops
What's a "don't do this" lesson that took you years to learn?

After years of writing code, I've got a mental list of things I wish I'd known earlier. Not architecture patterns or frameworks — just practical stuff like:

* Don't refactor and add features in the same PR
* Don't skip writing tests "just this once"
* Don't review code when you're tired

Simple things. But I learned most of them by screwing up first.

What's on your list? What's something that seems obvious now but took you years (or a painful incident) to actually follow?

https://redd.it/1pic0a4
@r_devops
Non-UNIX administration?

Hey! I have interest in some less popular OS. For example, right now I have interest in FreeBSD to try to learn jails, play around with ZFS and stuff like that.

My question: is it actually a useful skill? As I understand the field, the non-UNIX administration is really not something that companies look for when hiring DevOps Engineers. Maybe I am wrong and there is an area where (for example) FreeBSD is thriving and cannot be replaced?

https://redd.it/1piakms
@r_devops
I built envsgen: generate docker-compose files, dotenvs, JSON, and YAML from a single TOML config (with imports, variables, shell commands expansion)

Managing multiple services for my self-hosted projects meant rewriting the same env vars in a dozen places. Eventually I snapped and wrote **envsgen**, a small Go CLI that makes one TOML file the “master config” for everything.

Keeps in mind it can has bug as it is my first release, but it works.

Repo: [https://github.com/mcisback/envsgen](https://github.com/mcisback/envsgen)

Medium: [https://marcocaggiano.medium.com/awesome-devops-share-data-between-docker-dotenvs-secrets-and-apps-b909ff346cd3](https://marcocaggiano.medium.com/awesome-devops-share-data-between-docker-dotenvs-secrets-and-apps-b909ff346cd3)

Features:

* Imports (`#!import`)
* `${path.to.value}` references
* `${envs.MY_VAR}` for environment lookups
* `${\\`shell command`\\\`}`if you enable`\--allow-shell\`
* Inheritance (e.g. `backend.local` inherits `backend`)
* Output to dotenv, JSON, YAML, or **docker-compose.yaml**
* `--expand` flattens nested sections for .env formats

Now I can generate docker-compose + backend.env + production.env from the same file, no more duplication.

Happy to hear ideas or improvements!

https://redd.it/1pib7qz
@r_devops
Is there a good way to route requests to a specific instance of an API?

I am setting up a service that will be consumed exclusively through a client library. We will have multiple instances of the service with some instances being shared by multiple customers and some being dedicated to a specific customer. In our database, we have a table that maps the customer id to the specific instance ip their requests are supposed to go to. I am now trying to figure out how to route requests to the correct instance. Note, we already have an authentication mechanism set up that will reject requests if they are sent to the wrong instance, so here I am just figuring out how to route requests assuming the service is being used as intended.

My first thought was to send all requests to one load balancer or api gateway, include a header with the customer id, and have the load balancer route the request to the correct instance based on the customer id. We would want to use one of GCP or AWS's managed load balancers for this though, and I was not able to find a good way to manually specify fine grained routing rules like this for those services. They allow you to specify url maps with routing conditions, but this seems intended for routing requests to different apis rather than routing to specific instances of the same api.

My next thought was to have our client library make an initial request to a shared service that holds the customer id/instance ip map, get the ip of the customer's service and then make requests directly to that service (which will have its own load balancer in front of it) from there. This would work, but it feels a little hacky and has a fair number of edge cases that would need to be handled in the client library.

Anyone have ideas on how you would handle this kind of routing?

Edit: Here by "instance" I really mean a stand alone scalable deployment. Due to some stateful dependencies we need all of the requests from a single customer to go to one deployment.

https://redd.it/1pigr8t
@r_devops
Join the on-call roster, it’ll change your life

Joining an on-call rotation might change the future of your career and maybe even you as a person. Joining the roster 9 years ago has definitely changed me. In this article, I shared my experience being on-call.

Link: https://serce.me/posts/2025-12-09-join-oncall-it-will-change-your-life

https://redd.it/1pijp5t
@r_devops
Need brutally honest feedback: Am I employable as an internal tools/automation engineer with my background?

I'd really appreciate candid, unbiased feedback.

I’m based in Toronto and trying to understand where I realistically fit into the tech job market. My background is non-traditional, and I’ve developed a fear that I’m underqualified for most software roles despite being able to build a lot of things.

My background:

I was the main tech person at a small hedge fund that launched in 2021.

I built all the internal trading and operations tools from scratch:

PnL/exposure dashboards

Efficient trade executors

Signal engines built with insights from PM, deployed on EC2 communicated to client (traders') side noscripts through sockets.

automated margin checks

reconciliation pipelines

Excel/Python hybrid tools for ops


Basically: if the team needed something automated or streamlined, I designed and built it.


Where I feel confident:

I’m very comfortable:

understanding messy business processes

abstracting them into clean systems

building reliable automations

shipping internal tools quickly

integrating APIs

automating workflows for non-technical users

designing guardrails so people don’t make mistakes


Across domains, I feel I could pick up any internal bottleneck and automate it.

Where I feel unprepared / insecure:

Because I was the only technical person:

I never learned Agile/Scrum

never used Jira or any formal ticketing

barely used SQL (everything was Python + Excel)

never worked with other engineers

didn’t learn proper software development patterns

no pull requests, no code reviews

no experience building public products or services


I worry that I’m mostly a “noscript kiddie” who built robust systems by intuition, but not a “proper software engineer.”

The fund manager was a trained software engineer but gave me full freedom as long as the tools worked — which I loved, but now I’m worried I skipped important foundational learning.

My questions for people working in tech today:

1. Is someone with my background employable for internal tools or automation engineering roles in Canada?


2. If not, what specific skills should I prioritize learning to become employable?

SQL?

TypeScript/React?

DevOps?

Software architecture?



3. What kinds of roles would someone like me realistically be competitive for?

Internal tools engineer?

Automation engineer?

Operations engineer?

AI automation roles?



4. Is it realistic for someone with mostly Python + automation experience (but little formal SWE experience) to land roles in the ~80–110k range in Canada?


5. If you were in my position, what would you do next to fix the gaps and move forward?



I’m not looking for comfort — I genuinely want realistic, even harsh feedback from people who understand the current job market.

Thanks in advance to anyone who takes the time to answer.

https://redd.it/1piihs4
@r_devops
Malware on application server

I’m a 3rd year IT student on the ops team for a DevOps class where the devs are building a .NET application.

Earlier today I noticed a suspicious process called b0s running from `/var/tmp/.b0s` and eating a ridiculous amount of CPU. After digging into it I realized the application server was actually compromised. There were:

* strange binaries dropped in `/var/tmp` and `/tmp`
* a fake sshd running from `/var/tmp/sshd`
* cronjobs that kept recreating themselves every minute

With some AI help I cleaned everything up. I killed the active malware processes, and removed all the persistence so the server is stable again.

I built the application server with Ansible so rebuilding it tomorrow will be easy… still mad embarrassing though ngl.

https://redd.it/1piol32
@r_devops
Feel so hopeless and directionless

Just some backstory: I started off in devops straight off without any SWE background. Was working minimum wage jobs and spent hours of tutorials on my day job as I worked. A friend referred me and helped me get a support engineer job and I know how lucky I got there - I had take home assignments that I finished perfectly and got the job (the manager was leaving company and I think he just wanted to fill the position). But I struggle so much every day, team does not help me - not a single person interested in helping a junior learn or unblocking them. They don't even want to be dm'ed, they want you to ask any question in company slack in front of 500 people - which keeps me from asking at all at times. This was a couple years ago and I still have not learned or made any progress. Everyday is a struggle - I switch from one problem to next so fast that I never learn anything (thats support eng for you).

I feel like a complete newb in meetings or any discussions. I really really want to learn and find a direction for my learning. I have a few weeks off and I want to get somewhere in this time.

Here is my game plan:

Take the CKA course and pass the test: As I do this it will help me learn K8s (my jobs needs k8s knowledge) I'm working on kodekloud course.

AWS Solution architect course and test

Sys admin handbook to get good at fundamentals: https://www.amazon.com/UNIX-Linux-System-Administration-Handbook/dp/0134277554 (if you're familiar with this book and you know what can be skipped to save time please do let me know)

I think these three cover:

Container / Orchestration (k8s)
Cloud / Automation concepts (k8s / aws)
Observability (k8s)
Troubleshooting (book)
IaC (k8s)
Security (AWS)
Operating sys fundamentals (book)
Shell / noscripting (book)

My goal is 3 hours on CKA, one hour on book and 2 hours on AWS course daily.

If you think I should prioritize one above another or this looks good, let me know. Eager for some direction and advice.

https://redd.it/1piqj5d
@r_devops
Is there anyone use MLFlow for GenAI?

Heyyy. I'm sorry if my question is too naive or sounds lack of researching. But I swear I read the whole internet :)

Is there anyone here use MLFlow for GenAI ? So I started learning MLOps from a pure R&D NLP Engineer. I'm working for a startup company, and the evaluation pipeline right now is too vague and got a lot of criticism about the bad quality. I want to setup CI/CD pipeline integrate with MLFLow to make evaluation process clear and transparent. Build a quality gate to check the quality and decide if it should be on production or not.


While exploring MLFlow, I found it quite difficult to organize different stage: dev/staging/prod. As it all put in Experiment? Also I got difficulty in how to distinguish between experiment in dev (different config, model prompt) and evaluation result which put in production. (something like champion model in traditional quite useful but we don't have champion config? )


Aww thank you so much for reading this:) (this is my 3rd post T.T)

https://redd.it/1pit7nd
@r_devops
Hi guys, been looking into building a

price discovery platform for checking various FinOps platforms, and applying the optimal combination from a lookup to an individual and/or renegotiating rates

I also had a couple internal tools that I was thinking about open sourcing for using boto3 to map resource dependencies and VPCs/networks between resources

Thoughts on what the you'd like to see in something like this?

https://redd.it/1piti87
@r_devops
Using PSI + cgroups to debug noisy neighbors on Kubernetes nodes

I got tired of “CPU > 90% for N seconds → evict pods” style rules. They’re noisy and turn into musical chairs during deploys, JVM warmup, image builds, cron bursts, etc.

The mental model I use now:

* CPU% = how busy the cores are
* PSI = how much time things are actually *stalled*

On Linux, PSI shows up under `/proc/pressure/*`. On Kubernetes, a lot of clusters now expose the same signal via cAdvisor as metrics like `container_pressure_cpu_waiting_seconds_total` at the container level.

The pattern that’s worked for me:

1. Use PSI to confirm the node is actually under pressure, not just busy.
2. Walk cgroup paths to map PIDs → pod UID → {namespace, pod\_name, QoS}.
3. Aggregate per pod and split into:
* “Victims” – high stall, low run
* “Bullies” – high run while others stall

That gives a much cleaner “who is hurting whom” picture than just sorting by CPU%.

I wrapped this into a small OSS node agent I’m hacking on (Rust + eBPF):

* `/processes` – per-PID CPU/mem + namespace/pod/QoS (basically `top` but pod-aware).
* `/attribution` – you give it `{namespace, pod}`, it tells you which neighbors were loud while that pod was active in the last N seconds.

Code: [https://github.com/linnix-os/linnix](https://github.com/linnix-os/linnix?utm_source=chatgpt.com)
Write-up + examples: [https://getlinnix.substack.com/p/psi-tells-you-what-cgroups-tell-you](https://getlinnix.substack.com/p/psi-tells-you-what-cgroups-tell-you)

This isn’t an auto-eviction controller; I use it on the “detection + attribution” side to answer:

>

before touching PDBs / StatefulSets / scheduler settings.

Curious what others are doing:

* Are you using PSI or similar saturation signals for noisy neighbors?
* Or mostly app-level metrics + scheduler knobs (requests/limits, PodPriority, etc.)?
* Has anyone wired something like this into automatic actions without it turning into musical chairs?

https://redd.it/1pitfnt
@r_devops
How much better is AI at coding than you really?

If you’ve been writing code for years, what’s it actually been like using AI day to day? People hype up models like Claude as if they’re on the level of someone with decades of experience, but I’m not sure how true that feels once you’re in the trenches.

I’ve been using Claude and Cosine a lot lately, and some days it feels amazing, like having a super fast coworker who just gets things. Other days it spits out code that leaves me staring at my screen wondering what alternate universe it learned this from.

So I’m curious, if you had to go back to coding without any AI help at all, would it feel tiring?

https://redd.it/1piy5c1
@r_devops
Built a visual debugger for my local agents because I was lost in JSON, would you use this?


I run local LLM agents with tools / RAG.

When a run broke, my workflow was basically: rerun with more logging, diff JSON, and guess which step actually screwed things up. Slow and easy to miss.

So I hacked a small tool for myself: it takes a JSON trace and shows the run as a graph + timeline.

Each step is a node with the prompt / tool / result, and there’s a basic check that highlights obvious logic issues (like using empty tool results as if they were valid).

It’s already way faster for me than scrolling logs.

Long-term, I’d like this to become a proper “cognition debugger” layer on top of whatever logs/traces you already have, especially for non-deterministic agents where “what happened?” is not obvious.

It’s model-agnostic as long as the agent can dump a trace.

I’m mostly curious if anyone else here hits the same pain.

If this sounds useful, tell me what a debugger like this must show for you to actually use it.

I’ll drop a demo link in the comments 🔗.

https://redd.it/1piyll2
@r_devops
Your AI agents are a compliance disaster waiting to happen

Just got out of a meeting with legal and I need to vent somewhere.

We have like six agents running in production now. Different teams built them over the past year. They work fine, users like them, everyone was happy. Then legal started asking questions for some audit prep and everything fell apart.

Can you prove what data this agent accessed when it made that decision? No. Can you show me a trace of why it recommended X to this customer? Also no. Can you demonstrate that PII wasnt sent to openai? Definitely no. Can you prove GDPR compliance for the eu users? Lmao.

None of this stuff was even on anyones radar when we were building. We were just trying to get the damn things working. Now legal is talking about shutting down two of the agents entirely until we can prove theyre compliant. Which we cant. Because we logged basically nothing.

The thing that kills me is this isnt even hard technically. Audit logs, decision traces, data lineage. We know how to build this stuff. We just didnt because nobody asked and we were moving fast. Classic.

Now Im looking at retrofitting observability into agents that were built by people who already left the company. Some of this code is held together with prayers and yaml. One agent calls three different llm providers and nobody documented why.

Anyone else getting hit with this? How are you handling audit requirements for agent stuff? Our legal team wants full decision trails and Im not even sure where to start without rebuilding half of this from scratch.

https://redd.it/1pize7u
@r_devops
Jenkins alternative for workflows and tools

We are currently using Jenkins for a lot of automation workflows and calling all kind of tools with various parameters. What would be an alternative? GitOps is not suitable for all scenarios. For example I need to restore some specific customer database from a backup. Instead of running a noscript locally, I want to have some sort of a Jenkins-like pipeline/worflow where I can specify various parameters. What kind of tools do you guys use for such scenarios?

https://redd.it/1pj0hw0
@r_devops
ever mass forwarded a "quick fix" and mass forwarded your weekend?

Ever mass forwarded "yeah this looks like a quick fix" and mass forwarded your entire weekend?

Or approved a PR on Friday because you didn't want to be "that guy" — and spent Saturday debugging production?

I keep thinking about how no tutorial teaches you this stuff. You just screw up and learn.

But what if senior devs just... told you their screw-ups? Short videos, real situations, what they'd do differently.

Would you actually pay for that? Or is that what Reddit and YouTube are for?

https://redd.it/1pj2a99
@r_devops
Self host k3s github pipeline

Hi all,
I'm trying to build a DIY CI/CD solution on my VPS using k3s, ArgoCD, Tekton, and Helm.
I'm avoiding PaaS solutions like Coolify/Dokploy because I want to learn how to handle automation and autoscaling manually. However, I'm really struggling with the integration part (specifically GitHub webhooks failing and issues with my self-hosted registry, and tekton).

It feels like I might be over-engineering for a single server.

- What can I do to simplify this stack while keeping it "cloud-native"?
- Are there better/simpler alternatives to Tekton for a setup like this?

Thanks for any keywords or suggestions!

https://redd.it/1pj24v6
@r_devops
For the Europeans here how do you deal with agentic compliance ?

I’ve seen a few people complain about this and with the AI EU act it’s only getting worse, how are you handling this ?

https://redd.it/1pj40zu
@r_devops
How do I actually speedrun DevOps?

My main background is sysadmin, been doing it for like 10years. Few years back I decided to switch to DevOps bc I didn't wanna do the physical stuff anymore. Mainly printers...I hate printers. Anyways I started looking and found a devops job and been at it for 4+ years now.
The boss knew I didn't have actual devops experience. But based on my sysadmin background and willingness to learn and tinker, he hired me. (I told him about my whole homelap).

Here's the thing at this company for the past 4 years I haven't really done any actual "DevOps" stuff. Simply bc of the platforms and environments the company has. We have a GitHub account with a few repos that are for the most part ai generated ai apps/sites. The rest of the stack is bunch of websites on other platforms like sitegound, square space, etc. Basically for the past 4 years I've been more of a WordPress web admin and occasionally troubleshooted someone's Microsoft account/azure issues. We also have an AWS account but only use S3 for some images.

Every week, every month I would say to myself "tomorrow I'ma learn docker...or terraform...or I'ma setup a cool ci/cd pipeline in GitHub to learn devops" well everyday I was hella busy with the wp sites and other none DevOps duties that I would never get too do anything else.
Fast-forward to today and the company is being bought out and the tech dep will be gone. So I need to find a job.
While job hunting I realized(and forgot) that I needed actual DevOps experience 😢😅 everyone asking for AWS, GCP, azure, terraform, ansible..and I have NOT touched any of those.
So, how do I learn the most important things in like,..a week or so? . Im great at self-learning. Any project ideas I can whip up to speed run devops ?
My boss has told me to get certified in AWS or something, and while Yea I do want too. I also feel like I can study hard and learn what I need and just apply everything I've done for past 4years to "I automated x thing on aws to improve x thing" and use that during interviews.
Thoughts? Ideas?
Also, bc of my 3years of experience in basically WordPress and website design I kind of just want to start a side gig doing that. I became a WordPress/elementor pro basically. Oh and I actually learned a lot of JavaScript/html/css.(I already knew enough python/bash from sysadmin stuff) .
Thanks in advance!

https://redd.it/1pj8cte
@r_devops