Reddit DevOps – Telegram
Open source observability - what is your take?

Hey there 👋

I currently use victoriametrics/grafana for metrics and Loki for logs (I also use ELK, but not every project has the budget to keep an ES cluster running, so S3 is a nice alternative).

What I'm missing from this stack is APM. Today I stumbled upon a link (which I lost) for a new s3-backed open source apm tool and got me thinking about this.

Since I'm already on the Grafana stack, I'm considering Tempo, but there are other alternatives like https://signoz.io/ https://openobserve.ai/ and Elastic APM. All three of those are pretty resource-hungry and I'd prefer something lighter with S3 storage.

Do you have any suggestions for other tools to evaluate? On the app side we're mostly hosting php and python apps.

Happy new years and thanks in advance for any tips!

https://redd.it/1q2u17c
@r_devops
What actually happens to postmortem action items after the incident is “over”?

Hi folks,

I’m trying to sanity-check something and would appreciate some honest answers from people doing on-call / incident work.

In places I’ve worked (small to mid-size teams, no dedicated SREs), we write postmortems after incidents, capture action items, sometimes assign owners, set dates… and then real life happens.

A few patterns I keep seeing:

action items slip quietly when other work takes priority
once prod is “stable”, the incident is mentally considered done
weeks later, it’s hard to tell what actually changed (especially for mid-sev incidents)
sometimes the same incident happens again in a slightly different form

Tooling-wise, it’s usually:

incidents/alerts arrive in Slack
postmortems written in Confluence
action items tracked in Jira (if they make it there at all)

My question isn’t how this
should work, but how it actually works for you/your team:

What happens when a postmortem action item misses its due date?
Is there any real consequence, or does it just roll over?
Who notices, if anyone? Do you send a notification?
Do you explicitly track whether an incident led to completed changes, or does it fade once things are stable?
If incidents consistently resulted in completed follow-up work — and didn’t quietly fade after recovery — would that materially change your team’s on-call life?

Not looking for best practices. I’m just trying to understand whether this pain exists outside my bubble.


I appreciate any comments / opinions in this area :)

Cheers!

https://redd.it/1q30bt7
@r_devops
I have a DevOps opportunity, but I have no experience. Is it too risky?

Hi everyone,

I hope I'm not breaking any forum rules (I'm new, so I apologize in advance and will remove the post if necessary).

M35, I'm considering a job opportunity that would require me to leave a large multinational company for a smaller company looking for a middle developer in a DevOps role. I'm preparing for the interview by taking courses on Docker and Kubernetes and brushing up on Spring Boot.

In my current job, after six years, I'm still involved in legacy support and mainly manage tickets (about €1,800 net per month in a small town in central-northern Italy). I haven't written code for a few years, and even before that, I've never been involved in full-fledged projects (all started and finished). In my role, every day is active and busy, but I'm not really a developer: I read logs, solve some problems, and respond to tickets, but I've never really acquired any particular technical skills.

I studied computer engineering, but I didn't finish, and this was my first and so far only job. I've often been told I should have been more proactive, but I didn't really know how to do more beyond writing a few PowerShell noscripts to consult logs and respond to tickets. I feel like I've wasted the little I've studied.

The work environment, however, is fantastic, and my colleagues are exceptional. Even on a human level, they supported me when I went through a difficult period, and they didn't fire me even though I wasn't at my best. That's why I feel guilty about wanting to change, but I realize that, after all these years, I haven't learned anything about real programming.
I'm wondering if I should stay out of gratitude, or if it would be a mistake not to take advantage of the opportunity to learn new technologies at another company. In particular, I wonder if the DevOps role might be too challenging for me. So far, I've only seen it in courses, but I know the reality could be very different.

I wanted to hear from those in the industry.

Thanks so much in advance!

https://redd.it/1q2o742
@r_devops
What are the best practical DevOps tutorials that were released recently?

What are the best practical DevOps tutorials that were released recently? I am always on the lookout for new things to learn. Feel free to share.

https://redd.it/1q32e5o
@r_devops
Is it really worth getting into devops after spending years on another role?

I am QA Engineer(manual+automation)for 8 years and was offered DevOps position starting from July 2026 after passing an internal interview.
For about 3 months I am studying for CKA certificate and i’m close to schedule for the exam.
I already do the devops work in the team by managing a k8s cluster, fixing CI/CD pipelines, grafana monitoring, setting alerts and playing with noscripts.

Do I love it? Yes, I wished I started earlier because I’ve always wanted to get my hands on the infra.
I am tired of QA already which they always say it’s automation but 70-80% is manual, and maintenance of an automation framework.

Questions that are bugging me at this point are:
is it really worth it?
Is it future proof?
What’s the future of it with the evolution of AI and the mass layoffs which will keep occuring?


https://redd.it/1q365xe
@r_devops
DevOps/Platform engineers: what have you built on your own?

Hey folks,

I’m a platform engineer (Azure, AWS, Kubernetes, Terraform, Python, CI/CD, some Go). I want to start building my own thing, but I’m honestly stuck at the *idea* stage.

Most startup/product advice seems very app-focused (frontend, mobile apps, UX-heavy SaaS), and that’s not my background at all. I’m trying to understand:

* What kinds of products actually make sense for someone with a DevOps / platform engineering background?
* Has anyone here built something successful (or even just useful) starting from infra/automation skills?
* Did you double down on infra tools, or did you force yourself to learn app dev?

I’d love to hear real examples — even failed attempts are helpful.

Thanks!

https://redd.it/1q2t5ma
@r_devops
Cost guardrails as code: what actually works in production?

I’m collecting real DevOps automation patterns that prevent cloud cost incidents. Not selling anything. No links. Just trying to build a field-tested checklist from people who’ve been burned.

If you’ve got a story, share it like this:

Incident: what spiked (egress, logging, autoscaling, idle infra, orphan storage)
Root cause: what actually happened (defaults, bad limits, missing ownership, runaway retries)
Signal: how you detected it (or how you wish you did)
Automation that stuck: what you automated so it doesn’t depend on humans
Guardrail: what you enforced in CI/CD or policy so it can’t happen again

Examples of the kinds of automation I’m interested in:

“Orphan sweeper” jobs (disks, snapshots, public IPs, LBs)
“Non-prod off-hours shutdown” as a default
Budget + anomaly alerts routed to owners with auto-ticketing
Pipeline gates that block expensive SKUs or missing tags
Weekly cost hygiene loop: detect → assign owner → fix → track savings

I’ll summarize the best patterns in a top comment so the thread stays useful.

https://redd.it/1q2wzad
@r_devops
How do you go from incident review to actual alerts in production?

Every retro we do, someone says "we should have had an alert for this." Everyone nods. Ticket gets created.

Then it sits there for 3 weeks because nobody wants to write the PromQL.

By the time someone gets to it, we've already had another incident and the cycle repeats.

I've been messing with a tool that takes incident notes and spits out prometheus alert configs automatically. Not sure if it's worth building out more or if I'm solving a problem only my team has.

How do you guys handle this? Is there an actual workflow that works or is everyone just letting alert tickets rot in the backlog like us?

https://redd.it/1q38k59
@r_devops
DevOps/SRE coding assessment

Looking for some recommendations on how to improve on the coding assessment phase of interviews.

For context, I am self taught but have 10+ years experience as a devops/software engineer focusing on kubernetes, building/maintaining ci/cd piplines, python noscripting for automation, etc. About 4-5 years ago i was considering moving to san francisco and had a ton of interviews. Feel like i did really well technical/infrastructure discussion until we got to the coding assessment. As i said im self taught so im sure it was just spaghetti code (though i hope ive made some improvements in the last 4-5 years). My fiance and I are thinking about moving and I want to be better prepared for interviews.

Ive done some research into things like leetcode, bootcamps, mentorships, etc but everything seems to be scams or mixed reviews.

https://redd.it/1q280sr
@r_devops
Built an AI DevOps assistant for AWS, NEED feedback..

Hey everyone,
My cofounder and I are building an AI-powered DevOps assistant aimed at startups and engineering teams using AWS. We'd love your raw, unfiltered feedback on the idea before we go further. 🙏

It’s basically a chat-based DevOps co-pilot that connects to your AWS account and helps you manage infra using natural language. It can:

Answer questions like:
“How many EC2s are running?”,
“Why are my costs high this month?”,
“Which stacks are failing?”

Convert prompts into AWS CLI commands (editable + safe approval flow)

Generate, iterate, and deploy CloudFormation templates from natural language

Integrate with GitHub/Bitbucket to:

-Scan repos for CloudFormation
-Trigger existing CI/CD pipelines
-Stream logs and diagnose failures
-Apply rule-based fixes via PRs

Enforce IAM-permissioned access, full audit logs, and org/team-based controls

We’re planning to add Terraform support next (already being requested).

☁️ This is why we’ve built it:

Infra is complex, DevOps is expensive, and a lot of startups struggle to operate AWS safely. We want this tool to feel like a senior DevOps engineer who answers questions, gives you the CLI/code to act, and handles pipelines safely with approvals.

https://redd.it/1q3dmd4
@r_devops
When is old?

At what age should someone hang their hat on trying to get in the door? What door should the older try for?

https://redd.it/1q3jlw7
@r_devops
One Windows package manager to rule them all?

Just came across a nice articsl about an unfair that brings all the various package managers together.

I personally mainly use chocolatey as it what integrated into the tool company use, however this one "UniGetUI" brings them all together into a gui.

I haven't tried it myself yet but the artical seems to good not to share.

https://www.makeuseof.com/replace-microsoft-store-with-unigetui-package-manager/

https://redd.it/1q3kln2
@r_devops
Many companies are moving towards Dev-owned DevOps.

I’m seeing a trend where companies want developers to handle DevOps work directly.

For someone working as a DevOps engineer, what’s the best way to adapt?

What new skills are worth learning, and what roles make sense in the future?

Curious to hear how others are handling this shift

https://redd.it/1q3h19o
@r_devops
CI/IaC is basically a control plane now… what guardrail helped the most?

It feels like everything is a control plane now. GitHub Actions, IaC pipelines, internal platforms, agents, all of it.

And the failure mode I keep seeing is “one small change lands everywhere” because the blast radius is huge and rollout/rollback isn’t really a thing.

Curious... What’s one guardrail you added that actually helped?

Canaries, progressive delivery, env isolation, policy checks, drift detection, JIT admin, whatever… doesn’t have to be fancy.

https://redd.it/1q3oifo
@r_devops
Is Kubernetes here to stay for a long time?

Is it worh investing time in learning K8s or it will be hidden under PaS? Is it a must have skill for every DevOps in the future or it is expected to be buried under other technologies?

https://redd.it/1q3qgdx
@r_devops
Sci-Fi Author needs your help - "End of Integers"

Hey folks! I'm a career IT Ops Engineer, and Author, with just enough programmatic knowledge to be dangerous. I'm writing a Sci-Fi novel, and need your advice.

It's the year 2711, and I have an android-like bot that works in a research lab. She has a malfunction when her human boss ask her a question that she isn't supposed to answer.

That causes an error that makes her verbalize the terms and conditions of the leasing contract that she's governed by. Not in an informational way, but one that shows she's had a failure and not acting right.

When she's done, there's a one-second pause, followed by the statement End of Integers, which she says like it's a punctuation mark.

EDIT - I want the answer to sound programmatic, but also vague and not possible.

My Dev wife thinks it's a brilliant idea, since there is no such thing as an "end of integers."

My thought is there's a safeguard to keep her from telling anyone what she knows, but the code for the safeguard has a flaw that makes her say End of Integers.

1. Keep this, or use another type of error?
2. If another, which one would make more sense, for what I need to accomplish?

Thank you, and may your Secrets Management never fail, and blow up your Sprint schedule :)

https://redd.it/1q3rjh6
@r_devops
UAT for 40 +

We are rolling out a chatbot for our organization. Leadership wants all of corp tech to be able to soft test the feature and provide feedback. Jira ID, Acceptance Criteria, Pass/ fail, stengths, weaknesses.

Normally i would have test steps but its really launch the bot and ask it questions related to denoscription/acceptance criteria.

My queation. How do you distribute and track something like this? I normally do feature releases which is done via email. This seems like it might be better on a Microsoft form with a power automate to a sharepoint list for metrics. Its 40 + scenarios though as well, add that to the problem on how to distribute and track question.

https://redd.it/1q3tzeb
@r_devops
Company I work for realized AI can’t replace DevOps and now Hiring again

Hi folks, I work as a freelance DevOps engineer, and in 2020–2022 I used to get 2-3 recruiter calls a day.. those were crazy times. It started to slowly fade off, and by mid-2023, although I still managed to get offers, it was noticeably harder.

Currently, the company I’m working at has a large proportion of developers compared to the DevOps team (I’d say \~15% DevOps, 85% devs). Our management tried multiple shiny tools to improve our processes, but we ended up using AI only for PR reviews and even that is mostly for pre-screening. We still have to manually review things since AI makes mistakes and hallucinates.

For past few years usual response around here was "Hey, these guys don’t know how to use AI and .. it’s a skill issue." but imo These folks haven’t dealt with complex infrastructure beyond boilerplate to think AI can automate DevOps.

During the past three years, I've heard all sorts of things: "Everything will be automated," "It’s just the first year of AI wait and see in a couple of years there won’t be dev jobs," "Devin will eliminate engineers.. (LOL to this one)", and so on. All this hype and bubble kept growing, yet where I worked there were no meaningful headcount reductions beyond cutting back on intern and junior roles doing mostly grunt work and boilerplate and even that ended up hurting us.

Anyway, all of this could have remained speculation, if not for the fact that DevOps positions previously considered redundant due to "more efficient processes" are now being filled again, and the 5-6 DevOps engineers on our team are so overworked that we urgently need to hire more people.

In short (TL;DR), I haven’t seen any meaningful AI automation beyond what we already had, nor did it add much real value to our team. At best, it made us slightly more efficient, but at the cost of reduced maintainability and more complexity in the codebase. If you enjoy working in DevOps, there are still plenty of opportunities out there and likely more going forward.

https://redd.it/1q3ugf8
@r_devops
Another Helm Chart for Garage (MinIO Alternative for Homelabs & Small Deployments)

After MinIO abandoned the open-source project, I needed a new S3-compatible object store for my homelab. I tried the usual suspects (SeaweedFS, Ceph, etc.), but Garage stood out for its simplicity and focus on small, geo-distributed clusters.

I have published a Helm chart that goes way beyond the official one, making Garage a drop-in replacement for MinIO with a much smoother experience for Kubernetes users.

Repo: https://github.com/datahub-local/garage-helm1

What makes this Helm chart better than the official one?

1. Automated cluster configuration: No more manual CLI or YAML hacks. Just set your layout, buckets, and keys in values.yaml or secrets and a job will set up them for you.
2. Built-in WebUI: Deploy the Garage WebUI with a single flag for easy management.
3. Gateway API support: Native support for Kubernetes Gateway API (plus Ingress), so you’re ready for modern K8s networking.
4. Grafana dashboard & ServiceMonitor: Get instant metrics and dashboards out of the box.
5. Extra resources: Inject any custom K8s manifest (Secrets, ConfigMaps, etc.) directly via values.yaml.

Big thanks to \#wittdennis — this chart is based on his original Helm chart for Garage!

If you’re looking for a MinIO alternative that’s actually open source and easy to run at home, give Garage (and this chart) a try. Feedback and PRs welcome!

https://redd.it/1q3utve
@r_devops
Those using GitLab + MS Teams - how do you handle MR notifications?

The native GitLab integration for Teams is pretty basic and Microsoft is retiring Office 365 connectors soon.


I've seen tools like PullNotifier for GitHub + Slack, but nothing similar for GitLab + Teams.


Anyone found a good solution for:

\- Getting notified when assigned to review

\- Avoiding channel spam from every commit/comment

\- Tracking which MRs are still waiting for review?


What's your workflow?

https://redd.it/1q3wxtu
@r_devops
What OS do you daily drive, and why?

I'm curious about people working in the field and why you use one OS over another?
Are there tools you've found that only avaliable on your distro of choice, is it because of stability, is it because of less bloat? Maybe it was the only option or you just like it?

https://redd.it/1q3zk3k
@r_devops