Reddit DevOps – Telegram
How did you get into DevOps and what actually mattered early on?

I’m learning DevOps right now and trying to be smart about where I spend my time.

For people already working in DevOps:

- What actually helped you get your first role?

- What did you stress about early on that didn’t really matter later?

- When did you personally feel “ready” for a job versus just learning tools?

One thing I keep thinking about is commands. I understand concepts pretty well, but I don’t always remember exact syntax. In real work, do you mostly rely on memory, or is it normal to lean on docs, old noscripts, and Google as long as you understand what you’re doing?
I’m more interested in real experiences than generic advice. Would love to hear how it was for you.

https://redd.it/1q09vxa
@r_devops
Reflections on DevOps over the past year

This is more of a thinking-out-loud post than a hot take.

Looking back over the past year, I can’t shake the feeling that DevOps has gotten both more powerful and more fragile at the same time.

We have better tooling than ever:
- managed services everywhere
- more automation
- more abstraction
- AI creeping into workflows
- dashboards, alerts, pipelines for everything

And yet… a lot of the incidents I’ve seen still come down to the same old things.

Misconfigurations (still rampant at my company).
Shared failure domains that nobody realized were shared.
Deployments that technically “worked” but took the system down anyway (thinking of the AWS one specifically)
Observability that only told us what happened after users noticed.

It feels like we keep adding layers on top of systems without always revisiting the fundamentals underneath them.

I’ve been part of incidents where:
- redundancy existed on paper, but not in reality
- CI/CD pipelines became a bigger risk than the code changes themselves (felt this personally since our team took control of the cloud pipelines at my company)
- costs exploded quietly until someone finally asked “why is this so expensive?”
- security issues weren’t exotic attacks — just permissions that were too broad

None of this is new. But it feels more frequent, or at least more visible.

I’m genuinely curious how others see it:
- Do you feel like the DevOps role is shifting?
- Are we actually solving different problems now, or just re-solving the same ones with new tools?
- Has the push toward speed and abstraction made things easier… or just harder to reason about?

Not looking for definitive answers — just interested in how others experienced this past year.

https://redd.it/1q0cvl1
@r_devops
How do you track your LLM/API costs per user?

Building a SaaS with multiple LLMs (OpenAI, Anthropic, Mistral) + various APIs (Supabase, etc).

My problem: I have zero visibility on costs.

* How much does each user cost me?
* Which feature burns the most tokens?
* When should I rate-limit a user?

Right now I'm basically flying blind until the invoice hits.

Tried looking at Helicone/LangFuse but not sure I want a proxy sitting between me and my LLM calls.

How do you guys handle this? Any simple solutions?

https://redd.it/1q0ecii
@r_devops
I have been working on a self-hosted GitHub Actions runner orchestrator

Hey folks,

I have been working on CIHub, an open-source project that lets you run self-hosted GitHub Actions runner on your own metal servers using firecracker. Each job runs in its own isolated VM for better security.

It integrates directly with standard GitHub Actions workflows allowing you to specify runner resources (e.g. adding label runs-on: cihub-2cpu-4gb-amd64) and includes a server + agent setup for scaling across machines.

The project is still early and under active development, and I'd really appreciate any feedback or ideas !

GitHub: https://github.com/getcihub/cihub

https://redd.it/1q0gh41
@r_devops
Boss conflict with Scrum Relations during Christmas (Xmas-Nondenominational winter-solstice festivities) Holiday Season - PSU Course Focus

Hi all, hope you're enjoying Christmas (Xmas-Nondenominational winter-solstice festivities). Wanted to hear your thoughts on this situation. My boss and I were passive aggressively arguing during the latest sprint meeting about new operation methodologies leading into Q1 of 2026. Background, as a scrum master of my sector, we currently operate with a 70% interest towards improving ART (Agile Release Train) performance with a 25% interest in current burndown navigation rounds, a 3.8% (t.l.d.r this is calculated by total story points over a averaged period of time over three to four quarters divided by total confidence metric), and a 1.3% interest in handling "team issues" (story point assignment, workplace relationships, failed deadlines, simple stuff like that). My boss believes we should average out the interest relationship for at 5% (t.l.d.r this is calculated by total story points over a averaged period of time over three to four quarters divided by total confidence metric) rather than 3.8%. The internet is telling me this is due to a knowledge deficit caused by my non-acquisition of USUX scrum focus within the PSU scrum course (I will admit, I was watching the newest marvel movie (Fantastic four anyone???) and planning my Disney vacation while taking that part of the course, I tried getting my partner to screen record, but they was getting the new booster vaccine).



Has anyone ran into something similar in regard to priority assignments? Why specifically at the end of the year (for Gregorian calendar users) and not the end of the fiscal year (for American taxpayers). Also, what scrum cert would you recommend for a 15 year old child who has interests in turning his startup into a fully functioning scrum environment.

https://redd.it/1q0iwlm
@r_devops
why does metric high cardinality break things

Wrote a post where I have seen people struggle with high cardinality and what things can be done to avoid such scenarios. any other tips you folks have seen that work well? https://last9.io/blog/why-high-cardinality-metrics-break/

https://redd.it/1q0kaqi
@r_devops
On-call / Ops folks: what actually happens when a deployment breaks at 2 AM?

Hi everyone,
I’m doing research to better understand real on-call / operations workflows, especially around deployments, rollbacks, and incident handling.
This is not a product pitch and I’m not selling anything.
I’m trying to learn from people who actually handle production responsibility.
If you’re involved in:
- deployments
- rollbacks
- uptime monitoring
- on-call rotations
I’d really appreciate your input.
You can reply publicly or DM if you prefer.




Questions
1. What happens when a deployment goes wrong in production?
(Step by step — alerts, decisions, actions)
2. Who usually decides to rollback, and how fast does that happen?
3. What tools are you actively using during an incident?
(CI/CD, monitoring, logs, noscripts, manual steps)
4. What part of this process is the most stressful or error-prone?
5. What happens if the main on-call person is unavailable?
6. Is there anything you wish was automated — but currently isn’t? Why?
7. What would you never trust automation to do?
8. (Optional) How often does a bad deploy cause customer impact?




Thanks in advance — I’m genuinely trying to understand how this works in the real world.

https://redd.it/1q0lkb4
@r_devops
How do you prove Incident response works?

We have an incident response plan, on call rotations, alerts and postmortems. Now that customers are asking about how we test incident response, I realized we’ve never really treated it as something that needed evidence.
We handle incidents and we do have evidence like log files/hives/history etc but I want to know how to collect them faster and on a daily basis so they can be more presentable.
What do I show besides screenshots and does the more the merrier go for this type of topic?

Any input helps ty!

https://redd.it/1q0nrdy
@r_devops
PostHog vs BetterStack

I'm moving off Sentry. Just underwhelmed with the value.

I'm an indie dev.

Post Hog and Better Stack seem to be two of the best options under $50/mo.

Anyone tried both or either of them and have any insight they can share?

https://redd.it/1q0spin
@r_devops
Best DevOps roadmaps for 2025/26?

I’m a student who has been trying to get into DevOps for the past year or so, but I’m having a hard time picking up a start.

I’ve worked on a lot of projects with .NET mainly for school and whatnot, I’ve also had to learn some React and Flutter throughout my journey.

I’ve really liked the concept of DevOps for a while now, and usually I’ve learned a lot of the stuff I know about software engineering in general through courses, roadmaps and personal projects.

There is a really popular roadmap site which I like to browse through sometimes (not sure if mentioning it will be considered ad so I’ll best avoid it), but it doesn’t feel complete.

I tried youtube tutorials, but most of them feel very forced in their way of teaching and are probably sponsored by a course provider anyway.

So my question the community - is there a proven and tested source of an optimal DevOps roadmap in 2025 (heading into 2026)? So far I’ve peeped into Docker and I got comfortable with using Linux, but it’s not so easy for me to do project based learning, since you need some general knowledge of what the problems are in DevOps. I don’t struggle with finding projects on technology I already know because I know what it can do and what it can’t do. But I’m barely touching the tip of the iceberg here! DevOps seems like such a huge rabbit hole, but it seems very interesting and I do want to learn more about it.

All help is much appreciated!

https://redd.it/1q0u9fg
@r_devops
Defensive CI/CD & IaC pre-commit scanner (Bash) — seeking abuse-case feedback

I built a defensive pre-commit security scanner in Bash focused on overlooked attack surfaces (static sites, IaC, CI/CD). Looking for threat-model and abuse-case review—not validation or promotion.


Zimara\_v0.49.5

https://redd.it/1q0xtjr
@r_devops
Intermediate DevOps Project Ideas looking for Suggestions to Tie My Skills Together (AWS, Docker, Jenkins, etc.)

Hey r/devops,

I've been diving deeper into DevOps over the past year and feel like I've got a solid grasp on a bunch of tools, but now I want to put them into a real-ish project to solidify everything and have something cool for my portfolio/learning.

Here's what I've learned/practiced so far:

- AWS: EC2, ECS (Fargate mostly), S3, IAM, RDS, VPC
- Linux shell noscripting
- Docker (containerizing apps)
- Jenkins (pipelines, plugins)
- SonarQube (code quality)
- Trivy (image scanning)
- GitLab (repos, basic CI)
- Ansible (playbooks, config management)

I haven't touched Terraform or Kubernetes yet (planning to start Terraform soon), so ideally something that doesn't require those.

I'm thinking something like a full CI/CD pipeline for a simple web app (maybe a Flask/Node todo app with RDS backend): GitLab -> Jenkins build/scan/push to ECR -> Ansible to deploy/update ECS service, with proper IAM/VPC security, etc.

But I'm open to better/more realistic ideas! What projects have helped you level up at this stage? Bonus if it's something that mimics real-world workflows without being too basic (no just "hello world" deploy).

Appreciate any suggestions, resources, or even "don't do X because Y" advice. Thanks in advance!

https://redd.it/1q0zd6v
@r_devops