Reddit DevOps – Telegram
I built a tiny approval service to stop my cloud servers from burning money

I run a bunch of cloud servers for dev, testing, and experiments. Like everyone else, I’d forget to shut some of them down, burning money.


 I wanted automation to handle shutdowns safely, but every option felt heavy:

Slack bots
Workflow engines
Custom approval UIs
Webhooks and state machines

All I really wanted was a simple human approval before the cron job can shutdown the server.

So I built ottr.run \- a small service that turns approval into state, not an event.

The pattern is dead simple:

A noscript creates a one-time approval link
A human clicks approve
That click write a value to key/value store
The noscript is already polling and resumes

No callbacks, no webhooks, no OAuth, no long-running workers.


This worked great for:

Auto-shutdown of idle servers
Risky infra changes
“Are you sure?” moments in cron jobs
Guardrails around cost-saving automations


Later I realized the same pattern applies to AI agents, but the original use case was pure DevOps: cheap, reliable human checkpoints for automation.





https://redd.it/1prkxvs
@r_devops
I built a small tool to turn incident notes into blameless postmortems — looking for DevOps feedback

Hey r/devops,



I built a small side project after getting tired of postmortems turning into political documents instead of learning tools.



After incidents we usually have:

\- Slack threads

\- timelines

\- partial notes

\- context scattered across tools



Turning that into a clean, exec-safe postmortem takes time and careful wording, especially if you’re trying to keep things blameless and system-focused instead of personal.



This tool takes raw incident notes and generates a structured postmortem with:

\- Executive summary

\- Impact

\- Timeline

\- Blameless root cause

\- Action items



You can regenerate individual sections, edit everything, and export the full doc as Markdown to paste into Confluence / Notion / Docs. It’s meant as a drafting accelerator, not a replacement for review or accountability.



There’s a small free tier, then it’s $29/month if it’s useful. I’m mostly trying to sanity-check whether this solves a real pain for teams that write postmortems regularly.



Link: https://blamelesspostmortem.com



Genuinely interested in feedback from folks who actually run incidents:

\- Does this match how you do postmortems?

\- Where would this break down in real-world incidents?

\- Would you ever trust something like this, even as a first draft?



https://redd.it/1prkfe8
@r_devops
Real-time location systems on AWS: what broke first in production

Hey folks,

Recently, we developed a real-time location-tracking system on AWS designed for ride-sharing and delivery workloads. Instead of providing a traditional architecture diagram, I want to share what actually broke once traffic and mobile networks came into play.

Here are some issues that failed faster than we expected:
\- WebSocket reconnect storms caused by mobile network flaps, which increased fan-out pressure and downstream load instead of reducing it.
\- DynamoDB hot partitions: partition keys that seemed fine during design reviews collapsed when writes clustered geographically and temporally.
\- Polling-based consumers: easy to implement but costly and sluggish during traffic bursts.
\- Ordering guarantees: after retries, partial failures, and reconnects, strict ordering became more of an illusion than a guarantee.

Over time, we found some strategies that worked better:
\- Treat WebSockets as a delivery channel, not a source of truth.
\- Partition writes using an entity + time window, rather than just the entity.
\- Use event-driven fan-out with bounded retries instead of pushing everywhere.
\- Design systems for eventual correctness, not immediate consistency.

I’m interested in how others handle similar issues:
\- How do you prevent reconnect storms?
\- Are there patterns that work well for maintaining order at scale?
\- In your experience, which part of real-time systems tends to fail first?

Just sharing our lessons and eager to learn from your experiences.

https://redd.it/1proepg
@r_devops
KubeUser – Kubernetes-native user & RBAC management operator for small DevOps teams

Hey folks 👋

I’ve been working on an open-source project called **KubeUser** — a lightweight Kubernetes operator for managing user authentication, RBAC, and kubeconfigs using declarative custom resources. [github](https://github.com/openkube-hub/KubeUser)

It’s built for **small DevOps teams (1–10 people)** who don’t want to run **Keycloak, Dex, or a full IAM stack** just to give someone cluster access.

**What it does**

* Define Kubernetes users declaratively (`User` CRD)
* Generate client certificates via the Kubernetes CSR API
* Create RBAC bindings automatically
* Generate kubeconfigs as Kubernetes Secrets
* GitOps-friendly, Kubernetes-native, boring on purpose

No external IdP. No extra auth services. Just Kubernetes.

This isn’t trying to replace **Keycloak** — it’s focused on *simple, Kubernetes-native user lifecycle management*.

[https://github.com/openkube-hub/KubeUser](https://github.com/openkube-hub/KubeUser)

https://redd.it/1prqehq
@r_devops
Looking for Core Team & Contributors to Build a Non-Profit DevOps Community

Hey everyone 👋
Website: https://thedevopsworld.com
I’m building The DevOps World as a non-profit initiative registered in the Netherlands(for ease of getting fund but can be decided on community voting). The goal is simple: create a strong, open community where people can learn, collaborate, ship real DevOps content, and help each other grow — without paywalls. We are 100+ and growing.

Right now, we’re looking for teams / contributors from different communities who want to help shape and grow this from the ground up.

What we need help with (4 main areas)

1. Community building
onboarding members, moderation, events, partnerships, outreach

2. Content creation (code + blog posts)
hands-on labs, repositories, tutorials, project ideas, DevOps learning paths

blog posts, guides, tooling reviews, curated news, etc.

3. Website improvements for a growing community
role-based access, contributor workflows, community pages, playground integrations

4. Support on queries
answering questions, guiding learners, reviewing content, helping with troubleshooting

Core decision-making group

Alongside the above, we’re forming a core community that helps make key decisions: priorities, partnerships, governance, content direction, and how we grow sustainably.

Personally, I’d really like to see experienced devops professional from different countries and walk of life in the core decision-making community — but it’s totally up to you.
Just tell me what interests you most: (1) community, (2) content, (3) website, (4) support, or core.

Funding / money models (open for discussion)

Because this is a non-profit, we’re not building this to “make money,” but we do need sustainable operations (hosting, tools, platform costs).
There are a few possible models (donations, sponsorships, partner contributions, affiliate-style models, etc.) — but we’ll only finalize anything after the community is formed, and it will be discussed openly with the core group.

Netherlands registration note

Since this initiative is based in the Netherlands and operates as a non-profit, we’re also looking for members who can register / participate from within the Netherlands (this helps with formal structure, legitimacy, and administration).

Partnerships update

I’ve already signed a collaboration deal with an AI company to provide a Cloud + AI playground for our community, and we have more partnerships in the queue. These will bring practical hands-on learning experiences to members.


---

If you’re interested, comment or DM with:

What area you want to contribute to (1–4 or core)

Your community/skills (optional)

If you’re based in the Netherlands (optional)


Let’s build something meaningful together. 🚀

https://redd.it/1prqhhp
@r_devops
GCP Professional Architect - LF course recommendations

For now Im only following GCP Learning Paths - looking at AI and ML related topics more this year coz seems exam has changed recently and puts a lot of attention into GenAI with Vertex AI.

Anyone did the new exam and could recommend me which udemy/coursera/other course is good to prepare for it beside learning paths and docs?


(Ps. Im not from India and I think devops ppl like me have a lot of experience with cloud and probably wanned to know few providers offerings, Im mostly coming from AWS stack).

https://redd.it/1prq66n
@r_devops
How do DevOps teams reduce risk during AWS infrastructure changes?

I’ve noticed that in many small teams and startups, most production incidents happen during infrastructure changes rather than application code changes.
Even when using IaC tools like Terraform, issues still slip through — incorrect variables, missing dependencies, or last-minute console changes that bypass reviews.
For teams without a dedicated DevOps engineer, what processes or guardrails have actually worked in practice to reduce the blast radius of infra changes on AWS?
Interested in hearing what has worked (or failed) in real-world setups.

https://redd.it/1pred2p
@r_devops
Mods where are you?

95% of the posts here have 0 or less upvotes.

We want a place to talk DevOps. Not a place for 20 year olds who don't get it who want to get in to DevOps who don't get that it's not an entry level job.

And not a place for vendors to post AI slop...

https://redd.it/1prvlzh
@r_devops
Dynamic DevOps Roadmap

URL: https://devopsroadmap.io

Has anyone here tried this roadmap? If so, would you recommend it for a beginner? Also, I’m looking for a mentor / peer who can help with the problems / projects and offer constructive criticism (promise I won’t take it personally lol). For context, I’m a computer engineer undergrad (last year) and already familiar with basics like Linux, git, bash noscripting, and python.

P.S sorry for noob-posting.

https://redd.it/1ps13df
@r_devops
PCI DSS on AWS

Folks who work in PCI domain, how do you deal with compliance when deploying services and resources on AWS using Terraform. What are the things you had to learn the hard way? Or what are some gotchas to look out for? I am currently in a hiring process for a role in PCI DSS team, never had to deal with PCI, curious to know what were your experiences.

Thank you.

https://redd.it/1ps5dxc
@r_devops
Best vps for ci/cd pipelines on a budget?

Our team is looking for a few vps instances to handle our ci/cd pipelines and a private docker registry. We have been looking at some of the newer providers that offer high ram and nvme storage because our builds are starting to get pretty heavy and the old sata drives just are not cutting it anymore. We need something with a solid network since we are pushing large images back and forth all day.

we are also considering some of the smaller players that seem to offer better specs for the same price point. Reliability is the biggest factor here because if the server goes down our whole dev workflow stops.

Has anyone tried some of the newer nvme focused providers recently? Are there any specific ones that handle high cpu load well without throttling? Would love to hear some real world experiences before we commit.



https://redd.it/1ps4twv
@r_devops
When people say “know what’s running,” it often gets interpreted as a philosophical or security-only concern.

When people say “know what’s running,” it often gets interpreted as a philosophical or security-only concern.
I mean it very concretely.
A common scenario:
You inherit a system with monitoring, EDR, logging, dashboards
Everything is “green”
Nobody can clearly explain:
why certain services exist
which ones are intentional vs historical
what’s business-critical vs just still alive
who owns decisions made years ago
The system functions, alerts fire, CI/CD runs — but understanding has decayed faster than uptime.
In practice, I’ve found that most operational risk doesn’t come from missing tools, but from missing context.
Curious how others approach rebuilding that understanding without freezing delivery.

https://redd.it/1ps6wja
@r_devops
In law there’s the Magic Circle. What’s the real equivalent in tech?

In law there’s the Magic Circle. What’s the real equivalent in tech?

https://redd.it/1ps8cob
@r_devops
Career Trajectory

Hey everyone,

I’m looking for some honest career advice because I’m a bit unsure about my next step.

I have a bachelor’s in computer science and started my career in a DevOps engineer role for about 4 months, doing a mix of coding and ops. That project ended, and I moved into a system engineer role. I’ve been doing that for a little over a year now, working in a team of five on Linux and Windows servers for large clients.

My current work includes Ansible automation, kernel patching, OS upgrades, backups, troubleshooting, etc. I’ve learned a lot and built a solid base, but lately I feel like my learning curve is slowing down. Not bored, just not growing as fast as I’d like.

My long-term goal is to become a DevOps engineer in the next 3–4 years.

I now have an offer for a System Administrator role at another company, and I’m trying to figure out whether it’s a smart stepping stone or a potential detour. The noscript worries me a bit, but the actual responsibilities seem broader and more modern than my current role.

The role would involve:
• Working with Google Cloud Platform
• Managing on-prem infrastructure (Proxmox virtualization on Dell servers + Mac hardware)
• Docker for services and build processes
• Automation using Python and Ansible
• Ensuring reliable operation of IT systems (config management, infrastructure, integrations, and continuous improvements)
• Maintaining an office IT presence, hands-on user support, and onboarding/offboarding (hardware + accounts)
• Device management tools (Intune, NinjaOne, Mosyle)
• Supporting Linux, macOS, and Windows environments
• Contributing to security and compliance: patching, access controls, monitoring events, vulnerability remediation, and assisting with audits/access reviews alongside the security team
• Company-supported certifications (which my current company doesn’t offer)

On paper, this seems closer to DevOps fundamentals (cloud, automation, containers, infra ownership), but I’m still a bit concerned about drifting too far into end-user support or being labeled “just a sysadmin” long term.

For those who’ve gone from sysadmin → DevOps (or who hire DevOps engineers):
Does this sound like a good foundation for moving into DevOps in a few years, or a role that could slow that transition down if I’m not careful?

Thanks for any real-world insights.


I have rephrased this with AI since my english is not the best


https://redd.it/1psafw5
@r_devops
Operational pain points of OTP/SMS systems?

I’m curious about OTP/SMS from an ops perspective.
If you’ve managed systems using Twilio or similar:
What operational risks showed up?
How did you monitor or control usage?
What caused alerts or panic moments?
Not promoting anything — genuinely interested in ops lessons.

https://redd.it/1psaxo4
@r_devops
Pipeline to search for new job opportunities

I live in Europe (EU citizen) in a LCOL country. I have PhD and 2 YoE in a multinational company (DevOps). I'm thinking it's time to search for a new company mostly because of financial reasons.

I believe it's better to search for a fully remote position most probably in USA or high paying EU country.
Now, I'm trying to set a "pipeline" on how to do this optimized. Time is not an issue since I already have a job.

My idea is:

1. Search linkedin for remote jobs. Any other source? Glassdoor maybe?

2. Try to find people on the most promising companies (that posted a job) and try to communicate with them for internal info (how is the company, what they searching for, ask for referral etc.)

3. Create a "big" version of my CV with most of the stuff I've done regardless of job denoscriptions

4. Ask some AI tool (any suggestions?) to take the "big" CV and curate that to the job denoscription (supervised by me)

5. Apply to as much companies as i can with this targeted way (i dont like the one CV to all approach).

General questions: What helped you approach USA/HCOL EU companies and get a job there?

What job application pipeline did you find to work best (except from networking, which is also something I plan to look into)?

https://redd.it/1ps9f1y
@r_devops
I'm so tired of using AI :/

I'm a senior devops with 10+ years of experience. Im at a company that uses PHP and a really old methodology for deployments. I've slowly been improving our workflows but my company really wants to use AI.

I've been using GitHub agents to automate a lot of our manual processes for onboarding new clients. Because we have clear processes for tasks I've found myself doing the following a lot:

- Given these 10 commits or 5 PRs use them as a template on how to create a new client space.
- Commits x-y show how we generate API keys and authorize them, can you generate a AGENTS.md file to document that process in a format I can just tell you to: "generate a new API key for company id #1234455"


My output due to AI has increased. But let's be real, I'm not programming, I'm not making .tpl files to fill in with later, I'm just using our history to automate flows.

I miss solving complex issues. I miss working on issues where the answer isn't just "ask AI, leverage AI". I want to work on memory overflows and networking debugging and cdk/noscripts, not giving Microsoft more money :/

https://redd.it/1psdwan
@r_devops
Friday night GPU spike hit $50k/day, shift-left governance fail, what tools prevent this chaos?

Got paged at 11pm Friday. GPU costs jumped to $50k/day from eng teams testing AI models. No quotas, no policies. We could've easily burned $200k by Monday. Spent much of my day manually killing instances, tagging everything.

This is our 3rd spike this quarter. We have no pre-deploy checks, no vuln-cost tying , no auto-enforcement on schedules/rightsizing. CloudHealth just show postmortem damage, anomaly alerts land on deaf eng ears.

I am here looking for advice before the next fire. What tools shift-left without turning my team into cloud cops? Would love to hear it all.

https://redd.it/1pshj2b
@r_devops
Which Infrastructure as Code tools are actually used most in production today?

I’m trying to understand real-world adoption, not just what’s popular in tutorials.

For teams running production workloads (AWS, GCP, Azure or multi-cloud):
- What IaC tool do you actually use day to day?
-Terraform / OpenTofu, CloudFormation, CDK, Pulumi, something else?
- And why did you choose it (team size, scale, compliance, velocity)?

Looking for practical answers, not marketing.

https://redd.it/1ps5058
@r_devops
Fast API with celery worker

Deployment strategy GitHub actions - ECS - EC2

EC2 2cpu - 4GB

Nginx serving front end less than 500mb

Fast API 1GB

Celery worker (fast api image )

API have a upload requirement but any time there’s an upload the fast API service restarts with 137 OOM out of memory…

File size 2kb

https://redd.it/1psjbug
@r_devops
Why the hell do container images come with a full freaking OS I don't need?

Seriously, who decided my Go binary needs bash, curl, and 47 other utilities it'll never touch? I'm drowning in CVE alerts for stuff that has zero business being in production containers. Half my vulnerability backlog is noise from base image bloat.

Anyone actually using distroless or minimal images in prod? How'd you sell the team on it? Devs are whining they can't shell into containers to debug anymore but honestly that sounds like a feature not a bug.

Need practical advice on making the switch without breaking everything.

https://redd.it/1pskpsd
@r_devops