Fix deploy bugs before they land: a semantic firewall for devops + grandma clinic (beginner friendly, mit)
last week i shared a deep dive on failure modes in ai stacks and got great feedback here. a few folks asked for a simpler, beginner friendly version for devops. this is that post. same math idea, plain language. the trick is simple. instead of patching after a bad deploy, you install a tiny semantic firewall before anything runs. if the state is unstable, it loops, narrows, or refuses. only a stable state is allowed to execute.
why you should care
after style: output happens then you scramble with rollbacks and quick fixes. the same class of failure returns with a new shape.
before style: a pre-output gate inspects state signals first. if boot order is wrong, a lock is pending, or the first call will burn, it stops early. fixes become structural and repeatable.
what this looks like in devops terms
No.14 bootstrap ordering. hot pan before eggs. readiness probes pass, caches warmed, migrations staged.
No.15 deployment deadlock. decide who passes the narrow door. total order, timeouts and backoff, fallback path.
No.16 pre-deploy collapse. wash the first pot. versions pinned, secrets present, tiny canary first.
No.8 debugging black box. recipe card next to the stove. every run logs which inputs and checks created the output.
quick demo. add a pre-output gate to ci
paste this into a repo as
github actions wiring. run preflight before real work.
kubernetes job with a gate. refuse if gate fails.
minimal “citation first” for runbooks
the same idea works for human steps. put the card on the table before you act.
what changes after you add the gate
you stop guessing. every failure maps to a number and a fix you can name.
fewer rollbacks.
last week i shared a deep dive on failure modes in ai stacks and got great feedback here. a few folks asked for a simpler, beginner friendly version for devops. this is that post. same math idea, plain language. the trick is simple. instead of patching after a bad deploy, you install a tiny semantic firewall before anything runs. if the state is unstable, it loops, narrows, or refuses. only a stable state is allowed to execute.
why you should care
after style: output happens then you scramble with rollbacks and quick fixes. the same class of failure returns with a new shape.
before style: a pre-output gate inspects state signals first. if boot order is wrong, a lock is pending, or the first call will burn, it stops early. fixes become structural and repeatable.
what this looks like in devops terms
No.14 bootstrap ordering. hot pan before eggs. readiness probes pass, caches warmed, migrations staged.
No.15 deployment deadlock. decide who passes the narrow door. total order, timeouts and backoff, fallback path.
No.16 pre-deploy collapse. wash the first pot. versions pinned, secrets present, tiny canary first.
No.8 debugging black box. recipe card next to the stove. every run logs which inputs and checks created the output.
quick demo. add a pre-output gate to ci
paste this into a repo as
preflight.sh and call it from your pipeline. it fails fast with a clear reason.#!/usr/bin/env bash
set -euo pipefail
say() { printf "[preflight] %s\n" "$*"; }
fail() { printf "[preflight][fail] %s\n" "$*" >&2; exit 1; }
# 1) bootstrap order
say "checking service readiness"
kubectl wait --for=condition=available --timeout=90s deploy/app || fail "app not ready"
kubectl wait --for=condition=available --timeout=90s deploy/db || fail "db not ready"
say "warming cache and index"
curl -fsS "$WARMUP_URL/cache" || fail "cache warmup failed"
curl -fsS "$WARMUP_URL/index" || fail "index warmup failed"
# 2) secrets and env
say "checking secrets"
[[ -n "${API_KEY:-}" ]] || fail "missing API_KEY"
[[ -n "${DB_URL:-}" ]] || fail "missing DB_URL"
# 3) migrations have a lane
say "ensuring migration lane is clear"
flock -n /tmp/migrate.lock -c "echo locked" || fail "migration lock held"
./migrate --plan || fail "migration plan invalid"
./migrate --dry || fail "migration dry run failed"
# 4) deadlock guards
say "testing write path with timeout"
curl -m 5 -fsS "$HEALTH_URL/write-probe" || fail "write probe timeout likely deadlock"
# 5) first call canary
say "shipping tiny canary"
resp="$(curl -fsS "$API_URL/ping?traffic=0.1")" || fail "canary failed"
grep -q '"ok":true' <<<"$resp" || fail "canary not ok"
say "preflight passed"
github actions wiring. run preflight before real work.
name: release
on: [push]
jobs:
ship:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: setup
run: echo "WARMUP_URL=$WARMUP_URL" >> $GITHUB_ENV
- name: pre-output gate
run: bash ./preflight.sh
- name: deploy
if: ${{ success() }}
run: bash ./deploy.sh
kubernetes job with a gate. refuse if gate fails.
apiVersion: batch/v1
kind: Job
metadata:
name: job-with-gate
spec:
template:
spec:
restartPolicy: Never
containers:
- name: runner
image: your-image:tag
command: ["/bin/bash","-lc"]
args:
- |
./preflight.sh || { echo "blocked by semantic gate"; exit 1; }
./run-task.sh
minimal “citation first” for runbooks
the same idea works for human steps. put the card on the table before you act.
runbook step 2 – change feature flag
require: ticket id + monitoring link
refuse: if ticket or dashboard missing, do not flip flag
accept: when both are pasted and the dashboard shows baseline stable for 2 minutes
what changes after you add the gate
you stop guessing. every failure maps to a number and a fix you can name.
fewer rollbacks.
first call failures are caught on the canary.
fewer flaky deploys. boot order and locks are tested up front.
black box debugging ends. each release has a small trace that explains why it was allowed to run.
how to try this in 60 seconds
1. copy
2. set three env vars and one canary endpoint.
3. run. if it blocks, read the message, not the logs.
if you want the plain language guide
there is a beginner friendly “grandma clinic” that explains each failure as a short story plus the minimal fix. the labels above map to these numbers. start with No.14, No.15, No.16, No.8. if you need the doctor style prompt that points you to the exact page, ask and i can share it.
faq
q. do i need to install a platform or sdk
a. no. this is shell and yaml. it is a reasoning guard before output. you can keep your stack.
q. will this slow down release
a. it adds seconds. it removes hours of rollback and root cause churn.
q. can i adapt this for airflow, argo, jenkins
a. yes. drop the same gate into a pre step. the checks are plain commands.
q. how do i know it actually worked
a. acceptance targets. you decide them. at minimum require readiness passed, secrets present, no lock held, canary ok. if these hold three runs in a row, the class is fixed.
q. we also run ai agents to modify infra. does the same idea work
a. yes. add “evidence first” to the agent. tool calls only after a citation or a runbook page is present.
q: where is the plain language guide
a: “Grandma Clinic” explains the 16 common failure modes with tiny fixes. beginner friendly.
link:
https://github.com/onestardao/WFGY/blob/main/ProblemMap/GrandmaClinic/README.md
closing
this feels different because it is not a patch zoo after the fact. it is a small refusal engine before the fact. once a class is mapped and guarded, it stays fixed. Thanks for reading my work
https://redd.it/1nj1okk
@r_devops
fewer flaky deploys. boot order and locks are tested up front.
black box debugging ends. each release has a small trace that explains why it was allowed to run.
how to try this in 60 seconds
1. copy
preflight.sh into any pipeline or cron job.2. set three env vars and one canary endpoint.
3. run. if it blocks, read the message, not the logs.
if you want the plain language guide
there is a beginner friendly “grandma clinic” that explains each failure as a short story plus the minimal fix. the labels above map to these numbers. start with No.14, No.15, No.16, No.8. if you need the doctor style prompt that points you to the exact page, ask and i can share it.
faq
q. do i need to install a platform or sdk
a. no. this is shell and yaml. it is a reasoning guard before output. you can keep your stack.
q. will this slow down release
a. it adds seconds. it removes hours of rollback and root cause churn.
q. can i adapt this for airflow, argo, jenkins
a. yes. drop the same gate into a pre step. the checks are plain commands.
q. how do i know it actually worked
a. acceptance targets. you decide them. at minimum require readiness passed, secrets present, no lock held, canary ok. if these hold three runs in a row, the class is fixed.
q. we also run ai agents to modify infra. does the same idea work
a. yes. add “evidence first” to the agent. tool calls only after a citation or a runbook page is present.
q: where is the plain language guide
a: “Grandma Clinic” explains the 16 common failure modes with tiny fixes. beginner friendly.
link:
https://github.com/onestardao/WFGY/blob/main/ProblemMap/GrandmaClinic/README.md
closing
this feels different because it is not a patch zoo after the fact. it is a small refusal engine before the fact. once a class is mapped and guarded, it stays fixed. Thanks for reading my work
https://redd.it/1nj1okk
@r_devops
GitHub
WFGY/ProblemMap/GrandmaClinic/README.md at main · onestardao/WFGY
WFGY 2.0. Semantic Reasoning Engine for LLMs (MIT). Fixes RAG/OCR drift, collapse & “ghost matches” via symbolic overlays + logic patches. Autoboot; OneLine & Flagship. ⭐ Star if yo...
Are these types of DevOps interview questions normal for fresher/junior roles, or was this just overkill?
Hey everyone,
I recently gave a DevOps interview through Alignerr (AI-based assessment), and I honestly came out feeling like I got cooked. 🥲
The questions were way harder than I expected for a fresher/junior role. Some examples:
Identifying port 22.
How to separate broad staging and dev environments from a large Terraform configuration file.
Handling configs for multiple environments with variables.
Dealing with things bound to 0.0.0.0 and what policies you’d set around that.
General stuff about modules and structuring one big configuration.
Integrating Sentinal with CICD pipeline
I was expecting more “Terraform init/plan/apply” level or maybe some AWS basics, but these felt like senior-level production questions.
https://redd.it/1nj73nc
@r_devops
Hey everyone,
I recently gave a DevOps interview through Alignerr (AI-based assessment), and I honestly came out feeling like I got cooked. 🥲
The questions were way harder than I expected for a fresher/junior role. Some examples:
Identifying port 22.
How to separate broad staging and dev environments from a large Terraform configuration file.
Handling configs for multiple environments with variables.
Dealing with things bound to 0.0.0.0 and what policies you’d set around that.
General stuff about modules and structuring one big configuration.
Integrating Sentinal with CICD pipeline
I was expecting more “Terraform init/plan/apply” level or maybe some AWS basics, but these felt like senior-level production questions.
https://redd.it/1nj73nc
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Help me give my career some direction
I am a 2021 graduate from B.Tech IT graduate from a private college in Manipal, India.
My career has been a mess ever since. Soon after graduation I went to US for pursuing master's in 2021, but I didn't complete my degree and returned to India in about 6 months. Then I went back to US 6 months later and returned again in about 3 months. So overall I spent about 2 years gaining nothing and doing back and forth between India and US. I also accumulated some debt in the process. The reason for this flipflop were some untreated mental health issues.
After returning to India for second time in 2023, after extensive search, I finally found a DevOps Engineer job at a firm in Bengaluru. The salary was good until the job lasted (15 LPA or $17k/year), but layoffs hit soon in 2024. I was lucky to find another job in Bengaluru which paid the same, but the thing is I never learned core DevOps skills: Cloud Management, Kubernetes, CI/CD pipelines etc. For 2 years I have been working only on Python & Bash based programs and noscripts.
Now I am willing to undergo some certifications to aim for higher packages. Certified Kubernetes adminstrator and AWS DevOps Engineer Professional are the ones I am targeting. But, I am unsure if they will lead to higher packages at all. Most DevOps jobs in India are in WITCH like consulting companies. I am unsure how to aim for product based companies, especially in the current environment, when there are no jobs anywhere.
Should I try to switch to development, which seems so risky in the age of AI?
Tldr; I am a lost engineer, currently employed but looking for ways to increase my compensation. Please help me give my career some direction. I have wasted a lot of time but I am still only 26, and have many years ahead of me.
https://redd.it/1nj7jd4
@r_devops
I am a 2021 graduate from B.Tech IT graduate from a private college in Manipal, India.
My career has been a mess ever since. Soon after graduation I went to US for pursuing master's in 2021, but I didn't complete my degree and returned to India in about 6 months. Then I went back to US 6 months later and returned again in about 3 months. So overall I spent about 2 years gaining nothing and doing back and forth between India and US. I also accumulated some debt in the process. The reason for this flipflop were some untreated mental health issues.
After returning to India for second time in 2023, after extensive search, I finally found a DevOps Engineer job at a firm in Bengaluru. The salary was good until the job lasted (15 LPA or $17k/year), but layoffs hit soon in 2024. I was lucky to find another job in Bengaluru which paid the same, but the thing is I never learned core DevOps skills: Cloud Management, Kubernetes, CI/CD pipelines etc. For 2 years I have been working only on Python & Bash based programs and noscripts.
Now I am willing to undergo some certifications to aim for higher packages. Certified Kubernetes adminstrator and AWS DevOps Engineer Professional are the ones I am targeting. But, I am unsure if they will lead to higher packages at all. Most DevOps jobs in India are in WITCH like consulting companies. I am unsure how to aim for product based companies, especially in the current environment, when there are no jobs anywhere.
Should I try to switch to development, which seems so risky in the age of AI?
Tldr; I am a lost engineer, currently employed but looking for ways to increase my compensation. Please help me give my career some direction. I have wasted a lot of time but I am still only 26, and have many years ahead of me.
https://redd.it/1nj7jd4
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How are you keeping CI/CD security from slowing down deploys?
Our pipeline runs Terraform + Kubernetes deploys daily. We’ve got some IaC linting and container scans in place, but it feels like every added check drags the cycle out. Security wants more coverage, but devs complain every time scans add minutes.
How are you balancing speed and security here? Anyone feel like they’ve nailed CI/CD security without breaking velocity?
https://redd.it/1nj8ysz
@r_devops
Our pipeline runs Terraform + Kubernetes deploys daily. We’ve got some IaC linting and container scans in place, but it feels like every added check drags the cycle out. Security wants more coverage, but devs complain every time scans add minutes.
How are you balancing speed and security here? Anyone feel like they’ve nailed CI/CD security without breaking velocity?
https://redd.it/1nj8ysz
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How to get real-time experience with Rest Assured?
Hey everyone,
I’ve learned Rest Assured and Postman from YouTube and other online resources, but I don’t have any real-time industry experience using them.
From what I understand, Postman is mostly about validating status codes, response bodies, and response data. But I’m curious — how do companies actually use Rest Assured in real projects?
Also, if I want to practice and improve my skills, what kind of test cases should I automate beyond the basics? Any ideas on good sample APIs or projects to work on would be super helpful.
Thanks!
https://redd.it/1nja5q0
@r_devops
Hey everyone,
I’ve learned Rest Assured and Postman from YouTube and other online resources, but I don’t have any real-time industry experience using them.
From what I understand, Postman is mostly about validating status codes, response bodies, and response data. But I’m curious — how do companies actually use Rest Assured in real projects?
Also, if I want to practice and improve my skills, what kind of test cases should I automate beyond the basics? Any ideas on good sample APIs or projects to work on would be super helpful.
Thanks!
https://redd.it/1nja5q0
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Cut our hiring process from 6 weeks to 2 here’s what changed
We were losing great candidates because our process dragged: six rounds, take-homes, endless scheduling. By the time we made an offer, they were gone.
The breakthrough was ditching fragmented evaluations. Now, we run one in-depth technical session where candidates work on a real problem we’re facing. The team is in the room, asking questions, giving context. Candidates love it when they get a peek into our environment, and we see how they think under realistic conditions.
Services like paraform help keep the pipeline moving, but the real change was shifting from “testing everything” to evaluating real performance.
https://redd.it/1njbyc2
@r_devops
We were losing great candidates because our process dragged: six rounds, take-homes, endless scheduling. By the time we made an offer, they were gone.
The breakthrough was ditching fragmented evaluations. Now, we run one in-depth technical session where candidates work on a real problem we’re facing. The team is in the room, asking questions, giving context. Candidates love it when they get a peek into our environment, and we see how they think under realistic conditions.
Services like paraform help keep the pipeline moving, but the real change was shifting from “testing everything” to evaluating real performance.
https://redd.it/1njbyc2
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Want to switch from Testing (3 YOE) to DevOps – Need guidance, roadmap, and resources
Hey everyone,
I’ve been working as a tester for almost 3 years, and I’m considering switching to DevOps. I know some basics of Jenkins and a bit about CI/CD pipelines, but I’m not very confident yet.
Recently, I’ve seen a lot of LinkedIn posts and articles saying that DevOps is booming and offers great opportunities. Is this really true right now?
If yes, could you please guide me on:
1. Where to start – which DevOps tools/concepts to learn first.
2. A roadmap to move from testing to DevOps step-by-step.
3. Study material/resources (courses, books, or projects) to learn and practice.
My goal is to become skilled enough to transition into a DevOps role. Any advice from people who have made this switch or are working in DevOps would be super helpful!
Thanks in advance 🙏
https://redd.it/1njavo8
@r_devops
Hey everyone,
I’ve been working as a tester for almost 3 years, and I’m considering switching to DevOps. I know some basics of Jenkins and a bit about CI/CD pipelines, but I’m not very confident yet.
Recently, I’ve seen a lot of LinkedIn posts and articles saying that DevOps is booming and offers great opportunities. Is this really true right now?
If yes, could you please guide me on:
1. Where to start – which DevOps tools/concepts to learn first.
2. A roadmap to move from testing to DevOps step-by-step.
3. Study material/resources (courses, books, or projects) to learn and practice.
My goal is to become skilled enough to transition into a DevOps role. Any advice from people who have made this switch or are working in DevOps would be super helpful!
Thanks in advance 🙏
https://redd.it/1njavo8
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
tips for preparing for a devops course
hello everyone,
in a month im going to start a pretty intense course in devops, a course for people with a little bit of background in code and IT, meaning we wont start completely from scratch.
looking for tips on how to prepare.
I used to work in IT, and studied a python course in uni (mostly basic concepts and medium-hard leetcode).
I have a good base for networks, operating systems (windows from IT, and linux from studying online and using it daily).
most people I asked told me that networking, python and linux are the base of everything devops, though I feel like these are my strong sides, problem is, how do I know? I do leetcode in python, but how would one truly know he knows enough about linux and networking? how do I practice?
I just completed courses on udemy on ansible, jenkins, and docker, but how does one practice to make sure he actually knows around them? I dont like the concept of studying and just listening to the guy talk with no real confidence that I actually understood anything he said.
the udemy course had practice labs on kodekloud which were nice but i've done them all, and I feel like they mostly checked my understanding on syntax and commands, its not checking my understanding of what these tools do and why im doing what im doing.
any tips for how to practice? and any other tips are welcome!
https://redd.it/1njcwc0
@r_devops
hello everyone,
in a month im going to start a pretty intense course in devops, a course for people with a little bit of background in code and IT, meaning we wont start completely from scratch.
looking for tips on how to prepare.
I used to work in IT, and studied a python course in uni (mostly basic concepts and medium-hard leetcode).
I have a good base for networks, operating systems (windows from IT, and linux from studying online and using it daily).
most people I asked told me that networking, python and linux are the base of everything devops, though I feel like these are my strong sides, problem is, how do I know? I do leetcode in python, but how would one truly know he knows enough about linux and networking? how do I practice?
I just completed courses on udemy on ansible, jenkins, and docker, but how does one practice to make sure he actually knows around them? I dont like the concept of studying and just listening to the guy talk with no real confidence that I actually understood anything he said.
the udemy course had practice labs on kodekloud which were nice but i've done them all, and I feel like they mostly checked my understanding on syntax and commands, its not checking my understanding of what these tools do and why im doing what im doing.
any tips for how to practice? and any other tips are welcome!
https://redd.it/1njcwc0
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
What was the 'killer feature' for your IDP?
I'm making a shopping list for IDP features and a loose roadmap. I'm curious to hear from those who have build/bought a "Platform" - what feature added the most value for your developers/infra teams? Was it something that people were asking for or something you didn't expect? Our objective is to build velocity, so less dev time mucking about trying to find which infra team should be helping them, and faster time to new app creation.
WHAT COULD GO WRONG?!?
https://redd.it/1njgpmt
@r_devops
I'm making a shopping list for IDP features and a loose roadmap. I'm curious to hear from those who have build/bought a "Platform" - what feature added the most value for your developers/infra teams? Was it something that people were asking for or something you didn't expect? Our objective is to build velocity, so less dev time mucking about trying to find which infra team should be helping them, and faster time to new app creation.
WHAT COULD GO WRONG?!?
https://redd.it/1njgpmt
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
The Ultimate SRE Reliability Checklist
A practical, progressive SRE checklist you can actually implement. Plain explanations. Focus on user impact. Start small, mature deliberately.
https://oneuptime.com/blog/post/2025-09-10-sre-checklist/view
https://redd.it/1njbgeh
@r_devops
A practical, progressive SRE checklist you can actually implement. Plain explanations. Focus on user impact. Start small, mature deliberately.
https://oneuptime.com/blog/post/2025-09-10-sre-checklist/view
https://redd.it/1njbgeh
@r_devops
OneUptime | One Complete Observability platform.
The Ultimate SRE Reliability Checklist
A practical, progressive SRE checklist you can actually implement. Plain explanations. Focus on user impact. Start small, mature deliberately.
From AWS Intern to Remote-Ready Cloud Engineer: Looking for Guidance
Hey everyone — I recently completed a cloud support engineering internship at AWS where I was exposed on how to handle global support cases involving EC2, IAM, and VPC but also got greater exposure to Linux, and high available web application developed a strong understanding of security, governance, and compliance principles.
I'm AWS SAA-C03, AIF-C01, and CCNA certified, and have solid hands-on skills in cloud diagnostics, CLI tooling, and automation.
I'm now looking to pivot into remote work — ideally with startups or dev shops where I can contribute to infrastructure support, observability, or AI ops. I’m based in Kenya, with strong internet and power, and comfortable working US/EU hours.
Would love to hear from anyone who’s hired globally or transitioned from a support background into DevOps or infra roles.
Any advice, referrals, or critique of my approach would be hugely appreciated!
Happy to DM my CV or portfolio if helpful 🙏
https://redd.it/1njotqk
@r_devops
Hey everyone — I recently completed a cloud support engineering internship at AWS where I was exposed on how to handle global support cases involving EC2, IAM, and VPC but also got greater exposure to Linux, and high available web application developed a strong understanding of security, governance, and compliance principles.
I'm AWS SAA-C03, AIF-C01, and CCNA certified, and have solid hands-on skills in cloud diagnostics, CLI tooling, and automation.
I'm now looking to pivot into remote work — ideally with startups or dev shops where I can contribute to infrastructure support, observability, or AI ops. I’m based in Kenya, with strong internet and power, and comfortable working US/EU hours.
Would love to hear from anyone who’s hired globally or transitioned from a support background into DevOps or infra roles.
Any advice, referrals, or critique of my approach would be hugely appreciated!
Happy to DM my CV or portfolio if helpful 🙏
https://redd.it/1njotqk
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Our postmortem process is broken and I'm tired of pretending it's not
Another week, another "we learned from this incident" postmortem that nobody will read in 3 months when the exact same thing breaks again.
Spent 2 hours today writing up why our auth service went down. Same root cause as January. Same action items. Same people nodding along saying "yeah we should definitely fix that technical debt."
I'm so tired of this cycle. Write detailed report, assign action items, everyone's too busy, repeat when it breaks again.
The worst part? Leadership keeps asking why we haven't "learned" from previous incidents. Bro, we learned plenty - we just don't have time to actually implement any of the fixes because we're constantly firefighting new issues.
Our retros have become these long documents that check boxes for compliance but don't actually prevent anything. Half the action items from last quarter are still sitting in our backlog with no owner and no deadline that means anything.
Been thinking we need something that actually tracks this stuff automatically and keeps the retros short instead of these novel-length reports. Maybe pulls in similar past incidents so we stop pretending this is the first time our Redis cache decided to take a nap.
Anyone found tools that actually make postmortems useful instead of just another Jira ticket graveyard?
RetryI
no dashes
Edit
Our postmortem process is broken and I'm tired of pretending it's not
Posted in r/devops • 4h ago
Another week, another "we learned from this incident" postmortem that nobody will read in 3 months when the exact same thing breaks again.
Spent 2 hours today writing up why our auth service went down. Same root cause as January. Same action items. Same people nodding along saying "yeah we should definitely fix that technical debt."
I'm so tired of this cycle. Write detailed report, assign action items, everyone's too busy, repeat when it breaks again.
The worst part? Leadership keeps asking why we haven't "learned" from previous incidents. Bro, we learned plenty we just don't have time to actually implement any of the fixes because we're constantly firefighting new issues.
Our retros have become these long documents that check boxes for compliance but don't actually prevent anything. Half the action items from last quarter are still sitting in our backlog with no owner and no deadline that means anything.
Been thinking we need something that actually tracks this stuff automatically and keeps the retros short. Maybe pulls in similar past incidents so we stop pretending this is the first time our Redis cache decided to take a nap.
https://redd.it/1njrnlb
@r_devops
Another week, another "we learned from this incident" postmortem that nobody will read in 3 months when the exact same thing breaks again.
Spent 2 hours today writing up why our auth service went down. Same root cause as January. Same action items. Same people nodding along saying "yeah we should definitely fix that technical debt."
I'm so tired of this cycle. Write detailed report, assign action items, everyone's too busy, repeat when it breaks again.
The worst part? Leadership keeps asking why we haven't "learned" from previous incidents. Bro, we learned plenty - we just don't have time to actually implement any of the fixes because we're constantly firefighting new issues.
Our retros have become these long documents that check boxes for compliance but don't actually prevent anything. Half the action items from last quarter are still sitting in our backlog with no owner and no deadline that means anything.
Been thinking we need something that actually tracks this stuff automatically and keeps the retros short instead of these novel-length reports. Maybe pulls in similar past incidents so we stop pretending this is the first time our Redis cache decided to take a nap.
Anyone found tools that actually make postmortems useful instead of just another Jira ticket graveyard?
RetryI
no dashes
Edit
Our postmortem process is broken and I'm tired of pretending it's not
Posted in r/devops • 4h ago
Another week, another "we learned from this incident" postmortem that nobody will read in 3 months when the exact same thing breaks again.
Spent 2 hours today writing up why our auth service went down. Same root cause as January. Same action items. Same people nodding along saying "yeah we should definitely fix that technical debt."
I'm so tired of this cycle. Write detailed report, assign action items, everyone's too busy, repeat when it breaks again.
The worst part? Leadership keeps asking why we haven't "learned" from previous incidents. Bro, we learned plenty we just don't have time to actually implement any of the fixes because we're constantly firefighting new issues.
Our retros have become these long documents that check boxes for compliance but don't actually prevent anything. Half the action items from last quarter are still sitting in our backlog with no owner and no deadline that means anything.
Been thinking we need something that actually tracks this stuff automatically and keeps the retros short. Maybe pulls in similar past incidents so we stop pretending this is the first time our Redis cache decided to take a nap.
https://redd.it/1njrnlb
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Is it time to learn Kubernetes? - Zero Downtime Deployment with Docker
Hey Reddit, I've been stuck trying to achieve zero downtime deployment for a few weeks now to the point i'm considering learning proper container orchestration (K8s). It's a web stack (Laravel, Nuxt, a few microservices) and what I have now works but I'm not happy with the downtime... Any advice from some more experienced DevOps engineers would be much appreciated!
What I want to achieve:
Deployment to a dedicated server running Proxmox - managed hosting is out of the question
Continuous deployment (repo/registry) with rollbacks and zero downtime
Notifications for deployment success/failure
Simplicity and automation - the ability to push a commit from anywhere and have it go live
What I have currently:
Docker compose (5 containers)
Github Actions that build and publish to GHCR
Watchtowerr to pull and deploy images
Reverse proxy CT that routes via bridge to other CTs (e.g. 10.0.0.11:3000)
\~80 env vars in a file on the server(s), mounted to the containers and managed via ssh
What I've tried:
Swarm for rolling updates with watchtowerr
Blue/green with nginx upstream
Coolify/Dokploy (traefik)
Kamal
Nomad
Each of the above had pros and cons. Nginx had downtime. I don't want to trigger a deployment from the terminal. I don't need all the features of Coolify. Swarm had DNS/networking issues even when using `advertise-addr`...
Am I missing an obvious solution here? Docker is awesome but deploying it as a stack seems to be a nightmare!
https://redd.it/1njrj4u
@r_devops
Hey Reddit, I've been stuck trying to achieve zero downtime deployment for a few weeks now to the point i'm considering learning proper container orchestration (K8s). It's a web stack (Laravel, Nuxt, a few microservices) and what I have now works but I'm not happy with the downtime... Any advice from some more experienced DevOps engineers would be much appreciated!
What I want to achieve:
Deployment to a dedicated server running Proxmox - managed hosting is out of the question
Continuous deployment (repo/registry) with rollbacks and zero downtime
Notifications for deployment success/failure
Simplicity and automation - the ability to push a commit from anywhere and have it go live
What I have currently:
Docker compose (5 containers)
Github Actions that build and publish to GHCR
Watchtowerr to pull and deploy images
Reverse proxy CT that routes via bridge to other CTs (e.g. 10.0.0.11:3000)
\~80 env vars in a file on the server(s), mounted to the containers and managed via ssh
What I've tried:
Swarm for rolling updates with watchtowerr
Blue/green with nginx upstream
Coolify/Dokploy (traefik)
Kamal
Nomad
Each of the above had pros and cons. Nginx had downtime. I don't want to trigger a deployment from the terminal. I don't need all the features of Coolify. Swarm had DNS/networking issues even when using `advertise-addr`...
Am I missing an obvious solution here? Docker is awesome but deploying it as a stack seems to be a nightmare!
https://redd.it/1njrj4u
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How much time do you spend in your daily team stand-up meeting
Since new manager we have been spending 1 hour for 4 days per week on daily team meetings. I think this is a bit too much but other on the team appreciate it. We are doing remote work most of the time and it allows us to exchange on a variety of subjects but at the same time it's a real time sink and its mostly the same 3 people talking and most of the time about stuff that doesn't concern directly most of the team.
https://redd.it/1njuxdx
@r_devops
Since new manager we have been spending 1 hour for 4 days per week on daily team meetings. I think this is a bit too much but other on the team appreciate it. We are doing remote work most of the time and it allows us to exchange on a variety of subjects but at the same time it's a real time sink and its mostly the same 3 people talking and most of the time about stuff that doesn't concern directly most of the team.
https://redd.it/1njuxdx
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
What should a Mid-level Devops Engineer know?
I was lucky enough to transition from a cloud support role to devops, granted the position seems to be mainly maintain and enhance (and add to the infrastructure as requirements come) in a somewhat mature infrastructure. Although there are things which I am learning, migrations which are taking place, infra modernization. The team is mainly now just myself - I have a senior but as of last year they have been pretty much hands off. I would only go to them when needed or I have a question on something I dont have any information on. So it's a solo show between myself and the developers.
I would be lying if I said it's smooth sailing. Somewhat rough seas, and most of the time, I am trying to read into Cloud provider documentation or technology documentation to try and get a certain thing working. I dont always have the answers. I realize that's ok, but I feel that doesn't reflect well when I am the main POC.
Tech stack consists of EC2s, ECS fargate, cloudwatch for metrics, and we recently moved from github actions to AWS Codepipeline, so I am becoming familiar there slowly.
We dont use K8s/EKS as that's overkill for our applications. Although, that said, I feel like that is what 80% of the folks use(based on this subreddit) - I was told ECS is somewhat similar to EKS but I am not sure that is true.
Just trying to get a gauge of what I should be knowing as a mid-level engineer - most of the infrastructure is already established so I dont have an opportunity to implement new things. Just enhancing what is there, troubleshooting prod and pipeline issues, and implementing new features.
Also how long does it take to implement a new feature ? Being the only devops engineer, sometimes its smooth sailing, other times its not, and I start to panic.
Looking to setting up my own website(resume) and homelab at some point.
Open to ANY books as well, anything in particular you guys think will help me become a better engineer.
https://redd.it/1njvkra
@r_devops
I was lucky enough to transition from a cloud support role to devops, granted the position seems to be mainly maintain and enhance (and add to the infrastructure as requirements come) in a somewhat mature infrastructure. Although there are things which I am learning, migrations which are taking place, infra modernization. The team is mainly now just myself - I have a senior but as of last year they have been pretty much hands off. I would only go to them when needed or I have a question on something I dont have any information on. So it's a solo show between myself and the developers.
I would be lying if I said it's smooth sailing. Somewhat rough seas, and most of the time, I am trying to read into Cloud provider documentation or technology documentation to try and get a certain thing working. I dont always have the answers. I realize that's ok, but I feel that doesn't reflect well when I am the main POC.
Tech stack consists of EC2s, ECS fargate, cloudwatch for metrics, and we recently moved from github actions to AWS Codepipeline, so I am becoming familiar there slowly.
We dont use K8s/EKS as that's overkill for our applications. Although, that said, I feel like that is what 80% of the folks use(based on this subreddit) - I was told ECS is somewhat similar to EKS but I am not sure that is true.
Just trying to get a gauge of what I should be knowing as a mid-level engineer - most of the infrastructure is already established so I dont have an opportunity to implement new things. Just enhancing what is there, troubleshooting prod and pipeline issues, and implementing new features.
Also how long does it take to implement a new feature ? Being the only devops engineer, sometimes its smooth sailing, other times its not, and I start to panic.
Looking to setting up my own website(resume) and homelab at some point.
Open to ANY books as well, anything in particular you guys think will help me become a better engineer.
https://redd.it/1njvkra
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
What's the best way to detect vulnerabilities or issues with your API endpoints?
What's the best way to detect vulnerabilities or issues with your API endpoints? Is there anything free you would recommend?
https://redd.it/1njwmp0
@r_devops
What's the best way to detect vulnerabilities or issues with your API endpoints? Is there anything free you would recommend?
https://redd.it/1njwmp0
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Gitstrapped Code Server - fully bootstrapped code-server implementation
https://github.com/michaeljnash/gitstrapped-code-server
Hey all, wanted to share my repository which takes code-server and bootstraps it with github, clones / pulls desired repos, enables code-server password changes from inside code-server, other niceties that give a ready to go workspace, easily provisioned, dead simple to setup.
I liked being able to jump into working with a repo in github codespaces and just get straight to work but didnt like paying once I hit limits so threw this together. Also needed an lighter alternitive to coder for my startup since were only a few devs and coder is probably overkill.
Can either be bootstrapped by env vars or inside code-server directly (ctrl+alt+g, or in terminal use cli)
Some other things im probably forgetting. Check the repo readme for full breakdown of features. Makes privisioning workspaces for devs a breeze.
Thought others might like this handy as it has saved me tons of time and effort. Coder is great but for a team of a few dev's or an individual this is much more lightweight and straightforward and keeps life simple.
Try it out and let me know what you think.
Future thoughts are to work on isolated environments per repo somehow, while avoiding dev containers so we jsut have the single instance of code-server, keeping things lightweight. Maybe to have it automatically work with direnv for each cloned repo and have an exhaistive noscript to activate any type of virtual environments automatically when changing directory to the repo (anything from nix, to devbox, to activating python venv, etc etc.)
Cheers!
https://redd.it/1njxx7y
@r_devops
https://github.com/michaeljnash/gitstrapped-code-server
Hey all, wanted to share my repository which takes code-server and bootstraps it with github, clones / pulls desired repos, enables code-server password changes from inside code-server, other niceties that give a ready to go workspace, easily provisioned, dead simple to setup.
I liked being able to jump into working with a repo in github codespaces and just get straight to work but didnt like paying once I hit limits so threw this together. Also needed an lighter alternitive to coder for my startup since were only a few devs and coder is probably overkill.
Can either be bootstrapped by env vars or inside code-server directly (ctrl+alt+g, or in terminal use cli)
Some other things im probably forgetting. Check the repo readme for full breakdown of features. Makes privisioning workspaces for devs a breeze.
Thought others might like this handy as it has saved me tons of time and effort. Coder is great but for a team of a few dev's or an individual this is much more lightweight and straightforward and keeps life simple.
Try it out and let me know what you think.
Future thoughts are to work on isolated environments per repo somehow, while avoiding dev containers so we jsut have the single instance of code-server, keeping things lightweight. Maybe to have it automatically work with direnv for each cloned repo and have an exhaistive noscript to activate any type of virtual environments automatically when changing directory to the repo (anything from nix, to devbox, to activating python venv, etc etc.)
Cheers!
https://redd.it/1njxx7y
@r_devops
GitHub
GitHub - michaeljnash/gitstrapped-code-server
Contribute to michaeljnash/gitstrapped-code-server development by creating an account on GitHub.
Implementing SA 2 Authorization & Secure Key Generation
We’re in the process of rolling out **SA 2 authorization** to strengthen our security model and improve integration reliability.
Key steps include:
* Enforcing stricter access control policies
* Generating new authorization keys for service-to-service integration
* Ensuring minimal disruption during rollout through staged deployment and testing
The main challenge is balancing **security hardening** with **seamless continuity** for existing integrations. A lot of this comes down to careful planning around key distribution, rotation, and validation across environments.
👉 For those who have implemented SA 2 (or similar authorization frameworks), what strategies did you find most effective in managing key rotation and integration testing?
https://redd.it/1nk0kb0
@r_devops
We’re in the process of rolling out **SA 2 authorization** to strengthen our security model and improve integration reliability.
Key steps include:
* Enforcing stricter access control policies
* Generating new authorization keys for service-to-service integration
* Ensuring minimal disruption during rollout through staged deployment and testing
The main challenge is balancing **security hardening** with **seamless continuity** for existing integrations. A lot of this comes down to careful planning around key distribution, rotation, and validation across environments.
👉 For those who have implemented SA 2 (or similar authorization frameworks), what strategies did you find most effective in managing key rotation and integration testing?
https://redd.it/1nk0kb0
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Engineering Manager says Lambda takes 15 mins to start if too cold
Hey,
Why am I being told, 10 years into using Lambdas, that there’s some special wipe out AWS do if you don’t use the lambda often? He’s saying that cold starts are typical, but if you don’t use the lambda for a period of time (he alluded to 30 mins), it might have the image removed from the infrastructure by AWS. Whereas a cold start is activating that image?
He said 15 mins it can take to trigger a lambda and get a response.
I said, depending on what the function does, it’s only ever a cold start for a max of a few seconds - if that. Unless it’s doing something crazy and the timeout is horrendous.
He told me that he’s used it a lot of his career and it’s never been that way
https://redd.it/1nk37sw
@r_devops
Hey,
Why am I being told, 10 years into using Lambdas, that there’s some special wipe out AWS do if you don’t use the lambda often? He’s saying that cold starts are typical, but if you don’t use the lambda for a period of time (he alluded to 30 mins), it might have the image removed from the infrastructure by AWS. Whereas a cold start is activating that image?
He said 15 mins it can take to trigger a lambda and get a response.
I said, depending on what the function does, it’s only ever a cold start for a max of a few seconds - if that. Unless it’s doing something crazy and the timeout is horrendous.
He told me that he’s used it a lot of his career and it’s never been that way
https://redd.it/1nk37sw
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community