AWS ECS ( CI / CD )
which CI/CD you guys are using and which is better ??
note : needs to self hosted
https://redd.it/1niapty
@r_devops
which CI/CD you guys are using and which is better ??
note : needs to self hosted
https://redd.it/1niapty
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
DevOps team set up 15 different clusters 'for testing.' That was 8 months ago and we're still paying $87K/month for abandoned resources.
Our Devs team spun up a bunch of AWS infra for what was supposed to be a two-week performance testing sprint. We had EKS clusters, RDS instances (provisioned with GP3/IOPS), ELBs, EBS volumes, and a handful of supporting EC2s.
The ticket was closed, everyone moved on. Fast forward eight and a half months… yesterday I was doing some cost exploration in the dev account and almost had a heart attack. We were paying $87k/month for environments with no application traffic, near-zero CloudWatch metrics, and no recent console/API activity for eight and a half months. No owner tags, no lifecycle TTLs, lots of orphaned snapshots and unattached volumes.
Governance tooling exists, but the process to enforce it doesn’t. This is less about tooling gaps and more about failing to require ownership, automated teardown, and cost gates at provision time. Anyone have a similar story to make me feel better? What guardrails do you have to prevent this?
https://redd.it/1nieqfn
@r_devops
Our Devs team spun up a bunch of AWS infra for what was supposed to be a two-week performance testing sprint. We had EKS clusters, RDS instances (provisioned with GP3/IOPS), ELBs, EBS volumes, and a handful of supporting EC2s.
The ticket was closed, everyone moved on. Fast forward eight and a half months… yesterday I was doing some cost exploration in the dev account and almost had a heart attack. We were paying $87k/month for environments with no application traffic, near-zero CloudWatch metrics, and no recent console/API activity for eight and a half months. No owner tags, no lifecycle TTLs, lots of orphaned snapshots and unattached volumes.
Governance tooling exists, but the process to enforce it doesn’t. This is less about tooling gaps and more about failing to require ownership, automated teardown, and cost gates at provision time. Anyone have a similar story to make me feel better? What guardrails do you have to prevent this?
https://redd.it/1nieqfn
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Pod requests are driving me nuts
Anyone else constantly fighting with resource requests/limits?
We’re on EKS, and most of our services are Java or Node. Every dev asks for way more than they need (like 2 CPU / 4Gi mem for something that barely touches 200m / 500Mi). I get they want to be on the safe side, but it inflates our cloud bill like crazy. Our nodes look half empty and our finance team is really pushing us to drive costs down.
Tried using VPA but it's not really an option for most of our workloads. HPA is fine for scaling out, but it doesn’t fix the “requests vs actual usage” mess. Right now we’re staring at Prometheus graphs, adjusting YAML, rolling pods, rinse and repeat…total waste of our time.
Has anyone actually solved this? Scripts? Some magical tool?
I keep feeling like I’m missing the obvious answer, but everything I try either breaks workloads or turns into constant babysitting.
Would love to hear what’s working for you.
https://redd.it/1niec2z
@r_devops
Anyone else constantly fighting with resource requests/limits?
We’re on EKS, and most of our services are Java or Node. Every dev asks for way more than they need (like 2 CPU / 4Gi mem for something that barely touches 200m / 500Mi). I get they want to be on the safe side, but it inflates our cloud bill like crazy. Our nodes look half empty and our finance team is really pushing us to drive costs down.
Tried using VPA but it's not really an option for most of our workloads. HPA is fine for scaling out, but it doesn’t fix the “requests vs actual usage” mess. Right now we’re staring at Prometheus graphs, adjusting YAML, rolling pods, rinse and repeat…total waste of our time.
Has anyone actually solved this? Scripts? Some magical tool?
I keep feeling like I’m missing the obvious answer, but everything I try either breaks workloads or turns into constant babysitting.
Would love to hear what’s working for you.
https://redd.it/1niec2z
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
we deploy our app on ec2 instance with docker-composer. how to get more observability of docker containers on aws native? i’m unable to use config.json to scrape docker metrics in cwagent
e
https://redd.it/1nigaft
@r_devops
e
https://redd.it/1nigaft
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Any AI code review tools for GitHub PRs?
my agency’s been using cursor to ship features faster (seriously insane how much time it saves). BUT once code hits github prs… cursor doesn’t help. we still do manual reviews and end up missing dumb stuff. been going through this whole list of tools (coderabbit, qodo, codium, greptile, etc) and honestly i’m CONFUSED AF. every site says “best ai code review” but half of it feels like hype demos. currently following this list - https://www.codeant.ai/blogs/best-github-ai-code-review-tools-2025 but i think there is a lot missing here too?
all i really want is something that can act like a second pair of eyes before merge. doesn’t need to be magical, just catch obvious things humans miss. open source would be cool too, but i’m fine with paid IF IT ACTUALLY WORKS in production. anyone here using these daily? what’s worth the setup?
https://redd.it/1niif8l
@r_devops
my agency’s been using cursor to ship features faster (seriously insane how much time it saves). BUT once code hits github prs… cursor doesn’t help. we still do manual reviews and end up missing dumb stuff. been going through this whole list of tools (coderabbit, qodo, codium, greptile, etc) and honestly i’m CONFUSED AF. every site says “best ai code review” but half of it feels like hype demos. currently following this list - https://www.codeant.ai/blogs/best-github-ai-code-review-tools-2025 but i think there is a lot missing here too?
all i really want is something that can act like a second pair of eyes before merge. doesn’t need to be magical, just catch obvious things humans miss. open source would be cool too, but i’m fine with paid IF IT ACTUALLY WORKS in production. anyone here using these daily? what’s worth the setup?
https://redd.it/1niif8l
@r_devops
www.codeant.ai
We Tested 9 GitHub AI Code Review Tools: See What Works in 2026
We tested 9 GitHub AI code review tools in real dev workflows. See which ones catch real bugs, which overpromise, and which are worth using in 2026.
Interacting with a webpage during tests
I'm implementing some features for a docker compose based application. Some of such features are backup and restore.
I'd like to add some tests for this.
The steps would be something like the below
docker compose up
# Assert the instance is actually working by logging in
# Change username, profile image and update/install some apps
make backup
docker compose down --remove-orphans --volumes
docker compose up
make restore
# Assert the changes previously made are all still there
I'm having a hard time finding a good solution how to interact with the web page and do the stuff prefixed with #. Do I have better options then adding noscripts based on PlayWright, Selenium or Cypress?
https://redd.it/1niji8v
@r_devops
I'm implementing some features for a docker compose based application. Some of such features are backup and restore.
I'd like to add some tests for this.
The steps would be something like the below
docker compose up
# Assert the instance is actually working by logging in
# Change username, profile image and update/install some apps
make backup
docker compose down --remove-orphans --volumes
docker compose up
make restore
# Assert the changes previously made are all still there
I'm having a hard time finding a good solution how to interact with the web page and do the stuff prefixed with #. Do I have better options then adding noscripts based on PlayWright, Selenium or Cypress?
https://redd.it/1niji8v
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Resources for learning Openshift for someone who's already experienced in Kubernetes?
I have 5 years of Kubernetes experience. I have a technical interview coming up for a job I'm determined to get, though it's an open shift job.
What are the best resources for learning open shift when you already understand Kubernetes?
https://redd.it/1niizpl
@r_devops
I have 5 years of Kubernetes experience. I have a technical interview coming up for a job I'm determined to get, though it's an open shift job.
What are the best resources for learning open shift when you already understand Kubernetes?
https://redd.it/1niizpl
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Which AI coding assistant is best for building complex software projects from scratch, especially for non-full-time coders?
Hi everyone,
I’m an embedded systems enthusiast with experience working on projects using Raspberry Pi, Arduino, and microcontrollers. I have basic Python skills and a moderate understanding of C, C++, and C#, but I’m not a full-time software developer. I have an idea for a project that is heavily software-focused and quite complex, and I want to build at least a prototype to demonstrate its capabilities in the real world — mostly working on embedded platforms but requiring significant coding effort.
My main questions are:
Which AI tools like ChatGPT, Claude, or others are best suited to help someone like me develop complex software from scratch?
Can these AI assistants realistically support a project of this scale, including architectural design, coding, debugging, and iteration?
Are there recommended workflows or strategies to effectively use these AI tools to compensate for my limited coding background?
If it’s not feasible to rely on AI tools alone, what are alternative approaches to quickly build a functional prototype of a software-heavy embedded system?
I appreciate any advice, recommendations for specific AI tools, or general guidance on how to approach this challenge.
Thanks in advance!
https://redd.it/1ninjyc
@r_devops
Hi everyone,
I’m an embedded systems enthusiast with experience working on projects using Raspberry Pi, Arduino, and microcontrollers. I have basic Python skills and a moderate understanding of C, C++, and C#, but I’m not a full-time software developer. I have an idea for a project that is heavily software-focused and quite complex, and I want to build at least a prototype to demonstrate its capabilities in the real world — mostly working on embedded platforms but requiring significant coding effort.
My main questions are:
Which AI tools like ChatGPT, Claude, or others are best suited to help someone like me develop complex software from scratch?
Can these AI assistants realistically support a project of this scale, including architectural design, coding, debugging, and iteration?
Are there recommended workflows or strategies to effectively use these AI tools to compensate for my limited coding background?
If it’s not feasible to rely on AI tools alone, what are alternative approaches to quickly build a functional prototype of a software-heavy embedded system?
I appreciate any advice, recommendations for specific AI tools, or general guidance on how to approach this challenge.
Thanks in advance!
https://redd.it/1ninjyc
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
I may be over relying on AI and I’m not sure how to stop
I understand that similar questions might have been asked before but most of the answers assume the person is thinking of ditching AI entirely and people say it’s only a tool and should be used.
My problem is I’m still basically at the first levels of devops and I can’t for the life of me learn with a deadline. I understand the concepts and what almost everything does, but writing those noscripts? Almost every time I have a project , even if personal, with a deadline I use AI and as the noscripts and stuff are generally easy and simply, it does it in a single message.
I then assume I’ll finish everything and submit and then take the time to understand, and while I do actually understand, I wouldn’t be able to replicate or do some of those noscripts completely on my own.
What did everyone do at the start? How did you start studying and understand without relying much on AI? And when do you mix AI with your work? I know that maybe in the future we won’t be writing noscripts but I’d like to at least know how to write them and then I can throw it on the AI.
https://redd.it/1nir0ap
@r_devops
I understand that similar questions might have been asked before but most of the answers assume the person is thinking of ditching AI entirely and people say it’s only a tool and should be used.
My problem is I’m still basically at the first levels of devops and I can’t for the life of me learn with a deadline. I understand the concepts and what almost everything does, but writing those noscripts? Almost every time I have a project , even if personal, with a deadline I use AI and as the noscripts and stuff are generally easy and simply, it does it in a single message.
I then assume I’ll finish everything and submit and then take the time to understand, and while I do actually understand, I wouldn’t be able to replicate or do some of those noscripts completely on my own.
What did everyone do at the start? How did you start studying and understand without relying much on AI? And when do you mix AI with your work? I know that maybe in the future we won’t be writing noscripts but I’d like to at least know how to write them and then I can throw it on the AI.
https://redd.it/1nir0ap
@r_devops
Basic tool for small tasks during the day using pomodoro technique for focus
I have difficulty jumping from tool to tool, projects, languages and you can't really track time with project management tools. I started writing a tool after some courses and books in go. It works for Linux/wsl/mac not windows cause I still have some issues.
You just start a task in your terminal like:
Pomo-cli start --task "write post in reddit" --time 15 --background
Then a pid process starts and a local db is updated in your homedir\.pomo-cli. After it finishes you receive a message in the terminal and it's added to the db. You can also view the statistics and pause the task. It helps me focusing and take short breaks between changing repos or tools.
If anyone wants to use it:
https://github.com/arushdesp/pomo-cli
https://redd.it/1niqnir
@r_devops
I have difficulty jumping from tool to tool, projects, languages and you can't really track time with project management tools. I started writing a tool after some courses and books in go. It works for Linux/wsl/mac not windows cause I still have some issues.
You just start a task in your terminal like:
Pomo-cli start --task "write post in reddit" --time 15 --background
Then a pid process starts and a local db is updated in your homedir\.pomo-cli. After it finishes you receive a message in the terminal and it's added to the db. You can also view the statistics and pause the task. It helps me focusing and take short breaks between changing repos or tools.
If anyone wants to use it:
https://github.com/arushdesp/pomo-cli
https://redd.it/1niqnir
@r_devops
GitHub
GitHub - arushdesp/pomo-cli
Contribute to arushdesp/pomo-cli development by creating an account on GitHub.
I built a fully automated CI/CD pipeline for a Node.js app using Docker, Terraform & GitHub Actions
Hey everyone,
I just completed a hands-on project to practice modern DevOps workflows:
Built a Node.js service with a public route / and a protected route /secret using Basic Auth.
Dockerized the application to make it portable.
Provisioned a GCP VM with Terraform and configured firewall rules.
Set up a CI/CD pipeline with GitHub Actions to build the Docker image, push it to GitHub Container Registry, and deploy it automatically to the VM.
Managed secrets securely with GitHub Secrets and environment variables.
This project helped me learn how to connect coding, containerization, infrastructure as code, and automated deployments.
Check out the repo if you want to see the full implementation:
https://github.com/yanou16/dockerized-service
Would love feedback from anyone with experience deploying Dockerized apps in production!
https://redd.it/1nit4lz
@r_devops
Hey everyone,
I just completed a hands-on project to practice modern DevOps workflows:
Built a Node.js service with a public route / and a protected route /secret using Basic Auth.
Dockerized the application to make it portable.
Provisioned a GCP VM with Terraform and configured firewall rules.
Set up a CI/CD pipeline with GitHub Actions to build the Docker image, push it to GitHub Container Registry, and deploy it automatically to the VM.
Managed secrets securely with GitHub Secrets and environment variables.
This project helped me learn how to connect coding, containerization, infrastructure as code, and automated deployments.
Check out the repo if you want to see the full implementation:
https://github.com/yanou16/dockerized-service
Would love feedback from anyone with experience deploying Dockerized apps in production!
https://redd.it/1nit4lz
@r_devops
GitHub
GitHub - yanou16/dockerized-service
Contribute to yanou16/dockerized-service development by creating an account on GitHub.
Shift left security practices developers like
I’ve been playing around with different ways to bring security earlier in the dev workflow without making everyone miserable. Most shift left advice I’ve seen either slows pipelines to a crawl or drowns you in false positives.
A couple of things that actually worked for us:
tiny pre-commit/PR checks (linters, IaC, image scans) → fast feedback, nobody complains
heavier stuff (SAST, fuzzing) → push it to nightly, don’t block commits
policy as code → way easier than docs that nobody reads
if a tool is noisy or slow, devs ignore it… might as well not exist
I wrote a longer post with examples and configs if you’re curious: Shift Left Security Practices Developers Like
Curious what others here run in their pipelines without slowing everything down.
https://redd.it/1niuyxw
@r_devops
I’ve been playing around with different ways to bring security earlier in the dev workflow without making everyone miserable. Most shift left advice I’ve seen either slows pipelines to a crawl or drowns you in false positives.
A couple of things that actually worked for us:
tiny pre-commit/PR checks (linters, IaC, image scans) → fast feedback, nobody complains
heavier stuff (SAST, fuzzing) → push it to nightly, don’t block commits
policy as code → way easier than docs that nobody reads
if a tool is noisy or slow, devs ignore it… might as well not exist
I wrote a longer post with examples and configs if you’re curious: Shift Left Security Practices Developers Like
Curious what others here run in their pipelines without slowing everything down.
https://redd.it/1niuyxw
@r_devops
Fatih Koç
Shift Left Security Practices Developers Like
Shift Left Security practices developers actually like — with code examples, guardrails, and policy as code to reduce friction.
Has anyone done local deployment on Proxmox and kubernetes before?
How is this done normally and is this a normal way to go about it? Looking to deploy local web applications that’s only accessible on our local on-site server
https://redd.it/1niwx9u
@r_devops
How is this done normally and is this a normal way to go about it? Looking to deploy local web applications that’s only accessible on our local on-site server
https://redd.it/1niwx9u
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Airbyte OSS is driving me insane
I’m trying to build an ELT pipeline to sync data from Postgres RDS to BigQuery. I didn’t know it Airbyte would be this resource intensive especially for the job I’m trying to setup (sync tables with thousands of rows etc.). I had Airbyte working on our RKE2 Cluster, but it kept failing due to not enough resources. I finally spun up an SNC with K3S with 16GB Ram / 8CPUs. Now Airbyte won’t even deploy on this new cluster. Temporal deployment keeps failing, bootloader keeps telling me about a missing environment variable in a secrets file I never specified in extraEnv. I’ve tried v1 and v2 charts, they’re both not working. V2 chart is the worst, the helm template throws an error of an ingressClass config missing at the root of the values file, but the official helm chart doesn’t show an ingressClass config file there. It’s driving me nuts.
Any recommendations out there for simpler OSS ELT pipeline tools I can use? To sync data between Postgres and Google BigQuery?
Thank you!
https://redd.it/1nixzmx
@r_devops
I’m trying to build an ELT pipeline to sync data from Postgres RDS to BigQuery. I didn’t know it Airbyte would be this resource intensive especially for the job I’m trying to setup (sync tables with thousands of rows etc.). I had Airbyte working on our RKE2 Cluster, but it kept failing due to not enough resources. I finally spun up an SNC with K3S with 16GB Ram / 8CPUs. Now Airbyte won’t even deploy on this new cluster. Temporal deployment keeps failing, bootloader keeps telling me about a missing environment variable in a secrets file I never specified in extraEnv. I’ve tried v1 and v2 charts, they’re both not working. V2 chart is the worst, the helm template throws an error of an ingressClass config missing at the root of the values file, but the official helm chart doesn’t show an ingressClass config file there. It’s driving me nuts.
Any recommendations out there for simpler OSS ELT pipeline tools I can use? To sync data between Postgres and Google BigQuery?
Thank you!
https://redd.it/1nixzmx
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Engineering leaders; how do you respond when leaders ask you “ROI of a tool or of developers?”
Title. Curious how one could measure these consistently and reliably.
https://redd.it/1nj12px
@r_devops
Title. Curious how one could measure these consistently and reliably.
https://redd.it/1nj12px
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Script/Automation "Orchestration". Does this exist? Is Github Actions the best option? Maybe use "ETL" orchestration tools that are originally meant for data pipelines?
Many times if an org is doing IAC or already using GHA (Github Actions), Azure DevOps, or similar CI/CD platform, they'll inevitably leverage it for running Scripts/Automations as well, often times for "manual" workflows. Things like "Deploy a lab in AWS" Or "Rotate these secrets". Is there a better alternative?
I know there are ways to run automations, like Azure Automation accounts, AWS lambdas, azure functions..etc. However these are more programmatic and event-based. Not really designed for putting it Infront of L1-L2 technicians/users that are terrified of github/code and shouldn't have access anyway. I am aware you could use slack/teams w/ webhooks, build your own frontend of some sort to use webhooks...etc. I've done this using custom Slack bots + Lambdas and Azure Automation. However it's not ideal, and there's zero reporting really.
I bring this up because I've joined an environment where GHA is used for what I'd call "automation orchestration". Theres dozens of automation noscripts built to go out and deploy things to AWS/Microsoft/Cloud SaaS Solutions, which require user technicians to input 10-20 parameters per environment and run the workflow manually for new clients or dev environments. Some of these actions are running dozens of PowerShell noscripts and bash commands as steps, sequentially setting up cloud environments. Terraform does not cover all the options, so there's inevitably REST APIs that have to get hit or PoSh/Bash CLI commands for the various SaaS offerings that have to be used. Maybe in future the TF Providers will cover everything we need, but I digress.
Then there's automations that run against our managed environments, of which there are hundreds, each with their own unique parameters and such, to do things like secret management, cloud resource deployments, reporting, IAC tasks, building images...etc.
These workflows have to run on self-hosted runners for security and compliance reasons. It's all powershell, python, bash...etc. Which means it's just running noscripts on a container/VM to interact with public REST APIs at the end of the day, if we're being frank.
GHA can do a lot of this, and we've done a lot of creative engineering to make it work, but I think it's not exactly "built" for this sort of job. The actions web UI isn't terribly featureful nor built for sort of "reporting" besides what you can put in job summaries and error logs. It is fantastic for dev work, build tasks..etc, and I really enjoy it for those tasks don't get me wrong. It has worked well for our use, but perhaps we should be using something else?
Are there better solutions to, for lack of a better word, "automation orchestration"? A platform that simply runs noscripts on schedule, manually, triggered, etc? Similar to ETL orchestration solutions? Prefect, Airflow, various DAGs do something like this but they're more built for python and don't support j. A platform that has reporting, logs, UIs for showing failures and results, all in one place? Additionally it would have to be self-hosted.
I could be mistaken, and something like Airflow can do this quite easily, I'm not intimately familiar with the offerings and solutions, just that they preform a similar sort of orchestration functionality.
Is anyone utilizing GHA for similar use cases beyond simple IAC deployments? Would you have any recommendations? Thanks!
https://redd.it/1nj0q9r
@r_devops
Many times if an org is doing IAC or already using GHA (Github Actions), Azure DevOps, or similar CI/CD platform, they'll inevitably leverage it for running Scripts/Automations as well, often times for "manual" workflows. Things like "Deploy a lab in AWS" Or "Rotate these secrets". Is there a better alternative?
I know there are ways to run automations, like Azure Automation accounts, AWS lambdas, azure functions..etc. However these are more programmatic and event-based. Not really designed for putting it Infront of L1-L2 technicians/users that are terrified of github/code and shouldn't have access anyway. I am aware you could use slack/teams w/ webhooks, build your own frontend of some sort to use webhooks...etc. I've done this using custom Slack bots + Lambdas and Azure Automation. However it's not ideal, and there's zero reporting really.
I bring this up because I've joined an environment where GHA is used for what I'd call "automation orchestration". Theres dozens of automation noscripts built to go out and deploy things to AWS/Microsoft/Cloud SaaS Solutions, which require user technicians to input 10-20 parameters per environment and run the workflow manually for new clients or dev environments. Some of these actions are running dozens of PowerShell noscripts and bash commands as steps, sequentially setting up cloud environments. Terraform does not cover all the options, so there's inevitably REST APIs that have to get hit or PoSh/Bash CLI commands for the various SaaS offerings that have to be used. Maybe in future the TF Providers will cover everything we need, but I digress.
Then there's automations that run against our managed environments, of which there are hundreds, each with their own unique parameters and such, to do things like secret management, cloud resource deployments, reporting, IAC tasks, building images...etc.
These workflows have to run on self-hosted runners for security and compliance reasons. It's all powershell, python, bash...etc. Which means it's just running noscripts on a container/VM to interact with public REST APIs at the end of the day, if we're being frank.
GHA can do a lot of this, and we've done a lot of creative engineering to make it work, but I think it's not exactly "built" for this sort of job. The actions web UI isn't terribly featureful nor built for sort of "reporting" besides what you can put in job summaries and error logs. It is fantastic for dev work, build tasks..etc, and I really enjoy it for those tasks don't get me wrong. It has worked well for our use, but perhaps we should be using something else?
Are there better solutions to, for lack of a better word, "automation orchestration"? A platform that simply runs noscripts on schedule, manually, triggered, etc? Similar to ETL orchestration solutions? Prefect, Airflow, various DAGs do something like this but they're more built for python and don't support j. A platform that has reporting, logs, UIs for showing failures and results, all in one place? Additionally it would have to be self-hosted.
I could be mistaken, and something like Airflow can do this quite easily, I'm not intimately familiar with the offerings and solutions, just that they preform a similar sort of orchestration functionality.
Is anyone utilizing GHA for similar use cases beyond simple IAC deployments? Would you have any recommendations? Thanks!
https://redd.it/1nj0q9r
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Fix deploy bugs before they land: a semantic firewall for devops + grandma clinic (beginner friendly, mit)
last week i shared a deep dive on failure modes in ai stacks and got great feedback here. a few folks asked for a simpler, beginner friendly version for devops. this is that post. same math idea, plain language. the trick is simple. instead of patching after a bad deploy, you install a tiny semantic firewall before anything runs. if the state is unstable, it loops, narrows, or refuses. only a stable state is allowed to execute.
why you should care
after style: output happens then you scramble with rollbacks and quick fixes. the same class of failure returns with a new shape.
before style: a pre-output gate inspects state signals first. if boot order is wrong, a lock is pending, or the first call will burn, it stops early. fixes become structural and repeatable.
what this looks like in devops terms
No.14 bootstrap ordering. hot pan before eggs. readiness probes pass, caches warmed, migrations staged.
No.15 deployment deadlock. decide who passes the narrow door. total order, timeouts and backoff, fallback path.
No.16 pre-deploy collapse. wash the first pot. versions pinned, secrets present, tiny canary first.
No.8 debugging black box. recipe card next to the stove. every run logs which inputs and checks created the output.
quick demo. add a pre-output gate to ci
paste this into a repo as
github actions wiring. run preflight before real work.
kubernetes job with a gate. refuse if gate fails.
minimal “citation first” for runbooks
the same idea works for human steps. put the card on the table before you act.
what changes after you add the gate
you stop guessing. every failure maps to a number and a fix you can name.
fewer rollbacks.
last week i shared a deep dive on failure modes in ai stacks and got great feedback here. a few folks asked for a simpler, beginner friendly version for devops. this is that post. same math idea, plain language. the trick is simple. instead of patching after a bad deploy, you install a tiny semantic firewall before anything runs. if the state is unstable, it loops, narrows, or refuses. only a stable state is allowed to execute.
why you should care
after style: output happens then you scramble with rollbacks and quick fixes. the same class of failure returns with a new shape.
before style: a pre-output gate inspects state signals first. if boot order is wrong, a lock is pending, or the first call will burn, it stops early. fixes become structural and repeatable.
what this looks like in devops terms
No.14 bootstrap ordering. hot pan before eggs. readiness probes pass, caches warmed, migrations staged.
No.15 deployment deadlock. decide who passes the narrow door. total order, timeouts and backoff, fallback path.
No.16 pre-deploy collapse. wash the first pot. versions pinned, secrets present, tiny canary first.
No.8 debugging black box. recipe card next to the stove. every run logs which inputs and checks created the output.
quick demo. add a pre-output gate to ci
paste this into a repo as
preflight.sh and call it from your pipeline. it fails fast with a clear reason.#!/usr/bin/env bash
set -euo pipefail
say() { printf "[preflight] %s\n" "$*"; }
fail() { printf "[preflight][fail] %s\n" "$*" >&2; exit 1; }
# 1) bootstrap order
say "checking service readiness"
kubectl wait --for=condition=available --timeout=90s deploy/app || fail "app not ready"
kubectl wait --for=condition=available --timeout=90s deploy/db || fail "db not ready"
say "warming cache and index"
curl -fsS "$WARMUP_URL/cache" || fail "cache warmup failed"
curl -fsS "$WARMUP_URL/index" || fail "index warmup failed"
# 2) secrets and env
say "checking secrets"
[[ -n "${API_KEY:-}" ]] || fail "missing API_KEY"
[[ -n "${DB_URL:-}" ]] || fail "missing DB_URL"
# 3) migrations have a lane
say "ensuring migration lane is clear"
flock -n /tmp/migrate.lock -c "echo locked" || fail "migration lock held"
./migrate --plan || fail "migration plan invalid"
./migrate --dry || fail "migration dry run failed"
# 4) deadlock guards
say "testing write path with timeout"
curl -m 5 -fsS "$HEALTH_URL/write-probe" || fail "write probe timeout likely deadlock"
# 5) first call canary
say "shipping tiny canary"
resp="$(curl -fsS "$API_URL/ping?traffic=0.1")" || fail "canary failed"
grep -q '"ok":true' <<<"$resp" || fail "canary not ok"
say "preflight passed"
github actions wiring. run preflight before real work.
name: release
on: [push]
jobs:
ship:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: setup
run: echo "WARMUP_URL=$WARMUP_URL" >> $GITHUB_ENV
- name: pre-output gate
run: bash ./preflight.sh
- name: deploy
if: ${{ success() }}
run: bash ./deploy.sh
kubernetes job with a gate. refuse if gate fails.
apiVersion: batch/v1
kind: Job
metadata:
name: job-with-gate
spec:
template:
spec:
restartPolicy: Never
containers:
- name: runner
image: your-image:tag
command: ["/bin/bash","-lc"]
args:
- |
./preflight.sh || { echo "blocked by semantic gate"; exit 1; }
./run-task.sh
minimal “citation first” for runbooks
the same idea works for human steps. put the card on the table before you act.
runbook step 2 – change feature flag
require: ticket id + monitoring link
refuse: if ticket or dashboard missing, do not flip flag
accept: when both are pasted and the dashboard shows baseline stable for 2 minutes
what changes after you add the gate
you stop guessing. every failure maps to a number and a fix you can name.
fewer rollbacks.
first call failures are caught on the canary.
fewer flaky deploys. boot order and locks are tested up front.
black box debugging ends. each release has a small trace that explains why it was allowed to run.
how to try this in 60 seconds
1. copy
2. set three env vars and one canary endpoint.
3. run. if it blocks, read the message, not the logs.
if you want the plain language guide
there is a beginner friendly “grandma clinic” that explains each failure as a short story plus the minimal fix. the labels above map to these numbers. start with No.14, No.15, No.16, No.8. if you need the doctor style prompt that points you to the exact page, ask and i can share it.
faq
q. do i need to install a platform or sdk
a. no. this is shell and yaml. it is a reasoning guard before output. you can keep your stack.
q. will this slow down release
a. it adds seconds. it removes hours of rollback and root cause churn.
q. can i adapt this for airflow, argo, jenkins
a. yes. drop the same gate into a pre step. the checks are plain commands.
q. how do i know it actually worked
a. acceptance targets. you decide them. at minimum require readiness passed, secrets present, no lock held, canary ok. if these hold three runs in a row, the class is fixed.
q. we also run ai agents to modify infra. does the same idea work
a. yes. add “evidence first” to the agent. tool calls only after a citation or a runbook page is present.
q: where is the plain language guide
a: “Grandma Clinic” explains the 16 common failure modes with tiny fixes. beginner friendly.
link:
https://github.com/onestardao/WFGY/blob/main/ProblemMap/GrandmaClinic/README.md
closing
this feels different because it is not a patch zoo after the fact. it is a small refusal engine before the fact. once a class is mapped and guarded, it stays fixed. Thanks for reading my work
https://redd.it/1nj1okk
@r_devops
fewer flaky deploys. boot order and locks are tested up front.
black box debugging ends. each release has a small trace that explains why it was allowed to run.
how to try this in 60 seconds
1. copy
preflight.sh into any pipeline or cron job.2. set three env vars and one canary endpoint.
3. run. if it blocks, read the message, not the logs.
if you want the plain language guide
there is a beginner friendly “grandma clinic” that explains each failure as a short story plus the minimal fix. the labels above map to these numbers. start with No.14, No.15, No.16, No.8. if you need the doctor style prompt that points you to the exact page, ask and i can share it.
faq
q. do i need to install a platform or sdk
a. no. this is shell and yaml. it is a reasoning guard before output. you can keep your stack.
q. will this slow down release
a. it adds seconds. it removes hours of rollback and root cause churn.
q. can i adapt this for airflow, argo, jenkins
a. yes. drop the same gate into a pre step. the checks are plain commands.
q. how do i know it actually worked
a. acceptance targets. you decide them. at minimum require readiness passed, secrets present, no lock held, canary ok. if these hold three runs in a row, the class is fixed.
q. we also run ai agents to modify infra. does the same idea work
a. yes. add “evidence first” to the agent. tool calls only after a citation or a runbook page is present.
q: where is the plain language guide
a: “Grandma Clinic” explains the 16 common failure modes with tiny fixes. beginner friendly.
link:
https://github.com/onestardao/WFGY/blob/main/ProblemMap/GrandmaClinic/README.md
closing
this feels different because it is not a patch zoo after the fact. it is a small refusal engine before the fact. once a class is mapped and guarded, it stays fixed. Thanks for reading my work
https://redd.it/1nj1okk
@r_devops
GitHub
WFGY/ProblemMap/GrandmaClinic/README.md at main · onestardao/WFGY
WFGY 2.0. Semantic Reasoning Engine for LLMs (MIT). Fixes RAG/OCR drift, collapse & “ghost matches” via symbolic overlays + logic patches. Autoboot; OneLine & Flagship. ⭐ Star if yo...
Are these types of DevOps interview questions normal for fresher/junior roles, or was this just overkill?
Hey everyone,
I recently gave a DevOps interview through Alignerr (AI-based assessment), and I honestly came out feeling like I got cooked. 🥲
The questions were way harder than I expected for a fresher/junior role. Some examples:
Identifying port 22.
How to separate broad staging and dev environments from a large Terraform configuration file.
Handling configs for multiple environments with variables.
Dealing with things bound to 0.0.0.0 and what policies you’d set around that.
General stuff about modules and structuring one big configuration.
Integrating Sentinal with CICD pipeline
I was expecting more “Terraform init/plan/apply” level or maybe some AWS basics, but these felt like senior-level production questions.
https://redd.it/1nj73nc
@r_devops
Hey everyone,
I recently gave a DevOps interview through Alignerr (AI-based assessment), and I honestly came out feeling like I got cooked. 🥲
The questions were way harder than I expected for a fresher/junior role. Some examples:
Identifying port 22.
How to separate broad staging and dev environments from a large Terraform configuration file.
Handling configs for multiple environments with variables.
Dealing with things bound to 0.0.0.0 and what policies you’d set around that.
General stuff about modules and structuring one big configuration.
Integrating Sentinal with CICD pipeline
I was expecting more “Terraform init/plan/apply” level or maybe some AWS basics, but these felt like senior-level production questions.
https://redd.it/1nj73nc
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Help me give my career some direction
I am a 2021 graduate from B.Tech IT graduate from a private college in Manipal, India.
My career has been a mess ever since. Soon after graduation I went to US for pursuing master's in 2021, but I didn't complete my degree and returned to India in about 6 months. Then I went back to US 6 months later and returned again in about 3 months. So overall I spent about 2 years gaining nothing and doing back and forth between India and US. I also accumulated some debt in the process. The reason for this flipflop were some untreated mental health issues.
After returning to India for second time in 2023, after extensive search, I finally found a DevOps Engineer job at a firm in Bengaluru. The salary was good until the job lasted (15 LPA or $17k/year), but layoffs hit soon in 2024. I was lucky to find another job in Bengaluru which paid the same, but the thing is I never learned core DevOps skills: Cloud Management, Kubernetes, CI/CD pipelines etc. For 2 years I have been working only on Python & Bash based programs and noscripts.
Now I am willing to undergo some certifications to aim for higher packages. Certified Kubernetes adminstrator and AWS DevOps Engineer Professional are the ones I am targeting. But, I am unsure if they will lead to higher packages at all. Most DevOps jobs in India are in WITCH like consulting companies. I am unsure how to aim for product based companies, especially in the current environment, when there are no jobs anywhere.
Should I try to switch to development, which seems so risky in the age of AI?
Tldr; I am a lost engineer, currently employed but looking for ways to increase my compensation. Please help me give my career some direction. I have wasted a lot of time but I am still only 26, and have many years ahead of me.
https://redd.it/1nj7jd4
@r_devops
I am a 2021 graduate from B.Tech IT graduate from a private college in Manipal, India.
My career has been a mess ever since. Soon after graduation I went to US for pursuing master's in 2021, but I didn't complete my degree and returned to India in about 6 months. Then I went back to US 6 months later and returned again in about 3 months. So overall I spent about 2 years gaining nothing and doing back and forth between India and US. I also accumulated some debt in the process. The reason for this flipflop were some untreated mental health issues.
After returning to India for second time in 2023, after extensive search, I finally found a DevOps Engineer job at a firm in Bengaluru. The salary was good until the job lasted (15 LPA or $17k/year), but layoffs hit soon in 2024. I was lucky to find another job in Bengaluru which paid the same, but the thing is I never learned core DevOps skills: Cloud Management, Kubernetes, CI/CD pipelines etc. For 2 years I have been working only on Python & Bash based programs and noscripts.
Now I am willing to undergo some certifications to aim for higher packages. Certified Kubernetes adminstrator and AWS DevOps Engineer Professional are the ones I am targeting. But, I am unsure if they will lead to higher packages at all. Most DevOps jobs in India are in WITCH like consulting companies. I am unsure how to aim for product based companies, especially in the current environment, when there are no jobs anywhere.
Should I try to switch to development, which seems so risky in the age of AI?
Tldr; I am a lost engineer, currently employed but looking for ways to increase my compensation. Please help me give my career some direction. I have wasted a lot of time but I am still only 26, and have many years ahead of me.
https://redd.it/1nj7jd4
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How are you keeping CI/CD security from slowing down deploys?
Our pipeline runs Terraform + Kubernetes deploys daily. We’ve got some IaC linting and container scans in place, but it feels like every added check drags the cycle out. Security wants more coverage, but devs complain every time scans add minutes.
How are you balancing speed and security here? Anyone feel like they’ve nailed CI/CD security without breaking velocity?
https://redd.it/1nj8ysz
@r_devops
Our pipeline runs Terraform + Kubernetes deploys daily. We’ve got some IaC linting and container scans in place, but it feels like every added check drags the cycle out. Security wants more coverage, but devs complain every time scans add minutes.
How are you balancing speed and security here? Anyone feel like they’ve nailed CI/CD security without breaking velocity?
https://redd.it/1nj8ysz
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community