The Ultimate SRE Reliability Checklist
A practical, progressive SRE checklist you can actually implement. Plain explanations. Focus on user impact. Start small, mature deliberately.
https://oneuptime.com/blog/post/2025-09-10-sre-checklist/view
https://redd.it/1njbgeh
@r_devops
A practical, progressive SRE checklist you can actually implement. Plain explanations. Focus on user impact. Start small, mature deliberately.
https://oneuptime.com/blog/post/2025-09-10-sre-checklist/view
https://redd.it/1njbgeh
@r_devops
OneUptime | One Complete Observability platform.
The Ultimate SRE Reliability Checklist
A practical, progressive SRE checklist you can actually implement. Plain explanations. Focus on user impact. Start small, mature deliberately.
From AWS Intern to Remote-Ready Cloud Engineer: Looking for Guidance
Hey everyone — I recently completed a cloud support engineering internship at AWS where I was exposed on how to handle global support cases involving EC2, IAM, and VPC but also got greater exposure to Linux, and high available web application developed a strong understanding of security, governance, and compliance principles.
I'm AWS SAA-C03, AIF-C01, and CCNA certified, and have solid hands-on skills in cloud diagnostics, CLI tooling, and automation.
I'm now looking to pivot into remote work — ideally with startups or dev shops where I can contribute to infrastructure support, observability, or AI ops. I’m based in Kenya, with strong internet and power, and comfortable working US/EU hours.
Would love to hear from anyone who’s hired globally or transitioned from a support background into DevOps or infra roles.
Any advice, referrals, or critique of my approach would be hugely appreciated!
Happy to DM my CV or portfolio if helpful 🙏
https://redd.it/1njotqk
@r_devops
Hey everyone — I recently completed a cloud support engineering internship at AWS where I was exposed on how to handle global support cases involving EC2, IAM, and VPC but also got greater exposure to Linux, and high available web application developed a strong understanding of security, governance, and compliance principles.
I'm AWS SAA-C03, AIF-C01, and CCNA certified, and have solid hands-on skills in cloud diagnostics, CLI tooling, and automation.
I'm now looking to pivot into remote work — ideally with startups or dev shops where I can contribute to infrastructure support, observability, or AI ops. I’m based in Kenya, with strong internet and power, and comfortable working US/EU hours.
Would love to hear from anyone who’s hired globally or transitioned from a support background into DevOps or infra roles.
Any advice, referrals, or critique of my approach would be hugely appreciated!
Happy to DM my CV or portfolio if helpful 🙏
https://redd.it/1njotqk
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Our postmortem process is broken and I'm tired of pretending it's not
Another week, another "we learned from this incident" postmortem that nobody will read in 3 months when the exact same thing breaks again.
Spent 2 hours today writing up why our auth service went down. Same root cause as January. Same action items. Same people nodding along saying "yeah we should definitely fix that technical debt."
I'm so tired of this cycle. Write detailed report, assign action items, everyone's too busy, repeat when it breaks again.
The worst part? Leadership keeps asking why we haven't "learned" from previous incidents. Bro, we learned plenty - we just don't have time to actually implement any of the fixes because we're constantly firefighting new issues.
Our retros have become these long documents that check boxes for compliance but don't actually prevent anything. Half the action items from last quarter are still sitting in our backlog with no owner and no deadline that means anything.
Been thinking we need something that actually tracks this stuff automatically and keeps the retros short instead of these novel-length reports. Maybe pulls in similar past incidents so we stop pretending this is the first time our Redis cache decided to take a nap.
Anyone found tools that actually make postmortems useful instead of just another Jira ticket graveyard?
RetryI
no dashes
Edit
Our postmortem process is broken and I'm tired of pretending it's not
Posted in r/devops • 4h ago
Another week, another "we learned from this incident" postmortem that nobody will read in 3 months when the exact same thing breaks again.
Spent 2 hours today writing up why our auth service went down. Same root cause as January. Same action items. Same people nodding along saying "yeah we should definitely fix that technical debt."
I'm so tired of this cycle. Write detailed report, assign action items, everyone's too busy, repeat when it breaks again.
The worst part? Leadership keeps asking why we haven't "learned" from previous incidents. Bro, we learned plenty we just don't have time to actually implement any of the fixes because we're constantly firefighting new issues.
Our retros have become these long documents that check boxes for compliance but don't actually prevent anything. Half the action items from last quarter are still sitting in our backlog with no owner and no deadline that means anything.
Been thinking we need something that actually tracks this stuff automatically and keeps the retros short. Maybe pulls in similar past incidents so we stop pretending this is the first time our Redis cache decided to take a nap.
https://redd.it/1njrnlb
@r_devops
Another week, another "we learned from this incident" postmortem that nobody will read in 3 months when the exact same thing breaks again.
Spent 2 hours today writing up why our auth service went down. Same root cause as January. Same action items. Same people nodding along saying "yeah we should definitely fix that technical debt."
I'm so tired of this cycle. Write detailed report, assign action items, everyone's too busy, repeat when it breaks again.
The worst part? Leadership keeps asking why we haven't "learned" from previous incidents. Bro, we learned plenty - we just don't have time to actually implement any of the fixes because we're constantly firefighting new issues.
Our retros have become these long documents that check boxes for compliance but don't actually prevent anything. Half the action items from last quarter are still sitting in our backlog with no owner and no deadline that means anything.
Been thinking we need something that actually tracks this stuff automatically and keeps the retros short instead of these novel-length reports. Maybe pulls in similar past incidents so we stop pretending this is the first time our Redis cache decided to take a nap.
Anyone found tools that actually make postmortems useful instead of just another Jira ticket graveyard?
RetryI
no dashes
Edit
Our postmortem process is broken and I'm tired of pretending it's not
Posted in r/devops • 4h ago
Another week, another "we learned from this incident" postmortem that nobody will read in 3 months when the exact same thing breaks again.
Spent 2 hours today writing up why our auth service went down. Same root cause as January. Same action items. Same people nodding along saying "yeah we should definitely fix that technical debt."
I'm so tired of this cycle. Write detailed report, assign action items, everyone's too busy, repeat when it breaks again.
The worst part? Leadership keeps asking why we haven't "learned" from previous incidents. Bro, we learned plenty we just don't have time to actually implement any of the fixes because we're constantly firefighting new issues.
Our retros have become these long documents that check boxes for compliance but don't actually prevent anything. Half the action items from last quarter are still sitting in our backlog with no owner and no deadline that means anything.
Been thinking we need something that actually tracks this stuff automatically and keeps the retros short. Maybe pulls in similar past incidents so we stop pretending this is the first time our Redis cache decided to take a nap.
https://redd.it/1njrnlb
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Is it time to learn Kubernetes? - Zero Downtime Deployment with Docker
Hey Reddit, I've been stuck trying to achieve zero downtime deployment for a few weeks now to the point i'm considering learning proper container orchestration (K8s). It's a web stack (Laravel, Nuxt, a few microservices) and what I have now works but I'm not happy with the downtime... Any advice from some more experienced DevOps engineers would be much appreciated!
What I want to achieve:
Deployment to a dedicated server running Proxmox - managed hosting is out of the question
Continuous deployment (repo/registry) with rollbacks and zero downtime
Notifications for deployment success/failure
Simplicity and automation - the ability to push a commit from anywhere and have it go live
What I have currently:
Docker compose (5 containers)
Github Actions that build and publish to GHCR
Watchtowerr to pull and deploy images
Reverse proxy CT that routes via bridge to other CTs (e.g. 10.0.0.11:3000)
\~80 env vars in a file on the server(s), mounted to the containers and managed via ssh
What I've tried:
Swarm for rolling updates with watchtowerr
Blue/green with nginx upstream
Coolify/Dokploy (traefik)
Kamal
Nomad
Each of the above had pros and cons. Nginx had downtime. I don't want to trigger a deployment from the terminal. I don't need all the features of Coolify. Swarm had DNS/networking issues even when using `advertise-addr`...
Am I missing an obvious solution here? Docker is awesome but deploying it as a stack seems to be a nightmare!
https://redd.it/1njrj4u
@r_devops
Hey Reddit, I've been stuck trying to achieve zero downtime deployment for a few weeks now to the point i'm considering learning proper container orchestration (K8s). It's a web stack (Laravel, Nuxt, a few microservices) and what I have now works but I'm not happy with the downtime... Any advice from some more experienced DevOps engineers would be much appreciated!
What I want to achieve:
Deployment to a dedicated server running Proxmox - managed hosting is out of the question
Continuous deployment (repo/registry) with rollbacks and zero downtime
Notifications for deployment success/failure
Simplicity and automation - the ability to push a commit from anywhere and have it go live
What I have currently:
Docker compose (5 containers)
Github Actions that build and publish to GHCR
Watchtowerr to pull and deploy images
Reverse proxy CT that routes via bridge to other CTs (e.g. 10.0.0.11:3000)
\~80 env vars in a file on the server(s), mounted to the containers and managed via ssh
What I've tried:
Swarm for rolling updates with watchtowerr
Blue/green with nginx upstream
Coolify/Dokploy (traefik)
Kamal
Nomad
Each of the above had pros and cons. Nginx had downtime. I don't want to trigger a deployment from the terminal. I don't need all the features of Coolify. Swarm had DNS/networking issues even when using `advertise-addr`...
Am I missing an obvious solution here? Docker is awesome but deploying it as a stack seems to be a nightmare!
https://redd.it/1njrj4u
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How much time do you spend in your daily team stand-up meeting
Since new manager we have been spending 1 hour for 4 days per week on daily team meetings. I think this is a bit too much but other on the team appreciate it. We are doing remote work most of the time and it allows us to exchange on a variety of subjects but at the same time it's a real time sink and its mostly the same 3 people talking and most of the time about stuff that doesn't concern directly most of the team.
https://redd.it/1njuxdx
@r_devops
Since new manager we have been spending 1 hour for 4 days per week on daily team meetings. I think this is a bit too much but other on the team appreciate it. We are doing remote work most of the time and it allows us to exchange on a variety of subjects but at the same time it's a real time sink and its mostly the same 3 people talking and most of the time about stuff that doesn't concern directly most of the team.
https://redd.it/1njuxdx
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
What should a Mid-level Devops Engineer know?
I was lucky enough to transition from a cloud support role to devops, granted the position seems to be mainly maintain and enhance (and add to the infrastructure as requirements come) in a somewhat mature infrastructure. Although there are things which I am learning, migrations which are taking place, infra modernization. The team is mainly now just myself - I have a senior but as of last year they have been pretty much hands off. I would only go to them when needed or I have a question on something I dont have any information on. So it's a solo show between myself and the developers.
I would be lying if I said it's smooth sailing. Somewhat rough seas, and most of the time, I am trying to read into Cloud provider documentation or technology documentation to try and get a certain thing working. I dont always have the answers. I realize that's ok, but I feel that doesn't reflect well when I am the main POC.
Tech stack consists of EC2s, ECS fargate, cloudwatch for metrics, and we recently moved from github actions to AWS Codepipeline, so I am becoming familiar there slowly.
We dont use K8s/EKS as that's overkill for our applications. Although, that said, I feel like that is what 80% of the folks use(based on this subreddit) - I was told ECS is somewhat similar to EKS but I am not sure that is true.
Just trying to get a gauge of what I should be knowing as a mid-level engineer - most of the infrastructure is already established so I dont have an opportunity to implement new things. Just enhancing what is there, troubleshooting prod and pipeline issues, and implementing new features.
Also how long does it take to implement a new feature ? Being the only devops engineer, sometimes its smooth sailing, other times its not, and I start to panic.
Looking to setting up my own website(resume) and homelab at some point.
Open to ANY books as well, anything in particular you guys think will help me become a better engineer.
https://redd.it/1njvkra
@r_devops
I was lucky enough to transition from a cloud support role to devops, granted the position seems to be mainly maintain and enhance (and add to the infrastructure as requirements come) in a somewhat mature infrastructure. Although there are things which I am learning, migrations which are taking place, infra modernization. The team is mainly now just myself - I have a senior but as of last year they have been pretty much hands off. I would only go to them when needed or I have a question on something I dont have any information on. So it's a solo show between myself and the developers.
I would be lying if I said it's smooth sailing. Somewhat rough seas, and most of the time, I am trying to read into Cloud provider documentation or technology documentation to try and get a certain thing working. I dont always have the answers. I realize that's ok, but I feel that doesn't reflect well when I am the main POC.
Tech stack consists of EC2s, ECS fargate, cloudwatch for metrics, and we recently moved from github actions to AWS Codepipeline, so I am becoming familiar there slowly.
We dont use K8s/EKS as that's overkill for our applications. Although, that said, I feel like that is what 80% of the folks use(based on this subreddit) - I was told ECS is somewhat similar to EKS but I am not sure that is true.
Just trying to get a gauge of what I should be knowing as a mid-level engineer - most of the infrastructure is already established so I dont have an opportunity to implement new things. Just enhancing what is there, troubleshooting prod and pipeline issues, and implementing new features.
Also how long does it take to implement a new feature ? Being the only devops engineer, sometimes its smooth sailing, other times its not, and I start to panic.
Looking to setting up my own website(resume) and homelab at some point.
Open to ANY books as well, anything in particular you guys think will help me become a better engineer.
https://redd.it/1njvkra
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
What's the best way to detect vulnerabilities or issues with your API endpoints?
What's the best way to detect vulnerabilities or issues with your API endpoints? Is there anything free you would recommend?
https://redd.it/1njwmp0
@r_devops
What's the best way to detect vulnerabilities or issues with your API endpoints? Is there anything free you would recommend?
https://redd.it/1njwmp0
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Gitstrapped Code Server - fully bootstrapped code-server implementation
https://github.com/michaeljnash/gitstrapped-code-server
Hey all, wanted to share my repository which takes code-server and bootstraps it with github, clones / pulls desired repos, enables code-server password changes from inside code-server, other niceties that give a ready to go workspace, easily provisioned, dead simple to setup.
I liked being able to jump into working with a repo in github codespaces and just get straight to work but didnt like paying once I hit limits so threw this together. Also needed an lighter alternitive to coder for my startup since were only a few devs and coder is probably overkill.
Can either be bootstrapped by env vars or inside code-server directly (ctrl+alt+g, or in terminal use cli)
Some other things im probably forgetting. Check the repo readme for full breakdown of features. Makes privisioning workspaces for devs a breeze.
Thought others might like this handy as it has saved me tons of time and effort. Coder is great but for a team of a few dev's or an individual this is much more lightweight and straightforward and keeps life simple.
Try it out and let me know what you think.
Future thoughts are to work on isolated environments per repo somehow, while avoiding dev containers so we jsut have the single instance of code-server, keeping things lightweight. Maybe to have it automatically work with direnv for each cloned repo and have an exhaistive noscript to activate any type of virtual environments automatically when changing directory to the repo (anything from nix, to devbox, to activating python venv, etc etc.)
Cheers!
https://redd.it/1njxx7y
@r_devops
https://github.com/michaeljnash/gitstrapped-code-server
Hey all, wanted to share my repository which takes code-server and bootstraps it with github, clones / pulls desired repos, enables code-server password changes from inside code-server, other niceties that give a ready to go workspace, easily provisioned, dead simple to setup.
I liked being able to jump into working with a repo in github codespaces and just get straight to work but didnt like paying once I hit limits so threw this together. Also needed an lighter alternitive to coder for my startup since were only a few devs and coder is probably overkill.
Can either be bootstrapped by env vars or inside code-server directly (ctrl+alt+g, or in terminal use cli)
Some other things im probably forgetting. Check the repo readme for full breakdown of features. Makes privisioning workspaces for devs a breeze.
Thought others might like this handy as it has saved me tons of time and effort. Coder is great but for a team of a few dev's or an individual this is much more lightweight and straightforward and keeps life simple.
Try it out and let me know what you think.
Future thoughts are to work on isolated environments per repo somehow, while avoiding dev containers so we jsut have the single instance of code-server, keeping things lightweight. Maybe to have it automatically work with direnv for each cloned repo and have an exhaistive noscript to activate any type of virtual environments automatically when changing directory to the repo (anything from nix, to devbox, to activating python venv, etc etc.)
Cheers!
https://redd.it/1njxx7y
@r_devops
GitHub
GitHub - michaeljnash/gitstrapped-code-server
Contribute to michaeljnash/gitstrapped-code-server development by creating an account on GitHub.
Implementing SA 2 Authorization & Secure Key Generation
We’re in the process of rolling out **SA 2 authorization** to strengthen our security model and improve integration reliability.
Key steps include:
* Enforcing stricter access control policies
* Generating new authorization keys for service-to-service integration
* Ensuring minimal disruption during rollout through staged deployment and testing
The main challenge is balancing **security hardening** with **seamless continuity** for existing integrations. A lot of this comes down to careful planning around key distribution, rotation, and validation across environments.
👉 For those who have implemented SA 2 (or similar authorization frameworks), what strategies did you find most effective in managing key rotation and integration testing?
https://redd.it/1nk0kb0
@r_devops
We’re in the process of rolling out **SA 2 authorization** to strengthen our security model and improve integration reliability.
Key steps include:
* Enforcing stricter access control policies
* Generating new authorization keys for service-to-service integration
* Ensuring minimal disruption during rollout through staged deployment and testing
The main challenge is balancing **security hardening** with **seamless continuity** for existing integrations. A lot of this comes down to careful planning around key distribution, rotation, and validation across environments.
👉 For those who have implemented SA 2 (or similar authorization frameworks), what strategies did you find most effective in managing key rotation and integration testing?
https://redd.it/1nk0kb0
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Engineering Manager says Lambda takes 15 mins to start if too cold
Hey,
Why am I being told, 10 years into using Lambdas, that there’s some special wipe out AWS do if you don’t use the lambda often? He’s saying that cold starts are typical, but if you don’t use the lambda for a period of time (he alluded to 30 mins), it might have the image removed from the infrastructure by AWS. Whereas a cold start is activating that image?
He said 15 mins it can take to trigger a lambda and get a response.
I said, depending on what the function does, it’s only ever a cold start for a max of a few seconds - if that. Unless it’s doing something crazy and the timeout is horrendous.
He told me that he’s used it a lot of his career and it’s never been that way
https://redd.it/1nk37sw
@r_devops
Hey,
Why am I being told, 10 years into using Lambdas, that there’s some special wipe out AWS do if you don’t use the lambda often? He’s saying that cold starts are typical, but if you don’t use the lambda for a period of time (he alluded to 30 mins), it might have the image removed from the infrastructure by AWS. Whereas a cold start is activating that image?
He said 15 mins it can take to trigger a lambda and get a response.
I said, depending on what the function does, it’s only ever a cold start for a max of a few seconds - if that. Unless it’s doing something crazy and the timeout is horrendous.
He told me that he’s used it a lot of his career and it’s never been that way
https://redd.it/1nk37sw
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Thought I was saving $$ on Spark… then the bill came lol
so I genuinely thought I was being smart with my spark jobs…so i was like scaling down, tweaking executor settings, and setting timeouts etc.. then end of month comes and the cloud bill slapped me harder than expected. turns out the jobs were just churning on bad joins the whole time. Sad to witness that my optimizations were basically cosmetic. ever get humbled like that?
https://redd.it/1nk4b52
@r_devops
so I genuinely thought I was being smart with my spark jobs…so i was like scaling down, tweaking executor settings, and setting timeouts etc.. then end of month comes and the cloud bill slapped me harder than expected. turns out the jobs were just churning on bad joins the whole time. Sad to witness that my optimizations were basically cosmetic. ever get humbled like that?
https://redd.it/1nk4b52
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Company I turned down in the past wants to talk after I reached out, how should I approach it?
In the past I got a great job abroad but I turned it down. I asked their recruiter now if they have any roles and now surprisingly they want to talk.
I know I put them in a bad spot back then and wanted to ask how far would you go into explaining why I turned them down(family matters). I don't want to come across as a desperate but also want to explain I had a serious reason to turn them down at the time
https://redd.it/1nk4qx1
@r_devops
In the past I got a great job abroad but I turned it down. I asked their recruiter now if they have any roles and now surprisingly they want to talk.
I know I put them in a bad spot back then and wanted to ask how far would you go into explaining why I turned them down(family matters). I don't want to come across as a desperate but also want to explain I had a serious reason to turn them down at the time
https://redd.it/1nk4qx1
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How would you test Linux proficiency in an interview?
I am prepping for an interview where I think Linux knowledge might be my Achilles heel.
I came from windows/azure/Powershell background but I have more than basic knowledge of Linux systems. I can write bash, troubleshoot and deploy Linux containers. Very good theoretical knowledge of Linux components and commands but my production experience with core Linux is limited.
In my previous SRE/Devops role we deployed docker containers to kubernetes and barely needed to touch the containers themselves.
I aim to get understanding from more experienced folks here, what they would look out for to prove Linux expertise.
Thanks
https://redd.it/1nk4xzu
@r_devops
I am prepping for an interview where I think Linux knowledge might be my Achilles heel.
I came from windows/azure/Powershell background but I have more than basic knowledge of Linux systems. I can write bash, troubleshoot and deploy Linux containers. Very good theoretical knowledge of Linux components and commands but my production experience with core Linux is limited.
In my previous SRE/Devops role we deployed docker containers to kubernetes and barely needed to touch the containers themselves.
I aim to get understanding from more experienced folks here, what they would look out for to prove Linux expertise.
Thanks
https://redd.it/1nk4xzu
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Structured logs' ROI: is itworth it?
I suggested we invest into structured logging at work. We've a microservices platform.
Been getting lots of resistance, ROI unclear, etc.
Currently it takes us up to a whole day to get a clear picture of complex platform related issues.
What's your experience been like?
https://redd.it/1nk2rpm
@r_devops
I suggested we invest into structured logging at work. We've a microservices platform.
Been getting lots of resistance, ROI unclear, etc.
Currently it takes us up to a whole day to get a clear picture of complex platform related issues.
What's your experience been like?
https://redd.it/1nk2rpm
@r_devops
Reddit
Structured logs' ROI: is itworth it? : r/devops
11 votes, 43 comments. 436K subscribers in the devops community.
I've been cleaning up CI/CD breaches for 5 years. Please learn from other people's mistakes.
I'm tired of getting 3am calls from CTOs whose companies are falling apart because of preventable CI/CD security issues.
Last month, I watched a team of incredibly talented engineers cry in a conference room. Their startup - 3 years of work, 40 employees, families depending on them - lost their Series B funding because investors discovered an 8-month-old breach during due diligence.
The heartbreaking part? It started with something we've all done: a developer copied a long-lived AWS key into Jenkins on a Friday afternoon to unblock a release. "Just temporary," the commit message said.
I see this pattern constantly:
We lock down production like Fort Knox
We leave our CI/CD systems wide open
We tell ourselves "we'll fix the security debt next sprint"
We never do
Some hard truths from my experience:
Your CI/CD is 4x more likely to be breached than prod
Average cost when it happens: $4.9M
Average time to discover it: 287 days
Most devastating part: it's usually preventable
I'm not trying to scare you. I'm trying to help you avoid the pain I see teams go through.
Quick health check you can do right now:
# Secrets in Git history?
git log --all --source --grep="password\|key\|secret" | wc -l
# Overprivileged CI runners?
kubectl auth can-i '' '' --as=system:serviceaccount:ci:default
If those commands return anything scary, you're not alone. Every company I've helped started exactly there. I wrote up everything I've learned from 200+ incident responses - the attack patterns, the real costs, and most importantly, how to prevent it. Not trying to sell anything, just tired of seeing good teams get hurt by stuff we can fix. The goal isn't perfect security. It's avoiding the preventable disasters that destroy companies and careers.
Here's the link to the free guide: https://medium.com/@heinancabouly/the-50m-security-hole-in-your-ci-cd-pipeline-and-how-to-fix-it-before-attackers-find-it-9a1308fbb3dc?source=friends\_link&sk=5997988b9e9fbf2c31189f24dcf26e73
Hope this helps someone avoid a 3am call like the ones I get.
https://redd.it/1nk9h1y
@r_devops
I'm tired of getting 3am calls from CTOs whose companies are falling apart because of preventable CI/CD security issues.
Last month, I watched a team of incredibly talented engineers cry in a conference room. Their startup - 3 years of work, 40 employees, families depending on them - lost their Series B funding because investors discovered an 8-month-old breach during due diligence.
The heartbreaking part? It started with something we've all done: a developer copied a long-lived AWS key into Jenkins on a Friday afternoon to unblock a release. "Just temporary," the commit message said.
I see this pattern constantly:
We lock down production like Fort Knox
We leave our CI/CD systems wide open
We tell ourselves "we'll fix the security debt next sprint"
We never do
Some hard truths from my experience:
Your CI/CD is 4x more likely to be breached than prod
Average cost when it happens: $4.9M
Average time to discover it: 287 days
Most devastating part: it's usually preventable
I'm not trying to scare you. I'm trying to help you avoid the pain I see teams go through.
Quick health check you can do right now:
# Secrets in Git history?
git log --all --source --grep="password\|key\|secret" | wc -l
# Overprivileged CI runners?
kubectl auth can-i '' '' --as=system:serviceaccount:ci:default
If those commands return anything scary, you're not alone. Every company I've helped started exactly there. I wrote up everything I've learned from 200+ incident responses - the attack patterns, the real costs, and most importantly, how to prevent it. Not trying to sell anything, just tired of seeing good teams get hurt by stuff we can fix. The goal isn't perfect security. It's avoiding the preventable disasters that destroy companies and careers.
Here's the link to the free guide: https://medium.com/@heinancabouly/the-50m-security-hole-in-your-ci-cd-pipeline-and-how-to-fix-it-before-attackers-find-it-9a1308fbb3dc?source=friends\_link&sk=5997988b9e9fbf2c31189f24dcf26e73
Hope this helps someone avoid a 3am call like the ones I get.
https://redd.it/1nk9h1y
@r_devops
Medium
The $50M Security Hole in Your CI/CD Pipeline (And How to Fix It Before Attackers Find It)
How a “temporary” Jenkins fix led to the largest data breach of 2024 — and why your delivery pipeline is probably next
OTEL Collector + Tempo: How to handle frontend traces without exposing the collector?
Hey everyone!
I’m working with an environment using OTEL Collector + Tempo. The app has a frontend in Nginx + React and a backend in Node.js. My backend can send traces to the OTEL Collector through the VPC without any issues.
My question is about the frontend: in this case, the traces come from the public IP of the client accessing the app.
Does this mean I have to expose the Collector publicly (e.g., HTTPS + Bearer Token), or is there a way to keep the Collector completely private while still allowing the frontend to send traces?
Current setup:
Using GCP
Frontend and backend are running as Cloud Run services
They send traces to the OTEL Collector running on a Compute Engine instance
The connection goes through a Serverless VPC Access connector
Any insights or best practices would be really appreciated!
https://redd.it/1nkcf43
@r_devops
Hey everyone!
I’m working with an environment using OTEL Collector + Tempo. The app has a frontend in Nginx + React and a backend in Node.js. My backend can send traces to the OTEL Collector through the VPC without any issues.
My question is about the frontend: in this case, the traces come from the public IP of the client accessing the app.
Does this mean I have to expose the Collector publicly (e.g., HTTPS + Bearer Token), or is there a way to keep the Collector completely private while still allowing the frontend to send traces?
Current setup:
Using GCP
Frontend and backend are running as Cloud Run services
They send traces to the OTEL Collector running on a Compute Engine instance
The connection goes through a Serverless VPC Access connector
Any insights or best practices would be really appreciated!
https://redd.it/1nkcf43
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
OpenTelemetry Collector: What It Is, When You Need It, and When You Don’t
Understanding the OpenTelemetry Collector - what it does, how it works, real architecture patterns (with and without it), and how to decide if/when you should deploy one for performance, control, security, and cost efficiency.
https://oneuptime.com/blog/post/2025-09-18-what-is-opentelemetry-collector-and-why-use-one/view
https://redd.it/1nkd6vl
@r_devops
Understanding the OpenTelemetry Collector - what it does, how it works, real architecture patterns (with and without it), and how to decide if/when you should deploy one for performance, control, security, and cost efficiency.
https://oneuptime.com/blog/post/2025-09-18-what-is-opentelemetry-collector-and-why-use-one/view
https://redd.it/1nkd6vl
@r_devops
OneUptime | One Complete Observability platform.
OpenTelemetry Collector: What It Is, When You Need It, and When You Don’t
A practical, no-fluff guide to understanding the OpenTelemetry Collector - what it does, how it works, real architecture patterns (with and without it), and how to decide if/when you should deploy one for performance, control, security, and cost efficiency.
G-Man: Automatically (and securely) inject secrets into any command
I have no clue if anyone will find this useful but I wanted to share anyway!
I created this CLI tool called [G-Man](https://github.com/Dark-Alex-17/gman) whose purpose is to automatically fetch and pass secrets to any command securely from any secret provider backend, while also providing a unified CLI to manage secrets across any provider.
I've found this quite useful if you have applications running in AWS, GCP, etc. that have configuration files that pull from Secrets Manager or some other cloud secret manager. You can use the same secrets locally for development, without needing to manually populate your local environment or configuration files, and can easily switch between environment-specific secrets to start your application.
## What it does
* `gman` lets you manage your secrets in any of the supported secret providers (currently support the 3 major cloud providers and a local encrypted vault if you prefer client-side storage)
* Store secrets once (local encrypted vault or a cloud secret manager)
* Then use `gman` to inject secrets securely into your commands either via environment variables, flags, or auto-injecting into configuration files.
* Can define multiple run profiles per tool so you can easily switch environments, sets of secrets, etc.
* Can switch providers on the fly via the `--provider` flag
* Sports a `--dry-run` flag so you can preview the injected command before running it
## Providers
- Local: encrypted vault (Argon2id + XChaCha20‑Poly1305), optional Git sync.
- AWS Secrets Manager: select profile + region; delete is immediate (force_delete_without_recovery=true).
- GCP Secret Manager: ADC (`gcloud auth application-default login`) or `GOOGLE_APPLICATION_CREDENTIALS`; deleting a secret removes all versions.
- Azure Key Vault: `az login`/DefaultAzureCredential; deleting a secret removes all versions (subject to soft-delete/purge policy).
## CI/CD usage
- Use least‑privileged credentials in CI.
- Fetch or inject during steps without printing values:
- `gman --provider aws get NAME`
- `gman --provider gcp get NAME`
- `gman --provider azure get NAME`
- `gman get NAME` (the default-configured provider you chose)
- File mode can materialize config content temporarily and restore after run.
- Add & get:
- `echo "value" | gman add MY_API_KEY`
- `gman get MY_API_KEY`
- Inject env vars for AWS CLI:
- `gman aws sts get-caller-identity`
- This is more useful when running applications that actually use the AWS SDK and need the AWS config beforehand like Spring Boot projects, for example. But this gives you the idea
- Inject Docker env vars via the `-e` flags automatically
- `gman docker run my/image` injects `-e KEY=VALUE`
- Inject into a set of configuration files based on your run profiles
- `gman docker compose up`
- Automatically injects secrets into the configured files, and removes them from the file when the command ends
## Install
- `cargo install gman` (macOS/Linux/Windows).
- `brew install Dark-Alex-17/managarr/gman` (macOS/Linux).
- One-line bash/powershell install:
- `bash` (Linux/MacOS): `curl -fsSL https://raw.githubusercontent.com/Dark-Alex-17/gman/main/install.sh | bash`
- `powershell` (Linux/MacOS/Windows): `powershell -NoProfile -ExecutionPolicy Bypass -Command "iwr -useb https://raw.githubusercontent.com/Dark-Alex-17/gman/main/noscripts/install_gman.ps1 | iex"`
- Or grab binaries from the [releases page](https://github.com/Dark-Alex-17/gman/releases/latest).
### Links
- GitHub: https://github.com/Dark-Alex-17/gman
And to preemptively answer some questions about this thing:
* I'm building a much larger, separate application in Rust that has an `mcp.json` file that looks like Claude Desktop, and I didn't want to have to require my users put things like their GitHub tokens in plaintext in the file to configure their MCP servers. So I wanted a Rust-native way of storing and encrypting/decrypting and injecting values into the `mcp.json` file and I couldn't find another library
I have no clue if anyone will find this useful but I wanted to share anyway!
I created this CLI tool called [G-Man](https://github.com/Dark-Alex-17/gman) whose purpose is to automatically fetch and pass secrets to any command securely from any secret provider backend, while also providing a unified CLI to manage secrets across any provider.
I've found this quite useful if you have applications running in AWS, GCP, etc. that have configuration files that pull from Secrets Manager or some other cloud secret manager. You can use the same secrets locally for development, without needing to manually populate your local environment or configuration files, and can easily switch between environment-specific secrets to start your application.
## What it does
* `gman` lets you manage your secrets in any of the supported secret providers (currently support the 3 major cloud providers and a local encrypted vault if you prefer client-side storage)
* Store secrets once (local encrypted vault or a cloud secret manager)
* Then use `gman` to inject secrets securely into your commands either via environment variables, flags, or auto-injecting into configuration files.
* Can define multiple run profiles per tool so you can easily switch environments, sets of secrets, etc.
* Can switch providers on the fly via the `--provider` flag
* Sports a `--dry-run` flag so you can preview the injected command before running it
## Providers
- Local: encrypted vault (Argon2id + XChaCha20‑Poly1305), optional Git sync.
- AWS Secrets Manager: select profile + region; delete is immediate (force_delete_without_recovery=true).
- GCP Secret Manager: ADC (`gcloud auth application-default login`) or `GOOGLE_APPLICATION_CREDENTIALS`; deleting a secret removes all versions.
- Azure Key Vault: `az login`/DefaultAzureCredential; deleting a secret removes all versions (subject to soft-delete/purge policy).
## CI/CD usage
- Use least‑privileged credentials in CI.
- Fetch or inject during steps without printing values:
- `gman --provider aws get NAME`
- `gman --provider gcp get NAME`
- `gman --provider azure get NAME`
- `gman get NAME` (the default-configured provider you chose)
- File mode can materialize config content temporarily and restore after run.
- Add & get:
- `echo "value" | gman add MY_API_KEY`
- `gman get MY_API_KEY`
- Inject env vars for AWS CLI:
- `gman aws sts get-caller-identity`
- This is more useful when running applications that actually use the AWS SDK and need the AWS config beforehand like Spring Boot projects, for example. But this gives you the idea
- Inject Docker env vars via the `-e` flags automatically
- `gman docker run my/image` injects `-e KEY=VALUE`
- Inject into a set of configuration files based on your run profiles
- `gman docker compose up`
- Automatically injects secrets into the configured files, and removes them from the file when the command ends
## Install
- `cargo install gman` (macOS/Linux/Windows).
- `brew install Dark-Alex-17/managarr/gman` (macOS/Linux).
- One-line bash/powershell install:
- `bash` (Linux/MacOS): `curl -fsSL https://raw.githubusercontent.com/Dark-Alex-17/gman/main/install.sh | bash`
- `powershell` (Linux/MacOS/Windows): `powershell -NoProfile -ExecutionPolicy Bypass -Command "iwr -useb https://raw.githubusercontent.com/Dark-Alex-17/gman/main/noscripts/install_gman.ps1 | iex"`
- Or grab binaries from the [releases page](https://github.com/Dark-Alex-17/gman/releases/latest).
### Links
- GitHub: https://github.com/Dark-Alex-17/gman
And to preemptively answer some questions about this thing:
* I'm building a much larger, separate application in Rust that has an `mcp.json` file that looks like Claude Desktop, and I didn't want to have to require my users put things like their GitHub tokens in plaintext in the file to configure their MCP servers. So I wanted a Rust-native way of storing and encrypting/decrypting and injecting values into the `mcp.json` file and I couldn't find another library
GitHub
GitHub - Dark-Alex-17/gman: Universal command line credential management and injection tool
Universal command line credential management and injection tool - Dark-Alex-17/gman
that did exactly what I wanted; i.e. one that supported environment variable, flag, and file injection into any command, and supported many different secret manager backends (AWS Secrets Manager, local encrypted vault, etc). So I built this as a dependency for that larger project.
* I also built it for fun. Rust is the language I've learned that requires the most practice, and I've only built 6 enterprise applications in Rust and 7 personal projects, but I still feel like there's a TON for me to learn.
So I also just built it for fun :) If no one uses it, that's fine! Fun project for me regardless and more Rust practice to internalize more and learn more about how the language works!
https://redd.it/1nkf33p
@r_devops
* I also built it for fun. Rust is the language I've learned that requires the most practice, and I've only built 6 enterprise applications in Rust and 7 personal projects, but I still feel like there's a TON for me to learn.
So I also just built it for fun :) If no one uses it, that's fine! Fun project for me regardless and more Rust practice to internalize more and learn more about how the language works!
https://redd.it/1nkf33p
@r_devops
Reddit
From the devops community on Reddit: G-Man: Automatically (and securely) inject secrets into any command
Explore this post and more from the devops community
Ridiculous pay rate
I just came here to say I had a recruiter reach out and they were saying 24/hr pay rate for a DevOps engineer position.
What the hell is that pay, thankful I am already at a great FT job but that is absurd for DevOps work or really anything in IT.
And if was just a scam to steal my information they could have went higher on the pay rate to make me sending me resume over more enticing.
https://redd.it/1nkgnax
@r_devops
I just came here to say I had a recruiter reach out and they were saying 24/hr pay rate for a DevOps engineer position.
What the hell is that pay, thankful I am already at a great FT job but that is absurd for DevOps work or really anything in IT.
And if was just a scam to steal my information they could have went higher on the pay rate to make me sending me resume over more enticing.
https://redd.it/1nkgnax
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Im currently transitioning from help desk to devops at my job, how can I do the best I can? I was told it will be “a lot” and I’m already lost in the code
So we purchased puppet enterprise to help automate the configuration management of our servers. I was apart of the general puppet training but not involved in the configuration management side of training. There were two parts.
Now I was given this job and I have to automate the installation of all our security software and also our CIS benchmarks and there is some work done but there’s a ton left to do.
I’m not going to lie it feels like a daunting task and it was told to me that it was, and I’m not even “fully” in the role, I still have to “split time” which imo makes it even harder.
Right now I’m using my time at work to self study almost the whole day.
I kind of like the fact that I could make a job out of this here but there’s just so much code and different branches and I’m sitting here looking at some of the code and it overwhelms me how much I don’t know and what does this attribute do and why is the number here zero. It’s a lot and I do wish I had some work sponsored training cause I wasn’t invited for the second week of training.
https://redd.it/1nkj7m7
@r_devops
So we purchased puppet enterprise to help automate the configuration management of our servers. I was apart of the general puppet training but not involved in the configuration management side of training. There were two parts.
Now I was given this job and I have to automate the installation of all our security software and also our CIS benchmarks and there is some work done but there’s a ton left to do.
I’m not going to lie it feels like a daunting task and it was told to me that it was, and I’m not even “fully” in the role, I still have to “split time” which imo makes it even harder.
Right now I’m using my time at work to self study almost the whole day.
I kind of like the fact that I could make a job out of this here but there’s just so much code and different branches and I’m sitting here looking at some of the code and it overwhelms me how much I don’t know and what does this attribute do and why is the number here zero. It’s a lot and I do wish I had some work sponsored training cause I wasn’t invited for the second week of training.
https://redd.it/1nkj7m7
@r_devops