our ci/cd testing is so slow devs just ignore failures now"
we've got about 800 automated tests running in our ci/cd pipeline and they take forever. 45 minutes on average, sometimes over an hour if things are slow.
worse than the time is the flakiness. maybe 5 to 10 tests fail randomly on each run, always different ones. so now devs just rerun the pipeline and hope it passes the second time. which obviously defeats the purpose.
we're trying to do multiple deploys per day but the qa stage has become the bottleneck. either we wait for tests or we start ignoring failures which feels dangerous.
tried parallelizing more but we hit resource limits. tried being more selective about what runs on each pr but then we miss stuff. feels like we're stuck between slow and unreliable.
anyone solved this? need tests that run fast, don't fail randomly, and actually catch real issues.
https://redd.it/1qr00b5
@r_devops
we've got about 800 automated tests running in our ci/cd pipeline and they take forever. 45 minutes on average, sometimes over an hour if things are slow.
worse than the time is the flakiness. maybe 5 to 10 tests fail randomly on each run, always different ones. so now devs just rerun the pipeline and hope it passes the second time. which obviously defeats the purpose.
we're trying to do multiple deploys per day but the qa stage has become the bottleneck. either we wait for tests or we start ignoring failures which feels dangerous.
tried parallelizing more but we hit resource limits. tried being more selective about what runs on each pr but then we miss stuff. feels like we're stuck between slow and unreliable.
anyone solved this? need tests that run fast, don't fail randomly, and actually catch real issues.
https://redd.it/1qr00b5
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
made one rule for PRs: no diagram means no review. reviews got way faster.
tried a small experiment on our repo. every PR needed a simple flow diagram, nothing fancy, just how things move. surprisingly, code reviews became way easier. fewer back-and-forths, fewer “wait what does this touch?” moments. seeing the flow first changed how everyone read the code.
curious if anyone else here uses diagrams seriously in dev workflows??
https://redd.it/1qr131v
@r_devops
tried a small experiment on our repo. every PR needed a simple flow diagram, nothing fancy, just how things move. surprisingly, code reviews became way easier. fewer back-and-forths, fewer “wait what does this touch?” moments. seeing the flow first changed how everyone read the code.
curious if anyone else here uses diagrams seriously in dev workflows??
https://redd.it/1qr131v
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Build once, deploy everywhere and build on merge.
Hey everyone, I'd like to ask you a question.
I'm a developer learning some things in the DevOps field, and at my job I was asked to configure the CI/CD workflow. Since we have internal servers, and the company doesn't want to spend money on anything cloud-based, I looked for as many open-source and free solutions as possible given my limited knowledge.
I configured a basic IaC with bash noscripts to manage ephemeral self-hosted runners from GitHub (I should have used GitHub's Action Runner Controller, but I didn't know about it at the time), the Docker registry to maintain the different repository images, and the workflows in each project.
Currently, the CI/CD workflow is configured like this:
A person opens a PR, Docker builds it, and that build is sent to the registry. When the PR is merged into the base branch, Docker deploys based on that built image.
But if two different PRs originating from the same base occur, if PR A is merged, the deployment happens with the changes from PR A. If PR B is merged later, the deployment happens with the changes from PR B without the changes from PR A, because the build has already happened and was based on the previous base without the changes from PR A.
For the changes from PR A and PR B to appear in a deployment, a new PR C must be opened after the merge of PR A and PR B.
I did it this way because, researching it, I saw the concept of "Build once, deploy everywhere".
However, this flow doesn't seem very productive, so researching again, I saw the idea of "Build on Merge", but wouldn't Build on Merge go against the Build once, deploy everywhere flow?
What flow do you use and what tips would you give me?
https://redd.it/1qqhrbs
@r_devops
Hey everyone, I'd like to ask you a question.
I'm a developer learning some things in the DevOps field, and at my job I was asked to configure the CI/CD workflow. Since we have internal servers, and the company doesn't want to spend money on anything cloud-based, I looked for as many open-source and free solutions as possible given my limited knowledge.
I configured a basic IaC with bash noscripts to manage ephemeral self-hosted runners from GitHub (I should have used GitHub's Action Runner Controller, but I didn't know about it at the time), the Docker registry to maintain the different repository images, and the workflows in each project.
Currently, the CI/CD workflow is configured like this:
A person opens a PR, Docker builds it, and that build is sent to the registry. When the PR is merged into the base branch, Docker deploys based on that built image.
But if two different PRs originating from the same base occur, if PR A is merged, the deployment happens with the changes from PR A. If PR B is merged later, the deployment happens with the changes from PR B without the changes from PR A, because the build has already happened and was based on the previous base without the changes from PR A.
For the changes from PR A and PR B to appear in a deployment, a new PR C must be opened after the merge of PR A and PR B.
I did it this way because, researching it, I saw the concept of "Build once, deploy everywhere".
However, this flow doesn't seem very productive, so researching again, I saw the idea of "Build on Merge", but wouldn't Build on Merge go against the Build once, deploy everywhere flow?
What flow do you use and what tips would you give me?
https://redd.it/1qqhrbs
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
ECR alternative
Hey all,
We’ve been using AWS ECR for a while and it was fine, no drama. Now I’m starting work with a customer in a regulated environment and suddenly “just a registry” isn’t enough.
They’re asking how we know an image was built in GitHub Actions, how we prove nobody pushed it manually, where scan results live, and how we show evidence during audits. With ECR I feel like I’m stitching together too many things and still not confident I can answer those questions cleanly.
Did anyone go through this? Did you extend ECR or move to something else? How painful was the migration and what would you do differently if you had to do it again?
https://redd.it/1qr2zq2
@r_devops
Hey all,
We’ve been using AWS ECR for a while and it was fine, no drama. Now I’m starting work with a customer in a regulated environment and suddenly “just a registry” isn’t enough.
They’re asking how we know an image was built in GitHub Actions, how we prove nobody pushed it manually, where scan results live, and how we show evidence during audits. With ECR I feel like I’m stitching together too many things and still not confident I can answer those questions cleanly.
Did anyone go through this? Did you extend ECR or move to something else? How painful was the migration and what would you do differently if you had to do it again?
https://redd.it/1qr2zq2
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
What internal tool did you build that’s actually better than the commercial SaaS equivalent?
I feel like the market is flooded with complex platforms, but the best tools I see are usually the noscripts and dashboards engineers hack together to solve a specific headache.
Who here is building something on the side (or internally) that actually works?
https://redd.it/1qr4ipm
@r_devops
I feel like the market is flooded with complex platforms, but the best tools I see are usually the noscripts and dashboards engineers hack together to solve a specific headache.
Who here is building something on the side (or internally) that actually works?
https://redd.it/1qr4ipm
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Argo CD Image updater with GAR
Hii everyone! I need help finding the resources related to ArgoCD image updater with Google artifact registry also whole setup if possible I read official docs , It has detialied steps with ACR on Azure but couldn't find specifically for GCP can anyone suggest any good blog related to this setup or maybe give a helping hand ..
https://redd.it/1qr6j5n
@r_devops
Hii everyone! I need help finding the resources related to ArgoCD image updater with Google artifact registry also whole setup if possible I read official docs , It has detialied steps with ACR on Azure but couldn't find specifically for GCP can anyone suggest any good blog related to this setup or maybe give a helping hand ..
https://redd.it/1qr6j5n
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
AGENTS.md for tbdflow: the Flowmaster
I’ve been experimenting with something a bit meta lately: giving my CLI tool a Skill.
A Skill is a formal, machine-readable denoscription of how an AI agent should use a tool correctly. In my case, I wrote a
One thing became very clear very quickly:
as soon as you put an AI agent in the loop, vagueness turns into a bug.
Trunk-Based Development only works if the workflow is respected. Humans get away with fuzzy rules because we fill in gaps with judgement, but agents don’t. They follow whatever boundaries you actually draw, and if you are not very explicit of what _not_ to do; they will do it...
The SKILL.md for tbdflow does things like:
Enforce short-lived branches
Standardise commits
Reduce Git decision-making
Maintain a fast, safe path back to trunk (
What surprised me was how much behavioural clarity and explicitness suddenly matters when the “user” isn’t human.
Probably something we should apply to humans as well, but I digress.
If you don’t explicitly say “staging is handled by the tool”, the agent will happily reach for
And that is because I (the skill author) didn’t draw the boundary.
Writing the Skill forced me to make implicit workflow rules explicit, and to separate intent from implementation.
From there, step two was writing an AGENTS.md.
`AGENTS.md` is about who the agent is when operating in your repo: its persona, mission, tone, and non-negotiables.
The final line of the agent contract is:
>Your job is not to be helpful at any cost.
>Your job is to keep trunk healthy.
Giving tbdflow a Skill was step one, giving it a Persona and a Mission was step two.
Overall, this has made me think of Trunk-Based Development less as a set of practices and more as something you design for, especially when agents are involved.
Curious if others here are experimenting with agent-aware tooling, or encoding DevOps practices in more explicit, machine-readable ways.
SKILL.md:
https://github.com/cladam/tbdflow/blob/main/SKILL.md
AGENTS.md:
https://github.com/cladam/tbdflow/blob/main/AGENTS.md
https://redd.it/1qr76ye
@r_devops
I’ve been experimenting with something a bit meta lately: giving my CLI tool a Skill.
A Skill is a formal, machine-readable denoscription of how an AI agent should use a tool correctly. In my case, I wrote a
SKILL.md for tbdflow, a CLI that enforces Trunk-Based Development.One thing became very clear very quickly:
as soon as you put an AI agent in the loop, vagueness turns into a bug.
Trunk-Based Development only works if the workflow is respected. Humans get away with fuzzy rules because we fill in gaps with judgement, but agents don’t. They follow whatever boundaries you actually draw, and if you are not very explicit of what _not_ to do; they will do it...
The SKILL.md for tbdflow does things like:
Enforce short-lived branches
Standardise commits
Reduce Git decision-making
Maintain a fast, safe path back to trunk (
main)What surprised me was how much behavioural clarity and explicitness suddenly matters when the “user” isn’t human.
Probably something we should apply to humans as well, but I digress.
If you don’t explicitly say “staging is handled by the tool”, the agent will happily reach for
git add.And that is because I (the skill author) didn’t draw the boundary.
Writing the Skill forced me to make implicit workflow rules explicit, and to separate intent from implementation.
From there, step two was writing an AGENTS.md.
`AGENTS.md` is about who the agent is when operating in your repo: its persona, mission, tone, and non-negotiables.
The final line of the agent contract is:
>Your job is not to be helpful at any cost.
>Your job is to keep trunk healthy.
Giving tbdflow a Skill was step one, giving it a Persona and a Mission was step two.
Overall, this has made me think of Trunk-Based Development less as a set of practices and more as something you design for, especially when agents are involved.
Curious if others here are experimenting with agent-aware tooling, or encoding DevOps practices in more explicit, machine-readable ways.
SKILL.md:
https://github.com/cladam/tbdflow/blob/main/SKILL.md
AGENTS.md:
https://github.com/cladam/tbdflow/blob/main/AGENTS.md
https://redd.it/1qr76ye
@r_devops
agents.md
AGENTS.md is a simple, open format for guiding coding agents. Think of it as a README for agents.
Python Crash Course Notebook for Data Engineering
Hey everyone! Sometime back, I put together a crash course on Python specifically tailored for Data Engineers. I hope you find it useful! I have been a data engineer for 5+ years and went through various blogs, courses to make sure I cover the essentials along with my own experience.
Feedback and suggestions are always welcome!
📔 Full Notebook: Google Colab
🎥 Walkthrough Video (1 hour): YouTube \- Already has almost 20k views & 99%+ positive ratings
💡 Topics Covered:
1. Python Basics \- Syntax, variables, loops, and conditionals.
2. Working with Collections \- Lists, dictionaries, tuples, and sets.
3. File Handling \- Reading/writing CSV, JSON, Excel, and Parquet files.
4. Data Processing \- Cleaning, aggregating, and analyzing data with pandas and NumPy.
5. Numerical Computing \- Advanced operations with NumPy for efficient computation.
6. Date and Time Manipulations\- Parsing, formatting, and managing date time data.
7. APIs and External Data Connections \- Fetching data securely and integrating APIs into pipelines.
8. Object-Oriented Programming (OOP) \- Designing modular and reusable code.
9. Building ETL Pipelines \- End-to-end workflows for extracting, transforming, and loading data.
10. Data Quality and Testing \- Using `unittest`, `great_expectations`, and `flake8` to ensure clean and robust code.
11. Creating and Deploying Python Packages \- Structuring, building, and distributing Python packages for reusability.
Note: I have not considered PySpark in this notebook, I think PySpark in itself deserves a separate notebook!
https://redd.it/1qr93s8
@r_devops
Hey everyone! Sometime back, I put together a crash course on Python specifically tailored for Data Engineers. I hope you find it useful! I have been a data engineer for 5+ years and went through various blogs, courses to make sure I cover the essentials along with my own experience.
Feedback and suggestions are always welcome!
📔 Full Notebook: Google Colab
🎥 Walkthrough Video (1 hour): YouTube \- Already has almost 20k views & 99%+ positive ratings
💡 Topics Covered:
1. Python Basics \- Syntax, variables, loops, and conditionals.
2. Working with Collections \- Lists, dictionaries, tuples, and sets.
3. File Handling \- Reading/writing CSV, JSON, Excel, and Parquet files.
4. Data Processing \- Cleaning, aggregating, and analyzing data with pandas and NumPy.
5. Numerical Computing \- Advanced operations with NumPy for efficient computation.
6. Date and Time Manipulations\- Parsing, formatting, and managing date time data.
7. APIs and External Data Connections \- Fetching data securely and integrating APIs into pipelines.
8. Object-Oriented Programming (OOP) \- Designing modular and reusable code.
9. Building ETL Pipelines \- End-to-end workflows for extracting, transforming, and loading data.
10. Data Quality and Testing \- Using `unittest`, `great_expectations`, and `flake8` to ensure clean and robust code.
11. Creating and Deploying Python Packages \- Structuring, building, and distributing Python packages for reusability.
Note: I have not considered PySpark in this notebook, I think PySpark in itself deserves a separate notebook!
https://redd.it/1qr93s8
@r_devops
Google
Python for Data Engineers - Analytics Vector.ipynb
Colab notebook
Devops Project Ideas For Resume
Hey everyone! I’m a fresher currently preparing for my campus placements in about six months. I want to build a strong DevOps portfolio—could anyone suggest some solid, resume-worthy projects? I'm looking for things that really stand out to recruiters. Thanks in advance!
https://redd.it/1qr5t6q
@r_devops
Hey everyone! I’m a fresher currently preparing for my campus placements in about six months. I want to build a strong DevOps portfolio—could anyone suggest some solid, resume-worthy projects? I'm looking for things that really stand out to recruiters. Thanks in advance!
https://redd.it/1qr5t6q
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How do you track and manage expirations at scale? (certs, API keys, licenses, etc.)
Hey folks,
I’m curious how other teams handle time-bound assets in real life. Things like:
* TLS certificates
* API keys and credentials
* Licenses and subnoscriptions
* Domains
* Contracts or compliance documents
In theory this stuff is simple. In practice, I’ve seen outages, broken pipelines, access loss, and last minute fire drills because something expired and nobody noticed in time.
I’ve worked in a few DevOps and SRE teams now, and I keep seeing the same patterns:
* spreadsheets that slowly rot
* shared calendars nobody owns
* reminder emails that get ignored
* “Oh yeah, X was supposed to renew that”
* "There is too much tools for that and people don't communicate properly on the new time-bound assets or the new places where they are used"
So I wanted to ask the community:
**How are you handling this today?**
Some specific questions I’m really interested in:
* Where do you store expiration info? Code, CMDB, wiki, spreadsheet, somewhere else?
* Do you track ownership or is it mostly implicit?
* How far in advance do you alert, if at all?
* Are expirations tied into incident response or ticketing?
* What’s broken for you today that you’ve just learned to live with?
I’m especially curious how this scales once you’re dealing with:
* multiple teams
* multiple cloud providers
* audits and compliance requirements
* people rotating in and out
If you’ve had a failure caused by an expiration, I’d love to hear what happened and what you changed afterward, if anything.
Context: I’m a DevOps engineer myself. After getting burned by this problem a few too many times, I ended up building a small tool focused purely on expiration lifecycle management. I won’t pitch it here unless people ask. The goal of this post is genuinely to learn how others are solving this today.
Looking forward to the war stories and lessons learned.
https://redd.it/1qrdfm8
@r_devops
Hey folks,
I’m curious how other teams handle time-bound assets in real life. Things like:
* TLS certificates
* API keys and credentials
* Licenses and subnoscriptions
* Domains
* Contracts or compliance documents
In theory this stuff is simple. In practice, I’ve seen outages, broken pipelines, access loss, and last minute fire drills because something expired and nobody noticed in time.
I’ve worked in a few DevOps and SRE teams now, and I keep seeing the same patterns:
* spreadsheets that slowly rot
* shared calendars nobody owns
* reminder emails that get ignored
* “Oh yeah, X was supposed to renew that”
* "There is too much tools for that and people don't communicate properly on the new time-bound assets or the new places where they are used"
So I wanted to ask the community:
**How are you handling this today?**
Some specific questions I’m really interested in:
* Where do you store expiration info? Code, CMDB, wiki, spreadsheet, somewhere else?
* Do you track ownership or is it mostly implicit?
* How far in advance do you alert, if at all?
* Are expirations tied into incident response or ticketing?
* What’s broken for you today that you’ve just learned to live with?
I’m especially curious how this scales once you’re dealing with:
* multiple teams
* multiple cloud providers
* audits and compliance requirements
* people rotating in and out
If you’ve had a failure caused by an expiration, I’d love to hear what happened and what you changed afterward, if anything.
Context: I’m a DevOps engineer myself. After getting burned by this problem a few too many times, I ended up building a small tool focused purely on expiration lifecycle management. I won’t pitch it here unless people ask. The goal of this post is genuinely to learn how others are solving this today.
Looking forward to the war stories and lessons learned.
https://redd.it/1qrdfm8
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Resources for Debugging Best Practices
Do you guys have any books, papers, videos or other resources to develop a more disciplined or systematic approach to debugging, either in the infrastructure / system space or just general software development? I feel like I spend a huge amount of time debugging, and while learning through experience is great, I’d love to know if there were any books that you found useful.
Edit: when I say debugging I guess I should broaden it to also include like troubleshooting — debug suggest mostly code or terraform files or something, but maybe there’s more basic principles to think about
https://redd.it/1qreise
@r_devops
Do you guys have any books, papers, videos or other resources to develop a more disciplined or systematic approach to debugging, either in the infrastructure / system space or just general software development? I feel like I spend a huge amount of time debugging, and while learning through experience is great, I’d love to know if there were any books that you found useful.
Edit: when I say debugging I guess I should broaden it to also include like troubleshooting — debug suggest mostly code or terraform files or something, but maybe there’s more basic principles to think about
https://redd.it/1qreise
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How do you catch cron jobs that "succeed" but produce wrong results?
I've been dealing with a frustrating problem: my cron jobs return exit code 0, but the actual results are wrong.
I'm seeing cases where noscripts complete successfully but produce incorrect or incomplete results:
* Backup noscript completes successfully but creates empty backup files
* Data processing job finishes but only processes 10% of records
* Report generator runs without errors but outputs incomplete data
* Database sync completes but the counts don't match
* File transfer succeeds but the destination file is corrupted
The logs show "success" - exit code 0, no exceptions - but the actual results are wrong. The errors might be buried in logs, but I'm not checking logs proactively every day.
I've Tried:
1. Adding validation checks in noscripts - Works, but you have to modify every noscript, and changing thresholds requires code changes. Also, what if the file exists but is from yesterday? What if you need to check multiple conditions?
2. Webhook alerts - requires writing connectors for every noscript, and you still need to parse/validate the data somewhere
3. Error monitoring tools (Sentry, Datadog, etc.) - they catch exceptions, not wrong results. If your noscript doesn't throw an exception, they won't catch it
4. Manual spot checks - not scalable, and you'll miss things
The validation-in-noscript approach works for simple cases, but it's not flexible. You end up mixing monitoring logic with business logic. Plus, you can't easily:
* Change thresholds without deploying code
* Check complex conditions (size + format)
* Centralize monitoring rules across multiple noscripts
* Handle edge cases like "file exists but is corrupted" or "backup is from yesterday"
I built a simple monitoring tool that watches job results instead of just execution status. You send it the actual results (file size, record count, status, etc.) via a simple API call, and it alerts if something's off. No need to dig through logs, and you can adjust thresholds without deploying code.
How do you handle simillar cases in your environment?
https://redd.it/1qrjfqc
@r_devops
I've been dealing with a frustrating problem: my cron jobs return exit code 0, but the actual results are wrong.
I'm seeing cases where noscripts complete successfully but produce incorrect or incomplete results:
* Backup noscript completes successfully but creates empty backup files
* Data processing job finishes but only processes 10% of records
* Report generator runs without errors but outputs incomplete data
* Database sync completes but the counts don't match
* File transfer succeeds but the destination file is corrupted
The logs show "success" - exit code 0, no exceptions - but the actual results are wrong. The errors might be buried in logs, but I'm not checking logs proactively every day.
I've Tried:
1. Adding validation checks in noscripts - Works, but you have to modify every noscript, and changing thresholds requires code changes. Also, what if the file exists but is from yesterday? What if you need to check multiple conditions?
2. Webhook alerts - requires writing connectors for every noscript, and you still need to parse/validate the data somewhere
3. Error monitoring tools (Sentry, Datadog, etc.) - they catch exceptions, not wrong results. If your noscript doesn't throw an exception, they won't catch it
4. Manual spot checks - not scalable, and you'll miss things
The validation-in-noscript approach works for simple cases, but it's not flexible. You end up mixing monitoring logic with business logic. Plus, you can't easily:
* Change thresholds without deploying code
* Check complex conditions (size + format)
* Centralize monitoring rules across multiple noscripts
* Handle edge cases like "file exists but is corrupted" or "backup is from yesterday"
I built a simple monitoring tool that watches job results instead of just execution status. You send it the actual results (file size, record count, status, etc.) via a simple API call, and it alerts if something's off. No need to dig through logs, and you can adjust thresholds without deploying code.
How do you handle simillar cases in your environment?
https://redd.it/1qrjfqc
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
AWS vs Azure - learning curve.
So...sorry, dnt mean to hate on Azure, but why is it so hard to grasp..
Here's my example, breaking into cloud architecture, and have been trying to create serverless workflows. Mind you I already have a solid understanding, as I am currently in the IT field.
Azure functions gave me endless problems....and I never got it working. The function never got triggered. No help provided by Azure in the form of tips etc. Certain function plans are not allowed on the free tier, just so much of hoops to jump through. Sifting through logs is daunting, as apparently you have to setup queries to see logs.
AWS on the other hand, within 2 hours, I was able to get my app up and running. So much help just with AWS basic tips and suggested help articles.
Am I the only one which feels this way about Azure..
https://redd.it/1qrl93k
@r_devops
So...sorry, dnt mean to hate on Azure, but why is it so hard to grasp..
Here's my example, breaking into cloud architecture, and have been trying to create serverless workflows. Mind you I already have a solid understanding, as I am currently in the IT field.
Azure functions gave me endless problems....and I never got it working. The function never got triggered. No help provided by Azure in the form of tips etc. Certain function plans are not allowed on the free tier, just so much of hoops to jump through. Sifting through logs is daunting, as apparently you have to setup queries to see logs.
AWS on the other hand, within 2 hours, I was able to get my app up and running. So much help just with AWS basic tips and suggested help articles.
Am I the only one which feels this way about Azure..
https://redd.it/1qrl93k
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Asked for honest feedback last month, got it, spent January actually fixing things
a few weeks ago I posted here about OpsCompanion. You told me where it sucked and what was cool. Appreciate everyone who took the time to try it.
I was an sre at Cloudflare.. I know that behind every issue is a real person just trying to do their job. Keeping things secure, helping devs out, or dealing with stuff getting thrown over the fence....
And now everyone is vibe coding with zero context or concern about prod. Honestly I am a little worried about where this is all headed.
I see what we are all dealing with and I want to help. Would love to hear what would actually make your days easier...really. not just another AI SRE thing.
Check it out: https://opscompanion.ai/
If it still sucks, let me know and I will fix it.
https://redd.it/1qroxnu
@r_devops
a few weeks ago I posted here about OpsCompanion. You told me where it sucked and what was cool. Appreciate everyone who took the time to try it.
I was an sre at Cloudflare.. I know that behind every issue is a real person just trying to do their job. Keeping things secure, helping devs out, or dealing with stuff getting thrown over the fence....
And now everyone is vibe coding with zero context or concern about prod. Honestly I am a little worried about where this is all headed.
I see what we are all dealing with and I want to help. Would love to hear what would actually make your days easier...really. not just another AI SRE thing.
Check it out: https://opscompanion.ai/
If it still sucks, let me know and I will fix it.
https://redd.it/1qroxnu
@r_devops
OpsCompanion.ai
OpsCompanion | AIOps platform for enterprise-level reliability
OpsCompanion is the AI-driven Operations Intelligence Engine that automates root cause analysis, resolves alerts, and unifies observability across your stack helping enterprises run reliability on autopilot.
Will this AWS security project add value to my resume?
Hi everyone,
I’d love your input on whether the following project would meaningfully enhance my resume, especially for DevOps/Cloud/SRE roles:
Automated Security Remediation System | AWS
Engineered event-driven serverless architecture that auto-remediates high-severity security violations (exposed SSH ports, public S3 buckets) within 5 seconds of detection, reducing MTTR by 99%
Integrated Security Hub, GuardDuty, and Config findings with EventBridge and Lambda to orchestrate remediation workflows and SNS notifications
Implemented IAM least-privilege policies and CloudFormation IaC for repeatable deployment across AWS accounts
Reduced potential attack surface exposure time from avg 4 hours to <10 seconds
Do you think this project demonstrates strong impact and would stand out to recruiters/hiring managers? Any suggestions on how I could frame it better for maximum resume value?
Thanks in advance!
https://redd.it/1qrw38y
@r_devops
Hi everyone,
I’d love your input on whether the following project would meaningfully enhance my resume, especially for DevOps/Cloud/SRE roles:
Automated Security Remediation System | AWS
Engineered event-driven serverless architecture that auto-remediates high-severity security violations (exposed SSH ports, public S3 buckets) within 5 seconds of detection, reducing MTTR by 99%
Integrated Security Hub, GuardDuty, and Config findings with EventBridge and Lambda to orchestrate remediation workflows and SNS notifications
Implemented IAM least-privilege policies and CloudFormation IaC for repeatable deployment across AWS accounts
Reduced potential attack surface exposure time from avg 4 hours to <10 seconds
Do you think this project demonstrates strong impact and would stand out to recruiters/hiring managers? Any suggestions on how I could frame it better for maximum resume value?
Thanks in advance!
https://redd.it/1qrw38y
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How do you manage database access?
I've worked at a few different companies. Each place had a different approach for sharing database credentials for on-call staff for troubleshooting/support.
Each team had a set of read-only credentials, but credentials were openly shared (usually on a public password manager) and not rotated often. Most of them required VPNs though.
I'm building a tool for managed, credential-less database access (will not promote here).
I'm curious to know what are the other best practices that teams follow?
https://redd.it/1qsjswf
@r_devops
I've worked at a few different companies. Each place had a different approach for sharing database credentials for on-call staff for troubleshooting/support.
Each team had a set of read-only credentials, but credentials were openly shared (usually on a public password manager) and not rotated often. Most of them required VPNs though.
I'm building a tool for managed, credential-less database access (will not promote here).
I'm curious to know what are the other best practices that teams follow?
https://redd.it/1qsjswf
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
From QA to DevOps - What’s your advice?
Hi everyone,
I’m currently working as a Software Quality Engineer with a background in test automation, and I’m planning to transition into a DevOps role within the next 1-2 years in EU job market.
I already have hands-on experience with:
Docker
Linux
Some Kubernetes basics
Some basics with CICD Pipelines (Gitlab, GitHub Actions)
Grafana & Prometheus
Networking
My background is mainly in automation, noscripting, and system reliability from a QA perspective. I’m now trying to identify the most effective next steps to become a solid DevOps candidate in Europe.
For those who’ve made a similar move (QA/SDET → DevOps), especially in the EU:
Which skills or tools should I prioritize next (I am currently getting deeper into Kubernetes)?
What kind of practical projects actually help in EU hiring processes?
Are certifications (e.g. AWS, CKA, etc.) valued, or is experience king?
How can I best position my QA background as an advantage?
https://redd.it/1qsi7kl
@r_devops
Hi everyone,
I’m currently working as a Software Quality Engineer with a background in test automation, and I’m planning to transition into a DevOps role within the next 1-2 years in EU job market.
I already have hands-on experience with:
Docker
Linux
Some Kubernetes basics
Some basics with CICD Pipelines (Gitlab, GitHub Actions)
Grafana & Prometheus
Networking
My background is mainly in automation, noscripting, and system reliability from a QA perspective. I’m now trying to identify the most effective next steps to become a solid DevOps candidate in Europe.
For those who’ve made a similar move (QA/SDET → DevOps), especially in the EU:
Which skills or tools should I prioritize next (I am currently getting deeper into Kubernetes)?
What kind of practical projects actually help in EU hiring processes?
Are certifications (e.g. AWS, CKA, etc.) valued, or is experience king?
How can I best position my QA background as an advantage?
https://redd.it/1qsi7kl
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Honestly, would you recommend the DevOps path?
This isn't one of those "DevOps or other coolnoscript.txt?" question per se. I'm wondering if you'd genuinely recommend the path to becoming a DevOps. Are you happy where you are? Are the hours making you questioning your life choices etc. I'm looking to hearing genuine personal opinions.
I have a networking background and I currently work as a network engineer. I have several Cisco, AWS and Azure certifications and I have been doing this for a while. I fell in love with networking instantly and I still love it to this day. However it's a lot of the same and I have to travel/be away from my family more than I'd like. I have diagnosed ADHD which I am medicated for and it's been a blessing in my life. However, it's no secret that we get extra bored of repetitive tasks if there's nothing new and exciting.
Here I feel like the DevOps career is something that could be right up my alley, the amount of knowledge you need to have to just get started, the constantly changing environment, the never ending learning and the fact that there always seems to be something to do. Please correct me if I'm wrong.
I am now legible for a "scholarship" of sorts to get a 2 year DevOps education for free and I wonder if you'd take that chance if it was you? I was super excited until I realised that I have barely done any coding and sure there's courses in coding covered in this education but there are also many other things. But since I have experience in other things covered I could focus more on the coding aspect. Do you think two years will be enough experience to get into a junior DevOps role without being a burden to said company?
Thank you for your time.
/M
https://redd.it/1qssoqt
@r_devops
This isn't one of those "DevOps or other coolnoscript.txt?" question per se. I'm wondering if you'd genuinely recommend the path to becoming a DevOps. Are you happy where you are? Are the hours making you questioning your life choices etc. I'm looking to hearing genuine personal opinions.
I have a networking background and I currently work as a network engineer. I have several Cisco, AWS and Azure certifications and I have been doing this for a while. I fell in love with networking instantly and I still love it to this day. However it's a lot of the same and I have to travel/be away from my family more than I'd like. I have diagnosed ADHD which I am medicated for and it's been a blessing in my life. However, it's no secret that we get extra bored of repetitive tasks if there's nothing new and exciting.
Here I feel like the DevOps career is something that could be right up my alley, the amount of knowledge you need to have to just get started, the constantly changing environment, the never ending learning and the fact that there always seems to be something to do. Please correct me if I'm wrong.
I am now legible for a "scholarship" of sorts to get a 2 year DevOps education for free and I wonder if you'd take that chance if it was you? I was super excited until I realised that I have barely done any coding and sure there's courses in coding covered in this education but there are also many other things. But since I have experience in other things covered I could focus more on the coding aspect. Do you think two years will be enough experience to get into a junior DevOps role without being a burden to said company?
Thank you for your time.
/M
https://redd.it/1qssoqt
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Astrological CPU Scheduler with eBPF
Someone built a Linux CPU scheduler that makes scheduling decisions based on planetary positions and zodiac signs with eBPF and sched_ext...and it works! Obviously not something to run into production, but still a fun idea to play around with.
"Because if the universe can influence our lives, why not our CPU scheduling too?"
https://github.com/zampierilucas/scx\_horoscope
https://redd.it/1qrzpbr
@r_devops
Someone built a Linux CPU scheduler that makes scheduling decisions based on planetary positions and zodiac signs with eBPF and sched_ext...and it works! Obviously not something to run into production, but still a fun idea to play around with.
"Because if the universe can influence our lives, why not our CPU scheduling too?"
https://github.com/zampierilucas/scx\_horoscope
https://redd.it/1qrzpbr
@r_devops
GitHub
GitHub - zampierilucas/scx_horoscope: Astrological CPU Scheduler
Astrological CPU Scheduler. Contribute to zampierilucas/scx_horoscope development by creating an account on GitHub.
Getting pigeon-holed in my career - Need advice
A little background of myself, I have been working for the same company, in the same team since I graduated a few years ago. I had gotten an internship with them while I was studying CS and was lucky enough to get a FT role as soon as I graduated with the same team. Now the issue is this is a small team that purely does infrastructure automation for a big bank. I work with other infrastructure engineering teams and help automate many of their flows and create them into ansible pipelines. My company doesn’t even have terraform, we use Azure built in Azure Bicep to do IaC for cloud and use Ansible to do IaC for onPrem, I have minimal exposure to cloud, have only done a few automation and integrations with them.
With this job I have become an Ansible expert, and I am now knowledgeable on all the basics of Infrastructure Engineering especially onPrem however I don’t see a path upwards in my career and wanted advice on how to break out of this pigeon hole as a Ansible Automation expert to more conventional Cloud/DevOps Engineering.
What are maybe some certs I can pursue? What are some other ways to take my skill and expand on it? Just feeling stuck…
https://redd.it/1qspv6s
@r_devops
A little background of myself, I have been working for the same company, in the same team since I graduated a few years ago. I had gotten an internship with them while I was studying CS and was lucky enough to get a FT role as soon as I graduated with the same team. Now the issue is this is a small team that purely does infrastructure automation for a big bank. I work with other infrastructure engineering teams and help automate many of their flows and create them into ansible pipelines. My company doesn’t even have terraform, we use Azure built in Azure Bicep to do IaC for cloud and use Ansible to do IaC for onPrem, I have minimal exposure to cloud, have only done a few automation and integrations with them.
With this job I have become an Ansible expert, and I am now knowledgeable on all the basics of Infrastructure Engineering especially onPrem however I don’t see a path upwards in my career and wanted advice on how to break out of this pigeon hole as a Ansible Automation expert to more conventional Cloud/DevOps Engineering.
What are maybe some certs I can pursue? What are some other ways to take my skill and expand on it? Just feeling stuck…
https://redd.it/1qspv6s
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Underground office has a sulfuric smell
I work in a windowless office one floor underground, about 3 m (10 ft) from a large server room. There’s also a large cabinet nearby with a tunnel underneath it carrying thick cables.
Last week the room smelled sulfurous—like fart gas or stovetop gas. Not the extremely putrid “rotten egg” H₂S smell, but a persistent sulfur/fart odor.
A building inspector initially dismissed it, but a safety professional later tested the room with a 4-gas detector (O₂, CO, NO₂, SO₂, H₂S). No detectable H₂S was found, even though the odor is still present after testing.
Building management claims renovations upstairs may have disturbed plumbing (e.g., drilling into a sewer pipe), and advised heavy ventilation. It didn't smell today at the morning, but it's almost lunch and I smelt it again. Now my head hurts.
My concern is whether it’s safe to continue working there. If this were H₂S, smell would not be a reliable warning at dangerous concentrations, so I’m trying to assess risk without being paranoid.
I've used a career tag cause this seems to be more metawork than actual work.
https://redd.it/1qsvcqe
@r_devops
I work in a windowless office one floor underground, about 3 m (10 ft) from a large server room. There’s also a large cabinet nearby with a tunnel underneath it carrying thick cables.
Last week the room smelled sulfurous—like fart gas or stovetop gas. Not the extremely putrid “rotten egg” H₂S smell, but a persistent sulfur/fart odor.
A building inspector initially dismissed it, but a safety professional later tested the room with a 4-gas detector (O₂, CO, NO₂, SO₂, H₂S). No detectable H₂S was found, even though the odor is still present after testing.
Building management claims renovations upstairs may have disturbed plumbing (e.g., drilling into a sewer pipe), and advised heavy ventilation. It didn't smell today at the morning, but it's almost lunch and I smelt it again. Now my head hurts.
My concern is whether it’s safe to continue working there. If this were H₂S, smell would not be a reliable warning at dangerous concentrations, so I’m trying to assess risk without being paranoid.
I've used a career tag cause this seems to be more metawork than actual work.
https://redd.it/1qsvcqe
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community