Reddit DevOps – Telegram
Already 1.1 YOE in DevOps/SRE — Is Switching to SDE Worth It?

I have ~1.1 YOE as **DevOps/SRE** (first job). I didn’t “choose” it intentionally — this was the offer I got.
In college I did **web dev + some DSA**, but I’m not strongly inclined toward any single path.

My concern:

* How is **long-term growth for DevOps/SRE** in **top product-based companies**?
* I keep hearing that **DSA + coding rounds are still required** even for good DevoOps/SRE roles.
* Given that, does it make sense to **revisit development**, or is it **better to stay in DevOps/SRE**, prepare DSA, and target top PBC SRE roles?

I am planning to switch and start the journey of learning again , but I feel stuck to begin with Development path along with brushing up the DevOps skills or just stay in DevOps role and aim for top companies and career growth.

I’m not emotionally attached to SDE or DevOps/SRE — I just want **strong growth, good roles, and long-term optionality**.

Would love to hear from experienced folks who’ve been in SRE / DevOps / SDE roles.

https://redd.it/1povxbz
@r_devops
Blogs to read suggestions

Tell some blogs to read for working professionals as devops engineer on AWS ,K8s , and monitoring.. Also focused on troubleshooting and real production usecases

https://redd.it/1pozt7m
@r_devops
Pivoting from Legacy Telecom Ops (SIP/SMPP) to Cloud Native (Go/K8s). Does this roadmap scream "Mid-Level" to you?

Hello All,

I have 7 years of experience in Telecom Operations (troubleshooting SIP, SMPP, Network issues) while finishing my CS degree. I know exactly how systems break in production, but I'm tired of just fixing and monitoring all the time.

I am planning a hard pivot to Backend / SRE / DevOps roles. I want to escape "Ops Support" and leverage my domain knowledge.

My Transition Roadmap: I'm spending the next year bridging the gap between "Old School Telecom" and "Modern Cloud Native":

1. Legacy to Modern: Re-implementing basic Telecom engines (which I currently troubleshoot) using Go and gRPC.
2. Infrastructure: Moving from manual server configs to Kubernetes Operators and Terraform.
3. Observability: Instead of just reading logs, building the Prometheus/Grafana stacks myself.

The Question: Does the industry value a developer who understands low-level Telecom protocols (SIP/SMPP/TCP/UDP) but writes modern Go code? Can I market myself as a Mid-Level SRE/Backend Engineer with this mix, or does the lack of "professional software development experience" (despite 7 years in Ops) automatically reset me to Junior?

Any advice from folks who moved from Ops to Dev is appreciated.

https://redd.it/1pp1i0g
@r_devops
Alternatives for Github?

Hey, due to recent changes I want to move away from it with my projects and company.

But I'm not sure what else is there. I don't want to selfhost and I know that Codeberg main focus are open-source projects.

Do you have any recommendations?

https://redd.it/1pp33g9
@r_devops
Any recommendations?

Hi everyone. I'm recently found that I'm quite interested in DevOps (started as a homelabing). For now I use my old laptop as my sandbox. Specks: Ubuntu 24, CPU Intel Celeron 1005m, 16 Gb RAM, 500Gb HDD. What I've installed for now: Docker, Portainer, Watchtower, Jenkins and GiTea, Nginx and Immich. Now I'm about to install Prometheus+Grafana.

Well, my question is: should I create a separate directory for my Docker cantainers? Will it be fine without troubles? Or any recommendations for better ways to do this. For example Docker have /var/lib/docker, but I saw a video about installing Prometheus and Grafana (ik that reading documentation is better way, but nevertheless) looks like it works (I also did the same, but my separate "docker" folder doesn't appear time to time when I use "ls"). I'd like to add a screenshot of how it's on the video, but I can't add pictures for some reason.

https://redd.it/1pp1a0a
@r_devops
GCP quotas alerting

Hey all,
Is there a recommended way to configure proactive alerts when a GCP service is approaching its quota limit (e.g. 70–80%), instead of only finding out after the quota is exceeded?


I tried using Cloud Monitoring quota metrics, but it feels clunky, and I’m not confident it’ll catch things early enough. Why? We battle-tested it with a workload burst, and the alert reached us 10 minutes later. I am sure it can work for some use cases, but it would be great if there was something smarter that can almost "feel the trend", time it, and notify in advance, not after or right after.



Curious what others are doing in practice.

https://redd.it/1pp5n8m
@r_devops
How do I streamline the access update process in my org?

Dealing with a bunch of role changes at my company (project swaps, team changes, etc.) and access updates have been super messy. I've seen some people using HR-triggered workflows to try to automate this, but wondering if there are other things I should be looking into. I've been looking into Console to try to handle small permission tweaks that keep coming up. Would love to hear about how other ppl are handling this!

https://redd.it/1pp8kph
@r_devops
Colleague built a pretty neat tool for managing RabbitMQ DLQs

Hey all,

Just wanted to give a quick shoutout to a dev from my company who built a tool we’ve been using internally for a while now, it’s called Rabbit GUI (https://rabbitgui.com/), and it helps us manage RabbitMQ dead letter queues. We use it to read messages from the queue, search and filter, and republish only specific messages if needed.
We’ve had it in use for a couple months, and honestly, it’s been super handy. I definitely would not want to give it up.
Disclaimer, it’s a paid tool (lifetime license though, not a subnoscription), but I think the pricing’s fair for what it does.

Figured I’d help him get a bit more visibility since it’s actually been useful for us.
If anyone checks it out, I’d love to hear your thoughts, happy to pass along any feedback or questions to him!
Cheers

https://redd.it/1pp7fwq
@r_devops
Is SSL decryption still worth it for AI and SaaS visibility? Am a SecOps lead btw

Anyone still banking on SSL decryption for GenAI and SaaS app visibility? What's breaking in your environment: cert pinning, HSTS, user complaints?

Particularly curious about the network layer vs app layer debate. Seeing more teams pivot to browser-native controls but want to hear operational experiences. What's your take?

https://redd.it/1ppbi0c
@r_devops
Composable DXP in practice... flexibility win or long-term maintenance tax?

I’ve been seeing more teams move away from monolithic CMS platforms toward a composable DXP model with headless CMS, search, personalization, commerce, analytics, all loosely coupled and stitched together with APIs.

On paper it’s best-of-breed everything, faster iteration, and no vendor lock-in.

In practice though, it seems like the real tradeoff shows up later in:

\- Integration ownership and version drift

\- Observability across multiple vendors

\- Reliability when one service upstream sneezes

\- The ongoing cost of “keeping the stack composed”



For those running composable DXPs in production today:

\- Has it meaningfully improved delivery speed or experience quality?

\- Where did the complexity actually concentrate over time (build, ops, integration, governance)?

\- And if you’ve lived on both sides, would you still choose composable over a modern all-in-one today?

Less interested in vendor marketing... more in the lived operational reality.

https://redd.it/1ppa6d2
@r_devops
Am I Junior Level at least?

So i'll preface by saying I work as an SDET mainly. But here lately we've been moving over from Azure to AWS. I was kinda the first person to start messing with things. And I guess I wanted to see if this is at least "junior level" based off what ive done. Also we are using gitlab pipelines for CI/CD for the first time.

So far I have:

* Setup CI/CD Pipelines in Gitlab (ci-yaml file)
* Get a working pipeline for Deploying to AWS (Beanstalk for now)
* Similarly set up a working pipeline to handle Terraform Apply/Plan
* E2E Automated Testing on Pipelines (this is less devops and more SDET though)
* Get a decent understand of Terraform modules. Set up IAM and S3 Terraform state Terraform modules
* Dockerize our reporting tool (Allure) and work from ECR
* Document and work with DevOps on Environments/Shared Resources/etc.. for moving to Gitlab fully as well as AWS.

It doesn't feel like a lot, and I have a ways to go but I find it interesting. Yeah I obviously used A.I. for some of the syntax/CLI commands but I feel like I have a decent idea of Architecture.

https://redd.it/1ppejw8
@r_devops
How do you compare CI/CD providers?

I've been exploring which CI/CD provider to focus on for my organization over the past few months. We've got some things in GitHub actions, and some in Azure DevOps, mostly because different groups of people set up different solutions.

But to be honest, I can't find a compelling reason to go with one or the other. Coin toss?

And then of course, there are other options out there.

What are the key differentiators that you have come across in exploring these tools?

https://redd.it/1pph1m7
@r_devops
This is the kind of work AI should be doing

​

I already knew what I needed to do. The problem wasn’t lack of knowledge, it was recall. I could’ve spent time poking around, trial and erroring, or Googling until something clicked. All of that would’ve pulled me out of the flow I was in.

Instead, I asked Cosine, got what I needed almost instantly, and kept going. No rabbit holes, no context switch, no wasted mental energy.

For me, that’s the right use of AI. Handle the small, forgettable details so I can stay focused on the parts that actually require thinking.

https://redd.it/1ppjq86
@r_devops
Is £95–100k total comp solid for a senior-ish DevOps role in London?

Hey all,

Looking for a quick sanity check from people in the London market.

I've got two offers for Platform Engineer/SRE roles at large non-FAANG companies in London. Base is in the £80–90k range, total comp comes out around £95–100k with bonus.

I'm 24, a bit unsure if this is good for the market or if I should be pushing harder, looking elsewhere. Not that trying to min-max, just want to know if this is a solid place to be or if I'm undervaluing myself.

Would appreciate any perspective from people hiring or working in similar roles. Thanks!

https://redd.it/1ppj4y3
@r_devops
What’s the most common reason CI/CD pipelines break down in growing teams?

As teams grow, CI/CD pipelines that once worked fine can slowly turn messy. More people, more changes, quick fixes, and suddenly the pipeline feels fragile and breaks more often than it should. Tests become flaky, environments don’t match, and everyone starts blaming the tools instead of the process.

What do you think is the main reason CI/CD pipelines break down as teams scale?

https://redd.it/1pplnrt
@r_devops
New Features We Find Exciting in the Kubernetes 1.35 Release

Hey everyone! Wrote a blog post highlighting some of the features I think are worth taking a look at in the latest Kubernetes release, including examples to try them out.

Read here: https://metalbear.com/blog/kubernetes-1-35/(https://www.reddit.com/submit/?postid=t31n34tlz)

https://redd.it/1ppj9ur
@r_devops
On-demand runner on AWS CodeBuild with Bitbucket Pipelines

I made a package that enables AWS CodeBuild as an on-demand self-hosted runner for Bitbucket Pipelines.

The problem: AWS CodeBuild natively supports managed runners for GitHub Actions, GitLab, etc. - but not Bitbucket.

The solution: This package bridges that gap. Your Bitbucket Pipeline triggers CodeBuild via OIDC, which spins up an ephemeral self-hosted runner on-demand. When the build completes, the runner terminates automatically.

https://github.com/westito/aws-bitbucket-runner

https://redd.it/1ppn1xy
@r_devops
I wrote a garbage collector for my AWS account because 'Status: Available' doesn't mean 'In Use'.

Hey everyone,

I've been diving deep into the AWS SDKs specifically to understand how billing correlates with actual usage, and I realized something annoying: Status != Usage.

The AWS Console shows a NAT Gateway as "Available" , but it doesn't warn you that it has processed 0 bytes in 30 days while still costing \~$32/month. It shows an EBS volume as "Available", but not that it was detached 6 months ago from a terminated instance.

I wanted to build something that digs deeper than just metadata.

So I wrote CloudSlash.

It’s an open-source CLI tool (AGPL) written in Go.

The Engineering: I wanted to build a proper specialized tool, not just a noscript.

Heuristic Engine: It correlates CloudWatch Metrics (actual traffic/IOPS) with Infrastructure State to prove a resource is unused.
The Findings:
Zombie EBS: Volumes attached to stopped instances for >30 days (or unattached).
Vampire NATs: Gateways charging hourly rates with <1GB monthly traffic.
Ghost S3: Incomplete multipart uploads (invisible storage costs).
Stack: Go + Cobra + BubbleTea (for a nice TUI). It builds a strictly local dependency graph of your resources.

Why Use It? It runs with ReadOnlyAccess. It doesn't send data to any SaaS (it's local). It allows you to find waste that the basic free-tier tools might miss.

I also added a "Pro" feature that generates Terraform import blocks and destroy plans to fix the waste automatically, but the core scanning and discovery are 100% free/open source.

I'd really appreciate any feedback on the Golang structure or suggestions for other "waste patterns" I should implement next.

Repo: https://github.com/DrSkyle/CloudSlash

Cheers!



https://redd.it/1ppnn2n
@r_devops
Unpopular opinion: DORA metrics are becoming "Vanity Metrics" for Engineering Health.

I’ve been looking at our dashboard lately, and on paper, we are an "Elite" team. Deployment frequency is up, and lead time is down.

But if I look at the actual team health? It’s a mess. The Senior Architects are burning out doing code reviews, we are accruing massive tech debt to hit that velocity, and I’m pretty sure we are shipping features that don't actually move the needle just to keep the "deploy count" high.

It feels like DORA measures the efficiency of the pipeline, but not the health of the organization.

I’m trying to move away from just measuring "Output" to measuring "Capacity & Risk" (e.g., Skill Coverage, Bus Factor, Cognitive Load).

Has anyone successfully implemented metrics that measure sustainability rather than just speed? How do you explain to a board that "High Velocity" != "Good Engineering"?

https://redd.it/1ppphjb
@r_devops
How do I optimise wasted runs on github actions

This is from one repo that has not been that active in the last 7 days :

\- 39 total CI minutes

\- 14 minutes were non-productive

\- Biggest driver: failed/re-run workflows and Duplicate runs for the same PR



We always assumed “this is normal, but with billing changes, it adds up fast.

I am looking into some tools that could help with this, but I am curious how others are handling this...

\- Do you actively cancel outdated PR runs?

\- Or just accept the cost as the price of speed?



https://redd.it/1pppfsd
@r_devops