Reddit DevOps – Telegram
How do you think your role will change over the next decade, and how are you preparing for it?

Hey everyone!

I’ve been having these thoughts lately that honestly give me a bit of anxiety. We’ve all seen how fast AI has evolved. It’s not perfect, but it’s improving at an unbelievable pace.

I work in DevOps, and I think I’ve been doing fairly well so far, but I can’t help wondering how sustainable this career really is in the long run.
The demand for DevOps engineers already feels lower compared to other tech roles, and with AI slowly taking over, I sometimes wonder how long this role will stay as relevant as it is today.

On top of that, tech jobs in general don’t feel very stable. It’s not like traditional careers where you can safely work till 60. Another thing I keep thinking about is what happens over the next decade, when a large cohort of younger engineers move into senior roles. There will be a lot of people competing for management and leadership positions, and we all know not everyone is going to get them. That makes the future feel even more uncertain.

Then there’s the financial angle. The world is more debt-driven than ever. Housing prices are through the roof, and for someone like me with no family backup, taking on a 15–20 year home loan feels risky.

So I wanted to get some honest perspectives from this community:
- How much can one really rely on a DevOps career (or tech in general) for the long term?
- How do you position yourself to stay relevant and employable as the industry keeps changing?
- What’s a realistic way to build a second stream of income as a hedge? I’ve looked into a few options, but nothing has really clicked with my skills or situation so far.

Would really appreciate hearing from others who’ve had similar thoughts, or from anyone who’s found a way to deal with this uncertainty better.

https://redd.it/1ohajid
@r_devops
How do you verify vulnerability deltas between provider hardened and official upstream images?

I started benchmarking some hardened base images against their official upstreams (Ubuntu, Alpine, Debian, etc.). theoretically, CVE count drops dramatically but scanner metadata doesn’t always align. Some vulnerabilities are silently patched by upstream backports that scanners don’t recognize. Others look fixed in the hardened version but are really just suppressed by package removal. how do you objectively measure delta between a hardened image and the stock one?

https://redd.it/1ohbi0p
@r_devops
which roadmap?

Hey, I'm starting to study to become a DevOps engineer and I came to find two roadmaps, this one
Become A DevOps Engineer in 2025: \[A Practical Roadmap\](https://devopscube.com/become-devops-engineer/)
And this one from roadmap.sh
https://roadmap.sh/devops
I don't know which one to follow? Any help, please?

https://redd.it/1ohc5bp
@r_devops
Residency-first collaboration for regulated orgs: neutral notes on Gem Team

Regulated teams often need collaboration tools they can fully control. Gem Team is one example in this space - a secure B2B messenger that brings chat, voice, video, and file sharing together in one familiar workspace with enterprise-grade safeguards.

According to its docs, it supports meetings with up to 300 participants, including screen sharing, recording, and moderator roles. You also get presence indicators, message editing, delivery status, and native voice notes.

On the security side, it uses TLS 1.3, encryption at rest, and minimizes metadata. The platform runs on fail-safe clusters in Uptime Institute Tier III facilities. Deployment is flexible - on-prem, secure cloud, hybrid, or even fully air-gapped - with extras like IP masking and metadata shredding.

Data residency and lifecycle controls are customizable - you can choose where data is stored, set retention periods, and automate deletion on servers and endpoints. It aligns with ISO 27001, GDPR, and GCC regulations (including Qatar CRA).

Compared to cloud-only suites like Slack or Microsoft Teams, Gem Team focuses on data sovereignty, large meetings and recording out of the box, and no stated limits on message or file history.

https://redd.it/1ohee2r
@r_devops
Debugging LLM apps in production was harder than expected

I have been Running an AI app with RAG retrieval, agent chains, and tool calls. Recently some Users started reporting slow responses and occasionally wrong answers.

Problem was I couldn't tell which part was broken. Vector search? Prompts? Token limits? Was basically adding print statements everywhere and hoping something would show up in the logs.

APM tools give me API latency and error rates, but for LLM stuff I needed:

Which documents got retrieved from vector DB
Actual prompt after preprocessing
Token usage breakdown
Where bottlenecks are in the chain

My Solution:

Set up Langfuse (open source, self-hosted). Uses Postgres, Clickhouse, Redis, and S3. Web and worker containers.

The @observe() decorator traces the pipeline. Shows:

Full request flow
Prompts after templating
Retrieved context
Token usage per request
Latency by step

Deployment

Used their Docker Compose setup initially. Works fine for smaller scale. They have Kubernetes guides for scaling up. [Docs ](
https://langfuse.com/self-hosting)

Gateway setup

Added Anannas AI as an LLM gateway. Single API for multiple providers with auto-failover. Useful for hybrid setups when mixing different model sources.

Anannas handles gateway metrics, Langfuse handles application traces. Gives visibility across both layers. [Implementation Docs](
https://langfuse.com/integrations/gateways/anannas)

What it caught

Vector search was returning bad chunks - embeddings cache wasn't working right. Traces showed the actual retrieved content so I could see the problem.

Some prompts were hitting context limits and getting truncated. Explained the weird outputs.

Stack

Langfuse (Docker, self-hosted)
Anannas AI (gateway)
Redis, Postgres, Clickhouse

Trace data stays local since it's self-hosted.

If anyone is debugging similar LLM issues for the first timer, might be useful.

https://redd.it/1ohf70t
@r_devops
any self hostable alternatives for code rabbit??

as mentioned in the noscript im looking for open-source, self-hosted alternatives to coderabbit that can be deployed in our own cloud and integrated with openai, claude, or other ai api keys.... the reason is straightforward we’re a startup with cloud startup credits, so rather than purchasing coderabbit, we’d prefer to leverage these existing credits to run a similar solution ourselves.

https://redd.it/1oheri0
@r_devops
what Git flow for a repo of Ansible playbooks?

Hello all! I started a new contract where I have to administer a consul cluster with mainly Ansible playbooks through an awx platform.

---

Currently there is one branch per environment and there is no difference between them.

So for each evolution we merge the feature branch in each environment branch. it seems cumbersome to me. on the awx platform we have a template for each branch for deployment.

we are a team of 2 and sometimes 3 and I started to talk about tags and release/develop branch but they don't know about those concepts.

I was thinking to propose a trunk based approach with the use of rc and release tags whixill be linked to the awx templates. with only one main branch and feature branches.

our development environments could be linked to our main branch. the staging environment to a rc tag and ou production to a release tag.

also there is no pipeline today. so I also wanted to add a job to automate the updates of the awx platform to set then with the right tags to aim

---

what do you think about it?
do you have advices or other approach?

thanks!


https://redd.it/1ohcxo2
@r_devops
Did you have to leetcode to get your DevOps role and was it worth it (i.e. financially)?

I have never had to leetcode for my DevOps jobs in the past 10 years. However, none of what I’ve ever done is more than 30% noscripting/coding. I have learnt typenoscript and go just to stay competitive but no one ever tested me on it. That being said, I’m working in a LCOL region of the US and I’m in the top percentile of this region. It’s not bad. I get envious at the FAANG income-earners from time to time but I largely can’t complain. Anybody else see benefits from learning leetcode for this field in particular?

https://redd.it/1ohk7dn
@r_devops
Monitoring Jenkins Nodes with Datadog

Hi Community,

We have a Jenkins controller connected to multiple build nodes.
I’d like to monitor the health and performance of these nodes using Datadog.

I’ve explored the available Jenkins metrics and events, but haven’t been able to find a clear way to capture node-level metrics (such as connectivity, availability, or job execution health) through Datadog.

Has anyone implemented Datadog monitoring for Jenkins nodes successfully?
If so, could you please share how you achieved it or point me toward relevant configuration steps or documentation?

Appreciate any guidance or best practices you can provide!

Thanks,

https://redd.it/1ohl2v1
@r_devops
AWS Apprunner - impossible to deploy with - how do you use it??

[](https://www.reddit.com/r/aws/?f=flair_name%3A%22containers%22)trying to develop on app runner, cdk, python etc. w/ a webapp react and nextjs and node server and docker

keep running into "An error occurred (InvalidRequestException) when calling the StartDeployment operation: Can't start a deployment on the specified service, because it isn't in RUNNING state. "

you would think you can just cancel the deployment, but it is fully greyed out - can't do anything and its just hanging with very limited logging.

how do you properly develop on this thing?

https://redd.it/1ohnrse
@r_devops
Playwright tests failing on Windows but fine on macOS

Running the same Playwright suite locally on macOS and CI on Windows runners - works perfectly on Mac, randomly fails on Windows. Tried disabling video recording and headless mode, no luck. Anyone else seen platform-specific instability like this?

https://redd.it/1ohrw95
@r_devops
Self-hosting mysql on a Hetzner server

With all those managed databases out there it's an 'easy' choice to go for that, as we did years ago. Currently paying 130 for 8gb ram and 4vcpu but I was wondering how hard would it actually be to have this mysql db self hosted on a Hetzner server. The DB is mainly used for 8-9 integration/middleware applications so there is always throughput but no application (passwords etc) data is stored.

What are things I should think about and would running this DB on a dedicated server, next to some Docker applications (the laravel apps) be fine? Off course we would setup automatic backups

Reason why I am looking into this is mainly costs.

https://redd.it/1ohu7yj
@r_devops
How be up to date?

I’m a DevOps Engineer focused on building, improving and maintaining AWS Infrastructures so basically my Stack is AWS, Terraform, Github Actions, a bit of Ansible (and Linux of course). Those are my daily tools, however I want to apply to Big Tech companies and I realize they require multiple DevOps tools… As you might know, DevOps implies multiples tools so how do you keep up to date with all of them? It is frustrating

https://redd.it/1ohwcqk
@r_devops
How do you deal with stagnation when everything else about your job is great?

Hi everyone,

I’m a 13-year IT professional with experience mainly across DevOps, Cloud, and a bit of Data Engineering. I recently joined a service-based company about six months ago. The pay is decent, work-life balance is great, and the office is close by. I only need to go in a few days a month — so overall, it’s a very comfortable setup.

But the project and tech stack are extremely outdated. I was hired to help modernize things through DevOps, but most of the challenges are people- and process-related, not technical. The team is still learning very basic stuff, and there’s hardly any opportunity to work on modern tooling or architecture.

For the last few years, my learning curve was steep and exciting, but ever since joining this project, it’s almost flat. I’m starting to worry that staying in such an environment for too long could make me technologically handicapped in the long run.

I really don’t want to get stuck in a comfort zone and then realize years later that I’ve fallen behind. Because if, at some point, I want to switch jobs — whether for growth or monetary reasons — I might struggle to stay relevant.

So, I wanted to ask:
👉 How do you handle situations like this?
👉 How do you keep your skills sharp and your career moving forward when your current role offers comfort but little learning?

Would love to hear how others have navigated this phase without losing momentum.

https://redd.it/1ohwpk7
@r_devops
Experiment - bridging the gap between traditional networking and modern automation/API-driven approaches with AI

I work as a network admin, the only time you hear about our team is when something breaks. We spend the vast amount of time auditing the network, doing enhancements, verifying redundancies, all the boring things that needs to be done. Been thinking a lot about bridging the gap between traditional networking and modern automation/API-driven approaches to be create tools and ultimately have proactive alarming and troubleshooting. Here’s a project I am starting to document that I’ve been working on: https://youtu.be/rRZvta53QzI

There are a lot of videos of people showing a proof of concept of what AI can do for different application but nothing in-depth is out there. I spent the last 6 month really pushing the limits relative to the work I do to create something that is scalable, secure, restrictive and practical. Coding wise I did support for Adobe Cold Fusion application a lifetime ago and PowerShell noscripting so the concepts for programming I do understand but I am a Network admin first.

I would be curious to see if there is anyone that are actual developers exploring this space at this depth.

https://redd.it/1ohvdif
@r_devops
Guide How to add Basic Auth to Prometheus (or any app) on Kubernetes with AWS ALB Ingress (using Nginx sidecar)

I recently tackled a common challenge that many of us face: securing internal dashboards like Prometheus when exposed via an AWS ALB Ingress. While ALBs are powerful, they don't offer native Basic Auth, often pushing you towards more complex OIDC solutions when a simple password gate is all that's needed.

I've put together a comprehensive guide on how to implement this using an Nginx sidecar pattern directly within your Prometheus (or any) application pod. This allows Nginx to act as the authentication layer, proxying requests to your app only after successful authentication.

What the guide covers:

The fundamental problem of ALB & Basic Auth.
Step-by-step setup of the Nginx sidecar with custom nginx.conf401.html, and health.html.
Detailed `values.yaml` configurations for `kube-prometheus-stack` to include the sidecar, volume mounts, and service/ingress adjustments.
Crucially, how to implement a "smart" health check that validates the entire application's health, not just Nginx's.

This is a real-world, production-tested approach that avoids over-complication. I'm keen to hear your thoughts and experiences!

Read the full article here: https://www.dheeth.blog/enabling-basic-auth-kubernetes-alb-ingress/

Happy to answer any questions in the comments!

https://redd.it/1oi0ztc
@r_devops
what's a "best practice" you actually disagree with?

We hear a lot of dogma about the "right" way to do things in DevOps. But sometimes, strict adherence to a best practice can create more complexity than it solves.

What's one commonly held "best practice" you've chosen to ignore in a specific context, and what was the result? Did it backfire or did it actually work better for your team?

https://redd.it/1oi1daa
@r_devops
Observability Sessions at KubeCon Atlanta (Nov 10-13)

Here's what's on the observability track that's relevant to day-to-day ops work:

OpenTelemetry sessions:

[Taming Telemetry at Scale](https://sched.co/27FUv) \- standardizing observability across teams (Tue 11:15 AM)
Just Do It: OpAMP \- Nike's production agent management setup (Tue 3:15 PM)
[Instrumentation Score](https://sched.co/27FWx) \- figuring out if your traces are useful or just noise (Tue 4:15 PM)
Tracing LLM apps \- observability for non-deterministic workloads (Wed 5:41 PM)

CI/CD + deployment observability:

[End-to-end CI/CD observability with OTel](https://colocatedeventsna2025.sched.com/event/28D4A) \- instrumenting your entire pipeline, not just prod (Wed 2:05 PM)
Automated rollbacks using telemetry signals \- feature flags that rollback based on metrics (Wed 4:35 PM)
[Making ML pipelines traceable](https://colocatedeventsna2025.sched.com/event/28D7e) \- KitOps + Argo for MLOps observability (Wed 3:20 PM)
Observability for AI agents in K8s \- platform design for agentic workloads (Wed 4:00 PM)

Observability Day on Nov 10 is worth hitting if you have an All-Access pass. Smaller rooms, better Q&A, less chaos.

Full breakdown with first-timer tips: https://signoz.io/blog/kubecon-atlanta-2025-observability-guide/

Disclaimer: I work at SigNoz. We'll be at Booth 1372 if anyone wants to talk shop about observability costs or self-hosting.

https://redd.it/1oi1vw6
@r_devops
CI/CD pipelines are starting to feel like products we need to maintain

I remember when setting up CI/CD was supposed to simplify releases. Build, test, deploy, done.
Now it feels like maintaining the pipeline is a full-time job on its own.

Every team wants a slightly different workflow. Every dependency update breaks a step.
Secrets expire, runners go missing, and self-hosted agents crash right before release.
And somehow, fixing the pipeline always takes priority over fixing the app.

At this point, it feels like we’re running two products: the one we ship to customers, and the one that ships the product.

anyone else feel like their CI/CD setup has become its own mini ecosystem?
How do you keep it lean and reliable without turning into a build engineer 24/7?

https://redd.it/1oi3clf
@r_devops