Reddit DevOps – Telegram
ai made shipping faster but understanding slower



lately i’ve been thinking about how different building feels now compared to a few years ago. getting something off the ground is insanely fast. scaffolds, endpoints, ui, all done in a weekend. but when something breaks, i’m spending way more time reading than actually writing code.

i’ve ended up using different tools depending on what i’m working on. GitHub Copilot for in-editor autocomplete and quick suggestions, Replit Agent when i want help across bigger chunks of work, Claude Code when i need to talk through a codebase at a higher level. and on larger or messier repos, i’ve found cosine surprisingly useful to trace how logic flows across files when my mental map falls apart. it’s not doing magic, it just helps me see what already exists without burning energy.

it feels like the bottleneck shifted from “can i build this?” to “do i actually understand what’s already here?” curious how others are dealing with this. do you stick to one ai tool, or do you end up with a stack where each thing does one job well?

https://redd.it/1q4ermd
@r_devops
looking for good agile tools - how do you keep github issues and planning in sync?

we rely heavily on github, but things get messy when issues turn into real work items. how are teams syncing commits, PRs and sprint work without constant manual updates? i am looking for good agile tools that dont slow devs down

https://redd.it/1q4i05z
@r_devops
Wrote a deep dive on sandboxing for AI agents: containers vs gVisor vs microVMs vs Wasm, and when each makes sense

https://www.luiscardoso.dev/blog/sandboxes-for-ai

Wrote this after spending too long untangling the "just use Docker" vs "you need VMs" debate for AI agent sandboxing. I think the problem is that the word "sandbox" gets applied to four different isolation boundaries with very different security properties.

So, I decided to write this blog post to help people out there.

Interested in what isolation strategies folks here are running in production, especially for multi-tenant or RL workloads.

https://redd.it/1q4pvy6
@r_devops
researching the best subnoscription management software 2026, outgrowing our billing spreadsheets.

our saas company is moving from a handful of enterprise clients to a true product led growth model with hundreds of self serve subscribers. our manual billing and account management processes are breaking. were planning our 2026 tech stack and know we need a dedicated subnoscription management platform to handle billing, dunning, prorations, and plan changes.

when i search for the best subnoscription management software, the big names (chargebee, recurly, zuora, stripe billing) all seem strong, but its hard to understand the nuances for a b2b saas company at our stage. we need solid revenue recognition, tax handling, and flexible pricing models (seats, usage, flat fee).

if any finance, operations, or product folks at a scaling saas company have recently gone through this evaluation, id appreciate your perspective. we need a platform that can scale with us for the next 5 years. any real world insights are invaluable.

https://redd.it/1q4thrg
@r_devops
What are some fresh, underrated tools or products you’re loving right now?

doesn’t have to be strictly DevOps, just anything that made your workflow smoother, solved an annoying problem, or sparked a little “why didn’t I try this earlier” moment. What’s on your radar lately?

https://redd.it/1q7qv2o
@r_devops
Feeling stuck IN career as an SRE

I’m currently working as a Site Reliability Engineer. My role is mostly operational — setting up and tweaking YAMLs, running cloud operations on Azure, keeping applications stable, handling container and web application deployments, troubleshooting lower env and production issues, fixing pipeline failures and build issues, and working closely with multiple DevOps teams. I also manage monitoring and observability using Datadog and Splunk.

I don’t usually build CI/CD pipelines from scratch or create Kubernetes clusters end to end — my work is more about operations, reliability, and incremental improvements rather than greenfield builds.

I have around 11 years of experience, earn a good salary, and hold certifications including Azure Architect, GCP ACE, Terraform, and AWS Associate. On paper things look fine, but lately I feel stuck career-wise. I don’t feel like I’m moving up anymore, either in responsibility or role scope.

I’d especially love to hear from senior, staff, or principal engineers (or managers who’ve coached people at that level): how did you break out of this kind of plateau, and what changes actually made a difference?

I’m curious — has anyone else been in a similar situation at this stage of their career?

What did you do to move forward?

Any advice or perspectives would be really appreciated.

https://redd.it/1q7ci6t
@r_devops
Got screwed on MLOps project payment - $11k paid out of $18k, need advice

Hey folks,
So I'm in some BS situation right now and honestly don't know if I'm being paranoid or actually getting shafted.
Started a contract gig ~4 months back. Client needed their ML stack unfucked - they had data scientists pushing models to prod with literally zero pipeline, no monitoring, nothing. My job was:
Spin up proper MLOps infra on AWS (SageMaker + custom containers), Get their LLM stuff production-ready (they were running GPT wrappers with no fallbacks lmao),
Build out some agentic workflows for their support chatbot, Set up proper observability - Prometheus/Grafana, cost tracking, the works
Lock down their IAM because it was a dumpster fire
Rate was $18k split across 3 milestones - $6k each for planning, implementation, and deployment/handoff.
Here's where it gets weird:
First $6k hit my account fine. Second milestone, I shipped the entire ML pipeline, containerized everything, got their models deploying automatically. Invoice them, get... $2.5k. Ask WTF, they say "we're reviewing costs quarterly now" and me be like Ok!.
I didn't go aggressive because tbh I had like $9k buffer saved up and my project pipeline was dry. Figured I would finish strong, they would see the value, make it right.
Fast forward - I'm basically done. Their LLM agents are handling 60% of tickets autonomously, inference costs down 40%, everything's monitored. I even wrote runbooks for their junior devs. Invoice the last $6k.
Two weeks of ghosting, then they schedule a call. Offer me $3.2k as "completion bonus" bringing total to like $11.7k.
Their reasoning: "timeline extended beyond scope and we had infrastructure costs we didn't anticipate."
Bro. The timeline extended because THEY kept pivoting on which LLM provider to use (we went OpenAI -> Anthropic -> back to OpenAI). The infra costs went DOWN because of my work. I literally showed them the FinOps dashboards.
I'm sitting here like...? Do I just take the L and move on? My savings are getting thin and I don't have another gig yet, so part of me is like "just take the $3k and don't make enemies."
But another part is pissed because the work is legitimately good and in production making them money.
What would you do & I should do?
Anyone been in something similar? I had some rascals earlier who didn't paid me , Ignored my reachouts after the contract work was done , They is a special place in hell for these guyzz ..

https://redd.it/1q7iv92
@r_devops
Former Cloudflare SRE building a tool to keep a live picture of what’s actually running. Looking for honest feedback

Hey everyone, I’m Kenneth, founder of OpsCompanion.

I spent years as a Senior SRE at Cloudflare. One thing that became painfully clear is that most outages, security issues, and compliance fire drills don’t come from a lack of tools. They come from missing context. People don’t know what’s running, how things connect, or what changed recently, especially once systems sprawl across clouds, repos, and teams.

That’s why I’m building OpsCompanion.

OpsCompanion helps engineers:

* Keep a live, visual picture of what’s running and how things connect
* Answer “what changed?” without digging through five tools, Slack threads, or the god-awful state of documentation most teams are dealing with today
* Preserve operational context so the next on-call isn’t starting from zero

This isn’t about adding more logs or alerts, or slapping AI onto existing platforms and calling it AGI. It’s about giving engineers the same mental model I used to carry in my head, but shared and kept up to date.

We’ve opened up free access for a small, curated group of engineers who work close to production. If it’s useful, great. If not, I genuinely want to know why and what would make it useful.

Free access here:
[https://opscompanion.ai/](https://opscompanion.ai/)

Everyone who signs up during this early window will get an life time deal once we that part up(I will reach out via email), the gratitude of myself, and to drive the road map of our product

I’ll be in the comments. Happy to answer questions, hear skepticism, get roasted a bit, or talk about what it actually takes to be an SRE or DevOps engineer in 2026.

https://redd.it/1q7xn5c
@r_devops
DevOps Engineer: Which certifications are worth doing for the future?

Hi everyone,

I’m a DevOps Engineer with a few years of experience and I’m looking to invest in certifications that will actually help me in the long run.

Which certifications would you recommend that are relevant now and also future proof.

Cloud, Kubernetes, security, SRE or anything else?

Would love to hear from people who’ve seen real career benefits from certs. Thanks!

https://redd.it/1q7hz9b
@r_devops
AI is making CI/CD the slowest part of DevOps , how are teams handling this?

Something I’m noticing more and more:



Code gets written incredibly fast now (AI + copilot + agents), but CI pipelines haven’t caught up.

A 10–15 min CI run used to be fine when coding took hours.

Now it’s the dominant part of the feedback loop.



Symptoms I’m seeing:

\- PRs pile up waiting on CI

\- Teams reduce test coverage just to move faster

\- More issues slip into prod because “CI was green but shallow”



Curious how others are handling this:

\- selective test execution?

\- better change-impact detection?

\- just throwing more compute at CI?

\- or accepting slower feedback?



Would love concrete approaches that actually worked.

https://redd.it/1q817dy
@r_devops
suggestion needed: How do you manage hundreds of minimal container images in an air gaped environment?

We operate in isolated networks where artifacts can’t be pulled from the internet. Updating minimal images while keeping security current is challenging. What strategies do you use to automate vulnerability updates safely?



https://redd.it/1q81va8
@r_devops
Anyone else finding observability for LLM workloads is a completely different beast?

We just started deploying some AI heavy services and honestly I feel like I'm learning monitoring all over again. Traditional metrics like CPU and memory barely tell you anything useful when your inference times are all over the place and token usage is spiking randomly. The unpredictability is killing me. One minute everything looks fine, next minute latency is through the roof because some user decided to send a novel length prompt. And dont even get me started on trying to correlate model performance with actual infrastructure costs. Its like playing whack a mole but the moles are invisible. Been spending the last few weeks trying to build out a proper observability framework for this stuff and realizing most of what I learned about traditional APM only gets you halfway there. You need visibility into token throughput, embedding latencies, model versioning, and somehow tie all that back to user experience metrics. Curious how everyone else is handling observability for their AI/ML infrastructure? What metrics are you actually finding useful vs what turned out to be noise?



https://redd.it/1q83pi8
@r_devops
How do you handle small webhook payload changes during local testing?

When testing webhooks locally, I often hit the same issue.

If one field in the payload needs to change, the usual options are to retrigger the external event or dig through a dashboard to resend something close enough. It works, but it’s slow and a bit clumsy.

Curious how others deal with this.
Do you have a workflow that makes small payload tweaks easier, or is this just how it is?

https://redd.it/1q82815
@r_devops
Where the Cloud Ecosystem is Heading in 2026: Top 5 Predictions

Wrote a blog about where I feel the cloud ecosystem is heading in 2026. Here's a summary of the blog:


1. The AI Vibe Check

The "just add AI" honeymoon phase is ending. At KubeCon London, sessions were packed based on buzzwords alone. By Atlanta, the mood shifted to skepticism. In 2026, organizations will stop chasing the hype wagon and start demanding proof of ROI, better security audits, and a clear plan for Day 2 operations before integrating AI features.

2. Kubernetes Moves to the "Back Seat"

Kubernetes is no longer the star of the show and is more like the engine under the hood. We’re seeing a massive surge in adoption of projects like Crossplane, kro, and Kratix. Platform teams are moving away from forcing developers to touch K8s primitives, instead favoring abstractions and self-service APIs. The goal for 2026: developer experience (DevEx) that hides the complexity of the cluster.

3. The Death of Local Dev Environments

Local environments can’t keep up with modern cloud complexity or the speed of AI coding agents. The "slow feedback loop" (waiting for CI/Staging) is the new bottleneck. 2026 will be the year of production-like cloud dev environments.

4. The "Specific" AI SRE

We aren't at the "autopilot cluster" stage yet. While tools like K8sGPT and kagent are gaining ground, we won't see general-purpose AI managing entire clusters. Instead, 2026 will favor task-specific agents with limited scope and strict permissions. It’s about empowering SREs, not replacing them.

5. Open Source Fatigue

Organizations are hitting a saturation point with overlapping CNCF projects. In 2026, the "cool factor" won't be enough to drive adoption. Teams are becoming hyper-selective, prioritizing long-term maintainability, community health, and clear roadmaps over whatever is currently trending on GitHub.

https://redd.it/1q83ft5
@r_devops
Got lucky with a Junior SRE role — how do I not waste it?

Honestly, I got lucky.


I recently moved from Helpdesk to a Junior SRE/DevOps role at a startup.

I have very little actual DevOps background, but I want to use this opportunity to build a serious career.

Since I'm the only SRE, I have full access to everything. I want to use this "sandbox" to fast-track to a solid level in 2 years. If you were me, how would you prioritize?

What paid off the most early on? (Terraform, CI/CD, networking, observability, etc.)
What real-world implementation taught you the most about how systems fit together?
Which tools/trends are noise early on?
How did you keep improving without burning out?

Note: I'm currently a CS student considering dropping out to focus 100% on this role. Is the practical experience worth more than the paper in the current market?

Thanks!

https://redd.it/1q85aco
@r_devops
Can I try DevOps, or am I missing something I should master first?

I need a professional opinion from someone in DevOps. I’ve had a turbulent and fragmented professional path, and I’d like to know if there’s anyone who can guide me and tell me from which point I should start over.

My story is a bit long:

I graduated in Computer Engineering, a 5-year program (2019–2023), with half of it (2020–2023) during the pandemic. That period came with difficulties in networking and a lack of hands-on practice due to the remote format via cellphone (I didn’t have enough income to buy my own equipment).

With a lot of difficulty, I managed to get 2 internships.

I interned at a construction company where the focus was industrial and residential automation. Naively, everything they taught me was how to request product quotations. I tried to learn by observing others, but it wasn’t enough and had no real connection to computing.

Despite that, in 3 months I managed to save enough money to build my first PC, and then I spent 4 months applying for other internship positions until I got a support role.

The support position was at a small company with 12 employees, focused on assisting elderly people, and my supervisor was a systems analyst.

In this new internship, I studied NDG Linux Essentials, CCNA1, Python, computer assembly and maintenance, Windows Server (application and network management with Active Directory), Flask, JavaScript, Docker, Docker Compose, Git, GitHub, and Nginx.

My supervisor left, and I was hired by the company to work in IT, but officially under the role of administrative assistant. I accepted because I needed the money, but today I believe it was a mistake.

Being the only IT person, I was very busy managing and maintaining everything, without knowing if I was doing things the right way.

What was supposed to be 3 months while I looked for another job ended up becoming 2 years, and now, in 2026, I feel obsolete and out of the job market (I don’t even have a LinkedIn profile).

Today, I have about 90% of my time free because I automated all my tasks.

After researching a lot, I’m thinking about starting a DevOps journey, but I’d like to know if it makes sense to try DevOps without having a developer portfolio and without even knowing how to create a website beyond a basic Flask app or WordPress.

I have few certifications, and unfortunately, from engineering I only have the degree noscript, since the course itself went through all that turbulence.

At the moment, I’m a “do-everything” person, with a bit of everything and not really good at anything. What should I do to build a solid foundation and a strong specialization?

https://redd.it/1q86v3m
@r_devops
Why incidents and failures matter more than perfect uptime

Over time, you encounter various challenges. Deployments fail, systems break, and some decisions don't work as expected. This is often how real experience is built.

When people are hired, the focus is usually on successful systems, uptime, and automation. Sometimes, though, you're asked about incidents, outages, or things that went wrong. And those moments often show real experience.

What kind of difficulties or mistakes did you face while working with production systems, and what did they teach you?

https://redd.it/1q88ybp
@r_devops
The SEO Ecosystem in 2026: Why Rankings Are Now Built, Not Chased

SEO in 2026 isn’t about chasing algorithms or isolated hacks anymore. It’s an interconnected ecosystem where multiple forces work together to determine search visibility and long-term performance. What you see on the surface, rankings and traffic, is the result of deeper signals operating in sync.

Search visibility today is shaped by AI-driven algorithms that constantly interpret user behavior and intent. Search engines are getting better at understanding why users search, not just what they type. That’s why search behavior analysis has become a core strategy, not an afterthought.

Content quality has also evolved. It’s no longer about volume or keywords, but about depth, clarity, topical authority, and usefulness across the entire journey. Pages that genuinely solve problems and demonstrate expertise naturally earn credibility and trust, reinforced by strong brand signals and authoritative backlinks.

Community input is another growing influence. Mentions, discussions, shared experiences, and real-world engagement help search engines validate relevance beyond the website itself. Supporting all of this are solid technical foundations that allow efficient crawling, indexing, and performance.

Finally, user signals act as continuous feedback loops. Engagement, satisfaction, and interaction confirm whether a page truly deserves its position. In 2026, SEO success comes from aligning all these elements into one cohesive strategy, built for sustainability, not shortcuts.

\#SEO2026 #SEOEcosystem #FutureOfSearch #AIAndSEO #ContentQuality #SearchVisibility #TechnicalSEO #DigitalStrategy

https://redd.it/1q89kei
@r_devops
Is building a full centralized observability system (Prometheus + Grafana + Loki + network/DB/security monitoring) realistically a Junior-level task if doing it independently?

Hi r/devops,

I’m a recent grad (2025) with \~1.5 years equivalent experience (strong internship at a cloud provider + personal projects). My background:

• Deployed Prometheus + Grafana for monitoring 50+ nodes (reduced incident response \~20%)

• Set up ELK/Fluent Bit + Kibana alerting with webhooks

• Built K8s clusters (kubeadm), Docker pipelines, Terraform, Jenkins CI/CD

• Basic network troubleshooting from campus IT helpdesk

Now I’m trying to build a full centralized monitoring/observability system for a pharmaceutical company (traditional pharma enterprise, \~1,500–2,000 employees, multiple factories, strong distribution network, listed on stock exchange). The scope includes:

1. Metrics collection (CPU/RAM/disk/network I/O) via Prometheus exporters

2. Full logs centralization (syslog, Windows Event Log, auth.log, app logs) with Loki/Promtail or similar

3. Network device monitoring (switches/routers/firewalls: SNMP traps, bandwidth per interface, packet loss, top talkers – Cisco/Palo Alto/etc.)

4. Database monitoring (MySQL/PostgreSQL/SQL Server: IOPS, query time, blocking/deadlock, replication)

5. Application monitoring (.NET/Java: response time, heap/GC, threads)

6. Security/anomaly detection (failed logins, unauthorized access)

7. Real-time dashboards, alerting (threshold + trend-based, multi-channel: email/Slack/Telegram), RCA with timeline correlation

I’m confident I can handle the metrics part (Prometheus + exporters) and basic logs (Loki/ELK), but the rest (SNMP/NetFlow for network, DB-specific exporters with advanced alerting, security patterns, full integration/correlation) feels overwhelming for me right now.

My question for the community:

• On a scale of Junior/Mid/Senior/Staff, what level do you think this task requires to do independently at production quality (scaleable, reliable alerting, cost-optimized, maintainable)?

• Is it realistic for a strong Junior+/early-Mid (2–3 years exp) to tackle this solo, or is it typically a Senior+ (4–7+ years) job with real production incident experience?

• What are the biggest pitfalls/trade-offs for beginners attempting this? (e.g., alert fatigue, storage costs for logs, wrong exporters)

• Recommended starting point/stack for someone like me? (e.g., begin with Prometheus + snmp_exporter + postgres_exporter + Loki, then expand)

I’d love honest opinions from people who’ve built similar systems (open-source or at work). Thanks in advance – really appreciate the community’s insights

https://redd.it/1q88ulk
@r_devops
SBOM generation for a .net app in a container

I'm trying to create a reliable way to track packages we use (for license and CVE issues). So far I'm using CycloneDX for .NET apps, and cyclonedx-npm for our React apps. This is working fine.



I'm now looking to make this work for a .NET app deployed via Docker, and I'm not sure how to proceed. Currently I'm generating two SBOMs:



1. CycloneDX for the .NET application code (captures NuGet packages with versions)

2. Syft for the container image (captures OS packages and other container dependencies)



My questions:

\- Should I merge these BOMs into one, or treat them as separate projects in Dependency-Track?

\- Syft doesn't seem to capture NuGet package versions properly - if I only use Syft's SBOM, I'm missing important .NET dependency details

\- Is there a better tool than Syft for .NET containers, or a way to make Syft scan the published app files properly?



What approach do you use for tracking both application dependencies AND container dependencies for .NET apps in Docker?

https://redd.it/1q8erp9
@r_devops