Reddit DevOps – Telegram
Stay in a stable job or work for an AI company.

Hi,

I am working for a company in Berlin as an senior infrastructure engineer. The company is stable but does not pay well. I am working on impactful projects and working hard. I asked for a raise, but it seems I will not get a significant increase, maybe 5-8%.

Meanwhile, I am having an interview for an AI company, not EU-based. It got 130M investment last year and wants to expand in EMAE.
They pay ~30% more than what I make at the moment.

Given the market, does it make sense to take the risk or stay in a stable job for a while until the market gets better?






https://redd.it/1pmn9hh
@r_devops
Anyone automating their i18n/localization workflow in CI/CD?

My team is building towards launching in new markets, and the manual translation process is becoming a real bottleneck. We've been exploring ways to integrate localization automation into our DevOps pipeline.

Our current setup involves manually extracting JSON strings, sending them out for translation, and then manually re-integrating them—it’s slow and error-prone. I've been looking at ways to make this a seamless part of our "develop → commit → deploy" flow.

One tool I came across and have started testing for this is the Lingo.dev CLI. It's an open-source, AI-powered toolkit designed to handle translation automation locally and fits into a CI/CD pipeline . Its core feature seems to be that you point it at your translation files, and it can automatically translate them using a specified LLM, outputting files in the correct structure .



The concept of integrating this into a pipeline looks powerful. For instance, you can configure a GitHub Action to run the lingo.dev i18n command on every push or pull request. It uses an i18n.lock file with content checksums to translate only changed text, which keeps costs down and speeds things up .

I'm curious about the practical side from other DevOps/SRE folks:

When does automation make sense? Do you run translations on every PR, on merges to main, or as a scheduled job?

Handling the output: Do you commit the newly generated translation files directly back to the feature branch or PR? What does that review process look like?

Provider choice: The CLI seems to support both "bring your own key" (e.g., OpenAI, Anthropic) and a managed cloud option . Any strong opinions on managing API keys/credential rotation in CI vs. using a managed service?

Rollback & state: The checksum-based lock file seems crucial for idempotency . How do you handle scenarios where you need to roll back a batch of translations or audit what was changed?

Basically, I'm trying to figure out if this "set it and forget it" approach is viable or if it introduces more complexity than it solves. I'd love to hear about your real-world implementations, pitfalls, or any alternative tools in this space.

https://redd.it/1pmnax4
@r_devops
How to master

Amid mass layoffs and restructuring I ended up in devops teams from backend engineering team.

It’s been a couple of months. I am mostly doing pipeline support work meaning application teams use our templates and infra and we support them in all areas from onboarding to stability.

There are a ton of teams and their stacks are very different (therefore templates). How to get a grasp of all the pieces?

I know without giving a ton of info seeking help is hard but I’d like to know if there a framework which I can follow to understand all the moving parts?

We are on Gitlab and AWS. Appreciate your help.

https://redd.it/1pmsh7u
@r_devops
How long will Terraform last?

It's a Sunday thought but. I am basically 90% Terraform at my current job. Everything else is learning new tech stacks that I deploy with Terraform or maybe a noscript or two in Bash or PowerShell.

My Sunday night thought is, what will replace Terraform? I really like it. I hated Bicep. No state file, and you can't expand outside the Azure eco system.

Pulumi is too developer orientated and I'm a Infra guy. I guess if it gets to the point where developers can fully grasp infra, they could take over via Pulumi.

That's about as far as I can think.

https://redd.it/1pmzitq
@r_devops
How do you convince leadership to stop putting every workload into Kubernetes?

Looking for advice from people who have dealt with this in real life.

One of the clients I work with has multiple internal business applications running on Azure. These apps interact with on-prem data, Databricks, SQL Server, Postgres, etc. The workloads are data-heavy, not user-heavy. Total users across all apps is around 1,000, all internal.

A year ago, everything was decoupled. Different teams owned their own apps, infra choices, and deployment patterns. Then a platform manager pushed a big initiative to centralize everything into a small number of AKS clusters in the name of better management, cost reduction, and modernization.

Fast forward to today, and it’s a mess. Non-prod environments are full of unused resources, costs are creeping up, and dev teams are increasingly reckless because AKS is treated as an infinite sink.

What I’m seeing is this: a handful of platform engineers actually understand AKS well, but most developers do not. That gap is leading to:
1. Deployment bottlenecks and slowdowns due to Helm, Docker, and AKS complexity
2. Zero guardrails on AKS usage, where even tiny Python noscripts are deployed as cron jobs in Kubernetes
3. Batch jobs, experiments, long-running services, and one-off noscripts all dumped into the same clusters
4. Overprovisioned node pools and forgotten workloads in non-prod running 24x7
5. Platform teams turning into a support desk instead of building a better platform

At this point, AKS has become the default answer to every problem. Need to run a noscript? AKS. One-time job? AKS. Lightweight data processing? AKS. No real discussion on whether Functions, ADF, Databricks jobs, VMs, or even simple schedulers would be more appropriate.

My question to the community: how have you successfully convinced leadership or clients to stop over-engineering everything and treating Kubernetes as the only solution? What arguments, data points, or governance models actually worked for you?


https://redd.it/1pn0h49
@r_devops
Advice Needed for Following DevOps Path

Ladies and Gentlemen, i am grateful in advance for your support and assistance,
i need an advice about my path for DevOps, i am a self taught using Linux since 2008 and i love Linux so much so i went to study DevOps by doing, i used AI tools to create a Real World Scenarios for DevOps + RHCSA + RHCE and i uploaded it on GitHub within 3 Repos ( 2 Projects ), i know stuck is a part of the path specially for DevOps, and i know i am not good with asking for help, i think i have hardships of how to ask for help and where too.

i want an advice if anyone can check my Projects and Repos and give me an overview of the work is it good work so i can continue the path or it is not good and i better to search for another Career.

Project 1 ( First 2 Repos - Linux, Automation ) is finished, Project 2 ( Last Repo - High Availability ) still not complete and in the Milestone 0, i am struggling so much time of how to connect into Private Instances from the Public Instances, i am using AWS and i tried a lot from using ssh and aws ssm plugins, and still can't do it.

Summary, i want an advice to decide whether to carry on after DevOps or not.

Links:

Project 01 ( Repo 01 + Repo 02 ) | RHCSA & RHCE Path

01 - **enterprise-linux-basics-Prjct\_01**

02 - **linux-automation-infrastructure-Prjct\_02**

Project 02 ( Repo 03 ) | High Availability

03 - **linux-high-availability-Prjct\_03**

https://redd.it/1pn2bm0
@r_devops
How do you know which feature is changed to determine which noscript to run in CI/CD pipeline?

Hi,

I think I have setup almost everything and have this issue left. Currently the repo contains a lot of features. When someone does the enhance one feature and create a PR. Will do you the testing for all the features?

Lets say I have 2 noscripts: noscript/register_model_a and noscript/register_model_b. These register will create a new version and run evaluate and log to MLFlow.

But I don't know what's the best practice for this case. Like will u define folder for each module and detect file changed in which folder to decide which feature is being enhanced? or just run all the test.?



Thank you!

https://redd.it/1pn3g69
@r_devops
DevOps-Tech knowledge für job application (>Agile Coach) (GitLab, CI/CD, Docker, Ansible) - how to get into it?

Hi folks,

any suggestions how to get into the topic?
A job offer for an agile coach requires those, just for context.
Apart from having downloaded stuff from github before, I'm pretty much a newbie in that field.
How to get started, what are good tutorials and sources? What do I even need to know for such a position?

Thanks a lot!

https://redd.it/1pn3vwo
@r_devops
How do you keep storage management simple as infrastructure scales

I am working on a setup where data volume and infrastructure will grow steadily over time. What starts as a simple storage layer can quickly turn into something that needs constant attention if it is not designed carefully.

For those managing larger or growing environments, how do you keep storage from becoming an operational burden Do you rely on automation, strict conventions, or regular cleanup and review processes

I am interested in approaches that reduce day to day overhead while keeping systems reliable.

https://redd.it/1pn5uk0
@r_devops
Single Machine Availability: is it really a problem?

*Discussing Virtual Private Servers for simple systems :)*

Virtual Private Server (VPS) is not really a single physical machine - it is a single logical machine, with many levels of redundancy, both hardware and software, implemented by cloud providers to deliver High Availability. Most cloud providers have at least 99.9% availability, stated in their service-level agreements (SLAs), and some - DigitalOcean and AWS for example - offer 99.99% availability. This comes down to:

24 * 60 = 1440 minutes in a day
30 * 1440 = 43 200 minutes in a month
60 * 1440 = 86 400 seconds in a day

99.9% availability:
86 400 - 86 400 * 0.999 = 86.4 seconds of downtime per day
43 200 - 43 200 * 0.999 = 43.2 minutes of downtime per month

99.99% availability:
86 400 - 86 400 * 0.9999 = 8.64 seconds of downtime per day
43 200 - 43 200 * 0.9999 = 4.32 minutes of downtime per month

Depending on the chosen cloud provider, this is availability we can expect from the simplest possible system, running on a single virtual server. What if that is not enough for us? Or maybe we simply do not trust these claims and want to have more redundancy, but still enjoy the benefits of a Single Machine System Simplicity? Can it be improved upon?

First, let's consider short periods of unavailability - up to a few seconds. These will most likely be the most frequent ones and fortunately, the easiest to fix. If our VPS is not available for just 1 to 5 seconds, it might be handled purely on the client side by having retries - retrying every request up to a few seconds, if the server is not available. For the user, certain operations will just be slower - because of possible, short server unavailability - but they will succeed eventually, unless the issue is more severe and the server is down for longer.

Before considering possible solutions for this longer case, it is worth pausing and asking - *maybe that is enough?* Let's remember that with *99.9% and 99.99% availability* we expect to be daily unavailable for at most *86.4 or 8.64 seconds*.

Most likely, these interruptions will be spread throughout the day, so simple retries can handle most of them without users even noticing. Let's also remember that Complexity is often the Enemy of Reliability. Moreover, our system is as reliable as its weakest link; if we really want to have additional redundancy and be able to deal with potentially longer periods of unavailability, there are at least two ways of going about it - but maybe they are not worth the Complexity they introduce?

I would then argue that in most cases, *99.9% - 99.99% availability* delivered by the cloud provider + simple client retry strategy, handling most short interruptions, is good enough. Should we want/need more, there are tools and strategies to still reap the benefits of a Single Machine System Simplicity while having ultra high redundancy and availability - at the cost of additional Complexity.

*I write deeper and broader pieces on topics like this on my blog. Thanks for reading!*

https://redd.it/1pn77hn
@r_devops
debugging CI failures with AI? this model says it’s trained only for that

my usual workflow:

push code



get some CI error



spend 2 hrs reading logs to figure out what broke



fix something stupid



then i saw this paper on a model called chronos-1 that’s trained only on debugging workflows ... stack traces, ci logs, test errors, etc. no autocomplete. no hallucination. just bug hunting. claiming 80.3% accuracy on SWE-bench Lite (GPT-4 gets 13.8%).

paper: https://arxiv.org/abs/2507.12482

anyone think this could actually be integrated into CI pipelines? or is that wishful thinking?



https://redd.it/1pn8lqm
@r_devops
New to software testing

Hi everyone 👋

I’m pretty new to software testing and trying to learn from the community - asking questions, reading discussions, and understanding best practices.

There are a lot of platforms out there, and I’m not sure where beginners actually get good feedback and meaningful discussions (not just noise).

Pls use the Poll below- I’d really appreciate your advice🙏

Where do you think a beginner in testing/dev should engage with the community?

View Poll

https://redd.it/1pn97oh
@r_devops
What's working to automate the code review process in your ci/cd pipeline?

Trying to add automated code review to our pipeline but running into issues, we use github actions for everything else and want to keep it there instead of adding another tool.

Our current setup is pretty basic: lint, unit tests, security scan with snyk. All good but they don't catch logic issues or code quality problems,  our seniors still have to manually review everything which takes forever.

I’ve looked into a few options but most seem to either be too expensive for what they do or require a ton of setup, we Need something that just works with minimal config, we don't have time to babysit another tool.

What's actually working for people in production? Bonus points if it integrates nicely with github actions and doesn't slow down our builds, they already take 8 minutes which is too long.

https://redd.it/1pndos3
@r_devops
Tutorial From ONNX Model to K8s: Building a Scalable ML Inference Service with FastAPI, Docker, and Kind

Hey r/devops,

I recently put together a full guide on building a production-grade ML inference API and deploying it to a local Kubernetes cluster. The goal was simplicity and high performance, leading us to use FastAPI + ONNX.

Here's the quick rundown of the stack and architecture:

# The Stack:

Model: ONNX format (for speed)
API: FastAPI (asynchronous, excellent performance)
Container: Docker
Orchestration: Kubernetes (local cluster via Kind)

# Key Deployment Details:

1. Kind Setup: Instead of spinning up an expensive cloud cluster for dev/test, we used kind create cluster. We then loaded the Docker image directly into the Kind cluster nodes.
2. Deployment YAML: Defined 2 replicas initially, crucial resource requests (e.g., cpu: "250m") and limits to prevent noisy neighbors and manage scheduling.
3. Probes: The Deployment relied on:
Liveness Probe on `/health`: Restarts the pod if the service hangs.
Readiness Probe on /health: Ensures the Pod has loaded the ONNX model and is ready before receiving traffic.
4. Auto-Scaling: We installed the Metrics Server and configured an HPA to keep the target CPU utilization at 50%. During stress testing, Kubernetes immediately scaled from 2 to 5 replicas. This is the real MLOps value.

If you're dealing with slow inference APIs or inconsistent scaling, give this FastAPI/K8s setup a look. It dramatically simplifies the path to scalable production ML.

Happy to answer any questions about the config or the code!

https://redd.it/1pnfgxn
@r_devops
OpsOrch | Unified Ops Platform

Hi all, I built OpsOrch, an open-source orchestration layer that provides one unified API across incidents, logs, metrics, tickets, messaging, and service metadata.

It sits on top of the tools most DevOps and SRE teams already run, such as PagerDuty, Jira, Prometheus, Elasticsearch, Datadog and Slack, and normalizes them into a single schema instead of trying to replace them.

OpsOrch does not store operational data. It brokers requests through pluggable adapters, either in-process Go providers or JSON-RPC plugins, and returns unified structures. There is also an optional MCP server that exposes everything as typed tools for agent and automation use.

You can find the project overview on opsorch.com and the documentation on opsorch.com/docs.

# Why I built this

During incidents, most workflows still require hopping between paging, tickets, metrics, logs, and chat systems.

PagerDuty or Opsgenie for paging and incidents.
Jira or Github for tickets.
Prometheus or Datadog for metrics.
Elasticsearch, Loki, or Splunk for logs.
Slack or Teams for coordination.

Each system has its own auth model, schemas, and query semantics. OpsOrch aims to be a small, transparent glue layer that lets you reason across all of them without migrating data or buying a black-box “single pane of glass”.

# What’s available today

The core orchestration service is written in Go and licensed under Apache-2.0.

There are adapters available for PagerDuty, Jira, Prometheus, Elasticsearch, Slack, and mock providers for local testing, all maintained under the OpsOrch GitHub organization.

An MCP server exposes incidents, logs, metrics, tickets, and services as agent tools.

There is no vendor lock-in and no data gravity. OpsOrch does not become your system of record.

# Looking for feedback from DevOps and SREs on

The architecture, particularly the stateless core plus adapter model.
The plugin approach, in-process vs JSON-RPC.
Security and governance concerns.
Which integrations would make this immediately useful in real incident response.

Happy to answer questions or take criticism. This is built with real incident workflows in mind.

https://redd.it/1pngwsm
@r_devops
CKS Exam Re-try (second chance) in 2025

Hey guys, I'm going to make my re-try CKS exam in next 2days,
do you have any experiences in second round and see common questions from first try?




https://redd.it/1pniah1
@r_devops
CDKTF repository forks

There are some active discussions in the https://cdk.dev/ Slack channel #terraform-cdk about building community-driven forks of the existing Hashicorp/IBM CDKTF repositories. A number of developers who work at organizations that are heavily reliant on CDKTF have offered to pitch in.

There is currently a live proof of concept fork of the main cdktf repository that one developer made: https://github.com/TerraConstructs/terraform-cdk

And one Open Tofu developer said he and some other Open Tofu developers would be happy to collaborate with that community-driven effort to keep CDKTF alive:

>The OpenTofu maintainers are happy to collaborate with that project once it's up and running, but we will not be directly involved.

https://redd.it/1pnkcqb
@r_devops
Why did we name virtual switches, bridges?

Title says it all. A bridge is a virtual switch, you plug virtual ethernet cables in on both ends. Why did we name it a bridge, and not a vSwitch!

https://redd.it/1pnjvke
@r_devops
Book Recommendations

Hello all,

As someone on a learning journey I was curious if you had any recommendations for books around DevOps that you wished other Engineers or team mates read?

I have read: The Phoenix Project, The Unicorn Project and Production-Ready Micro-services.

https://redd.it/1pnmf1k
@r_devops
My Raspberry pi pi3d Project

Hey , I am Warthog . I am a part of technolab team . We developed an app that helps preparing image for a particular raspberry pi pi3d picture frame all under one platform .

Our App's name is MetaPi currently on playstore .


WHAT Metapi do ?
It edit , crop and send images according to your pi3d picture frame . No more usage of 3,4 different apps to do the same thing .

Key features ?
It provide soothing reading and editing of Metadata for the images with for free . Like other apps where you have to pay to see and edit metadata for your images . In MetaPi you can see and categories and edit metadata for your images according to you

Moreover you can filter out tags of metadata and crop in free resolution with real time location change inside metadata and free of cost sharing with drive , icloud and other platforms through with your raspberry pi can read the prepared images for your own picture frame



https://redd.it/1pno4k6
@r_devops