Reddit DevOps – Telegram
Every team wants "MLOps", until they face the brutal truth of DevOps under the hood

I’ve lost count of how many early-stage teams build killer ML models locally then slap them into production thinking a simple API can scale to millions of clients... until the first outage hits, costs skyrocket or drift turns the model to garbage.

And they assign it to a solo dev or junior engineer as a "side task".

Meanwhile:

No one budgets for proper tooling like registries or observability.

Scaling? "We'll Kubernetes it later".

Monitoring? Ignored until clients churn from slow responses.

Model updates? Good luck versioning without a registry - one bad push and you're rolling back at 3AM.

MLOps is DevOps fundamentals applied to ML: CI/CD, IaC, autoscaling, and relentless monitoring.



I put together a hands-on video demo: Building a scalable ML API with FastAPI, MLflow registry, Kubernetes and Prometheus/Grafana monitoring. From live coding to chaos tested prod, including pod failures and load spikes. Hope it saves you some headaches.

https://youtu.be/jZ5BPaB3RrU?si=aKjVM0Fv1DTrg4Wg

https://redd.it/1qz7e1r
@r_devops
I wrote a noscript to automate setting up a fresh Mac for Development & DevOps (Intel + Apple Silicon)

Hey everyone,

I recently reformatted my machine and realized how tedious it is to manually install Homebrew, configure Zsh, set up git aliases, and download all the necessary SDKs (Node, Go, Python, etc.) one by one.

To solve this, I built `mac-dev-setup` – a shell noscript that automates the entire process of bootstrapping a macOS environment for software engineering and DevOps.

**Repo:**[https://github.com/itxDeeni/mac-dev-setup](https://github.com/itxDeeni/mac-dev-setup)

**Why I built this:** I switch between an older Intel MacBook Pro and newer M-series Macs. I needed a single noscript that was smart enough to detect the architecture and set paths correctly (`/usr/local` vs `/opt/homebrew`) without breaking things.

**Key Features:**

* **Auto-Architecture Detection:** Automatically adjusts for Intel (x86) or Apple Silicon (ARM) so you don't have to fiddle with paths.
* **Idempotent:** You can run it multiple times to update your tools without duplicating configs or breaking existing setups.
* **Modular Flags:**
* `--minimal`: Just the essentials (Git, Zsh, Homebrew).
* `--skip-databases`: Prevents installing heavy background services like Postgres/MySQL if you prefer using Docker for that (saves RAM on older machines!).
* `--skip-cloud`: Skips AWS/GCP/Azure CLIs if you don't need them.
* **DevOps Ready:** Includes Terraform, Kubernetes tools (kubectl, k9s), Docker, and Ansible out of the box.

**What it installs (by default):**

* **Core:** Homebrew, Git, Zsh (with Oh My Zsh & plugins).
* **Languages:** Node.js (via nvm), Python, Go, Rust.
* **Modern CLI Tools:** `bat`, `ripgrep`, `fzf`, `jq`, `htop`.
* **Apps:** VS Code, iTerm2, Docker, Postman.

**How to use it:** You can clone the repo and inspect the code (always recommended!), or run the one-liner in the README.

Bash

git clone https://github.com/itxDeeni/mac-dev-setup.git
cd mac-dev-setup
./setup.sh


I’m looking for feedback or pull requests if anyone has specific tools they think should be added to the core list.

Hope this saves someone a few hours of setup time!

Cheers,

itzdeeni

https://redd.it/1qz5itd
@r_devops
Best way to get started?

i been wanting to start learning devops, but i dont know where to start.

My background is IT, i've been working for the last 5 years as a Data Center Technician - mostly installing servers and experience with fiber optics.

i also did a CCNA course about two years ago ( i dont know if its relevant).

if any more information is needed please guide me below and i will write.

Thanks in advance! :)

https://redd.it/1qza05a
@r_devops
Simple Terraform module for multi-service AWS ECS (Fargate/EC2)

Hi everyone,

I've been working on a Terraform module to simplify deploying containerized apps on AWS ECS. I wanted something that handles the boilerplate for VPC, Load Balancers, and ECS while keeping the interface clean for multiple services.

Repo: [https://github.com/NazarSenchuk/terraform-aws-ecs](https://github.com/NazarSenchuk/terraform-aws-ecs)

Main things it handles:

* Dynamic VPC setup (public/private subnets, NAT, etc).
* Single variable switch between Fargate with spot and EC2.
* Support for all types of deployments and Service Connect.
* Multi-service management in one block.

Example:

module "ecs_cluster" {
source = "NazarSenchuk/awsecs/aws"
version = "1.0.0"

general = {
environment = "prod"
project = "my-app"
region = "us-east-1"
}

infrastructure = { type = "FARGATE" }

services = {
web = {
name = "web-service"
img = "nginx:latest"
desired_count = 2
alb_path = "/*"
deploy = {
enabled = true
strategy = "ROLLING"
}
}
}
}

Registry link: [here](https://registry.terraform.io/modules/NazarSenchuk/awsecs/aws)
More examples: [here](https://github.com/NazarSenchuk/terraform-aws-ecs/blob/main/docs/examples.md)

Would appreciate any feedback on the structure or if anyone has suggestions or additional parameters i need to add.

Thanks.

https://redd.it/1qzbnx3
@r_devops
Release service-bus-tui v1.0.0-alpha

Hey everyone,

I’m working on a small tool for exploring Azure Service Bus entities and messages directly from the terminal. There’s still a lot of work to do, but you can already browse messages from topics/subnoscriptions and queues.

Github : https://github.com/MonsieurTib/service-bus-tui

https://redd.it/1qzbnl5
@r_devops
Need advice: am I overthinking or is our message queue setup really so insecure?

I'm pretty new to this team (3 months in) and noticed something that seems off but nobody's mentioned it so maybe I'm missing context.

We're running a multi tenant saas and use message queues to pass events between services. The queue itself has no authentication or authorization configured. Like tenant A could technically subscribe to tenant B's topics if they knew the topic names.

When I asked about it my senior said "it's fine, everything's on a private network" but that doesn't feel like enough? Isn't that basically security through obscurity?

Am I being paranoid or should I push back on this? Don't want to be that junior who questions everything but also this seems like a pretty big issue.

https://redd.it/1qzeubb
@r_devops
Coming from a Kubernetes-heavy SRE background and moving into AWS/ECS ops – could use some perspective

Hey all, looking for some perspective from people who’ve been around this longer than me.

I’ve been working as an SRE for just under three years now, and almost all of that time has been in Kubernetes-based environments. I spent most of my days dealing with production issues, on-call rotations, scaling problems, deployments that went sideways, and generally keeping clusters alive. Observability was a big part of my work too, Prometheus, Grafana, ELK, Datadog, some Jaeger tracing. Basically living inside k8s and the tooling around it.

I’m now interviewing for a role that’s a lot more AWS-ops heavy, and honestly it feels like a bit of a mental shift. They don’t run Kubernetes at all. Everything is ECS on AWS, and the role is much more focused on things like cost optimization, release and change management, versioning, and day-to-day production issues at the AWS service level. None of that sounds crazy to me in theory, but I can feel where my experience is thinner when it comes to AWS-native workflows, especially around ECS and FinOps.

I’m not trying to pretend I’m an AWS expert. I know how to think about capacity, failures, rollbacks, and noisy systems, but now I’m trying to translate that into how AWS actually does things. Stuff like how people really manage releases in ECS, where AWS costs usually get out of hand in real environments, and what ops teams actually look at first when something breaks in production outside of Kubernetes.

If you’ve moved from a Kubernetes-heavy setup into more traditional AWS or ECS-based ops work, I’d really like to hear how that transition went for you. What did you wish you understood earlier? What mattered way more than you expected? And what things did you overthink that turned out not to be that important?

Just trying to level myself up properly and not walk into this role blind. Appreciate any advice.

https://redd.it/1qzhbcr
@r_devops
Vouch: earn the right to submit a pull request (from Mitchell Hashimoto)

Mitchell Hashimoto got tired of watching open-source maintainers drown in AI-generated pull requests. So he built Vouch, a contributor trust management system. The concept is almost absurdly simple: before you can submit a PR to a project using Vouch, someone already trusted has to vouch for you.

The whole thing lives in a single text file inside the repo. One username per line. A minus sign means denounced. You can parse it with grep.

Sigstore verifies artifacts. SLSA verifies builds. Dependabot checks dependencies. None of them answer the question of whether a given person should be contributing to a project at all. That's the gap Vouch fills: contributor trust, not artifact trust.

Hashimoto designed it the same way he designed Terraform. Declarative. Human-readable. Version-controlled. Instead of .tf files for infrastructure, you get .td files for trust. Same brain, different domain.

The xz-utils backdoor is the elephant in the room. "Jia Tan" spent two years earning trust through legitimate contributions before planting a CVSS 10.0 backdoor. Vouch wouldn't have stopped that attack. But the vouch record would've been visible in the git history, who vouched for them, when, and the denouncement would propagate to every project subscribing to that vouch list. Less of a lock, more of a security camera.

Ghostty is already integrating it. The repo picked up 600 stars in three days. A GitHub staff member commented on the HN thread saying they'd ship changes "next week."

The concerns are real though. Gatekeeping is the obvious one. Open source is supposed to be open, and Vouch creates an explicit barrier where there wasn't one before. One HN commenter called it "social credit on GitHub." The persona gaming problem hasn't gone away either; someone could still spend months building trust before going rogue.

Hashimoto himself flags it as experimental. But it's the first serious attempt at making contributor trust visible and version-controlled.

I wrote up the full breakdown, including how Vouch compares to PGP's web of trust, Advogato, and Debian's maintainer process, here if you want the deep dive.

https://redd.it/1qzgoao
@r_devops
State of OpenTofu?

Has OpenTofu gained anything on Terraform? Has it proven itself as an alternative?


I unfortunately don't use IaC in my current deployment but I'm curious how the landscape has changed.



https://redd.it/1qz67sq
@r_devops
Need advice: trying to document an installation guide for production

Hey guys, I recently open-sourced a pretty huge self-hosted project. I've set up a docker-compose.yaml that worked fine for local deployments, but I suppose I made a lot of rookie mistakes for a production deployment guide.

I don't have much experience in DevOps except for small services and deploying websites with nginx+letsencrypt, and when people started coming to me for advice on why their setup failed, I was a bit overwhelmed.

For the last three evenings I've been trying to come to a default installation guide for a reverse proxy that would work fine for production.

So, the current setup is pretty standard:

- docker-compose.yaml with setup on localhost by default
- pretty much a default Go backend container
- frontend container that builds the frontend with baked in nginx that serves the static files on / and sets up a localhost reverse proxy on /api

 

My initial prod setup directed people to build the images manually and to edit the frontend/nginx.conf.template that the frontend container uses, so that people change their servername/adjust their IP address and so on.

Well, after debugging a couple environment-specific problems that people faced trying to deploy it this way, I realized that I need to adjust the guide ASAP.

At first, I thought that I needed to remove the baked-in nginx from the frontend container and move it up to `docker-compose.yaml`, but then I've read a suggestion on the internet that I can just put another reverse proxy in front of the frontend-internal nginx one.

 

So, my current thinking process is:

1. adjust nginx.conf.template to accept the DOMAIN and BACKEND
PORT, so that they're provided by docker-compose, not changed by the user (or should the baked in nginx.conf be left untouched, without accepting those env vars, staying localhost-only?)
2. add a new container in docker-compose for prod setups - caddy with a reverse proxy in front (maybe as an override file)

Also, is it fine to mix caddy and nginx this way? or am I better off overhauling the setup entirely? If so, what's the best course of action for me?

In case someone wants to take a look: https://github.com/Vsein/Neohabit (the setup files are docker-compose.yaml, .env.example, frontend/nginx.conf.template; all of them are mentioned in the installation guide "building manually from source")

And here's what I've been trying to do: https://github.com/Vsein/Neohabit/pull/110

Anyway, sorry if this post is amateurish, I just genuinely feel like I'm wasting my time trying to do something that might be a wrong direction entirely.

https://redd.it/1qzmth6
@r_devops
How do devs secure their notebooks?

Hi guys,
How do devs typically secure/monitor the hygiene of their notebooks?
I scanned about 5000 random notebooks on GitHub and ended up finding almost 30 aws/oai/hf/google keys (frankly, they were inactive, but still).



https://redd.it/1qzn7f2
@r_devops
Priority Dilemma: Academic GPA vs. Personal Projects in DevOps

​Hi everyone,

​I’m a first-year Computer Science student, and I’m currently facing a dilemma that I’d love to get your take on (especially from the recruiters and hiring managers here).

​On one hand, a high GPA is often seen as a critical resource and a primary screening tool for many companies.

​On the other hand, I feel that the DevOps world is highly practical.
A project that demonstrates a complete End-to-End Pipeline (using tools like GitHub Actions, AWS, Docker, K8s, Terraform, Ansible, etc.)
shows hands-on toolchain knowledge and real-world application—qualities that are hard to measure through a GPA alone.

​I’d like to ask about your priorities:

1. ​When screening for a Junior or Student position, what would make you stop and look at my CV—a 90 GPA with no projects, or an 80 GPA with a portfolio that demonstrates a deep understanding of CI/CD and IaC?

2. ​Do you have any tips on how to properly present such projects on a CV or in an interview to effectively reflect architectural understanding?

​Thanks in advance for your insights! 🙏

https://redd.it/1qzoupy
@r_devops
What should I prepare / learn in detail before a DevOps / Cloud Engineer internship? (GitLab, Terraform, AWS)

Hi everyone,

I have a **DevOps / Cloud Engineer internship** coming up (about **4–5 months long**) , and the main tools used are **GitLab, Terraform, and AWS**.

For context, I already have:

* **AWS Solutions Architect Associate**
* **Terraform Associate**
* **CKA (In progress)**

So I’m familiar with the **concepts and theory**, but I don’t have much **real hands-on / production-style experience yet**, which I’d like to work on before the internship starts.

I’d really appreciate advice from people in DevOps / cloud roles on:

* What **hands-on skills** I should focus on with:
* **GitLab** (CI/CD pipelines, runners, YAML, etc.)
* **Terraform** (state management, modules, best practices?)
* **AWS** (which services matter most at intern level?)
* Any **common gaps interns usually have**, even with certs
* Things you wish you had practiced *before* your first DevOps / cloud role

I’m not trying to master everything, just want to be **useful quickly and not completely lost** on day one 😅

Any advice, learning priorities, or “focus on this, ignore that” tips would be really appreciated. Thanks!

https://redd.it/1qzrs6y
@r_devops
What decides where to ru the build on git runners or cloud build machines . Which is better in the long run if you may have multiple clouds

Currently using aws ci cd but new devops guy is using git runners .

No idea what is the right strategy


https://redd.it/1qzw544
@r_devops
[Weekly/temp] Built a tool? New idea? Seeking feedback? Share in this thread.

This is a weekly thread for sharing new tools, side projects, github repositories and early stage ideas like micro-SaaS or MVPs.

What type of content may be suitable:

* new tools solving something you have been doing manually all this time
* something you have put together over the weekend and want to ask for feedback
* "I built X..."

etc.

If you have built something like this and want to show it, please post it here.

Individual posts of this type may be removed and redirected here.

Please remember to follow the rules and remain civil and professional.

*This is a trial weekly thread.*

https://redd.it/1qzyfzf
@r_devops
1
Weekly/temp DevOps ENTRY LEVEL - internship / fresher & changing careers

This is a weekly thread to ask questions about getting into DevOps.

If you are a student, or want to start career in DevOps but do not know how? Ask here.

Changing careers but do not have basic prerequisites? Ask here.

Before asking

try to search if your question was asked and answered
try these resources
[https://roadmap.sh/devops](https://roadmap.sh/devops)
(please suggest more)

_____________

Individual posts of this type may be removed and redirected here.

Please remember to follow the rules and remain civil and professional.

This is a trial weekly thread.

https://redd.it/1qzzvku
@r_devops
SSL/TLS explained (newbie-friendly): certificates, CA chain of trust, and making HTTPS work locally with OpenSSL

I kept hearing “just add SSL” and realized I didn’t actually understand what a certificate proves, how browsers trust it, or what’s happening during verification—so I wrote a short “newbie’s log” while learning.

In this post I cover:

What an “SSL certificate” (TLS, really) is: issuer info + public key + signature
Why the signature matters and how verification works
The chain of trust (Root CA → Intermediate CA → your cert) and why your OS/browser already trusts certain roots
A practical walkthrough: generate a local root CA + sign a localhost cert (SAN included), then serve a local site over HTTPS with a tiny Python server + import the root cert into Firefox

Blog Link: https://journal.farhaan.me/ssl-how-it-works-and-why-it-matters

https://redd.it/1r07ejx
@r_devops
Monitoring performance and security together feels harder than it should be

One thing I have noticed is how disconnected performance monitoring and cloud security often are. You might notice latency or error spikes, but the security signals live somewhere else entirely. Or a security alert fires with no context about what the system was doing at that moment.

Trying to manage both sides separately feels inefficient, especially when incidents usually involve some mix of performance, configuration, and access issues. Having to cross check everything manually slows down response time and makes postmortems messy.

I am curious if others have found ways to bring performance data and security signals closer together so incidents are easier to understand and respond to.

https://redd.it/1r0dbxa
@r_devops
When is it time to quit?

I wrapped up a tech panel for a Principal Azure Engineer role at an investment bank a couple of hours ago. This followed an interview with the hiring manager last Wednesday. We know each other from the past, i.e., I’ve interviewed for multiple roles at this firm over the last 5-6 years.

This role landed on my LinkedIn feed randomly. I commented on the post and emailed the hiring manager directly, we had a short back-and-forth, and his recruiter called me almost immediately. The process has been unusually smooth by modern standards.

Today’s panel felt strong. I’m confident I cleared the bar with both the Azure SME and the hiring manager. I saw visible agreement on several answers, got verbal acknowledgment more than once and handled questions from a junior panelist with ease. I was told that I’m “first in line” (not sure if that means FIFO or first on the shortlist), however, it seemed to be directionally positive.

Here’s the problem: I was laid off a little over six months ago and I am EXHAUSTED. It's like I've been on the hamster wheels of interviews since 8/4/2025. I’ve done the prep, the loops, the panels, the follow-ups. I know I’m good enough to be gainfully employed as a DevOps engineer.

If this role doesn’t turn into an offer, I’m seriously questioning whether I want to continue in tech at all. I don’t know if I have it in me to keep doing 5–7 round interview gauntlets, only to be rejected for vague reasons like “culture fit” or not smiling enough. I’ve given my adult life to STEM / engineering / corporate IT / tech and I am exhausted from having to engage with recruiters who want someone to take managerial roles for IC level pay.

I’m not bitter about rejection. I’m tired of dysfunction...hiring managers who don’t know the difference between EC2 and AWS Lambda, recruiters who can’t distinguish an AWS account from an Azure subnoscription and BS interview processes that ding candidates for being "too intense".

So I’m asking honestly: when is it time to walk away?
For those who’ve been at a similar crossroads...did you step back temporarily, change strategy or leave tech altogether?


TL;DR: Six months, countless interviews, strong signals in today's tech panel. If today's tech panel doesn’t result in an offer, I’m seriously considering being done with the tech interview industrial complex.

https://redd.it/1r0jghq
@r_devops
Security findings come in Jira tickets with zero context

Security scanner runs nightly and I wake up to 15 Jira tickets. Each one says fix CVE-2025-XXXX in dependency Y with no explanation of what the dependency does, where it's used, or why it matters.

I'm supposed to drop whatever sprint work I'm on, research the CVE, find where we use that package, assess actual risk, test the upgrade, and hope nothing breaks.

Meanwhile the ticket was auto-generated and the security team has no idea what they're asking me to fix. Just scanner said critical so here's a ticket.

Why can't these tools give actual context? Like this package is used in auth flow, vulnerability allows account takeover, here's how to fix it. Instead of just screaming CVE numbers at me.

https://redd.it/1r4xpz9
@r_devops
Duplicate writes in multi-step automation: where do you enforce idempotency?

Genuine question.

We run multi-step automation that touches tickets, db writes, api calls and emails.

A step partially failed or timed out. we restarted the run. a downstream write had already happened. result: duplicate tickets, duplicate notifications.

This does not feel like a simple retry problem. it is about where step boundaries live and how side effects stay idempotent across an entire run.

Things we are trying:

Treating write-capable steps differently from read-only steps
Requiring idempotency keys or operation ids for side effects
Making re-runs step-scoped instead of whole-run
Keeping a durable per-step ledger with inputs, outputs and timestamps
Adding manual pause or cancel before certain write steps

It still feels easy to get wrong.

Where do you enforce idempotency in practice?

Application layer
Workflow engine
Middleware or sidecar
Sagas or outbox pattern
Approval gates

If you have shipped long-running automation with real side effects, what worked and what caused incidents?

https://redd.it/1r4u7zr
@r_devops