Reddit DevOps – Telegram
Becoming better on the coding side?

Does anyone have any recommendations or suggestions for becoming better on the programming side of the house?

It feels as if every job posting wants you to not only be a strong Linux admin proficient with kubernetes, terraform, databases, and the flavor of the month’s observability and gitops tools. They also want you to be a full stack dev.

I’ve got about 10 years of experience in IT but it’s all on the ops side of the house and I feel like I lack an understanding of “programming”.

I’ve gone through CS50p, automate the boring stuff, and boot.dev. I am fairly comfortable with basic python, bash and powershell noscripts and automate everything I can. I manage my noscripts with git and have set up pipelines to deploy infrastructure but I feel like I just am missing some piece of the puzzle.

Is the answer to go back to school for a CS degree or software engineering degree through somewhere like WGU? This doesn’t seem like the right call since my goal isn’t to be a dev, I’d love to move into an SRE/DevOps/Platform engineering role but I don’t have the coding chops and just feel stuck at the moment.

Does anyone have any recommendations?

https://redd.it/1qy1gvf
@r_devops
Why most background workers aren’t actually crash-safe

I’ve been working on a long-running background system and kept noticing the same failure pattern: everything looks correct in code, retries exist, logging exists — and then the process crashes or the machine restarts and the system quietly loses track of what actually happened.

What surprised me is how often retry logic is implemented as control flow (loops, backoff, exceptions) instead of as durable state (yeah I did that too). It works as long as the process stays alive, but once you introduce restarts or long delays, a lot of systems end up with lost work, duplicated work, or tasks that are “stuck” with no clear explanation.

The thing that helped me reason about this was writing down a small set of invariants that actually need to hold if you want background work to be restart-safe — things like expiring task claims, representing failure as state instead of stack traces, and treating waiting as an explicit condition rather than an absence of activity.

Curious how others here think about this, especially people who’ve had to debug background systems after a restart.

https://redd.it/1qy1ve1
@r_devops
Security analyst trying to move into DevOps/Cloud — what am I missing?

Finding myself stuck between choices, maybe someone who does DevOps or works with cloud systems could share what it’s actually like. One path feels uncertain, another unclear - those handling security day to day might know how it plays out. Hearing real stories instead of polished answers would help more than anything else right now.

Background:

1.7 years at PwC as a Security Operations Analyst

Security tools like SIEM and SOAR help track threats. When incidents pop up, quick response matters most. Following ISO 27001 means meeting strict rules on data safety. Problems often appear when Linux users get too many access rights. Data loss prevention keeps sensitive files from leaking out. Close coordination with infrastructure groups ensures systems stay aligned

I had to leave the job for family reasons. Currently unemployed for 1.5 years

Finding my thoughts shift while in that position, then later too - focus drifted toward setup and systems rather than alert chasing. What stood out wasn’t the response grind but how things were built behind it.

So after leaving, I spent significant time building hands-on DevOps/DevSecOps skills:

Learning and making projects with docker + k8s

GitOps deployments using ArgoCD

Monitoring with Grafana

CI/CD pipelines using GitHub Actions, Docker, Trivy, GHCR

AWS serverless project using Lambda, API Gateway, DynamoDB, IAM

Terraform for infrastructure provisioning

I aim for positions in DevSecOps, cloud, or DevOps - staying clear of returning to straight SOC work. What pulls me forward isn’t the old path, but blending security into systems as they build. Sticking only to incident tracking doesn’t fit where I’m headed. The shift toward automation and infrastructure feels more like progress. Focusing on live environments while coding flows matters more now. Jumping back into reactive monitoring? That’s off the table. Building safeguards early beats chasing alerts later. This direction lines up with how tech moves today.

Problem:

Still no interviews, even after redoing everything - new materials, fresh focus on Cloud Security and DevSecOps. Hard work doesn’t always open doors, turns out. The frustration builds slowly, knowing I’ve actually done the tasks, touched the systems, built things myself. Yet somehow, old labels stick too hard; once SOC, always seen that way, it feels like. That word drags along assumptions I can’t shake off fast enough.

Faking skills isn’t my goal. An honest shift feels right instead.

Now here’s something folks often notice after making that change

What path took you from a SOC role into working with DevOps or cloud systems?

Maybe DevSecOps feels like a stretch right now - could starting with junior DevOps make more sense? Currently I have 2 accounts for applying, one for fresher in devops, where i get calls but gets rejected as they are looking for candidates passing out from 2024-2025 while i was in 2022.
Other is the experienced one.

Then again, jumping into security-infused workflows might align better. Some paths twist unexpectedly. Others stay flat by design. Depends where pressure builds first.

What makes a resume/interview stand out for someone in this situation?

Could it be there's something I haven't noticed yet?

People who walked this road first might offer what actually works. Their steps already covered ground you’re standing on now.

https://redd.it/1qy71it
@r_devops
What tools do I use for Terraform plan visualiser

I am new to terraform, before my terraform apply goes live I want to see that how can I know that what and how my resources are being created?

https://redd.it/1qydkqu
@r_devops
I vibe coded a site to practice DevOps skills. Would love some feedback.

A week ago I started building skillops because I’m tired of doing generic LeetCode questions for DevOps interviews. I want to turn this into a way for candidates to actually show off their skills in a real environment.

Currently, there are 3 hands-on challenges: Terraform, K8s, and GitHub Actions. I’d love if you could give them a try and share your feedback so I can grow this in the right direction.

Access it here: https://skillops.io (No login/signup required).

Happy to discuss the roadmap or technical stack!

https://redd.it/1qyt8q6
@r_devops
Moving from Sysadmin for SMB to Devops

Hi everyone,

I’m currently a sysadmin working mainly with SMBs (up to \~80-100 users).

I have 6 years of experience and my biggest project was the network deployment of a big mall in Montréal (180 AP, HA firewall, 60 switches with single mode fiber, DAS infra etc). I am 30 years old and I leave in Montreal (Canada).

My background is mostly networking and systems: firewalls, switches, access points, Windows servers, AD, backups, troubleshooting, keeping things running with limited resources. I’ve always had very good feedback from clients and users.

That said, I’ve never worked for large enterprises or in big-scale environments, and I’m starting to feel stuck in what I’d call a “classic / old-school sysadmin” role: managing small infrastructures, doing a bit of everything, but without real exposure to cloud-native or modern DevOps practices.

I’m seriously considering moving towards cloud / DevOps, but I have a few doubts and I’d like honest opinions from people already in the field.

My main concerns:

• I don’t come from a software development background

• I can read noscripts and do some automation, but I’m clearly not a former dev

• I’m worried this could be a hard blocker for DevOps roles

On the other hand:

• I’m highly motivated

• I’m ready to spend the next 6–12 months doing labs, learning properly and building real projects

• I’m planning to work on technologies like:

• Docker / Kubernetes

• CI/CD (GitHub Actions, GitLab CI, etc.)

• Terraform / IaC

• Cloud platforms (AWS / Azure)

• The goal would be to have solid, demonstrable projects I can show during interviews

What I’m really trying to understand is:

• Is this transition realistic from an SMB sysadmin background?

• Is the lack of a strong dev background a deal breaker, or something that can be compensated with infra + automation skills?

• Does motivation + consistent practice over \~1 year actually pay off in this field?

• Any recommendations on what to focus on first or what to avoid?

I’m not looking for shortcuts or buzzwords — I just want to evolve, work on more modern stacks, and avoid stagnating in small-scale sysadmin work forever.

Thanks in advance for any feedback, even blunt or critical ones. I’d rather hear the truth than sugar-coated answers.

https://redd.it/1qyrahx
@r_devops
Release Antigravity Link v1.0.10 – Fixes for the recent Google IDE update

Hey everyone,

If you’ve been using Antigravity Link lately, you probably noticed it broke after the most recent Google update to the Antigravity IDE. The DOM changes they rolled out essentially killed the message injection and brought back all those legacy UI elements we were trying to hide and this made it unusable. I just pushed v1.0.10 to Open VSX and GitHub which gets everything back to normal.

What’s fixed:

Message Injection: Rebuilt the way the extension finds the Lexical editor. It’s now much more resilient to Tailwind class changes and ID swaps.

Clean UI: Re-implemented the logic to hide redundant desktop controls (Review Changes, old composers, etc.) so the mobile bridge feels professional again.

Stability: Fixed a lingering port conflict that was preventing the server from starting for some users.

You’ll need to update to 1.0.10 to get the chat working again. You can grab it directly from the VS Code Marketplace (Open VSX) or in Antigravity IDE by clicking on the little wheel in the Antigravity Link Extensions window (Ctl + Shift + X) and selecting "Download Specific Version" and choosing 1.0.10 or you can set it to auto-update and update it that way. You can find it by searching for "@recentlyPublished Antigravity Link". Let me know if you run into any other weirdness with the new IDE layout by putting in an issue on github, as I only tested this on Windows.

GitHub: https://github.com/cafeTechne/antigravity-link-extension

https://redd.it/1qywaq6
@r_devops
How do adult-content platforms usually evaluate infrastructure providers?

Hi everyone,

I’m trying to understand how engineering or DevOps teams working on high-traffic, adult-content platforms typically evaluate and choose their infrastructure or storage providers.

From an ops perspective, are these decisions usually driven by referrals, private communities, industry-specific forums, or direct outreach? Are there particular technical concerns (traffic patterns, abuse handling, storage performance, legal workflows, etc.) that tend to weigh more heavily compared to other industries?

I’m not looking to pitch anything here — just trying to learn how this segment approaches infrastructure decisions so I can better understand the ecosystem.

Any insights or experiences would be really helpful.

Thanks!

https://redd.it/1qz2e0w
@r_devops
Software Engineer to Cloud/DevOps

Has anyone here successfully transitioned from software development (especially web development) to cloud engineering or DevOps? How was the experience? What key things did you learn along the way? How did you showcase your new skills to land a job?



https://redd.it/1qz34mj
@r_devops
I've lost production data several times. So I'm developing a tool to prevent this from happening again.

Hi everyone, I'm Benjamin, founder and freelancer.

A little anecdote: during my career, I've managed web development agencies and worked in startups and SMEs. Over the years, we inevitably lost data due to corrupted or nonexistent backups. Nobody checked. Great!

This prompted me to dig deeper into the subject. It turns out that only about +50% of backups are successfully restored (which is frightening when you think about it). And almost no one performs regular restore tests. We just trust the green checkmark on the backup and move on.

I examined the existing solutions. Veeam and Commvault offer backup validation features, but they only work within their own ecosystem and are geared towards large enterprises. If you're an SME using PostgreSQL on S3 or another combination of tools, there's practically nothing available.

That's how I started developing RestoreProof. The idea is quite simple: you deploy a small runner on your infrastructure, it retrieves your backup, restores it to an isolated container, performs the defined checks (SQL queries, file integrity, etc.), generates a signed report, and then deletes all the data. No data leaves your network.

The report feeds into a dashboard that's useful for compliance (ISO 27001, SOC 2), but honestly, the main benefit is ensuring your backups are working correctly.

I'm particularly curious: how do you manage backup testing today? Do you test restores or do you prefer to wait for a problem to occur? I'd like to know how other teams handle it.

https://redd.it/1qz5iwn
@r_devops
Open Source Terraform Modules for SAMA (Saudi) & NESA (UAE) Compliance

I built a set of Terraform modules pre-configured for Gulf region compliance (SAMA/NESA).

The Problem: Deploying to KSA/UAE requires strict data residency (GCP Dammam, Oracle Jeddah), mandatory encryption (CMEK), and log retention policies that differ from standard US/EU setups.

The Solution:

Modules for AWS, GCP, Azure, and OCI.

Enforces Private Subnets (no public DBs).

Enforces KMS rotation (365 days).

Hardcoded region checks to prevent accidental `us-east-1` deployments.

Repo: https://github.com/SovereignOps/terraform-aws-sama

https://redd.it/1qz2vfs
@r_devops
Every team wants "MLOps", until they face the brutal truth of DevOps under the hood

I’ve lost count of how many early-stage teams build killer ML models locally then slap them into production thinking a simple API can scale to millions of clients... until the first outage hits, costs skyrocket or drift turns the model to garbage.

And they assign it to a solo dev or junior engineer as a "side task".

Meanwhile:

No one budgets for proper tooling like registries or observability.

Scaling? "We'll Kubernetes it later".

Monitoring? Ignored until clients churn from slow responses.

Model updates? Good luck versioning without a registry - one bad push and you're rolling back at 3AM.

MLOps is DevOps fundamentals applied to ML: CI/CD, IaC, autoscaling, and relentless monitoring.



I put together a hands-on video demo: Building a scalable ML API with FastAPI, MLflow registry, Kubernetes and Prometheus/Grafana monitoring. From live coding to chaos tested prod, including pod failures and load spikes. Hope it saves you some headaches.

https://youtu.be/jZ5BPaB3RrU?si=aKjVM0Fv1DTrg4Wg

https://redd.it/1qz7e1r
@r_devops
I wrote a noscript to automate setting up a fresh Mac for Development & DevOps (Intel + Apple Silicon)

Hey everyone,

I recently reformatted my machine and realized how tedious it is to manually install Homebrew, configure Zsh, set up git aliases, and download all the necessary SDKs (Node, Go, Python, etc.) one by one.

To solve this, I built `mac-dev-setup` – a shell noscript that automates the entire process of bootstrapping a macOS environment for software engineering and DevOps.

**Repo:**[https://github.com/itxDeeni/mac-dev-setup](https://github.com/itxDeeni/mac-dev-setup)

**Why I built this:** I switch between an older Intel MacBook Pro and newer M-series Macs. I needed a single noscript that was smart enough to detect the architecture and set paths correctly (`/usr/local` vs `/opt/homebrew`) without breaking things.

**Key Features:**

* **Auto-Architecture Detection:** Automatically adjusts for Intel (x86) or Apple Silicon (ARM) so you don't have to fiddle with paths.
* **Idempotent:** You can run it multiple times to update your tools without duplicating configs or breaking existing setups.
* **Modular Flags:**
* `--minimal`: Just the essentials (Git, Zsh, Homebrew).
* `--skip-databases`: Prevents installing heavy background services like Postgres/MySQL if you prefer using Docker for that (saves RAM on older machines!).
* `--skip-cloud`: Skips AWS/GCP/Azure CLIs if you don't need them.
* **DevOps Ready:** Includes Terraform, Kubernetes tools (kubectl, k9s), Docker, and Ansible out of the box.

**What it installs (by default):**

* **Core:** Homebrew, Git, Zsh (with Oh My Zsh & plugins).
* **Languages:** Node.js (via nvm), Python, Go, Rust.
* **Modern CLI Tools:** `bat`, `ripgrep`, `fzf`, `jq`, `htop`.
* **Apps:** VS Code, iTerm2, Docker, Postman.

**How to use it:** You can clone the repo and inspect the code (always recommended!), or run the one-liner in the README.

Bash

git clone https://github.com/itxDeeni/mac-dev-setup.git
cd mac-dev-setup
./setup.sh


I’m looking for feedback or pull requests if anyone has specific tools they think should be added to the core list.

Hope this saves someone a few hours of setup time!

Cheers,

itzdeeni

https://redd.it/1qz5itd
@r_devops
Best way to get started?

i been wanting to start learning devops, but i dont know where to start.

My background is IT, i've been working for the last 5 years as a Data Center Technician - mostly installing servers and experience with fiber optics.

i also did a CCNA course about two years ago ( i dont know if its relevant).

if any more information is needed please guide me below and i will write.

Thanks in advance! :)

https://redd.it/1qza05a
@r_devops
Simple Terraform module for multi-service AWS ECS (Fargate/EC2)

Hi everyone,

I've been working on a Terraform module to simplify deploying containerized apps on AWS ECS. I wanted something that handles the boilerplate for VPC, Load Balancers, and ECS while keeping the interface clean for multiple services.

Repo: [https://github.com/NazarSenchuk/terraform-aws-ecs](https://github.com/NazarSenchuk/terraform-aws-ecs)

Main things it handles:

* Dynamic VPC setup (public/private subnets, NAT, etc).
* Single variable switch between Fargate with spot and EC2.
* Support for all types of deployments and Service Connect.
* Multi-service management in one block.

Example:

module "ecs_cluster" {
source = "NazarSenchuk/awsecs/aws"
version = "1.0.0"

general = {
environment = "prod"
project = "my-app"
region = "us-east-1"
}

infrastructure = { type = "FARGATE" }

services = {
web = {
name = "web-service"
img = "nginx:latest"
desired_count = 2
alb_path = "/*"
deploy = {
enabled = true
strategy = "ROLLING"
}
}
}
}

Registry link: [here](https://registry.terraform.io/modules/NazarSenchuk/awsecs/aws)
More examples: [here](https://github.com/NazarSenchuk/terraform-aws-ecs/blob/main/docs/examples.md)

Would appreciate any feedback on the structure or if anyone has suggestions or additional parameters i need to add.

Thanks.

https://redd.it/1qzbnx3
@r_devops
Release service-bus-tui v1.0.0-alpha

Hey everyone,

I’m working on a small tool for exploring Azure Service Bus entities and messages directly from the terminal. There’s still a lot of work to do, but you can already browse messages from topics/subnoscriptions and queues.

Github : https://github.com/MonsieurTib/service-bus-tui

https://redd.it/1qzbnl5
@r_devops
Need advice: am I overthinking or is our message queue setup really so insecure?

I'm pretty new to this team (3 months in) and noticed something that seems off but nobody's mentioned it so maybe I'm missing context.

We're running a multi tenant saas and use message queues to pass events between services. The queue itself has no authentication or authorization configured. Like tenant A could technically subscribe to tenant B's topics if they knew the topic names.

When I asked about it my senior said "it's fine, everything's on a private network" but that doesn't feel like enough? Isn't that basically security through obscurity?

Am I being paranoid or should I push back on this? Don't want to be that junior who questions everything but also this seems like a pretty big issue.

https://redd.it/1qzeubb
@r_devops
Coming from a Kubernetes-heavy SRE background and moving into AWS/ECS ops – could use some perspective

Hey all, looking for some perspective from people who’ve been around this longer than me.

I’ve been working as an SRE for just under three years now, and almost all of that time has been in Kubernetes-based environments. I spent most of my days dealing with production issues, on-call rotations, scaling problems, deployments that went sideways, and generally keeping clusters alive. Observability was a big part of my work too, Prometheus, Grafana, ELK, Datadog, some Jaeger tracing. Basically living inside k8s and the tooling around it.

I’m now interviewing for a role that’s a lot more AWS-ops heavy, and honestly it feels like a bit of a mental shift. They don’t run Kubernetes at all. Everything is ECS on AWS, and the role is much more focused on things like cost optimization, release and change management, versioning, and day-to-day production issues at the AWS service level. None of that sounds crazy to me in theory, but I can feel where my experience is thinner when it comes to AWS-native workflows, especially around ECS and FinOps.

I’m not trying to pretend I’m an AWS expert. I know how to think about capacity, failures, rollbacks, and noisy systems, but now I’m trying to translate that into how AWS actually does things. Stuff like how people really manage releases in ECS, where AWS costs usually get out of hand in real environments, and what ops teams actually look at first when something breaks in production outside of Kubernetes.

If you’ve moved from a Kubernetes-heavy setup into more traditional AWS or ECS-based ops work, I’d really like to hear how that transition went for you. What did you wish you understood earlier? What mattered way more than you expected? And what things did you overthink that turned out not to be that important?

Just trying to level myself up properly and not walk into this role blind. Appreciate any advice.

https://redd.it/1qzhbcr
@r_devops
Vouch: earn the right to submit a pull request (from Mitchell Hashimoto)

Mitchell Hashimoto got tired of watching open-source maintainers drown in AI-generated pull requests. So he built Vouch, a contributor trust management system. The concept is almost absurdly simple: before you can submit a PR to a project using Vouch, someone already trusted has to vouch for you.

The whole thing lives in a single text file inside the repo. One username per line. A minus sign means denounced. You can parse it with grep.

Sigstore verifies artifacts. SLSA verifies builds. Dependabot checks dependencies. None of them answer the question of whether a given person should be contributing to a project at all. That's the gap Vouch fills: contributor trust, not artifact trust.

Hashimoto designed it the same way he designed Terraform. Declarative. Human-readable. Version-controlled. Instead of .tf files for infrastructure, you get .td files for trust. Same brain, different domain.

The xz-utils backdoor is the elephant in the room. "Jia Tan" spent two years earning trust through legitimate contributions before planting a CVSS 10.0 backdoor. Vouch wouldn't have stopped that attack. But the vouch record would've been visible in the git history, who vouched for them, when, and the denouncement would propagate to every project subscribing to that vouch list. Less of a lock, more of a security camera.

Ghostty is already integrating it. The repo picked up 600 stars in three days. A GitHub staff member commented on the HN thread saying they'd ship changes "next week."

The concerns are real though. Gatekeeping is the obvious one. Open source is supposed to be open, and Vouch creates an explicit barrier where there wasn't one before. One HN commenter called it "social credit on GitHub." The persona gaming problem hasn't gone away either; someone could still spend months building trust before going rogue.

Hashimoto himself flags it as experimental. But it's the first serious attempt at making contributor trust visible and version-controlled.

I wrote up the full breakdown, including how Vouch compares to PGP's web of trust, Advogato, and Debian's maintainer process, here if you want the deep dive.

https://redd.it/1qzgoao
@r_devops
State of OpenTofu?

Has OpenTofu gained anything on Terraform? Has it proven itself as an alternative?


I unfortunately don't use IaC in my current deployment but I'm curious how the landscape has changed.



https://redd.it/1qz67sq
@r_devops
Need advice: trying to document an installation guide for production

Hey guys, I recently open-sourced a pretty huge self-hosted project. I've set up a docker-compose.yaml that worked fine for local deployments, but I suppose I made a lot of rookie mistakes for a production deployment guide.

I don't have much experience in DevOps except for small services and deploying websites with nginx+letsencrypt, and when people started coming to me for advice on why their setup failed, I was a bit overwhelmed.

For the last three evenings I've been trying to come to a default installation guide for a reverse proxy that would work fine for production.

So, the current setup is pretty standard:

- docker-compose.yaml with setup on localhost by default
- pretty much a default Go backend container
- frontend container that builds the frontend with baked in nginx that serves the static files on / and sets up a localhost reverse proxy on /api

 

My initial prod setup directed people to build the images manually and to edit the frontend/nginx.conf.template that the frontend container uses, so that people change their servername/adjust their IP address and so on.

Well, after debugging a couple environment-specific problems that people faced trying to deploy it this way, I realized that I need to adjust the guide ASAP.

At first, I thought that I needed to remove the baked-in nginx from the frontend container and move it up to `docker-compose.yaml`, but then I've read a suggestion on the internet that I can just put another reverse proxy in front of the frontend-internal nginx one.

 

So, my current thinking process is:

1. adjust nginx.conf.template to accept the DOMAIN and BACKEND
PORT, so that they're provided by docker-compose, not changed by the user (or should the baked in nginx.conf be left untouched, without accepting those env vars, staying localhost-only?)
2. add a new container in docker-compose for prod setups - caddy with a reverse proxy in front (maybe as an override file)

Also, is it fine to mix caddy and nginx this way? or am I better off overhauling the setup entirely? If so, what's the best course of action for me?

In case someone wants to take a look: https://github.com/Vsein/Neohabit (the setup files are docker-compose.yaml, .env.example, frontend/nginx.conf.template; all of them are mentioned in the installation guide "building manually from source")

And here's what I've been trying to do: https://github.com/Vsein/Neohabit/pull/110

Anyway, sorry if this post is amateurish, I just genuinely feel like I'm wasting my time trying to do something that might be a wrong direction entirely.

https://redd.it/1qzmth6
@r_devops