Reddit DevOps – Telegram
ACA autoscaling killing long running jobs — best practice?

Using Azure Container Apps with HTTP autoscaling(with 10 as concurrent users) for report generation. During scale up/down, replicas get terminated and reports fail mid-execution.

Questions:
• Is this the right pattern for long-running jobs on ACA?
• Any Service Bus lock timeout gotchas?

https://redd.it/1r4hkzu
@r_devops
Dual boot or VMware

I started learning devops a while ago, I used to practice on VMware but sometimes the machine freezes specially when I am learning k8s so I start thinking about dual boot but I don’t know if it is good enough for learning and practice all the tools or I should give the machine more specs

https://redd.it/1r57rba
@r_devops
Homelab or digital ocean?

i need to do projects to learn and show off on my resume but im a student and i dont have money. I thought that maybe i should do some cloud provider free trial in order to show competency with servers(terraform) but all signs lead me to believe that homelabbing will guarantee a special interview i have in a month and a half from now. Should i take the invesand homelab or try to do projects with a cloud provider?

https://redd.it/1r58xhz
@r_devops
What does “config hell” actually look like in the real world?

I've heard about "Config Hell" and have looked into different things like IAM sprawl and YAML drift but it still feels a little abstract and I'm trying to understand what it looks like in practice.

I'm looking for war stories on when things blew up, why, what systems broke down, who was at fault. Really just looking for some examples to ground me.

Id take anything worth reading on it too.

https://redd.it/1r5ew1g
@r_devops
where can I find courses

hello all,

I want advice regarding where to find good courses about devops, Kubernetes, dockers, AWS.

if there is a course that tackles most of this in one go would be better.

https://redd.it/1r5kgn9
@r_devops
I made a single binary alternative to Grafana+Prometheus for monitoring Docker on remote servers

I got tired of needing a full grafana + prometheus + loki + alertmanager stack just to monitor a handful of docker containers across a couple VPSs. So I built a simpler alternative.

A single binary agent runs on your server collecting host metrics from /proc, monitoring containers via the docker socket (read-only), tailing logs, and evaluating alert rules. You define alert conditions in a toml config, container down, high cpu, disk filling up, unhealthy health checks, restart loops, and get notified via email or webhooks. You connect from your machine over SSH via a TUI, no exposed ports, no HTTP server, nothing to firewall.

It deploys as a docker compose service or a systemd unit. Sub 50 mb ram usage on my own servers currently, sqlite storage with 7 day retention, config reload via SIGHUP.

There's a gif of how the TUI looks on the repo if you want to see it in action. MIT licensed, I really just built it to solve my own problem so feel free to check it out but expect bugs if you do :)

https://github.com/thobiasn/tori-cli

https://redd.it/1r5mp8g
@r_devops
Can the CKA replace real k8s experience in job hunting?

Senior DevOps engineer here, at a biotech company. My specific team supports more on the left side of the SDLC, helping developers create and improve build pipelines, integrating cloud resources into that process like S3, EC2, and creating self-help jobs on Jenkins/GitHub actions.

TLDR, I need to find another job. However, most DevOps jobs ive seen require k8s at scale- focusing on reliability/observability. I have worked with Kubernetes lightly, inspecting pod failures etc, but nothing that would allow me to deploy and maintain a kubernetes cluster. Because of this, I'm in the process of obtaining the CKA to address those gaps.

To hiring managers out there: Would you hire someone or accept the CKA as a replacement for X years of real Kubernetes experience?

For those of you who obtained the CKA for this reason, did it help you in your job search?

https://redd.it/1r5nh8z
@r_devops
How I Built a Production-Grade Kubernetes Homelab on 2 Recycled PCs (Proxmox + Talos Linux, ~€150)

I wrote a detailed walkthrough on building a production-grade Kubernetes homelab using 2 recycled desktop PCs (\~€150 total). The stack covers Proxmox for virtualization, Talos Linux as an immutable K8s OS, ArgoCD for GitOps, and Traefik + Cloudflare Tunnel for external access.

Key topics: Infrastructure as Code with Terraform, GlusterFS for replicated storage, External Secrets Operator with Bitwarden, and a full monitoring stack (Prometheus + Grafana + Loki).

Full article: https://medium.com/@sylvain.fano/how-i-built-a-production-grade-kubernetes-homelab-in-2-weekends-with-claude-code-b92bca5091d3

Happy to discuss architecture decisions or answer any questions!

https://redd.it/1r5m7ir
@r_devops
Weekly/temp DevOps ENTRY LEVEL - internship / fresher & changing careers

This is a weekly thread to ask questions about getting into DevOps.

If you are a student, or want to start career in DevOps but do not know how? Ask here.

Changing careers but do not have basic prerequisites? Ask here.

Before asking

try to search if your question was asked and answered
try these resources
[https://roadmap.sh/devops](https://roadmap.sh/devops)
(please suggest more)

_____________

Individual posts of this type may be removed and redirected here.

Please remember to follow the rules and remain civil and professional.

This is a trial weekly thread.

https://redd.it/1r659ga
@r_devops
I've run Docker Swarm in production for 10 years. $166/year. 24 containers. Two continents. Zero crashes. Here's why I never migrated to Kubernetes.

Every week on Reddit someone asks about Docker Swarm and the responses are always the same: "Swarm is dead." "Just use K8s." "Nobody runs Swarm in production."

I've run Swarm in production for a decade. Not a toy setup — multi-node clusters, manager redundancy, 4-6 replicas per service, rolling deployments in batches of two with automatic rollback on healthcheck failure. Zero customer downtime. Over the years I optimized the architecture down to 24 containers across two continents on $166/year total infrastructure.

I finally wrote the article I wish existed when I made my choice ten years ago. 7,400 words. Real production numbers. Working code. No affiliate links. No "it depends" cop-out.

**What's in it:**

* Side-by-side YAML comparison: 27 lines (Compose) → 42 lines (Swarm) → 170+ lines (K8s) for the same app
* Healthcheck comparison table testing 6 failure scenarios — K8s wins 2 out of 6
* A working 150-line autoscaler that's actually smarter than K8s HPA (adaptive polling vs fixed 15s intervals)
* Cost breakdown: $166/year vs $1,584-2,304/year minimum for EKS
* CAST AI 2024 data: 87% idle CPU, 68% of pods overprovisioned 3-8x, $50-500K annual waste per cluster
* Why your Node.js containers are 7x bigger than they need to be and how that drives false demand for autoscaling
* Why you should never expose Node.js directly to the internet (and what to do instead)

The only feature K8s genuinely has that Swarm lacks is autoscaling — and Datadog's own 2023 report shows only \~50% of K8s organizations even use HPA. So half the industry is paying the full complexity tax for a feature they don't use.

Not saying K8s is bad. It's an incredible system for the 1% who need it. But the data shows 99% don't — they're paying 10-100x more for capabilities they never touch while 87% of their CPU does nothing.

[Read Full Web Article Here](https://thedecipherist.com/articles/docker_swarm_vs_kubernetes/?utm_source=reddit&utm_medium=post&utm_campaign=docker-swarm-vs-kubernetes&utm_content=launch-post&utm_term=r-devops)

Happy to answer any questions. I've been running this setup since before K8s hit 1.0.

https://redd.it/1r6krmk
@r_devops
Security Scanning, SSO, and Replication Shouldn't Be Behind a Paywall — So I Built an Open-Source Artifact Registry

Side project I've been working on — but more than anything I'm here to pick your brains.

I felt like there was no truly open-source solution for artifact management. The ones that exist cost a lot of money to unlock all the features. Security scanning? Enterprise tier. SSO? Enterprise tier. Replication? You guessed it. So I built my own.

Artifact Keeper is a self-hosted, MIT-licensed artifact registry. 45+ package formats, built-in security scanning (Trivy + Grype + OpenSCAP), SSO, peer mesh replication, WASM plugins, Artifactory migration tooling — all included. No open-core bait-and-switch.

What I really want from this post:

\- Tell me what drives you crazy about Artifactory, Nexus, Harbor, or whatever you're running

\- Tell me what you wish existed but doesn't

\- If something looks off or missing in Artifact Keeper, open an issue or start a discussion

GitHub Discussions: https://github.com/artifact-keeper/artifact-keeper/discussions

GitHub Issues: https://github.com/artifact-keeper/artifact-keeper/issues

You don't have to submit a PR. You don't even have to try it. Just tell me what sucks about artifact management and I'll go build the fix.

But if you do want to try it:

https://artifactkeeper.com/docs/getting-started/quickstart/

Demo: https://demo.artifactkeeper.com

GitHub: https://github.com/artifact-keeper

https://redd.it/1r6pwxy
@r_devops
Can we stop with the LeetCode for DevOps roles?

I just walked out of an interview where I was asked to reverse a binary tree on a whiteboard. For a Platform Engineering role.

In what world does that help me troubleshoot a 502 error in an Nginx ingress or optimize a Jenkins build that’s taking 40 minutes?

I'd much rather be asked:

1. "How do you handle a dev who refuses to follow the CI/CD flow?"
2. "Walk me through how you’d debug a DNS issue in a multi-region cluster."
3. "Explain the trade-offs of using a Service Mesh."

Is anyone else still seeing heavy LeetCode, or are companies finally moving toward practical, scenario-based testing?

https://redd.it/1ra4poz
@r_devops
Recently Accepted Jr Devops Role!!

I recently accepted a junior devops role where I'll be using a lot of terraform and ansible allegedly. Since I'm still waiting on the official start date to come I figured I'd get started learning these early so the ramp up is quicker and man...


I did the terraform hello world yesterday spinning up a docker container and that was fun enough, so I set out with a goal today when I woke up, provision and configure a vanilla minecraft server before I go to sleep. 10 hours later and here I am writing this post with a vanilla server running on my t3.small chugging away as I run across the world just amazed at how much I was able to get done today. Boys I fear my journey has just begun and I am excited for what is ahead of me!

https://redd.it/1ragqui
@r_devops
our "self-service platform" is just a Jira board with extra steps

we spent six months building an "internal developer platform" and I just realized it's basically a form that creates a Jira ticket which gets manually processed by the same three people as before. the only difference is now there's a React frontend on top of it.anyone here actually built a platform that genuinely reduced toil and developers actually use voluntarily? what did you get right that we clearly didn't?

https://redd.it/1radws1
@r_devops
Rest api development in a microservices world, where does governance even fit and who owns it

Sixty services and the api layer looks like a yard sale. Different auth patterns, versioning nobody agreed on, rate limiting that exists on maybe half of them and is configured differently on each one that has it.

Platform team (three people including me) keeps getting pulled into incidents that should belong to service teams but don't because there's no standard anyone actually follows. And every time I raise this in an architecture review I get "it depends" answers that don't help me figure out what to actually do next week.

Gateway enforcement or ci/cd enforcement? Who owns the standard, platform or the services? How do you make teams follow it without becoming the bottleneck for every api deployment?

https://redd.it/1ralzfj
@r_devops
Looking to work for free on real devops projects to gain experience

Hi everyone,

I'm learning DevOps and looking to work under an experienced DevOps freelancer to understand real-world projects and workflows.

I'm comfortable with:

\- AWS basics (EC2, VPC, IAM, ALB)

\- Linux & networking fundamentals

\- CI/CD basics

\- Hands-on practice with deployments and troubleshooting

I'm not asking for payment. I'm happy to assist with tasks like documentation, monitoring, testing, basic deployments, or shadowing—anything that helps reduce your workload while | learn.

If you're a freelancer who could use an extra pair of hands (or know someone who might), I'd really appreciate connecting via DMs.

Thanks for reading!

https://redd.it/1r9zhgx
@r_devops
Infra aware tool

Hi. Got hired recently to a big product company and noticed how difficult is onboarding process. Outdated confluence pages, unclear inventory. Nobody can tell for sure how many clusters we have(except CTO maybe), VMs are spread across OCI, AWS and Azure clouds. Hundreds of build configurations in TeamCity for various purposes.

So for me as a new devops getting hands on this infra takes months and still I am finding stuff that I was never aware of.

Question is - if there will be some infra aware chat gpt that you can ask like how many VMs we have with windows arm 64 or which k8s clusters are below 1.30 version, etc. would it make sense in your team ? Would it solve your operational overhead as it would do for me?

https://redd.it/1ram2sv
@r_devops
How likely it is Reddit itself keeps subs alive by leveraging LLMs?

Is reddit becoming Moltbook.. it feels half of the posta and comments are written by agents. The same syntax, structure, zero mistakes, written like for a robot.

Wtf is happening, its not only this sub but a lot of them. Dead internet theory seems more and more real..

https://redd.it/1ralvnj
@r_devops
jq 101 – Practical guide to parsing JSON from the CLI

If you spend your days in the AWS CLI, Azure CLI, Kubernetes, or Terraform, you already know: you’re swimming in JSON. Most folks just pipe everything to grep, scroll through endless output, or hack together a Python noscript for a problem jq solves in seconds.

So, I put together a straight-to-the-point technical guide. It covers the core jq moves: things like .key, .array[\], select(), length, and sort_by. I walk through real examples with a public API, and I tie those examples directly to what you see in AWS and Azure CLI outputs. The patterns I show? They handle about 90% of what you actually deal with in the cloud.

No stories, no fluff. Just clear, practical jq tricks built for DevOps and SRE work. If you’re in the CLI all the time but JSON filtering still feels awkward, this guide clears things up.

Link:

https://medium.com/@odinumbelino/jq-101-how-to-parse-json-like-a-pro-a883ca08b3f9

Feedback welcome.

https://redd.it/1raeo3r
@r_devops
1
I'm being asked to provide inputs

I was asked recently which platform I should pick for our a new self-service pipeline. There are only 2 options given, ECS or EKS/AKS. We have presence on both providers. My knowledge on both is little so I can't decide which one to choose. It seems like my boss is leaning towards k8s since his team has used it before. However, he is still asking me which technology I should use. He also mentioned argocd. I saw it in action in a cncf conference and was quite amazed with the demo. How would you decide on it?

Oh, he is aware that it can take several months in building the new self service tooling and he's ok with that.

https://redd.it/1radje6
@r_devops
Is it possible to use your IDE on your phone??

Hey devs, I wanted to ask if there is any way that I can use my IDE directly on my phone? So that what I have on my laptop is syncing with my phone too.

Is this possible?

https://redd.it/1rat13s
@r_devops