Reddit DevOps – Telegram
Kubernets homelab

Hello guys
I’ve just finished my internship in the DevOps/cloud field, working with GKE, Terraform, Terragrunt and many more tools. I’m now curious to deepen my foundation: do you recommend investing money to build a homelab setup? Is it worth it?
And if yes how much do you think it can cost?

https://redd.it/1oi7lab
@r_devops
playwright vs selenium alternatives: spent 6 months with flaky tests before finding something stable

Our pipeline has maybe 80 end to end tests and probably 15 of them are flaky. They'll pass locally every time, pass in CI most of the time, but fail randomly maybe 1 in 10 runs. Usually timing issues or something with how the test environment loads.

The problem is now nobody trusts the CI results. If the build fails, first instinct is to just rerun it instead of actually investigating. I've tried increasing wait times, adding retry logic, all the standard stuff. It helps but doesn't solve it.

I know the real answer is probably to rewrite the tests to be more resilient but nobody has time for that. We're a small team and rewriting tests doesn't ship features.

Wondering if anyone's found tools that just handle this better out of the box. We use playwright currently. I tested spur a bit and it seemed more stable but haven't fully migrated anything yet. Would rather not spend three months rewriting our entire test suite if there's a better approach.

What's actually worked for other teams dealing with this?

https://redd.it/1oi8z4m
@r_devops
Practicing interviews taught me more about my job than any cert

I didn't expect mock interviews to change how I handle emergencies. I've done AWS certifications, Jenkins pipelines, and Prometheus dashboards. All useful, sure. But none of them taught me how to work in the real world.

While prepping for a role switch, I started running scenario drills from iqb interview question bank and recording myself with my beyz coding assistant. GPT would also randomly throw up mock interview questions like "Pipeline rollback error" or "Alarms surge at 2 a.m.."

Replaying my own answers, I realized my thinking was scattered. There was a huge gap between what I thought in my head and what I actually said. I'd jump straight to a Terraform or Kubernetes fix, skipping the rollback logic and even forgetting who was responsible for what. I began to wonder if I was easily disrupted by the backlog of tasks at work, too.

Many weeks passed in this chaotic state... with no clear idea of what I'd actually done, whether I'd made any progress, or whether I'd documented anything. So, when faced with many interview questions, I couldn't use STAR or other methods to describe the challenges I encountered and the final results of my projects.

So now, I've started taking notes again... I write down my thoughts before I start. Then I list to-do items. For example, I check Grafana trends, connect with PagerDuty, and review recent merges in GitHub, and then take action. This helps me slow down and avoid making stupid mistakes that waste time re-analyzing bugs.

https://redd.it/1oiabuh
@r_devops
How do you all feel about Wiz?

Curious who’s used the DSO tool/platform Wiz, what your experiences were, and your opinions on it… is it widely used in the industry and I’ve just somehow managed to not be exposed to it to this point?

I’m being asked to review our org’s proposal to use it as part of our DSO implementation plan I just found out exists and am slightly annoyed there’s a bunch of vendor products on here I’ve not been exposed to, which is really saying something tbh haha.

https://redd.it/1oie7ji
@r_devops
Amazon layoffs, any infra engineers impacted?

Today, Amazon announced 30k layoffs, most posts on LinkedIn I’ve seen were from HR/Recruiting. Curious to know if they laid off any DevOps/SRE as that would imply a lot of Amazon engineers would be coming into the market. Anyone hear anything?

https://redd.it/1oigvwn
@r_devops
Intel SGX alternative migration - moved to Intel TDX and AMD SEV with better results

Built our entire privacy stack around Intel SGX. Then Intel announced they're discontinuing the attestation service in 2025.

Spent two months in panic mode migrating everything. Painful process but honestly ended up in a better place than before.

New setup uses Intel TDX and AMD SEV with a universal API layer so we're not locked into one vendor anymore. Performance is actually better than SGX was and we have proper redundancy now. If one TEE vendor has issues we can failover to another.

If you're still on SGX, start planning your migration now. The deadline is closer than you think and these projects always take longer than estimated.

https://redd.it/1oiaznw
@r_devops
Looking to learn more about authentication

Hey there,

For some background, I started as a dev 10+ years ago, always did some infra on the side, and switched to mainly infra ~6 years ago.

My specialty is kubernetes, including metal clusters and a lot of observability on the Grafana stack at interesting scale (a few dozen TB of logs a day).

Thing is, I'm behind on authentication / authorization subjects, as it was often already in place or managed by someone else.

I'm currently trying to redo the auth system for a personal project, and taking a lot of time to learn about all the ways to solve my issues (centralizing auth / perms, authenticating Apis via gateway, trying to follow zero trust more closely with maybe some mesh).

I'd be happy to share the knowledge I have, and receive some in return in subjects I'm weaker at.

If anyone is interested in a conversation, hit me up!

Cheers

https://redd.it/1oiemho
@r_devops
Would you let devs do this?

In our organization, we have a team that is responsible for 'devops'. They connect the security, dev, and infra teams needs to deliver a product. The development team has recently decided that they want to be completely in charge of building their artifacts in their own systems (local laptops, etc) and that the folks with devops responsibilities only need to take their artifacts and run them. We've expressed our concerns with this process to management, but it appears it's a losing battle of attrition with them. The current pipeline has many security processes built in that can notify devs early of issues and allow them to fix before even getting to a test or deployment stage. Am I crazy for thinking we shouldn't shift those processes to deployment time and keep the roles/responsibilities separated as they are? What do you all think and what do you do in your orgs?

This isn't a time issue as the time to run the current pipeline with the security features in place takes less than 4 minutes.

https://redd.it/1oisa32
@r_devops
rolling back to bare metal kubernetes on top of Linux?

Since Broadcom is raising our license cost 300% (after negotiation and discount) we're looking for options to reduce our license footprint.

Our existing k8s is just running on Linux vms in our vsphere with rancher. we have some workloads in Tanzu but nothing critical.

Have I just been out of the game in running os' on bare metal servers or is there a good reason why we don't just convert a chunk to of our esx servers to Debian and run kubernetes on there? it'll make a couple hundred thousand dollars difference annually...

https://redd.it/1oit22p
@r_devops
Suggestion

honesty, Linode’s fine but it feels kinda outdated the support’s okay, but the UI and performance can be inconsistent. I know there’s gcp, azure, and aws out there which one’s the best to learn that’s modern, flexible, and still affordable?

https://redd.it/1oiv2d5
@r_devops
Suggestion about learning active directory

Hello All ,
I am learning devops from scratch from youtube. I have started with AWS - recently i learned IAM after that there is a topic called active directory setup. The use case : youtuber told was if there is many users ( ex count users count : 2000) it will be difficult to setup user and setup iam role and do role switch and all those things . While learning this topic i can understand what he is doing and how he is doing but it is difficult to co relate as i do not have a networking background . Should i learn this topic is it important for devops learning . Please share your inputs.

https://redd.it/1oiw1rp
@r_devops
Self-hosted alternatives to Jira that don't require a PhD to set up?

We want to move away from Atlassian but every self-hosted alternative seems to require days of configuration or is missing critical features. What are people actually using that works out of the box?

https://redd.it/1oijtow
@r_devops
kafka complexity was killing our team's productivity so we switched to something with zero dependencies


Look, I love Kafka. I really do. But running it for the past 18 months has been exhausting. We're a team of 4 backend engineers. That's it. And somehow we were spending like 30% of our time just keeping Kafka alive. Not building features. Not fixing bugs. Just babysitting infrastructure. Rolling updates felt like defusing a bomb. Debugging cluster issues meant we'd all gather around someone's monitor staring at JMX metrics and heap dumps like we were reading tea leaves. When something broke (and it would break), you could kiss the next 6 hours goodbye.

I remember this one time we had a broker go down at 2am. Spent until 8am trying to get the cluster stable again. My wife was not thrilled, and the kicker? The actual issue was some obscure zookeeper connection timeout that only showed up under specific load conditions. Cool, great, love that for us.

An option was hiring a dedicated Kafka admin but we absolutely couldn't afford it, so I started digging into alternatives. Tried rabbitmq first because everyone talks about it, solid tool but it didn't really solve the distributed streaming thing we needed. Then I looked at pulsar and yeah no, that's even more complex with bookkeeper on top of zookeeper. Hard pass.

My thinking shifted. I stopped asking "what's exactly like Kafka but simpler" and started asking "what could solve our problems?" We needed reliable messaging, streaming, replay capability. Did we need Kafka's specific partition model? Honestly... probably not. We were using it because it's what you're "supposed" to use, you know?

I found nats with jetstream and I'll be honest, I was unsure about it. It looked too simple. Single binary, no zookeeper, no massive config files. My brain was like "this won’t work." But I spun it up anyway. Clustering just worked out of the box with no drama. We've been running it for 3 months now in production so we don't even manage the infrastructure, which is nice because remember, we're 4 people, processing about 500k messages per day. Haven't had a single issue that took more than 10 minutes to resolve.

My team is shipping features again, actual features. Not spending our days in kafka land. It's wild how much mental energy we got back. I'm not saying kafka is bad. If you're at massive scale and you've got the team to manage it, is probably perfect for you. But for smaller teams? Sometimes you're using enterprise tools when you don't have enterprise problems, and that's okay.

https://redd.it/1oixphz
@r_devops
AI was implemented as a trial in my company, and it’s scary.

I know that almost everyday someone comes up and says AI will take my job and I’m scared but I promise to keep this short and maybe different.

I am currently a junior devops, so not huge experience or knowledge, but I was told that the team are trying to implement Claude code into vs code for the dev team and MCPs for provisioning and then later for monitoring generally and taking action when something fails.

The trial was that Claude code was so good in the testing, it scared me alittle, because it planned and worked with hundreds of files, found what it needs to do, and did it first try (now fully implemented)

With the MCP, it was like a junior devops/SRE, and after that trial, the company stopped the hiring cycle and the team is kept at only 4 instead of expanding to 6 as planned, and honestly from what I saw, I even think they might view it as “4 too many”.

This is all happening 3 years after ChatGPT released, 3 years and people are already getting scared shitless. I thought AI was a good boost, but I don’t think management would see it as a boost, but a junior replacement and maybe later a full replacement.

https://redd.it/1oiytfa
@r_devops
[Advice] Best way to build and distribute an internal CLI tool (PHP vs Node)?

Hey !

I’m currently working on an internal CLI tool for my team — mainly to automate recurring tasks like syncing databases, uploading assets, or triggering deployments.

I’m hesitating between several approaches, and I’d love to get some feedback, especially about distribution and updates.

**Context:**

* Team setup: mixed macOS (only me) / Windows
* Goal: make it easy for anyone to just run commands like "*toto sync*" or "*toto deploy*" through SSH, without a complex setup.

**Options I’m considering:**

1. Node.js CLI (using clack js) → Publish on Private GitLab’s npm Registry, install globally
2. PHP CLI (using Laravel Prompts or Laravel Zero) → Distributed either as a Composer global package or a single PHAR binary via GitLab Releases.

**My questions:**

* Which approach feels the most maintainable and easiest to distribute internally?
* Is the PHAR format still considered a good modern option?
* How do you handle auto-updates for your internal CLI tools?
* Do you prefer Composer global, npm -g, or shipping a standalone binary?

Looking for the cleanest, cross-platform way to build a team CLI that’s:

* easy to install,
* modern (interactive prompts, spinners, multiselect, etc.),
* and able to run SSH/rsync commands to deploy WordPress projects.

Would love to hear how you do it internally or see examples if you’ve built something similar!

https://redd.it/1oiz484
@r_devops
How does your team promote your products? Which channel?

Hi all, I’m curious about how web developers and their teams promote their own products or tools.

Do you mainly use email marketing to reach your audience or do you rely more on social media, blogs, or other channels?



https://redd.it/1oj0lfc
@r_devops
Help! My side project is burning cash on Google Cloud SQL 😅need a free database host

I’ve deployed my machine learning web app on Google Cloud, but I’ve started incurring charges. I’m now looking for a free alternative for hosting.

The app consists of:

* A frontend hosted on Vercel
* Two APIs (one for data processing and another for connecting to the ML .pkl model)
* A MySQL database that stores all the data used by the APIs

From what I understand, the costs are coming from the MySQL database hosted on Cloud SQL. It’s already cost me around $3 in just a week, which is not sustainable since the app doesn’t generate any income.

I’m looking for a free MySQL hosting option (or something similar) that can work with my current setup. I’ve tried alternatives like CockroachDB and Firebase, but I found them a bit confusing. Before committing to another platform, I wanted to ask for recommendations.

Thanks in advance!

https://redd.it/1oj1lyp
@r_devops
We’re building a small fintech app – AWS vs Azure? Need advice on structure, security, and cost

Hey everyone,

I’m part of a small team building a mobile app (iOS & Android) for home financing. The app’s purpose is to let users create a profile, go through a credit evaluation via a third-party integration, and eventually manage parts of their financing process in a secure and compliant way.

We’re at the stage where we need to decide on the overall backend and authentication setup, and I’d really appreciate some insight from people who’ve been there before.

Here’s what we care about:

- Keeping costs low, especially early on (MVP phase).

- Minimizing our data responsibility – ideally, we don’t want to directly handle sensitive personal data due to GDPR.

- Maintaining a secure and scalable architecture.

- Using something our team (mostly .NET/C# devs) can work with comfortably.


We’ve been comparing three main approaches:

1. AWS (Cognito + API Gateway + Lambda + DynamoDB)

- Super low cost for early usage (Cognito free up to ~10k MAU, Lambda pay-per-use).

- Easy to scale, and no server maintenance.

- .NET 8 works great with Lambda now.

- Slightly less integrated if we ever need to connect with Microsoft services later.

2. Azure (Entra ID B2C + Azure Functions + CosmosDB)

- Strong enterprise-level security and compliance.

- Better if we end up needing Office 365 / Power BI / MS ecosystem integration.

- B2C is free up to 50k users, but setup and maintenance seem more complex.

- Costs and admin overhead might ramp up faster.

At this point, I’m leaning toward AWS because it seems cheaper, easier to maintain, and gives us a clean, serverless architecture with minimal ops.

But I’d love to hear your experiences:

- Have you built similar apps (fintech, identity-heavy, serverless)?

- How have you handled user authentication and third-party integrations securely?

- Any surprises or gotchas you’ve faced with Cognito, Entra B2C, or Auth0?

- Would you choose differently if you had to start over?


Any advice, lessons learned, or real-world insights would be massively appreciated 🙏

Thanks!

https://redd.it/1oj2vky
@r_devops
Any tool for debugging mobile viewport breakpoints remotely?

Our responsive app works fine on desktop but certain breakpoints on Android Chrome look broken. I can’t tether every phone to inspect it. Is there any way to live-debug mobile browsers remotely?

https://redd.it/1oiws4y
@r_devops
No Kubernetes experience, Am I cooked?

Currently in a role which everything is deployed via AWS ECS Fargate containers. I have been supporting these applications for a little bit now. There is not a TON of net new things to work on and learn. Just browsing roles or Job Denoscriptions I am seeing a ton of companies asking for Kubernetes experience. It seems like 80-90% of the roles want this for a mid level engineer. Are this many companies actually using Kubernetes, whether it be AWS EKS or Azure AKS, or googles Kubernetes offering.

having no experience and frankly, Kubernetes for my current work application is overkill. So I wouldn't be able to gain on the job experience. That said, am I cooked in this Job market(outside of the Market already being doo-doo in general). I have come across posts of folks who study for the cert but seem to not have hands on experience - which I DONT want to go down this route, not sure what the though process is on that lol.

Thought about doing it on my spare time but kids and wife take a good majority of my weekend, and not sure what the best method is to learn about Kubernetes and which learning method would be the most effective which the community recommends.

https://redd.it/1oj5dlq
@r_devops