Reddit DevOps – Telegram
Deployments kept failing in production for the dumbest reason

Spent two months chasing phantom bugs that turned out to not be bugs at all. Our staging environment would work perfectly and all tests were green but once you deploy to production everything explodes. And if we tried again with the same code sometimes it'd work and sometimes no, it made zero sense.

Figured out the issue was just services not knowing where to find each other. We had configs spread across different repos that would get updated at different times so service A deploys on monday expecting service b to be at one address but service b already moved on friday and nobody updated the config. We switched everything to just figure out addresses at runtime instead of hardcoding them. We looked at a few options like consul for service discovery or using kubernetes dns or even just etcd for config management, in the end we went with synadia cause it handles service discovery plus the messaging we needed anyway. Now services find each other automatically. Sounds like an obvious solution in hindsight but we wasted so much time thinking it was code problems.

Feel kind of stupid it took this long to figure out but at least its fixed now.

https://redd.it/1qc7kln
@r_devops
Solving Factorio with Terraform

Just released this video not too long ago, and while its part entertainment. I'd be cursious on your guy's impression on the conclusion. When is Terraform overkill?

https://redd.it/1qcaap6
@r_devops
Devcontainers question

Just a quick question because I came across a youtube video where the creator was talking about doing everything out of devcontainers. So that if he gets a new PC, he just has to clone a repo and everything he needs is right there. And I got to thinking, rather than installing azurecli, powershell, python, go, etc. why can't these things just be setup in a devcontainer so when work issues a temp laptop or a new laptop, boom I am good to go. So I was curious if anyone is doing or has done this. I thought of having just a single devcontainer with all things installed, but I also thought of having different devcontainers with different versions of things like older versions of powershell.


So tell me, have to seen or done anything like this? Thoughts / suggestions?


TY in advance.

https://redd.it/1qcecqt
@r_devops
Open-source Amazon SES email backend (looking for early feedback)

Hi everyone,

I’m building a small open-source email backend on top of Amazon SES, focused only on the essentials.

Initial features:

Domain verification helpers (SPF, DKIM)

Simple API to send emails via SES

Receive emails via SES → webhook

Basic domain & sending status checks

No UI, no hosted service — just a clean, self-hostable backend to remove SES boilerplate and glue code.

Before releasing it publicly, I’d appreciate feedback:

Is this useful for teams already using SES?

Any must-have features I should include in the OSS core?

Similar tools I should look at?

Thanks!

https://redd.it/1qcfss9
@r_devops
Long running browser automation keeps failing, not sure what I’m missing

I’ve been building a few automation noscripts for browser based workflows like signing into apps, navigating dashboards, and pulling structured data. Early tests with Selenium and Puppeteer looked solid, but once I let jobs run for extended periods, things started to fall apart. Sessions expire, tabs lose state, and the browser context becomes unreliable.

Out of curiosity, I also tried Hyperbrowser and noticed it handled longer executions more gracefully. It wasn’t flawless, but it stayed up far longer and avoided the repeated crashes I was seeing elsewhere.

For people running browser automation in production, how do you usually approach stability? Is this mostly about aggressive retries and health checks, or are there architectural choices or runtime settings that make a bigger difference for long lived sessions?



https://redd.it/1qchlhr
@r_devops
What’s the most painful, time-wasting part of your workflow right now?

Hey everyone — We’re part of a small team building workflow / automation tools, and we’re trying to understand real pain points people actually run into day to day.

If you could remove one frustrating or repetitive part of your current workflow, what would it be?

Would really love to hear about things like:

• What task feels the most painful or repetitive

• How often it happens (daily / weekly / per project)

• What you’re using today to deal with it (manual steps, noscripts, spreadsheets, tools, etc.)

• Why existing tools or automations don’t quite solve it

We’re not here to pitch anything — just collecting honest problems to learn where tools break down and where people still rely on workarounds.

If you’d rather not comment publicly, DMs are totally fine too.

Thanks in advance — really appreciate any insight 🙏

https://redd.it/1qcikrj
@r_devops
What do you use for real time device monitoring and alert system?

I currently have a small but expanding infrastructure and need to continuously monitor the performance of specific devices on the network. I am looking for a system that allows me to define customized threshold values based on metrics like CPU RAM abd traffic and receive alerts accordingly.

https://redd.it/1qcjejq
@r_devops
Got to a confused phase in career...

I feel like I still lack a broad mindset when it comes to approaching a problem.

Im not sure where to fill myself in the job rank as I could figure out by myself how to build a proper CI/CD pipeline, provision whole infra for a project from scratch, etc. My point is I can implement/create but I still feel like lacking a broader view. When I approach a task, I feel like I’m just doing it mindlessly without understanding 'the game.' It’s not that I’m bad at system design, but I feel like I am missing something specific to step from 'good' to 'excellent', and it isn't just about technical skills. If you’ve broken through this plateau, what was the turning point that helped you level up?

Apologies for the rant in advance.

https://redd.it/1qcjvwo
@r_devops
What tools are powering reliable browser automation for enterprise needs in 2026?

Scaling browser automation for production workflows has been challenging since many sites lack APIs. We rely on them for tasks like extracting reports, filling forms, refreshing dashboards, capturing dynamic data, and accessing login-secured account views. Local noscripts with Puppeteer or Playwright function briefly but fail when websites alter their structure slightly or sessions lapse during extended operations. We evaluated options including browserless, Browserbase, and Hyperbrowser to identify what holds up best in real production scenarios. Self-managed tools offer flexibility yet demand ongoing tweaks and monitoring. Cloud platforms simplify deployment but often struggle with reliability during repeated cron jobs or complex authentication sequences. No solution yet provides seamless 24/7 performance for high-volume enterprise use. Wonder about production setups. Do you guys manage in-house browser farms or prefer fully managed cloud platforms? How do you approach masking automation from DOM inspection versus direct element manipulation?

https://redd.it/1qckg0t
@r_devops
Is "FinOps" actually a standalone career, or are companies just failing to train DevOps engineers properly?

I've been seeing a massive spike in "FinOps Engineer" roles lately, but looking at the job denoscriptions, 80% of it just looks like "DevOps with a budget mandate."

In a perfect world, cost optimization is just another non-functional requirement that every senior engineer should own. Creating a separate "FinOps Team" often feels like a band-aid for engineering teams that don't care about efficiency.

However, I see the flip side: At enterprise scale, the bill is so complex that maybe you do need a full-time specialist.

For those of you doing this full-time: Do you feel like a valued specialist, or are you just chasing engineers to tag their resources all day? Is this a viable long-term career path, or will it eventually fold back into general Platform Engineering?

https://redd.it/1qclmv4
@r_devops
Help regarding a architecture

i am currently using new relic for stats and logs , which is very costly. Now i wan trying ot use fluentBit + OpenTelemetry + Graffana . but i wanted to know whether there are any better alternative than this approach or what could be bottlenecks in it ?

I also wanted to know your experience with these tools if used .

thanks in advance.



https://redd.it/1qcm8ve
@r_devops
Senior Software Engineer considering a move to Cloud/DevOps – looking for advice

Hi everyone,

I’m a senior software engineer with several years of experience, mainly full-stack JavaScript and Java, with a strong backend focus. Lately, seeing how the market is going, I’ve been feeling a bit uneasy — especially with developer roles getting hundreds of applications within hours.

Given the current situation in IT (and particularly software development), I’m seriously considering pivoting toward Cloud / DevOps.

I already have: • A solid systems administration foundation • Hands-on experience with cloud. CI/CD etc

What I’m unsure about: • Is moving to Cloud/DevOps a smart strategic move right now? • How difficult is the transition from a senior backend role? • What skills should I double down on first (Kubernetes, Terraform, AWS/GCP certs, Linux internals, etc.)?

Would love to hear from people who: • Made a similar transition • Are currently working in Cloud/DevOps

Thanks in advance 🙏

https://redd.it/1qcmacp
@r_devops
What constitutes for a submission for CNCF to consider into their portfolio?

Hi there,

I am in DevOps since 2010 and been developing myself with latest tech.

I got an innovative thought and started building a product that currently there is no similar outreach.

I want to submit it to CNCF but really have no insights into it.

I can google and get the instructions but I want to hear from the people who submitted their products (either accepted or rejected) and understand how it works 🫡

Appreciate if anyone been through this before can share some of your valuable insights.

Cheers!!

https://redd.it/1qcqg3u
@r_devops
[Update] StatefulSet Backup Operator v0.0.5 - Configurable timeouts and stability improvements

Hey everyone!

Quick update on the StatefulSet Backup Operator - continuing to iterate based on community feedback.

**GitHub:** [https://github.com/federicolepera/statefulset-backup-operator](https://github.com/federicolepera/statefulset-backup-operator)

**What's new in v0.0.5:**

* **Configurable PVC deletion timeout for restores** \- New `pvcDeletionTimeoutSeconds` field lets you set custom timeout for PVC deletion during restore operations (default: 60s). This was a pain point for people using slow storage backends where PVCs take longer to delete.

**Recent changes (v0.0.3-v0.0.4):**

* Hook timeout configuration (`timeoutSeconds`)
* Time-based retention with `keepDays`
* Container name selection for hooks (`containerName`)

**Example with new timeout field:**

yaml

apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetRestore
metadata:
name: restore-postgres
spec:
statefulSetRef:
name: postgresql
backupName: postgres-backup
scaleDown: true
pvcDeletionTimeoutSeconds: 120
# Custom timeout for slow storage (new!)

**Full feature example:**

yaml

apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetBackup
metadata:
name: postgres-backup
spec:
statefulSetRef:
name: postgresql
schedule: "0 2 * * *"
retentionPolicy:
keepDays: 30
# Time-based retention
preBackupHook:
containerName: postgres
# Specify container
timeoutSeconds: 120
# Hook timeout
command: ["psql", "-U", "postgres", "-c", "CHECKPOINT"]

**What's working well:**

The operator is getting more production-ready with each release. Redis and PostgreSQL are fully tested end-to-end. The timeout configurability was directly requested by people testing on different storage backends (Ceph, Longhorn, etc.) where default 60s wasn't enough.

**Still on the roadmap:**

* Combined retention policies (`keepLast` \+ `keepDays` together)
* Helm chart (next priority)
* Webhook validation
* Prometheus metrics

**Following up on OpenShift:**

Still haven't tested on OpenShift personally, but the operator uses standard K8s APIs so theoretically it should work. If anyone has tried it, would love to hear about your experience with SCCs and any gotchas.

As always, feedback and testing on different environments is super helpful. Also happy to discuss feature priorities if anyone has specific use cases!

https://redd.it/1qcstdn
@r_devops
Need to stay focused during 12 hour on-call without ruining sleep, what works for you?

Im doing on-call rotation every 3 weeks for about 8 months now and the focus part during those long shifts is harder than dealing with the actual incidents. Like I can troubleshoot production issues fine, that's not the problem, it's more about maintaining any sort of mental sharpness for 12+ hours straight while also not completely destroying my sleep schedule for the next week afterwards.

By hour 8 or 9 my brain just starts turning to mush, especially on those shifts where nothing's really breaking and I'm just sitting there monitoring dashboards waiting for alerts. Coffee stops helping around midday and just makes me feel jittery and kind of anxious which is obviously not ideal when you might need to make quick calls about prod systems. Energy drinks made me feel worse after the rush dropped.

The sleep thing is probably the bigger issue though? Because even if I time my caffeine right I still end up lying in bed at 2am completely wired even though I'm exhausted, then the next day I'm useless. Can't really nap during quiet periods either because my brain won't let me disconnect knowing I could get paged any second.

Just curious what other people do for these situations because my current approach of drinking more coffee and hoping for the best is clearly not working lol. Not expecting some perfect solution, just wondering if anyone's found something that's at least better than what I'm doing now.

https://redd.it/1qcuwmb
@r_devops
Best way to download a python package as part of CI/CD jobs ?

Hi folks,

I’m building a read-only cloud hygiene / cleanup evaluation tool and currently in CI it’s run like this:

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install CleanCloud
run: |
python -m pip install --upgrade pip
pip install -e ".dev,aws,azure"

This works fine, but I’m wondering whether requiring Python in CI/CD is a bad developer experience.

Ideally, I’d like users to be able to:

download a single binary (or similar)
run it directly in CI
avoid managing Python versions/dependencies

Questions:

Is the Python dependency totally acceptable for DevOps/CI workflows?
Or would you expect a standalone binary (Go/Rust/PyInstaller/etc.)?
Any recommended patterns for distributing Python-based CLIs without forcing users to manage Python?

Would really appreciate opinions from folks running tooling in real pipelines.


The config is here: https://github.com/cleancloud-io/cleancloud/blob/main/.github/workflows/main-validation.yml#L21-L29

Thanks!

https://redd.it/1qcvlmr
@r_devops
How to find out Cloud untagged/unused resources before they cost too much?

Hi,

In my experience, untagged and unused cloud resources can quietly lead to very large bills if they’re left unattended. Terraform helped with provisioning, but it didn’t really address ongoing cloud hygiene.


I ran into this problem myself and come across of https://www.getcleancloud.com/

I’m curious, have others run into unexpectedly high cloud bills due to unused or untagged resources?
What precautions or processes have worked for you?



https://redd.it/1qcy7ff
@r_devops
CVE counts are terrible security metrics and we need to stop pretending otherwise

Been saying this for years. CVE-2023-12345 in some obscure library function you never call gets the same weight as an RCE in your web framework. Half my critical alerts are for components in test containers that never see production traffic.

Real risk assessment needs exploit context, reachability analysis, and actual attack surface mapping. A distroless image with 5 CVEs can be infinitely safer than a bloated base with "clean" scans that just haven't been discovered yet.

We're optimizing for the wrong metrics and burning out teams with noise.

https://redd.it/1qczgxo
@r_devops
Manual cloud vs modern cloud — am I hurting my career staying here?

I apologize for the lengthy post in advance.

**Quick context**

* Currently a Cloud Systems Administrator
* Working in higher-ed at a community college (public sector) with gov benefits
* YOE
* Very hands-on, broad responsibility role

What I work on:

**AWS**

* VPC networking (subnets, route tables, IGW/NAT etc.)
* Security Groups, NACLs, firewalls
* Setting up VPC peering connections
* Application Load balancers
* Site-to-Site VPN tunneling
* IAM and Cloud Security
* On-prem-to-cloud migrations

**Azure**

* Azure Virtual Desktop
* VM provisioning and maintenance
* Storage and profile management
* Remote user access
* Cost Optimization

**Hyper-V (on-prem)**

* VM provisioning
* Storage allocation
* Host/guest management

**Microsoft/Identity/Endpoint**:

I manage the full Microsoft 365 admin stack:

* Intune – device enrollment, compliance/config policies, app packaging, patching
* Defender – threat policies, Defender for Identity, automated response
* Purview – DLP, data classification, eDiscovery
* Entra ID – SSO (SAML/OIDC), enterprise apps, Conditional Access, user/group mgmt
* Exchange Online – mail flow rules, mailbox management
* SharePoint Online – access and permissions

**Infra, Security & Identity**:

* Firewall management
* Active Directory (Domain Controllers, hybrid identity)

# The kicker:

One concern I have is that I know we’re doing cloud *“the wrong way.”* Most infrastructure is provisioned manually through the console rather than using Infrastructure as Code with version control. Mainly because we’re a smaller environment and many of our AWS servers were lifted-and-shifted from on-prem, we’re not constantly spinning up new resources.

Also a lot of our workloads could likely be handled by managed services instead of EC2:

* Web apps on App Runner or Elastic Beanstalk
* Databases on RDS
* Containers instead of long-running VMs
* SMTP relay via Amazon SES instead of a self-managed server

Instead, the approach tends to be more traditional: *“everything runs on EC2 with the necessary ports open.”*

I’m 26 and don’t want to stagnate or fall behind industry best practices, though benefits and stress level for my role are very manageable.

On top of that, at this school the only real upward progression from my current role is into an IT Director / management position. While I respect that path, it’s not where I want to go right now. I want to continue growing as a hands-on technical engineer, not move into people management or budgeting-heavy leadership roles.

Lastly, due to it being a small IT department, everyone wears many hats, and (while seldomly) I may have to help manage cameras/speakers/projectors during events, help with cabling, end-user support, and on-prem infrastructure setup (if we are under-staffed).

**What I’m trying to figure out:**

* Whether I should try to specialize in devops/security/identity types of roles or stay put for the benefits, low stress, and W/L balance.
* What roles realistically align with what I’m already doing.
* What skills I’m missing that would unlock the next tier of roles.

If you were in my position:

* What would your next move be?
* What skills would you prioritize?
* What job noscripts would you apply for?

I appreciate any perspective.

https://redd.it/1qd09z5
@r_devops