Reddit DevOps – Telegram
[Update] StatefulSet Backup Operator v0.0.5 - Configurable timeouts and stability improvements

Hey everyone!

Quick update on the StatefulSet Backup Operator - continuing to iterate based on community feedback.

**GitHub:** [https://github.com/federicolepera/statefulset-backup-operator](https://github.com/federicolepera/statefulset-backup-operator)

**What's new in v0.0.5:**

* **Configurable PVC deletion timeout for restores** \- New `pvcDeletionTimeoutSeconds` field lets you set custom timeout for PVC deletion during restore operations (default: 60s). This was a pain point for people using slow storage backends where PVCs take longer to delete.

**Recent changes (v0.0.3-v0.0.4):**

* Hook timeout configuration (`timeoutSeconds`)
* Time-based retention with `keepDays`
* Container name selection for hooks (`containerName`)

**Example with new timeout field:**

yaml

apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetRestore
metadata:
name: restore-postgres
spec:
statefulSetRef:
name: postgresql
backupName: postgres-backup
scaleDown: true
pvcDeletionTimeoutSeconds: 120
# Custom timeout for slow storage (new!)

**Full feature example:**

yaml

apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetBackup
metadata:
name: postgres-backup
spec:
statefulSetRef:
name: postgresql
schedule: "0 2 * * *"
retentionPolicy:
keepDays: 30
# Time-based retention
preBackupHook:
containerName: postgres
# Specify container
timeoutSeconds: 120
# Hook timeout
command: ["psql", "-U", "postgres", "-c", "CHECKPOINT"]

**What's working well:**

The operator is getting more production-ready with each release. Redis and PostgreSQL are fully tested end-to-end. The timeout configurability was directly requested by people testing on different storage backends (Ceph, Longhorn, etc.) where default 60s wasn't enough.

**Still on the roadmap:**

* Combined retention policies (`keepLast` \+ `keepDays` together)
* Helm chart (next priority)
* Webhook validation
* Prometheus metrics

**Following up on OpenShift:**

Still haven't tested on OpenShift personally, but the operator uses standard K8s APIs so theoretically it should work. If anyone has tried it, would love to hear about your experience with SCCs and any gotchas.

As always, feedback and testing on different environments is super helpful. Also happy to discuss feature priorities if anyone has specific use cases!

https://redd.it/1qcstdn
@r_devops
Need to stay focused during 12 hour on-call without ruining sleep, what works for you?

Im doing on-call rotation every 3 weeks for about 8 months now and the focus part during those long shifts is harder than dealing with the actual incidents. Like I can troubleshoot production issues fine, that's not the problem, it's more about maintaining any sort of mental sharpness for 12+ hours straight while also not completely destroying my sleep schedule for the next week afterwards.

By hour 8 or 9 my brain just starts turning to mush, especially on those shifts where nothing's really breaking and I'm just sitting there monitoring dashboards waiting for alerts. Coffee stops helping around midday and just makes me feel jittery and kind of anxious which is obviously not ideal when you might need to make quick calls about prod systems. Energy drinks made me feel worse after the rush dropped.

The sleep thing is probably the bigger issue though? Because even if I time my caffeine right I still end up lying in bed at 2am completely wired even though I'm exhausted, then the next day I'm useless. Can't really nap during quiet periods either because my brain won't let me disconnect knowing I could get paged any second.

Just curious what other people do for these situations because my current approach of drinking more coffee and hoping for the best is clearly not working lol. Not expecting some perfect solution, just wondering if anyone's found something that's at least better than what I'm doing now.

https://redd.it/1qcuwmb
@r_devops
Best way to download a python package as part of CI/CD jobs ?

Hi folks,

I’m building a read-only cloud hygiene / cleanup evaluation tool and currently in CI it’s run like this:

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install CleanCloud
run: |
python -m pip install --upgrade pip
pip install -e ".dev,aws,azure"

This works fine, but I’m wondering whether requiring Python in CI/CD is a bad developer experience.

Ideally, I’d like users to be able to:

download a single binary (or similar)
run it directly in CI
avoid managing Python versions/dependencies

Questions:

Is the Python dependency totally acceptable for DevOps/CI workflows?
Or would you expect a standalone binary (Go/Rust/PyInstaller/etc.)?
Any recommended patterns for distributing Python-based CLIs without forcing users to manage Python?

Would really appreciate opinions from folks running tooling in real pipelines.


The config is here: https://github.com/cleancloud-io/cleancloud/blob/main/.github/workflows/main-validation.yml#L21-L29

Thanks!

https://redd.it/1qcvlmr
@r_devops
How to find out Cloud untagged/unused resources before they cost too much?

Hi,

In my experience, untagged and unused cloud resources can quietly lead to very large bills if they’re left unattended. Terraform helped with provisioning, but it didn’t really address ongoing cloud hygiene.


I ran into this problem myself and come across of https://www.getcleancloud.com/

I’m curious, have others run into unexpectedly high cloud bills due to unused or untagged resources?
What precautions or processes have worked for you?



https://redd.it/1qcy7ff
@r_devops
CVE counts are terrible security metrics and we need to stop pretending otherwise

Been saying this for years. CVE-2023-12345 in some obscure library function you never call gets the same weight as an RCE in your web framework. Half my critical alerts are for components in test containers that never see production traffic.

Real risk assessment needs exploit context, reachability analysis, and actual attack surface mapping. A distroless image with 5 CVEs can be infinitely safer than a bloated base with "clean" scans that just haven't been discovered yet.

We're optimizing for the wrong metrics and burning out teams with noise.

https://redd.it/1qczgxo
@r_devops
Manual cloud vs modern cloud — am I hurting my career staying here?

I apologize for the lengthy post in advance.

**Quick context**

* Currently a Cloud Systems Administrator
* Working in higher-ed at a community college (public sector) with gov benefits
* YOE
* Very hands-on, broad responsibility role

What I work on:

**AWS**

* VPC networking (subnets, route tables, IGW/NAT etc.)
* Security Groups, NACLs, firewalls
* Setting up VPC peering connections
* Application Load balancers
* Site-to-Site VPN tunneling
* IAM and Cloud Security
* On-prem-to-cloud migrations

**Azure**

* Azure Virtual Desktop
* VM provisioning and maintenance
* Storage and profile management
* Remote user access
* Cost Optimization

**Hyper-V (on-prem)**

* VM provisioning
* Storage allocation
* Host/guest management

**Microsoft/Identity/Endpoint**:

I manage the full Microsoft 365 admin stack:

* Intune – device enrollment, compliance/config policies, app packaging, patching
* Defender – threat policies, Defender for Identity, automated response
* Purview – DLP, data classification, eDiscovery
* Entra ID – SSO (SAML/OIDC), enterprise apps, Conditional Access, user/group mgmt
* Exchange Online – mail flow rules, mailbox management
* SharePoint Online – access and permissions

**Infra, Security & Identity**:

* Firewall management
* Active Directory (Domain Controllers, hybrid identity)

# The kicker:

One concern I have is that I know we’re doing cloud *“the wrong way.”* Most infrastructure is provisioned manually through the console rather than using Infrastructure as Code with version control. Mainly because we’re a smaller environment and many of our AWS servers were lifted-and-shifted from on-prem, we’re not constantly spinning up new resources.

Also a lot of our workloads could likely be handled by managed services instead of EC2:

* Web apps on App Runner or Elastic Beanstalk
* Databases on RDS
* Containers instead of long-running VMs
* SMTP relay via Amazon SES instead of a self-managed server

Instead, the approach tends to be more traditional: *“everything runs on EC2 with the necessary ports open.”*

I’m 26 and don’t want to stagnate or fall behind industry best practices, though benefits and stress level for my role are very manageable.

On top of that, at this school the only real upward progression from my current role is into an IT Director / management position. While I respect that path, it’s not where I want to go right now. I want to continue growing as a hands-on technical engineer, not move into people management or budgeting-heavy leadership roles.

Lastly, due to it being a small IT department, everyone wears many hats, and (while seldomly) I may have to help manage cameras/speakers/projectors during events, help with cabling, end-user support, and on-prem infrastructure setup (if we are under-staffed).

**What I’m trying to figure out:**

* Whether I should try to specialize in devops/security/identity types of roles or stay put for the benefits, low stress, and W/L balance.
* What roles realistically align with what I’m already doing.
* What skills I’m missing that would unlock the next tier of roles.

If you were in my position:

* What would your next move be?
* What skills would you prioritize?
* What job noscripts would you apply for?

I appreciate any perspective.

https://redd.it/1qd09z5
@r_devops
Learn devops outside of a company

How can I actually learn devops without working for a company? Without spending a lot of money or setting up my own application, how can I learn devops? I never worked on a complicated or high volume enough project but I want to learn how to handle it if I ever get there.

https://redd.it/1qd1lrs
@r_devops
Should this subreddit introduce post flairs?

Dear community,

We are considering to introduce some small changes in this subreddit. One of the changes would be to... introduce post flairs.

I think post flairs might improve overall experience. For example you can set your expectations about the contents of the thread before opening it, or filter according to your interests.

However we would like to hear from all of you. You can tell us in few ways:

a) by voting, please see the poll,

b) if you think of a better flair option, or if you don't like some of the proposed ones, put your thoughts in the comments,

c) upvote/downvote proposed options in comments (if any) to keep it DRY.

Feel free to discuss.

The list, just to start

- 'Discussion'
- 'Tooling' or 'Tools'
- 'Vendor / research' ?
- 'Career'
- 'Design review' or 'Architecture' ?
- 'Ops / Incidents'
- 'Observability'
- 'Learning'
- 'AI' or 'LLM' ?
- 'Security'

It would be good to keep the list short and be able to include all core principles that make DevOps. But it is also good to have few extra flairs to cover all other types of posts.

Thank you all.

View Poll

https://redd.it/1qd2pc3
@r_devops
We made ktfmt 100x faster by eliminating JVM warmup - same approach works for any Java/Kotlin compilation in CI/CD

I've been working on Elide, which uses GraalVM native-image to compile Java/Kotlin tools (like javac, kotlinc) into native binaries. This eliminates JVM warmup overhead in CI/CD pipelines.

Our CEO Sam recently contributed a PR to Facebook's ktfmt (Kotlin formatter) showing up to 100x speedup for formatting tasks in CI. See the benchmarks here.

The principle is pretty simple. Everytime your CI runs javac or any JVM-based tool, the JVM boots and warms up before actual work happens. For small-to-medium projects (under \~10k classes) or formatting changed files, warmup time often exceeds actual processing time.

Our approach takes standard Java/Kotlin compilers and compiles them to native binaries via GraalVM. Same compiler, same inputs, same outputs, which means zero warmup penalty.

There are some honest tradeoffs, ex. for very large projects (10k+ classes), the performance gap closes as JVM JIT warmup pays off. But for typical CI jobs and compiling changed files, running formatters, incremental builds, the native compilation wins significantly.

Would love feedback on whether faster JVM tool execution matters for your CI/CD workflows.

GitHub: https://github.com/elide-dev/elide

https://redd.it/1qd2zmc
@r_devops
Considering using monday dev for sprint planning, agile, backlog visibility, and integrations

we have never used monday dev before and are considering it for our dev team. we are currently evaluating tools for sprint planning,agile , backlog visibility, and integrations with github and slack, but dont want something overly complex out of the gate.

for teams that adopted it from scratch:
how was the initial setup and onboarding?
did devs actually like using it day to day?
anything you wish you knew before switching?

looking for honest first time experiences before we test it internally.

https://redd.it/1qda88t
@r_devops
How do you balance SBOM detail with actionable vulnerability prioritization?

SBOMs for minimal images can get huge. Not every vulnerability is relevant, and it’s hard to decide which ones to address first.
How do you focus on the most critical issues without getting lost in the details?

https://redd.it/1qdcrvb
@r_devops
What C library is missing from the ecosystem that would actually be useful?

I want to write a practical C library that solves a real problem, but I'm struggling to find a gap worth filling.

Background:

I'm a DevOps engineer with solid C experience (alongside Go, Python, etc.) and I want to contribute something useful to the open source ecosystem. Not a "learning project" - something people would actually use in production.

Areas I've been considering:

1. Configuration parsing \- TOML 1.0.0 compliant library (most C TOML parsers are outdated or incomplete)
2. Observability primitives \- lightweight metrics/tracing that doesn't pull in massive dependencies
3. Container/cgroup utilities \- low-level tools for working with namespaces/cgroups without shelling out
4. Network utilities \- something that sits between raw sockets and full HTTP libraries
5. Data serialization \- fast, simple formats that aren't JSON/Protobuf

What I'm NOT looking for:

"Just use language X instead" - I know C isn't for everything, but some domains need it
Crypto libraries (that's a minefield I'm avoiding)
Reimplementing existing mature libraries

Questions:

What C libraries do you wish existed when building infrastructure tooling?
What do you end up writing custom wrappers for repeatedly?
Any pain points with existing C libraries in DevOps/infrastructure space?

The TOML parser idea came from noticing that a lot of tools (especially Rust/Go projects) use TOML configs, but C integration is spotty. Is that actually a problem worth solving, or am I overthinking it?

Would love to hear what would genuinely make your life easier, even if it's niche. Bonus points if it's something that would integrate well with container/Kubernetes tooling.

https://redd.it/1qdgn0k
@r_devops
Search compressed files without decompressing - just shipped Crystal Unified

Hey everyone!  Just shipped something I'm pretty excited about - Crystal Unified Compressor.  

The big deal: Search through compressed archives without decompressing. Find a needle in 700MB or 70GB of logs in milliseconds instead of waiting to decompress, grep, then clean up.  

What else it does:
  \- Firmware delta patching \- Create tiny OTA updates by generating binary diffs between versions. Perfect for IoT/embedded devices, games patches, and other updates
  \- Block-level random access \- Read specific chunks without touching the rest
  \- Log files \- 10x+ compression (6-11% of original size) on server logs + search in milliseconds
  \- Genomic data \- 4:1 compression on DNA sequences
  \- Time series / sensor data \- Delta encoding that crushes sequential numeric patterns
  \- Parallel compression \- Throws all your cores at it  Decompression runs at 1GB/s+.  

Check it out: https://github.com/powerhubinc/crystal-unified-public  
We published it under BSL 1.1 license.

Would love thoughts on where you've seen this kind of thing needed in your portfolios 

https://redd.it/1qdkelh
@r_devops
What I like about being a senior engineer

What I don't like about being a senior engineer:


* I'm no longer in a room full of people smarter than me.
* I don't trust my ego sometimes. That's a me thing.


What I like about being a senior engineer:


* When I speak things I know something about, people pretty much listen.
* I get to have a meaningful impact on organizational outcomes, I get to work on big projects.
* I really enjoy mentoring junior people who are open to it.

https://redd.it/1qdmhoi
@r_devops
Do you think that justfiles underdelivers everywhere except packing noscripts into single file?

I'm kinda disappointed in Justfiles. In documentation it looks nice, on practice it create whole another set of hustle.

I'm trying to automate and document few day to day tasks + deployment jobs. In my case it is quite simple env (dev, stage, prod) + target (app1, app2) combination.

I'd want to basically write something like just deploy dev app1, just tunnel dev app1-db.

Initially I've tried have some map like structure and variables, but Justfile doesn't support this. Fine, I've written all the constants manually by convention like, DEV_SOMETHING, PROD_SOMETHING.

Okay, then I figured I need a way to pick the value conditionally. So for the test I picked this pattern:

noscript
arg("env", pattern="dev|stage|prod")
arg("target", pattern="app1|app2")
deploy env target:
{{ if env == "dev" { "instanceid=" + DEVINSTANCEID } else { "" } }}
{{ if env == "prod" { "instance
id=" + PRODINSTANCEID } else { "" } }}
...

Which is already ugly enough, but what are my options?

But then I faced the need to pick values based on combination of env + target conditions, e.g. for port forwarding, where all the ports should be different. At this point I found out that justfile doesn't support AND or OR in if conditions. Parsing and evaluation of AND or OR operations isn't much harder then == and != itself.

Alright. Then I thought, maybe I'm approaching this wrong completely, maybe I need to generate all the tasks and treat justfile as a rendering engine for noscripts and task? I thought, maybe I need to use some for loop and basically try to generate deploy-{{env}}-{{target}}: root level tasks with fully instantiated noscript definition?

But I justfile doesn't support it as well.

I thought also about implementing some additional functions to simplify it, or like render time evaluation, but justfile doesn't support such functions as well.

So, at this point I'm quite disappointed in the value proposition of justfile, because honestly packing the noscripts into single file is quite the only value it brings. I know, maybe it's me, maybe I expected too much from it, but like what's the point of it then?

I've looked through github issues, there are things in dev, like custom functions and probably loops, but it's been about 3 or 4 years since I heard about it first time, and main limitations are still there. And the only thing I found regarding multiple conditions in if, is that instead of just implementing simplest operators evaluation, they thinking about integrating python as a noscripting language. Like, why? You already have additional tool to setup, "just" itself, bringing other runtime which actually gives programming features, out of which you need only the simplest operators and maps, is kinda defeats all the purpose. At this point it seems like reverting completely to just bash noscripts makes more sense than this.

What's your experience with just? All the threads I've seen about justfiles are already 1-3 years old, want to hear more fresh feedback about it.

https://redd.it/1qdnjhz
@r_devops
Research: how are teams controlling and auditing AI agents in production?

Hey folks,

We are researching how teams running AI agents in production deal with things like cost spikes, access control, and “what did this agent actually do?”

We put together a short anonymous survey (5–7 min) to understand current practices and gaps.

This is not a sales pitch. We are validating whether this is even a real problem worth solving.

Would appreciate honest, even skeptical feedback.

👉 https://forms.gle/yo7xwf6DrAnk2L5x7


https://redd.it/1qdoyc0
@r_devops
How big of a risk is prompt injection for client-facing chatbots or voice agents?

I’m trying to get a realistic read on prompt injection risk, not the “Twitter hot take” version When people talk about AI agents running shell commands, the obvious risks are clear. You give an agent too much power and it does something catastrophic like deleting files, messing up git state, or touching things it shouldn’t. But I’m more curious about client-facing systems. Things like customer support chatbots, internal assistants, or voice agents that don’t look dangerous at first glance. How serious is prompt injection in practice for those systems?

I get that models can be tricked into ignoring system instructions, leaking internal prompts, or behaving in unintended ways. But is this mostly theoretical, or are people actually seeing real incidents from it?

Also wondering about detection. Is there any reliable way to catch prompt injection after the fact, through logs or output analysis? Or does this basically force you to rethink the backend architecture so the model can’t do anything sensitive even if it’s manipulated?

I’m starting to think this is less about “better prompts” and more about isolation and execution boundaries.

Would love to hear how others are handling this in production.

https://redd.it/1qdr4hg
@r_devops
A Friday production deploy failed silently and went unnoticed until Monday

We have automated deployments that run Friday afternoons, and one of them silently failed last week. The pipeline reported green, monitoring did not flag anything unusual, and everyone went home assuming the deploy succeeded.

On Monday morning we discovered the new version never actually went out. A configuration issue prevented the deployment, but health checks still passed because the old version was continuing to run. Customers were still hitting bugs we believed had been fixed days earlier.

What makes this uncomfortable is realizing the failure could have gone unnoticed for much longer. Nothing in the process verified that the running build actually matched what we thought we deployed. The system was fully automated, but no one was explicitly confirming the outcome.

Automation removed friction, but it also removed curiosity. The pipeline succeeded, dashboards looked fine, and nobody thought to validate that the intended version was actually live. That is unsettling, especially since the entire system was designed to prevent exactly this kind of failure.

https://redd.it/1qdl5m8
@r_devops