Reddit DevOps – Telegram
Monitoring infra cost for on-prem infrastructure(Not Cloud): which tool do you use?

Hi,

We need a tool to estimate infra cost for deploying new application which will be hosted on-prem or local data center like cost for using vCPU, Memory, Storage, DB and the cost to provision (labor cost) them.


Could you please tell me what all tools do you use to perform all this task.


Thank you

https://redd.it/1p1eblo
@r_devops
Building prod image with certificate

What’s the best way to do inject ssl certificates into a docker build process? I currently am copying the certs as part of the dockerfile which is fine but I’d rather only do it during the prod build process.

Thanks

https://redd.it/1p1mkrn
@r_devops
What are the best SAST tools for identifying security vulnerabilities?

What are the best SAST tools for identifying security vulnerabilities? We already use Snyk at work, so I was wondering if there are free tools I can use to find even more security issues.

https://redd.it/1p1nw0d
@r_devops
Has anyone ever felt burn out and found changes to really help?

Reading through this sub, I see I’m not too original in thinking maybe having a side gig with manual labor or hands-on work is not too uncommon. Maybe the better question would be, did that help? Did you exit the industry ultimately or just find balance with other interests?

https://redd.it/1p1qb40
@r_devops
Is maintaining a VPC/ rented servers really that much more effort than what the cloud providers offer?

Hey everyone,

I’m stuck trying to choose between going all-in on AWS or running everything on a Hetzner + K8s setup for 2 projects that are going commercial. They're low-traffic B2B/B2C products where a bit of downtime isn’t the end of the world, and after going in circles, I still can’t decide which direction makes more sense. I've used both approaches to some extent in the past, nothing too business critical, and had pleasant-ish experience with both approaches.

I am 99% certain I am fine with either choice and we'll be able to migrate from one to another if needs be, but I am genuinely curious to hear peoples opinions.

**AWS:**
I *want* to just pay someone else to deal with the operational headaches, that’s the big appeal. But the price feels ridiculous for what we actually need. A “basic” setup ends up being \~$400/month, with $100 just for the NAT Gateway. And honestly, the complexity feels like overkill for a small-scale product that won’t need half the stuff AWS provides. The numbers may be a bit off, but if I want proper subnets, endpoints and all the I'd say necessary setup around VPC, the costs really ramps up. I doubt we'd go over $400-600 even if we have prod and staging, but still.

**Hetzner:**
On the flip side, I love the bang for the buck. A small k3s cluster on Hetzner has been super straightforward, reliable, and mostly hands-off in my pet projects. Monitoring is simple, costs are predictable, and it feels like I’m actually in control. The turn off is the self hosted parts is running my own S3-compatible storage, secrets manager, or registry. I’ve done it before, but I don’t really want the ongoing babysitting.

Right now I’m leaning toward a hybrid: Hetzner for compute + database, and AWS (or someone else) for managed services like S3 and Secrets Manager.

**What I’d love feedback on:**

* If you’ve been in this exact 50/50 situation, what was the one thing that pushed you to choose one over the other?
* Is a hybrid setup actually a good idea, or do the hidden costs (like data transfer) ruin the savings?
* And if I *do* self-host, what are the lowest-maintenance, production-ready alternatives to S3/Secrets/ECR that really “just work” without constant hand-holding?

Maybe I am too much in my head and can't see things clearly, but my question boils down to, is self hosting/ having servers really that much hassle and effort? I've had single machines in bare-bones docker setup run for a year without any interventions. At the same time I don't want to spend all my time on infra rather than on the product, but I don't feel like AWS would save me that much time in this regard.

Looking for that one insight to break the deadlock. Appreciate any thoughts!

https://redd.it/1p1fiw9
@r_devops
QA tests blocking our CI/CD pipeline 45min per run, how do you handle this bottleneck?

We've got about 800 automated tests in the pipeline and they're killing our deployment velocity. 45 min average, sometimes over an hour if resources are tight.

The time is bad enough but the flakiness is even worse. 5 to 10 random test failures every run, different tests each time. So now devs just rerun the pipeline and hope it passes the second time which obviously defeats the entire purpose of having tests.

We're trying to ship multiple times daily but qa stage has become the bottleneck so either wait for slow tests or start ignoring failures which feels dangerous. We tried parallelizing more but hit resource limits also tried running only relevant tests per pr but then we miss regressions.

It feels like we're stuck between slow and unreliable. Anyone actually solved this problem? We need tests that run fast, don't randomly fail, and catch real issues. Im starting to think the whole approach might be flawed.

https://redd.it/1p1uh6c
@r_devops
Looking at how FaceSeek works made me think about the DevOps side of large scale image processing

I tried a face search tool called FaceSeek with an old photo just out of curiosity. The quick response time surprised me and it made me think about the DevOps challenges behind something like that. It reminded me that behind every fast public facing feature there is usually a lot of work happening with pipelines, caching strategies, autoscaling, and monitoring.
I started wondering how a system like FaceSeek handles millions of embeddings, how it manages indexing jobs, and how it keeps latency reasonable when matching images against large datasets. It also made me think about what the CI and CD setup for this kind of workload would look like, especially when updating models or deploying new versions that might change the shape of the data.
This is not a promotion for FaceSeek. It simply sparked a technical question.
For those experienced in DevOps work, how would you approach designing the infrastructure for a system that depends on heavy preprocessing tasks, vector search, and bursty user traffic? I am especially curious about how to structure queues, scale workers, and maintain observability for something that needs to handle unpredictable spikes.
Would love to hear thoughts from people who have dealt with similar workloads.

https://redd.it/1p1v5vl
@r_devops
Logs, logs, and more logs… Spark job failed again!

I’m honestly getting tired of digging through Spark logs. Job fails, stage fails, logs are massive… and you still don’t know where the hell in the code it actually broke.

It’s 2025. Devs using Supabase or MCP can literally click on a cursor in their IDE and go straight to the problem. So fast. So obvious.

Why do we Spark folks still have to hunt through stages, grep through logs, and guess which part of the code caused the failure? Feels like there should be a way to jump straight from the alert to the exact line of code.

Has anyone actually done this? Any ideas, tricks, or hacks to make it possible in real production? I’d love to know because right now it’s a huge waste of time.

https://redd.it/1p1w2f3
@r_devops
Trying to level up again… but the learning paths all feel chaotic lately

I currently work at a startup. I've been in DevOps long enough to be considered "experienced," but not long enough, to feel like I truly understand where the field is headed. My current work involves Kubernetes emergency drills, CI/CD tuning, and half the company discussions revolve around "AI-driven infrastructure," when nobody really understands it, lol.

I tried to create a learning plan, but it turned into a bunch of uncategorized tabs: Kelsey Hightower talks, in-depth analysis of Grafana, a half-finished Terraform course, and a ton of system design materials for interviews. One minute I'm in my VSCode notes, the next I'm quickly sketching in Miro, and occasionally I use Beyz coding assistant or Copilot to check if my presentation is correct.

What confuses me is how fragmented everything feels. One second I'm learning about PDBs, the next I'm reading about cost anomalies, and then some blog tells me I need to understand L4/L7 load balancing for an "interview." I don't currently have a clear roadmap that "fits me." I only have scattered puzzle pieces, and I have to piece them together while also dealing with the constant impact of industry changes.

So I'm curious, how do others rebuild their learning structures when faced with an overwhelming amount of information? Do you focus on in-depth study of a particular topic, or do you rotate through different topics each week?

https://redd.it/1p1xkf2
@r_devops
I built anomalog - a tool to quickly diff log files between deployments – in-browser, and no data uploads

As an engineer wearing the “DevOps” hat, I often had to compare logs from different deployments/environments to figure out what changed (think: “Why is Prod acting weird when Stage was fine?”). I got frustrated doing this by hand, so I created Anomalog (https://anomalog.com), a lightweight log comparison tool to automate the process.

What it does: You feed Anomalog two log files (say, logs from the last successful deploy vs. the latest one), and it highlights all the lines that are in one log but not the other. This makes it super easy to spot new errors, config differences, or any unexpected output introduced by a release. It’s essentially a diff tuned for logs – helpful for pinpointing issues between versions.

Tech notes: It’s a static web app (HTML/JS) that runs entirely in your browser, so no logs are sent to any server. You can even run it offline or self-host it. The comparison is done via client-side parsing and set logic on log lines. It handles large log files (tested up to a few hundred MB) by streaming the comparison. And since it’s browser-based, it’s cross-platform by default. Open-sourced on GitHub [placeholder\] – contributions welcome!

Why it’s useful: It can save time in CI/CD troubleshooting – for example, compare a working pipeline log to a failing one to quickly isolate what’s different. Or use it in incident post-mortems to spot what an attacker’s run did versus normal logs. We’ve been using it internally for config drift detection by comparing daily cron job logs. Early tests caught an issue where a config line disappeared in one environment – something that would’ve been a needle in a haystack otherwise.

I’d love for folks here to try it out. It’s free and doesn’t require any install (just a web browser). Feedback is hugely appreciated – especially on how it could fit into your workflows or any features that would make it more DevOps-friendly. If you have ideas (or find a log format it struggles with), let me know. Thanks for reading, and I hope Anomalog can save you some debugging time! 🙌

https://redd.it/1p1x107
@r_devops
What Is API Contract Testing and Why DevOps Teams Rely on It in 2025?

What exactly counts as API contract testing in 2025, and how are DevOps teams integrating it into CI/CD?

At a basic level, an API contract defines how services talk to each other. A contract test checks that the service actually matches that spec as the code changes. The whole point is to catch breaking changes before they hit staging or production.

Right now we’re using OpenAPI/JSON Schema as the source of truth and running checks in CI before merges, plus mock servers for early validation. But I’m curious how other teams are handling it.

Tools I’m looking at:

• Apidog – tight integration between API design + testing + mock servers; built-in contract checks against the OpenAPI spec
• Pact – strong for consumer-driven flows
• Postman – works for schema validation
• Dredd – validates real responses directly against the OpenAPI contract
• Stoplight – good for design-first workflows
• Karate – API test automation with schema checks

For teams running microservices or multiple pipelines:

1. Do you treat contract tests as a mandatory CI/CD gate?

2. How do you prevent drift between the OpenAPI spec and the actual implementation?
3. Any tool setups that scale without becoming a maintenance headache?

Looking for real workflows, not the textbook definitions.

https://redd.it/1p1wzcj
@r_devops
How do you cope with burnout

Im at the point in my life where I can barely function In this field anymore. The constant change and grind. The occasional brutal oncall experience where you're trying to debug some k8s cluster environment at 2am.

I'm in my mid 40s and tech has been good money but also the biggest source of misery for me the last 20 years.

I've become obsessed with the FIRE movement and specifically CoastFi where I can just work some bullshit job for lower pay and let my retirement savings compound.

Unfortunately I don't know what else I would do for an occupation and I'm tired. Learning new things is not exciting anymore. Not sure if it's age related or perhaps I've always had lower IQ that's starting to catch up with me in my recent work struggles. Not sure.

How are people coping with burnout in this ridiculous field having to consistently adapt with the whims of the business and the Industry that I don't give too shits about anymore.

Has anyone benefited from antidepressants/SSRIs to fix their brain and keep the tech job going?

https://redd.it/1p1zf4n
@r_devops
our startup grew too fast and now our processes are chaos

When we were 5 devs, everything ran smoothly. Now we are 20 and everything is on fire. Jira setup is too rigid, Linear is too minimal for our needs and ClickUp feels like tough every time we try to customize anything. We desperately need a system which scales without turning into hidden columns. Something flexible, visual but powerful enough for complex dev workflows.

https://redd.it/1p20pq6
@r_devops
Is it just me or is modern dev work starting to feel like playing Jenga with someone shaking the table?

Every time I fix one thing, something else breaks in a completely unrelated part of the stack. Half my week is just debugging stuff I didn’t even touch. Does anyone else feel like software used to be, calmer? am I finally losing it?

https://redd.it/1p221gc
@r_devops
Need your suggestion ASAP

I have 5.5 years of DevOps tooling, cloud and python/shell automation experience. Recently, I joined a product based company. They hired me as a devops lead. When I joined this company within the week they laid off product owner who hired me. 😓

Things went very south for me and team. Now senior manager ( who is a senior dev as well) asking me to learn c# and become backend developer because he thinks there is no need of devops.

In this company the cloud/infra team created their own tool for devops/infra provisioning stuff, which can connect to git repo and provision the infra and do the deployment in infra in single click.

If I choose to become a c#/.net developer I’ll be loosing devops track and if I stick with devops, I’ll not have much work to justify my position in team?

What you guys will do in this situation? How will you justify devops here?

https://redd.it/1p2213j
@r_devops
GuardScan - Free Security Scanner & Code Review Tool for CI/CD Pipelines

Hey r/devops,

I've built a tool that may be useful for your CI/CD pipelines, particularly if you're implementing DevSecOps or shift-left security.

What is GuardScan?

It's a privacy-first CLI security scanner and code reviewer that you can integrate into your CI/CD workflows. It's designed to catch security issues before they reach production.

DevOps-Relevant Features:

🔄 CI/CD Ready:

Works with GitHub Actions, GitLab CI, Jenkins, CircleCI
Proper exit codes for pipeline integration
JSON/SARIF output formats
Configurable severity thresholds

🔒 Security Scanning:

Secrets detection (prevents credential leaks)
Dependency vulnerability scanning
OWASP Top 10 detection
Docker & IaC security (Terraform, K8s, CloudFormation)
API security analysis

📊 Code Quality Gates:

Cyclomatic complexity limits
Code smell detection
License compliance checking
Test coverage validation

🎯 Privacy & Control:

Self-hosted option (MIT license)
Code stays on your infrastructure
No external dependencies for security scanning
Works in air-gapped environments

Quick Integration:

# .github/workflows/security.yml
- name: Security Scan
run: |
npm install -g guardscan
guardscan security --fail-on high


Why I built this:

Most security scanning tools are either expensive or require uploading code to third-party services. For regulated industries or sensitive codebases, that's a non-starter. GuardScan runs entirely on your infrastructure.

Free & Open Source:

No subnoscriptions or usage limits
MIT License
GitHub: https://github.com/ntanwir10/GuardScan

Would love to hear how you're handling security scanning in your pipelines!

https://redd.it/1p281i2
@r_devops
Vendor could use an update

I've been working with a vendor that says they are "trusted by over 80,000 companies". Their tool is open source with a paid addon for enterprises. My org bought the software and now we have to set it up. So in the kick-off meeting I point out to their "Success Engineer" that they have installation guides for server and for Docker, but not Kubernetes. The Docker instructions include an example docker-compose.yml file and specific instructions on how to set environment variables with Docker, stuff about Docker volumes, etc. Very detailed. But they don't even mention Kubernetes on their website. I asked him if there was anything particular about Kubernetes I should watch out for, and suggested that having a guide written by them might be nice in the future.

He said, "We don't have a guide for Kubernetes because there's so many different ways to deploy it. We didn't want to be prenoscriptive." "Prenoscriptive" was the word he used. But like, if there's so many different ways to install the software on Kubernetes (there's not), wouldn't that be the reason why you'd want to be prenoscriptive? To offer your customers a baseline install they could work from?

The PostgreSQL docs they gave us were just their standard database install doc with "RDS" pasted in in a couple of places because we told them we used RDS. It says RDS at the top and suggests using gp3 disks, so they understand that we're using AWS. But then it has lines like "create or modify /etc/postgresql/postgresql.conf" and provides full maintenance noscripts, shebang line and all, to put on the database server that doesn't exist.

The vendor has actually been great so far and their product seems solid, so no shade there, and luckily I'm a 10x engineer so I can translate all this as needed. 😁 It's just... if you're offering enterprise software in the year 2025, shouldn't you expect your customers to be using one of a certain common set of technologies and be prepared for that with documentation and experience?

https://redd.it/1p2amtu
@r_devops
How to Monitor MariaDB and ScyllaDB for a stress test Comparison

Hi! I want to show the performance benefits of ScyllaDB compared to MariaDB. How can I do this? I tried to write the code for it using Vibecode, but it was too complicated, so I decided to do it myself. The problem is, I don’t see much information about this, and I’m still too junior to know the right tools or how to write a Docker Compose file. Could you guys help me out? Even if you only know how to monitor one of them, that would be super helpful. Thanks!

https://redd.it/1p2eqsj
@r_devops
Anyone else getting way more take-homes in tech interviews this year?

Some say interviews are easier now, others say it just turned into unpaid mini projects.

One thing I keep seeing people say is that because of AI, companies are pushing take-homes since it’s supposedly harder to cheat compared to live coding.

Is this actually happening to you too?

https://redd.it/1p2gg8m
@r_devops