Reddit DevOps – Telegram
IS AI the future or is a big scam?

I am really confused, I am a unity developer and I am seeing that nowdays 90% of jobs is around AI and agentic AI

But at the same time every time I ask to any AI a coding task
For example how to implement this:
https://github.com/CyberAgentGameEntertainment/InstantReplay?tab=readme-ov-file

I get a lot of NONSENSE, lies, false claiming, code that not even compile etc.

And from what I hear from collegues they have the same feelings.

And at the same time I not see in real world a real application of AI other then "casual chatting" or coding no more complex than "how is 2+2?"

Can someone clarify this to me? there are real good use of ai?

https://redd.it/1pe6ho7
@r_devops
Bitbucket bait-and-switched, now charging $15/month per self-hosted runner

I saw this morning that Bitbucket has announced self-hosted runner v5 which comes with some interesting new features, but they also changed their pricing from no charge for self-hosted runners to $15/month per concurrent build slot. So now if you're trying to run multiple builds at once or parallelizing releases on your own hardware they want you to pay for the privilege.

This seems crazy to me as we are using self-hosted runners to save money by using our own hardware for builds. We just spent months moving a bunch of our pipelines over to BB and it just seems so wrong that after all that, they can just threaten to make our releases (which rely on parallelizing pipelines) take over 10x as long unless we want to pony up a monthly fee that we really can't afford on top of what we're already paying for users and hardware or instances to actually run the builds.

Github doesn't charge for self-hosted runners. Gitlab doesn't either. It looks like CircleCI does but included concurrency is higher, or unlimited if you have an enterprise plan. So this feels like a total ripoff and a bait-and-switch because they know moving to another CI platform is a massive undertaking.

https://www.atlassian.com/blog/bitbucket/announcing-v5-self-hosted-runners

https://redd.it/1pe8wzd
@r_devops
Snyk AI-BOM CLI launched on Product Hunt today

Hey ops friends, how are you getting a grip on scattered AI usage across the org?

Snyk launched AI-BOM today on Product Hunt that shows how it works via the CLI:

$ snyk aibom --experimental

If you head over to producthunt.com and scroll down there's a video and more screenshots that show how it works.

Curious to get feedback and any input you have if you at all are concerned about discovery and rogue usage of LLMs, AI libraries like LangChain, AI SDK or other libraries without IT approval, or even just one-offs MCP servers downloaded from the Internet.

https://redd.it/1pe928d
@r_devops
Observability Overload: When Monitoring Creates More Work Than It Saves

I've set up comprehensive monitoring and alerting, but now I'm drowning in data and alerts. More visibility hasn't made things better, it's made them worse.

**The problem:**

* Hundreds of metrics to track
* Thousands of potential alerts
* Alert fatigue from false positives
* Debugging issues takes longer because of so much data
* Can't find signal in the noise

**Questions:**

* How do you choose what to actually monitor?
* What's a reasonable alert threshold before alert fatigue?
* Should you be alarming on everything, or just critical paths?
* How do you structure alerting for different severity levels?
* Tools for managing monitoring complexity?
* How do you know monitoring is actually helping?

**What I'm trying to achieve:**

* Actionable monitoring, not noise
* Early warning for real issues
* Reasonable on-call experience
* Not spending all time responding to false alarms

How do you do monitoring without going insane?

https://redd.it/1pe7r1f
@r_devops
IT profile

Guys, help me with something, in humility without trying to make fun lol

I've been in the IT area for about 6 years, I started working as an IT intern, I did everything.

At the time I was working with ERP Protheus, it gave me very good information about the system, how a company operates, etc., but I didn't have much contact with anything.

I was hired as an assistant, assistant and then as an analyst. I was responsible for the IT department, support, networks, telephony, new solutions, updating and supporting the ERP, testing, I was responsible for servers such as AD, DNS, DHCP, etc...

I changed jobs and joined as an analyst, it was just me in the department, a company with 250 employees.

I had to make do in my 30s, I had no passwords, no processes, no management... Nothing.

Today I am an IT supervisor and lead another analyst and other third parties who provide services.

I manage the network of the headquarters and branches, including markets, I am responsible for bringing new solutions, I create reports in SQL for senior management, I take care of cloud telephony, I am the administrator of the ERP system, I manage other security solutions, I manage cell phones with MDM, I design networks and cameras for new and existing units.

I feel like Severino and I don't even earn 5,000.00, well, I'm lost, there are so many fronts that I need to focus on that I can't say what I am, what I do, how much I deserve, etc...

Has anyone reached this stage, and if so, what did you do to get out?

I see myself as more in the management field than in the technical field, but at the same time I like to be ahead and resolve particular issues that keep the company running.

At the same time that I do a lot of things and post them on LinkedIn, I haven't had a single visitor interested in me in all this time.

This makes me feel like I'm out of date and that companies don't look at professionals with my profile, which scares me.

https://redd.it/1pecqef
@r_devops
I built a tool that generates your complete reliability stack from a single YAML file

What it does:
Define service once in YAML (name, tier, dependencies, SLOs)
Generate: Grafana dashboards, Prometheus alerts, PagerDuty setup, SLOs
Technology-aware: knows PostgreSQL, Redis, Kafka, etc. have different metrics
See reliability health across all your services in one command

Example output for a payment-api service:
12-28 panel Grafana dashboard (based on dependencies)
400+ battle-tested Prometheus alerts
PagerDuty team, escalation policy, service (tier-based defaults)
SLO definitions with error budget tracking

Bonus - org-wide visibility:

$ nthlayer portfolio
Overall Health: 78% (14/18 SLOs meeting target)
Critical: 5/6 healthy
! payment-api needs reliability investment

Works with your existing stack - generates configs for the tools you
already use.

Live demo: https://rsionnach.github.io/nthlayer

Early alpha - feedback welcome from folks who deal with this toil daily.

GitHub: https://github.com/rsionnach/nthlayer

https://redd.it/1pedzz0
@r_devops
So what does the career path of a really good DevOps engineer look like?

As a new grad in computer science and someone who's intermediate at full stack engineering, I've just decided to pivot to a junior devops role at a company my friend is referring me to. I found it interesting and I also wrote a bit of code in GO and I loved it.

I was curious, let's say if you're a really good devops engineer who decides to work hard at it and get CKA and AWS certified. What does the career path of such a engineer look like and potential income levels they can reach?

And finally, what entrepreneurial opportunities are open to you with this skillset and experience in the tech industry? Consulting?

https://redd.it/1peorui
@r_devops
How good is devops as a career?

So, currently I am working as a QA on a certain company. I am currently doing bachelors and will graduate this coming september of 2026. I am planning to choose devops as my career and will try to go abroad for further studies. How good is devops as a career and how hard it is to reach a certain good level? What is the market requirements for a DevOps intern? Can anyone help me with this?


https://redd.it/1peps4g
@r_devops
Cloudflare is down again

All I see is "500 Internal Server Error"... almost everywhere...

Is it just me?

https://redd.it/1peqa4c
@r_devops
CycloneDX or SPDX

Hi everyone! We (BellSoft) are trying to determine which SBOM format to use for our hardened images. There are obvious considerations: SPDX is more about licenses, while CycloneDX is more about security.

But what we don't know - what actual people want/need/prefer to use.

So, here's the question: what do you need/use/want? And another one: which tools you are using support which format?

https://redd.it/1peqqdx
@r_devops
How do you guys get into Cloud with no previous experience

Some things about me first.

I started out as a junior software engineer building websites. I found a lot of people were not paying so i decided to chase my other love, security and hacking. I tried the freelance thing for \~2 years.

Started a new job in a security operation center. The job was fun at the start, but as i kept learning more and getting more responsibilities i found out that it has nothing to do with what i had in mind, at least on in most companies in Greece. In the end of the day it was just us overselling other peoples products. But i build up a lot of experience in managing linux servers, elk stack, networking etc. I stayed on that job 2 years.

Then i got an offer from a friend to work as a sysadmin. There i got to work with backups, deploying new software, ansible, jenkins, hetzner(mostly managing dedicated servers), managed and installed dbs(mariadb), proxies, caches, self hosted emails, dns and a loooot in general. I also coded a lot in go and python which i loved. Stayed there 4+ years. Job was fine but the employer crossed a lot of lines that made people quit and the environment stopped being what it was.

Then due to all the knowledge i got from all these jobs i decided that i actually love what people called devops. And i chased that position next!

Now i have been working as a devops engineer for the past 5 years, working with kubernetes(all kinds of flavors), deploying with bamboo, automating a ton of stuff everyday, managing vms, dockerizing apps, deploying in all kinds of envornments, managing kafka clusters(mainly cdc via strimzi, sync via mm2) and lately been into using azure(foundry + ai search) to create agents that serve our documentation to users to improve on-boarding and generally assist people across all managerial positions that raise the same questions again and again or developers that needs specific environment info, how to's etc.

So whats all this intro about? Cloud is nowhere to be seen. Terraform is nowhere to be seen. ArgoCD is nowhere to be seen. And these are the big 3 right now in terms of wanted skills. I even made my own projects, used these tools, got certifications(AZ-900, AZ-104, terraform associate) but i never got to use them since i got them, so now i cant say that i even know anything. Its been 3 years since i got these. And i cant go around paying myself all the time to learn something that i wont get to use anytime soon.

My main problem is, how on earth do you get into these positions in any way other than taking a huge pay cut and start again from a lower position? Companies, at least where i live, do not seem to care for the fact that all that these are, are tools, and with the experience one carries will catch up fast, given some time.

You either got what they want, or you dont. And with devops evolving every other year(with AI/MLOps being the new shiny thing) how can you get into these areas if your company is not setup to use these tools and technologies? I wish i had enough money to throw around into new projects. But i dont. How do you guys manage to follow through the tech evolving and not stay behind? What has your experience been so far with getting into positions where you lack some of the knowledge the listing needs?

https://redd.it/1peqoij
@r_devops
Building a complete Terraform CI/CD pipeline with automated validation and security scanning

We recently moved our infrastructure team off laptop-based Terraform workflow. The solution was layered validation in CI/CD. Terraform fmt and validate run in pre-commit hooks. tflint catches quality issues and deprecated patterns during PR checks. tfsec blocks security misconfigurations like unencrypted buckets or overly permissive IAM policies. Then Conftest with OPA enforces organizational policies that used to live in wikis.

One key decision was using OIDC authentication instead of long-lived access keys. GitHub Actions authenticates directly to AWS without storing credentials. Every infrastructure change requires PR review, shows the plan output as a comment, and needs manual approval before apply runs.

Drift detection runs on a schedule and creates issues when it finds manual changes. Infracost posts cost estimates in PRs so expensive mistakes get caught during review. The entire pipeline uses open-source tools and works without Terraform Cloud.

Starting advice: don't enable every security rule at once. You'll get 100+ warnings and your team will ignore it. Start with HIGH severity findings, fix those, then tighten gradually.

I documented the complete setup with working GitHub Actions workflows and policy examples: Production Ready Terraform with Testing, Validation and CI/CD

What's your approach to Terraform governance and automated validation?

https://redd.it/1pet61x
@r_devops
How did you reduce testing overhead at your startup without sacrificing quality?

Our engineering team is 8 people and we're drowning in testing overhead. Between unit tests, integration tests, and e2e tests we're spending almost 30% of sprint time on testing related work (writing, maintaining, fixing flaky tests).

Don't get me wrong, i know testing is important and we've caught a lot of bugs before production. But the overhead is getting ridiculous, we're moving slower than our competitors because we're spending so much time on test maintenance.

Curious how other startups have tackled this, especially teams that scaled testing without adding dedicated qa headcount. Did you find better tools? Change your testing strategy? Just accept the overhead as cost of quality?

We're using playwright right now which is better than selenium but still requires constant maintenance. Every UI change breaks tests even with data-testid attributes. CI times are also getting long which slows down deployment velocity.

Looking for practical advice from people who've actually solved this not theoretical best practices. What worked for you?

https://redd.it/1peu9wa
@r_devops
finally cut our CI/CD test time from 45min to 12min, here's how

We had 800 tests running in pipeline taking forever and failing randomly. Devs were ignoring test failures and just merging because nobody trusted the tests anymore

We tried a bunch of things that didn't work, parallelized more but hit resource limit, split tests into tiers but then we missed bugs, rewrote flaky tests but new ones kept appearing

What actually worked was rethinking our whole testing approach. Moved away from traditional selector-based testing for most functional tests because those were breaking constantly on ui changes and kept some integration tests for specific scenarios but changed the approach for the bulk of our test suite

We also implemented better test selection so we're not running everything on every pr. Risk based approach where we analyze what code changed and run relevant tests and full suite still runs on main branch and nightly

Pipeline now runs in about 12 min average and test failures actually mean something again. Devs trust them enough to investigate instead of just rerunning and it literally feel like we finally have a sustainable qa process in ci/cd

https://redd.it/1pev57u
@r_devops
Yea.. its DataDog again, how you cope with that?

So we got new bill, again over target. Ive seen this story over and over on this sub and each time it was:

- check what you dont need

- apply filters

- change retentions etc





Maybe, maybe this time someone will have some new ideas on how to tackle the issue on the broader range ?

https://redd.it/1peoe0x
@r_devops
Hosting 20M+ Requests Daily

I’ve been reading the HN comments on the battle between Kubernetes and tools like Uncloud. It reminded me of a real story from my own experience, how I hosted 20m+ daily requests and thousands of WebSocket connections.

Once, some friends reached out and asked me to build a crypto mining pool very quickly ("yesterday"). The idea was that if it took off, we would earn enough to buy a Porsche within a month or two. (We almost made it, but that’s a story for another time.)

I threw together a working prototype in a week and asked them to bring in about 5 miners for testing. About 30 minutes later, 5 miners arrived. An hour later, there were 50. Two hours later, 200. By the next day, we had over 2000, ...

The absolute numbers might not look stunning, but you have to understand the behavior: every client polled our server every few seconds to check if their current block had been solved or if there was new work. On top of that, a single client could represent dozens of GPUs (and there were no batching or anything). All of this generated significant load.

By the time we hit 200 users, I was urgently writing a cache and simultaneously spinning up socket servers to broadcast tasks. I had absolutely no time for Kubernetes or similar beauty. From the moment of launch, everything ran on tmux ;-)

At the peak, I had 7 servers running. Each hosted a tmux session with about 8-10 panes.

Tmux acted not just as a multiplexer, but as a dashboard where I could instantly see the logs for every app. In case a server crashed, I wrote a custom noscript to automatically restore the session exactly as it was.

This configuration survived perfectly until the project eventually died.

Are there any lessons or conclusions here? Nope ;-) The whole thing came together by chance, didn't last as long as we’d hoped.

But ever since then, whenever I see a discussion about these kinds of tools, an old grandpa deep inside me wakes up, smiles, and says: "If I may..."



https://redd.it/1peyjbq
@r_devops
Transition from backend to devops/infrastructure/platform

How did you transit from a backend to a platform/infra position?

I find myself really bored with developing backend business stuff. However I find myself really interested in the infrastructure side of things. K8s, containers, monitoring and observability. And each time I discover new tools, I feel really excited to try them out.

Also, it feels like the infra side of things have a lot of interesting problems and I gravitate towards these. How would I slowly transit towards these roles? I’m also thinking of studying and getting the CKA cert next year.

https://redd.it/1pf1g3d
@r_devops
Made a nifty helper noscript for acme.sh

I recently had trouble with user permissions while configuring slapd on alpine. So I made this little noscript called apit to "config"fy the installation of certs. It is just 100 lines of pure UNIX sh, and should work everywhere.

Sharing it here in the hopes it might be useful for someone.

https://redd.it/1pf3lab
@r_devops
The Missing Foundation of Non-Human Identity

I’ve been working on an identity/authorization system for machines and kept getting stuck on a basic question: what is machine identity, independent of any one stack (Kubernetes, cloud, OAuth, etc.)?

This post proposes a simple model based on where identity originates (self-proven / attested / asserted), what privileges it has at birth, and how it lives over time (disposable vs durable). I’ve also mapped common systems like SSH, SPIFFE/SPIRE, API keys, IoT, and AI agents into it.

I’d be very interested in counterexamples, ways this breaks down in real systems, or prior art I’ve missed.

Here's the post: https://www.hessra.net/blog/the-missing-foundation-of-non-human-identity

https://redd.it/1pf76t3
@r_devops
In AI/infra/devtools companies with usage-based pricing, who actually owns “adoption”?

In a lot of AI / infra / devtools products that charge by usage (requests, tokens, build minutes, cluster hours, etc.), there’s this blurry line after the deal is closed:

On paper, it looks like “someone on the post-sales side” owns adoption,
But in reality, I keep hearing about Solution Architects, Technical Account Managers, “technical success” folks, field engineers, SREs, and even core engineers getting dragged in when a key account’s usage isn’t where it’s supposed to be.

Sometimes usage is way below what was expected, sometimes it spikes in weird ways, sometimes it’s flat, but everyone feels something is off. And then suddenly there’s a Slack war room and a bunch of people with very different goals looking at the same graphs.

In your org (AI/infra/devtools, usage-based or pay-as-you-go):

When usage is clearly off for an important customer, who actually takes the lead on figuring out what’s going on and what to do about it, and what does that usually look like from your side?

Curious how this plays out in real life vs. how the org chart says it should.

https://redd.it/1pf8mi5
@r_devops
Job Switch

Currently working as a devops engineer and I like it a lot, been doing this for about 7-8 years. I want to switch into more backend/distributed systems but not sure what programming languages are best for this. I see it being split between Python & Go.

For anyone who has transitioned from Devops to BE/DSE or the other way around. What language would you say is best to learn ?

I’m trying to lock in for the next 12 months alongside grad school.

https://redd.it/1pf9974
@r_devops