Reddit DevOps – Telegram
Did I break the server, or was it already broken?

I work at a mid-sized AEC firm (\~150 employees) doing automation and computational design. I'm not a formally trained software developer - I started in a more traditional domain expertise role and gradually moved into writing C# tools, add-ins, and automation noscripts. There's one other person doing similar work, but we're largely self-taught.

Our file infrastructure runs on a Linux Samba server with 100TB+ of data stored serving all 150 + maybe 50 more users. The development workflow that existed when I started was to work directly on the network drives. The other automation developer has always done this with smaller projects for years and it seemed to work fine.

What Happened

I started working on a project to consolidate scattered noscripts and small plugins into a single, cohesive add-in. This meant creating a larger Visual Studio solution with 30+ projects - basically migrating from "loose noscripts on the network" to "proper solution architecture on the network."

Over 7-8 days, the file server experienced complete outages lasting 30-40 minutes daily. Users couldn't access files, work stopped, and IT had to investigate. IT traced the problem to my user account holding approximately 120 simultaneous file handles \- significantly more than any other user (about 30).

The IT persons sent an email to my manager and his boss saying that it should be investigated what I'm doing and why I could be locking so many files basically framing it as if I am the main cause of the outages. The other cause they have stated is that the latest version of the main software used in the AEC field (Autodesk Revit) is designed to create many small files locked by each individual user which even though true, to me sounds like a ridiculous statement as a cause for the server to crash.

Should a production file server serving 200 users be brought down by one user's 120 file handles? I've already moved to local development - that's not the question. I want to understand whether I did something genuinely problematic or the server couldn't handle normal development workload. Even if my workflow was suboptimal, should it be possible for one developer opening Visual Studio to bring down the entire file server for half an hour? This feels like a capacity planning issue.

https://redd.it/1qx1u5r
@r_devops
What is your biggest pain point

Seriously wondering this.

I am a non-technical individual. In fact, I am a recruiter for VC backed early stage tech companies in Ai/Infrastructure/Data. I partner with VCs and build GTM teams for startups.

I am currently working with a cyber vendor who quite literally is a couple of guys who have no founder or cyber experience, but were just recognized by insight partners. They literally just went out and asked CISOs what they struggled with and were able to make something from nothing with the right people.

Not saying that I could ever do that, but I want to find the people doing what solves the common denominator here for you guys.

Are each of these AI tools making life easier? Is there some form of consolidation needed with a conflict of interest between code generation and code review tools? Is AI workflow good or has n8n cornered the market and there is nowhere to improve?

So many questions. Explain it to me like a 5 year old.

https://redd.it/1qx4eh5
@r_devops
We’re testing double enforcement for irreversible ops after restart/retry issues

Post:
We’ve been running into the same operational question:
What actually protects an irreversible external mutation if the service restarts after authorization but before commit?
Most flows authorize once at ingress and then execute later. But between those two points we’ve seen:
pod restarts
retry storms
duplicated webhooks
race conditions across workers
stale grants surviving longer than expected
Ingress validation alone doesn’t protect the commit moment.
So we’re testing a stricter pattern:

Gate A validates the proposed action at ingress (ordering + replay protection).
The system processes normally.

Gate B re-validates the same bound action immediately before the external mutation (idempotency + continuity check).
If either fails, the operation freezes instead of attempting the external call.
We’re specifically testing this against real external side effects (payments, state transitions, etc.) under forced restarts and concurrent retry scenarios.
Curious how others handle this boundary.
Do you rely on idempotent APIs downstream and ingress validation upstream, or do you re-enforce at the commit edge as well?

https://redd.it/1qx5bvm
@r_devops
Unable to get to interview stage after screening

Hi guys, I was recently part of an organization restructure and got laid off. So I’ve been looking for new roles for the past two weeks, and I’ve applied to around 70+ roles. I’ve heard back from about 7–8 for initial screenings, where they said it’s a great match and that they would forward my resume to the hiring manager, but then nothing has happened.

For eg I applied to Deloitte and the recruiter did a phone screening on Tuesday seemed happy with me, but it’s Friday now and still nothing. Another company recruiter yesterday told me he’s really busy and asked me to call him. When I did, he said he’d like to bring me in for an interview and would call me back, but he had to rush to a meeting. Since then, no callback. I tried following up and calling again today but it went to voicemail (he did say he’s on his phone a lot and very busy).

Other companies have sent technical tests or done initial calls, and same thing — nothing since.

Am I being impatient? I haven’t been out in the job market for 4–5 years, so I’m not sure what the normal pace is now, because my previous interview process was all sorted in a week from screening to the offer letter.

https://redd.it/1qx7d1s
@r_devops
Resources to learn CrossPlane

Hi everyone! i want to learn how to set up and use crossplane. Are there any resource online similar to cloudguru/kodekloud for this? or just the crossplane docs?

https://redd.it/1qx8jci
@r_devops
New to AI tools .looking for real world recommendations

Hi I’m pretty new to AI and trying to figure out which tools are actually worth using.
What websites do you rely on for work, studying, or daily tasks?
Would love to hear what’s been useful for you.

https://redd.it/1qx74i2
@r_devops
$225 in prizes - incident diagnosis speed competition this Saturday

Hosting a live incident diagnosis competition this Saturday, 1pm-1:45pm PST on Google Meet.

2 rounds, 2 incidents. You get access to our playground telemetry, GitHub, Confluence docs. First person to find the root cause, present evidence, and propose a fix wins.

Prizes
\- 1st: $100 Amazon gift card
\- 2nd: $75
\- 3rd: $50

At the end, we'll show what our AI found for the same incidents, and how long it took. Humans only for the prizes though.

Think of it as a CTF but for incident response.

DM me to sign up!

https://redd.it/1qxaay5
@r_devops
What should be the next step in DevOps ?

Whenever people talk about DevOps, all I hear is that Terraform is the word of the mouth now, all that IaaC and stuff. But as someone who wants to move into DevOps, what would be the best way to utilise all these different tools and build projects ?

I know for sure that projects in DevOps domain are not same as projects in any other domain. I would build an ML pipeline and post it on GitHub and I would be done. But I know for sure that DevOps projects don't work that way. Any suggestions on how to build DevOps projects ?

https://redd.it/1qxc50o
@r_devops
Do you commit Helm charts to your Git repo or pull them on the fly?

Hi I have question:

When using open-source tools like Prometheus, Grafana, or Ingress-NGINX on production, do you:

Keep the full chart source code in your repo (vendoring)?
Or just keep a Chart.yaml with dependencies (pointing to public repos) and your values.yaml?

I see the benefits of "immutable" infrastructure by having everything locally, but keeping it updated seems like a nightmare. How do you balance security/reliability with maintainability?

I've had situations where the repository became unavailable after a while. On the other hand, downloading everything and pushing it to your own repository is tedious.

Currently using ArgoCD, if that matters. Thanks!

https://redd.it/1qxc9of
@r_devops
Stop writing brittle Python glue code for your security pipelines (Open Source)

In every DevOps role I've had, "security automation" usually meant a folder full of unmaintained Python or Bash noscripts running on a random Jenkins node.

It works until the API changes, or the guy who wrote it leaves.

We wanted a proper orchestration layer for this stuff without paying $50k for enterprise SOAR tools. So we built ShipSec Studio and open-sourced it.

It’s a visual workflow builder that lets you chain tools together.

What it replaces:

Writing a noscript to parse Trufflehog JSON output.
Manually hooking up Nuclei scans to Jira/Slack.
Cron jobs for cloud compliance checks (Prowler).

You can drag-and-drop the logic, handle errors visually, and deploy it via Docker on your own infra.

We just released it under Apache. We’re a small team trying to make security automation accessible, so if you think this is useful, a star on the repo would mean a lot to us.

Repo: github.com/shipsecai/studio

Let me know if you run into any issues deploying the container.

https://redd.it/1qxe2mj
@r_devops
Choosing DevOps instead of SDE?, Is it a Good Choice, More Info on Body

Hello,

I'm a Fresher, Actively applying for jobs from December (Mostly on SDE and Fullstack).


I can clearly see the entry level jobs are slowly vanishing, even if i found something it says 2+ yrs of exp.


It's my personal belief that AI is slowly killing the Junior and entry level roles.


It made me think, like, is there any entry-level role which cannot be affected by AI?



I asked some people on my circle,


One of my friend said DevOps, i don't know is it True or not?


That's why I'm asking you'll guys.

Is DevOps have more job potential than SDE/Fullstack in this current situation.


Is it a good to switch to DevOps or should i continue the SDE Path?


Thanks for reading this far!!!

https://redd.it/1qxeiq3
@r_devops
Too much reports

Hello,

I’m working on CI/CD pipelines where we’re generating more and more reports from different tools:

* SonarQube (code quality, coverage, technical debt)
* Test frameworks (Vitest, Jest, Selenium, Playwright, Cypress…)
* Sometimes performance / E2E tests as well

Each tool outputs its own format (often JSON / XML / HTML), and in the end the information is scattered all over the place.

How do you handle this on your side? Do you use a dedicated tool, a shared folder on the network, or something else to store everything? (If you have a solution name, I’m definitely interested.)

I’m mainly looking for real-world feedback to avoid building an overcomplicated Rube Goldberg machine.
Thanks in advance 🙏

https://redd.it/1qxfgnc
@r_devops
German DevOps Community

Hi folks, I'm looking to switch jobs in Germany. So far I always knew somebody in the company I was switching to and it seems like a pain to me to interact with all these external recruitment companies. Just had an unpleasant experience with a recruiter who called themselves DevOps Teamlead because they are handling external DevOps recruitment for a few years but were ofc not tech savvy.

So basically I'm looking for skipping external recruitment and a German DevOps community of DevOps Engineers or adjacent fields to interact with and maybe find out about open job listings, talk a bit, maybe get a referral.

Is somebody aware of such a space or something similar?

https://redd.it/1qxgb5h
@r_devops
deeploy v0.2.0 - lightweight Git-to-container PaaS for single-node DevOps setups

Built a small self-hosted PaaS for teams/projects that don’t need Kubernetes overhead.

Deploy from git, run on Docker, manage projects and pods via a panel-based TUI.

Designed for simple VPS or homelab infra. Uses Docker + SQLite.

Curious how others approach single-node deployment workflows.

https://github.com/deeploy-sh/deeploy
https://deeploy.sh

https://redd.it/1qxj19u
@r_devops
How to transition from Technical Support Engineer at Microsoft to a DevOps role (long-term plan advice needed)

I’m starting as a Technical Support Engineer (IC1) at Microsoft after months of job searching and want to eventually move into DevOps / SRE.

For those who’ve gone from support → DevOps:

\- What skills mattered most (automation, Linux, cloud, etc.)?

\- How long did you stay in support before moving?

\- Is internal mobility realistic or is switching companies easier?

\- What mistakes should I avoid early on?

I don’t want to rush, but I also don’t want to stagnate. Any real-world advice would help.

https://redd.it/1qxhqwb
@r_devops
RubyShell noscripting tool v1.5.0 released!!

Library made to help devs to create automations, CLI softwares and user noscripts

Coming soon the command `sh.remote` to execute RubyShell blocks on remote servers via SSH, bringing the same familiar syntax to remote administration.

sh.remote("user@server") do
ls("-la")
cat("/etc/hostname")
end

sh.remote("deploy@production", port: 2222) do
cd("/var/www/app")
git("pull", "origin", "main")
bundle("install")
systemctl("restart", "app")
end

%wweb1 web2 web3.each do |server|
sh.remote("admin@#{server}.example.com") do
apt("update")
end
end

https://redd.it/1qxln25
@r_devops
Trying to move from IT support / managed services into DevOps or Solutions Architect. Where do I realistically start?

Hi everyone,

I’m trying to move into a DevOps/Solutions Architect path and I honestly don’t know where to start.

A bit about me for context: I’m currently working in Managed Services and incident management, dealing with tickets, change management, service delivery, Jira, RCA and daily operations. I’ve completed ITIL Foundation, CompTIA Cloud+ (CV0-004).I also have a background in basic networking, Linux fundamentals and some coding.

My problem is this: I don’t know what a realistic and practical roadmap looks like.

Can someone please help me understand:

• Should I focus on AWS or Azure first (and why)?

• Is there a good learning platform you would actually recommend for this path?

• What order should I follow when learning DevOps or cloud engineering properly?

• What kind of projects should I be building as a beginner, and how do I even start building them?

• How do I move from a support and operations role into a DevOps or Solutions Architect role in a realistic way?

I’m not looking for shortcuts. I just need a clear direction and a structured path so I don’t keep jumping between tools and courses without progress.

https://redd.it/1qxewvt
@r_devops
Is the SRE noscript officially a trap?

I've noticed a trend lately: 'Platform Engineer' roles seem to get to build the cool internal tools and IDPs, while 'SRE' roles are increasingly becoming the catch-all bin for "everything that is broken in production."

It feels like the SRE noscript is slowly morphing back into "Ops Support" while the actual engineering work shifts to Platform teams.

If you were starting over in 2026, would you still aim for SRE, or pivot straight to Platform/Cloud Engineering?

https://redd.it/1qxoqcr
@r_devops
What is your logging format - trying to configure my k8s logging

Hello. I am evaluating otel-collector and grafana alloy, so I want to export some of my apps logs to Loki for developers to look at.

However, we have a mix of logs - JSON and logfmt (python and go apps).

I understand that the easiest and straighforward would be to log in JSON format, and I made it work with otel-collector. easy. But I cannot quite figure out how to enable logfmt support, is thre no straightforward way?

is it worth it spending time on supporting logfmt, or should I just configure everything to log in JSON?

I am new to this new world of logging, please advise.

Thanks.

https://redd.it/1qxn9f7
@r_devops
Fellow old-heads that got out, what does your career look like these days?

I'm pushing 40 years of physical existence, and 15 of those have been spent staring at AWS consoles and terminal windows. I'm not burnt out at the moment, but I wonder as I sit here and let Claude write an entire Python noscript to make some quick backend changes to a couple dozen Github repos (that management requested this morning but apparently needed two weeks ago), what's next? The story seems to be the same everywhere I go: A) join promising startup, do interesting work for a few years, C-suite cycles out, company either crashes, spins it's wheels for another few years, or we get acquired, or B) come close to jumping off a bridge studying for big tech roles, only to get to the final round to be told, "hey, we were just kidding about full remote the three times you asked us, we need you in insert city 1000 miles away here with a 2.5x CoL". If the market was better I'd start pivoting towards full on software engineering, but alas, many of our glorious technological leaders decided it was a good idea to cozy up to whatever governmental facade of the time would give them quick quarterly wins and over-gorged shareholders, so here we are.

For those of you older DevOps folk that successfully escaped and made career transitions without taking huge hits to your comp, what are you doing these days? Are you happy (or at least content)? Do you have regrats?

A quick search seems like a lot of the threads asking these questions as of late are from AI doomers (which you know, understandable, I get it and hate it... but damn does it make reading Terraform docs so much easier) and folks unknowingly knee deep in a burn-out cycle; I want to hear from people that took the plunge and are happy with it, or at the very least, content not being in Cloud Infrastructure.

https://redd.it/1qxrlkh
@r_devops
Team is relying on hardcoded real IPs in nginx for local testing and ifconfig IP aliasing, with DB root access for everyone. What are the risks?

Hi all,

Looking for a sanity check from people with more infra experience.



Our rough setup looks like this:

* Prod and staging running in cloud (EC2)
* Databases and services in private IP space
* DNS names resolve to these private IPs

For local dev and testing, everyone is instructed to do this:

* use ifconfig to alias a real internal IP
* hardcode the IP in nginx config
* use same DNS names locally as in staging and prod
* use root access for DB

I wonder about routing ambiguity.

What happens if some people are accidentally on VPN, some are not, if some people forgot to do the ifconfig setting and they are on VPN/not on VPN, executing commands against the database?

Is there a risk that people end up hitting prod/staging/other people's machines instead of their local DB?

https://redd.it/1qxs0g8
@r_devops