Reddit DevOps – Telegram
Is DevOps getting harder, or are we just drowning in our own tooling?

Has DevOps has actually become more complex, or have we slowly buried ourselves under layers of tools, noscripts, and processes that nobody fully understands anymore?

across our org, we somehow ended up with ArgoCD for some teams, Jenkins for others, GitHub Actions in a few pockets, and someone even brought in Prefect just for one workflow. On the infra side we have Terraform, but also Pulumi for one team’s project, plus Datadog and Prometheus running in parallel because no one wanted to kill either one

Then testing and quality brought their own mix. Some people track work in plain sheets, others use light test management options like Qase or Tuskr and analytics has its own stack with Mixpanel, Amplitude, and random noscripts floating around. None of these tools are bad, but together they create maintenance overhead that quietly grows in the background.

At this point, every deployment touches five separate systems and at least one integration someone wrote two years ago and swears is “temporary”. when something breaks, half the time we are troubleshooting the toolchain instead of the code

How do your teams deal with this?
Do you standardize everything hard?
Let teams pick their stack as long as they own the pain?
Or is a certain level of tool chaos just the reality of modern DevOps?

Where do you personally draw the line?

https://redd.it/1p04lsx
@r_devops
centralising compliance across clouds. Is it worth building our own pipeline?

maybe we should build our own internal compliance reporting pipeline instead of relying on native tools. hear me out. we could pull logs from CloudTrail Azure Monitor GCP Logging, dump everything into a data lake or SIEM run standard queries / dashboards. yes it’ll take effort up front but the payoff could be huge in terms of audit readiness and consistency. on the other hand maintaining that might become its own beast. has anyone built something like this.

#

https://redd.it/1p04qn1
@r_devops
I finally get rid of Vercel/Render after $200/mo bills and migrated to my own VPS, here's what I learned

For years, I was terrified of managing my own server. I mean, who wouldn't be? Vercel, Render, and Supabase made everything so easy.
Push to GitHub, and boom, your app is live. No SSH, no nginx configs, no worrying about SSL certificates or process managers.

**But then my bills started climbing.**

What started as $20/month quickly escalated to over $200 as my side projects gained traction.
Meanwhile, I kept seeing people talk about running everything on a $10 Hetzner VPS.

I thought they were crazy. "There's no way I can manage that," I told myself.

# The migration that changed everything

When one of my apps hit a traffic spike and Vercel wanted to charge me $300+ for that month, I finally snapped. I spun up a Hetzner VPS and started migrating.

And you know what? **It was harder than it should have been.**

Not because VPS hosting is inherently difficult — but because the tooling gap is massive. With Vercel, I had:

* One-click deploys from GitHub
* Automatic SSL
* Real-time logs
* Environment variable management
* Zero-downtime deployments

On my VPS? I had... SSH and a prayer.

# The real problem: UX, not capability

Here's what frustrated me: **servers are actually more powerful and flexible than PaaS platforms**. But the user experience is stuck in 2010.

I tried Coolify (it's great, by the way), but it consumed too many resources on my small VPS and added another layer I had to manage.

I didn't want a control panel taking up 1GB of RAM. I just wanted the **Vercel experience, but for my own server**.

# So I built something for myself

I ended up building a desktop app that connects to my VPS via SSH and gives me:

* GitHub integration with one-click deploys
* Automatic nginx config and SSL (Let's Encrypt)
* Real-time deployment logs
* Environment variables management
* Process monitoring

The key difference from control panels? **It runs on my local machine** — zero footprint on the server. It's literally just "SSH with a nice GUI."

# Why I'm sharing this

I'm not here to bash PaaS platforms. Vercel and Render are incredible for certain use cases. But if you're:

* Running multiple side projects
* Paying $100+/month for simple Next.js apps
* Comfortable with the terminal but want better UX
* Worried about vendor lock-in

**You can absolutely manage your own VPS** without sacrificing developer experience.

# The results

I'm now running 5 production apps on a single $20/month Hetzner VPS (8GB RAM, 4 vCPUs).

My monthly bill went from \~$200 to $20. Same apps, same performance, but I actually have MORE control over everything.

# My honest take

* **PaaS platforms are worth it** if you're making money and don't want to think about infrastructure
* **VPS hosting makes sense** once you have 3+ projects or you're spending $50+/month
* **The tooling gap is real** — this is the actual barrier, not server management itself
* **Coolify is great** if you have a beefier VPS (4GB+ RAM) and want a full control panel
* **Not competing with anything** — there's room for different approaches

The goal isn't to convince everyone to migrate. It's to show that **managing your own server doesn't have to be intimidating** if you have the right tools to bridge that UX gap.

Has anyone else made the PaaS → VPS migration? What was your experience?

https://redd.it/1p067pw
@r_devops
Nginx php-fpm and redis in single container

Is it ok to put redis nginx and php-fpm in one container? What are the things that should keep in mind.

I am going to run it on aws ECS.

Context :: trying it as stage but if works as expected it is going to process 15m requests everyday.

https://redd.it/1p073j8
@r_devops
AutoScaling Ec2 in huge spikes

How are you guys managing autoscaling with alb + ec2 setup ? I know we can set up autoscaling group but in my case there are huge spikes in traffic and not getting enough time to scale? What can be done in this case?

Also when it starts scaling it goes to max no of instances. Scaling policy is if average cpu more that 50%

https://redd.it/1p07tu3
@r_devops
Is the real production was scenarios and trainings? Has anyone brought this?

i came across this training from linkedin, they are teaching real production war scenarios, it says "Master production-grade tools, fire-drill scenarios, and cross-cloud architectures. Every skill here is forged through real outages, real deployments, and real engineering war rooms. " https://elite.infrathrone.xyz/

Has anyone have idea about it? how is it?

https://redd.it/1p08yh8
@r_devops
Can you really automate QA testing without headcount or is everyone just lying?

serious question because i'm tired of the linkedin hype. Every other post is someone claiming they "automated 90% of QA" and "eliminated manual testing" but then you talk to them and they still have a QA team.

Here's my situation, we have 3 QA engineers for a team of 25 devs, they're constantly underwater and we keep getting bugs in production anyway and Leadership wants to "automate QA" instead of hiring more people but i'm skeptical this is actually possible, feels like one of those things that works in theory but not in practice.

I've seen test automation frameworks, we use some already, but they still need someone to write and maintain the tests and they don't catch the weird edge cases that a human would. Plus our integration tests are flaky as hell and take forever to run.

So what's the reality here? Can you actually reduce headcount with automation or is it just shifting the work around? And if you did pull this off, what did you use? Not interested in solutions that require hiring a separate automation team, that defeats the whole point.

https://redd.it/1p0a727
@r_devops
Is the internet really decentralized, or just fragile?

Most people don’t realize this: the internet they think is distributed is actually held together by a handful of infrastructure chokepoints. Cloudflare sneezes, and half the web catches a fever. We’ve built our digital world on a fragile stack of AWS, Cloudflare, Google Cloud, and a few telcos.

When one fails, everything collapses like dominoes. The internet wasn’t supposed to be this vulnerable.

Edit: By “Internet” I meant what regular users experience daily the apps, websites, payments, and services they rely on.

https://redd.it/1p0bcub
@r_devops
Do you have backup plan in case your provider going down?

Currently I see issue with cloaudflare for almost 45 minutes, I didn't prepare any plan in this case and I cant move my dns. Because namecheap also down. How to prepare to such cases?

https://redd.it/1p0b4tf
@r_devops
a few weeks back dockerhub was done, along with abunch of others- now cloudflare

can someone, senior please, tell us, wtf is going on lately?

how's this happening. this sounds like a devops problem, but it could be IT physical problem as well- data center fails.


any info about these outages?


as an up and coming devops, i would like to be ready for anything, and this is interesting to me...since there are always surprises in this field it seems.



https://redd.it/1p0aa5g
@r_devops
Datadog? Eval

Hello! I’m interviewing for a role at DataDog and want to get some candid feedback on their product. If you use it in any capacity it’d be great to hear the good, bad, and ugly. How are you using it? How has it impacted your day to day or overall strategy? What are the downfalls? I know there are already threads in here but I want to be sure I get any feedback on new feature launches or recent changes. Thanks in advance!

https://redd.it/1p0ffnz
@r_devops
Curious About Internal Workflows During Massive Outages

With the current Cloudflare outage going on, I’ve been wondering what the internal workflow looks like inside large tech companies during incidents of this scale.

How do different teams coordinate when something huge breaks?

Do SRE/DevOps/Network teams all jump in at once or does it follow a strict escalation path?
And how is communication handled across so many teams and time zones?

https://redd.it/1p0bsur
@r_devops
Trying to transition to Devops

Hi all, pretty new here and was hoping on some advice.

Context: By trade I’m currently a civil design engineer was my uni background also being in civil engineering. I’ve been doing it for about 2 years now.

Recently I’ve been really interested in devops and I’m determined to transition my career. I started by learning python and I’m pretty confident as an intermediate level. I’ve also done my first azure certification (AZ-900) to get my fundamentals knowledge right. I have also done some fundamentals in network and I’m pretty confident with my understanding of the osi layers. I’m currently working on getting my admin associate certification (AZ-104). My plan is to the learn terraform afterwards as well as azure devops or GitHub actions (leaning towards GitHub actions). I’m learning powershell slowly on the side right now too.

Outside of my core learning I’ve done some high level research on containerzation and orchestration too knowing I’ll have to focus of those when the time comes.

Just wanted to get thoughts from people that already do it and steer on what would help, thanks.

https://redd.it/1p0mx4b
@r_devops
Is there anything new to learn in 2025?

Aside Kubernetes and Terraform, is there anything to learn as a software developer or DevOps engineer? What would you suggest and why?

https://redd.it/1p0sfmj
@r_devops
hello devops fam, I just passed my AWS SAA and wanna go straight to learning devops

hello fam I just pass my AWS course any recommended yt channel, courses, udemy, etc that you'd recommend to learn about devops? Any recommendations are greatly appreciated

https://redd.it/1p0t19c
@r_devops
What a day...

I spent the last 3 weeks working on a project management pipeline that was heavy in GitHub actions and was set to demo it today in a huge meeting in-front of all of the project managers and developers and started the demo at 3:30 EST this afternoon.

I started off at the user creation command line and created a new user, switched to them and ran a custom SSH and GitHub config wizard I wrote which abstracted away the burdens of dealing with configuring those for PMs.

It worked flawlessly. It ran the check, verified everything was good, pulled repos. It was golden.

I went further into the systems and went to have it send some project management files into a branch to be picked up by CI....

Suddenly git was broken, I was flabberghasted.

It was 3:40, GitHub was down. I sat there like an iditot fudging it for 10 minutes until the meeting moved to another presentation....

It was devastating....

What a day fellas (fellettes), what a day...

https://redd.it/1p0vx27
@r_devops
When does Policy-as-Code become "The Slow Lane" for developers?

Hey r/devops,

I'm working on scaling up our internal developer platform (IDP) and one of the biggest points of friction is how we enforce DevSecOps and compliance policies without killing our velocity. We're trying to shift left, but it feels like we've just shifted all the pipeline friction right onto the developer's lap.

We moved from a few post-merge human approval tollgates to an aggressive Policy-as-Code strategy using tools like Open Policy Agent (OPA) with Rego on every pull request (PR).

The result? Our security posture is fantastic. Our IaC drift is near zero. But our average PR time is up 25%, and the team is starting to view the pipeline as an adversary, not an enabler.

The checks are running: SAST, SCA, Terrascan, custom checks for naming conventions, and resource tagging compliance. All before merge. The problem is that a failed low-severity SAST finding can hold up a critical patch that has a clean functional change.

My burning question to the community:

How are you balancing the enforcement of non-critical-but-mandatory policies (like resource tagging or specific naming conventions) in the pipeline?

1. Do you have an explicit 'fail fast/fail hard' policy only for critical security issues, and let minor compliance issues run through the main pipeline, alerting to a dashboard for follow-up? (i.e., making them blocking in pre-prod but non-blocking in the main CI?)
2. Are you using a separate, performance-optimized "compliance-only" pipeline that runs less frequently, thereby unblocking the core CI/CD flow?

I’m looking for actual tooling or architectural patterns that allow for selective blocking that doesn't rely on us writing custom logic in every single Jenkinsfile/GitHub Action workflow.




https://redd.it/1p0z7dp
@r_devops