Reddit DevOps – Telegram
Best vps for ci/cd pipelines on a budget?

Our team is looking for a few vps instances to handle our ci/cd pipelines and a private docker registry. We have been looking at some of the newer providers that offer high ram and nvme storage because our builds are starting to get pretty heavy and the old sata drives just are not cutting it anymore. We need something with a solid network since we are pushing large images back and forth all day.

we are also considering some of the smaller players that seem to offer better specs for the same price point. Reliability is the biggest factor here because if the server goes down our whole dev workflow stops.

Has anyone tried some of the newer nvme focused providers recently? Are there any specific ones that handle high cpu load well without throttling? Would love to hear some real world experiences before we commit.



https://redd.it/1ps4twv
@r_devops
When people say “know what’s running,” it often gets interpreted as a philosophical or security-only concern.

When people say “know what’s running,” it often gets interpreted as a philosophical or security-only concern.
I mean it very concretely.
A common scenario:
You inherit a system with monitoring, EDR, logging, dashboards
Everything is “green”
Nobody can clearly explain:
why certain services exist
which ones are intentional vs historical
what’s business-critical vs just still alive
who owns decisions made years ago
The system functions, alerts fire, CI/CD runs — but understanding has decayed faster than uptime.
In practice, I’ve found that most operational risk doesn’t come from missing tools, but from missing context.
Curious how others approach rebuilding that understanding without freezing delivery.

https://redd.it/1ps6wja
@r_devops
In law there’s the Magic Circle. What’s the real equivalent in tech?

In law there’s the Magic Circle. What’s the real equivalent in tech?

https://redd.it/1ps8cob
@r_devops
Career Trajectory

Hey everyone,

I’m looking for some honest career advice because I’m a bit unsure about my next step.

I have a bachelor’s in computer science and started my career in a DevOps engineer role for about 4 months, doing a mix of coding and ops. That project ended, and I moved into a system engineer role. I’ve been doing that for a little over a year now, working in a team of five on Linux and Windows servers for large clients.

My current work includes Ansible automation, kernel patching, OS upgrades, backups, troubleshooting, etc. I’ve learned a lot and built a solid base, but lately I feel like my learning curve is slowing down. Not bored, just not growing as fast as I’d like.

My long-term goal is to become a DevOps engineer in the next 3–4 years.

I now have an offer for a System Administrator role at another company, and I’m trying to figure out whether it’s a smart stepping stone or a potential detour. The noscript worries me a bit, but the actual responsibilities seem broader and more modern than my current role.

The role would involve:
• Working with Google Cloud Platform
• Managing on-prem infrastructure (Proxmox virtualization on Dell servers + Mac hardware)
• Docker for services and build processes
• Automation using Python and Ansible
• Ensuring reliable operation of IT systems (config management, infrastructure, integrations, and continuous improvements)
• Maintaining an office IT presence, hands-on user support, and onboarding/offboarding (hardware + accounts)
• Device management tools (Intune, NinjaOne, Mosyle)
• Supporting Linux, macOS, and Windows environments
• Contributing to security and compliance: patching, access controls, monitoring events, vulnerability remediation, and assisting with audits/access reviews alongside the security team
• Company-supported certifications (which my current company doesn’t offer)

On paper, this seems closer to DevOps fundamentals (cloud, automation, containers, infra ownership), but I’m still a bit concerned about drifting too far into end-user support or being labeled “just a sysadmin” long term.

For those who’ve gone from sysadmin → DevOps (or who hire DevOps engineers):
Does this sound like a good foundation for moving into DevOps in a few years, or a role that could slow that transition down if I’m not careful?

Thanks for any real-world insights.


I have rephrased this with AI since my english is not the best


https://redd.it/1psafw5
@r_devops
Operational pain points of OTP/SMS systems?

I’m curious about OTP/SMS from an ops perspective.
If you’ve managed systems using Twilio or similar:
What operational risks showed up?
How did you monitor or control usage?
What caused alerts or panic moments?
Not promoting anything — genuinely interested in ops lessons.

https://redd.it/1psaxo4
@r_devops
Pipeline to search for new job opportunities

I live in Europe (EU citizen) in a LCOL country. I have PhD and 2 YoE in a multinational company (DevOps). I'm thinking it's time to search for a new company mostly because of financial reasons.

I believe it's better to search for a fully remote position most probably in USA or high paying EU country.
Now, I'm trying to set a "pipeline" on how to do this optimized. Time is not an issue since I already have a job.

My idea is:

1. Search linkedin for remote jobs. Any other source? Glassdoor maybe?

2. Try to find people on the most promising companies (that posted a job) and try to communicate with them for internal info (how is the company, what they searching for, ask for referral etc.)

3. Create a "big" version of my CV with most of the stuff I've done regardless of job denoscriptions

4. Ask some AI tool (any suggestions?) to take the "big" CV and curate that to the job denoscription (supervised by me)

5. Apply to as much companies as i can with this targeted way (i dont like the one CV to all approach).

General questions: What helped you approach USA/HCOL EU companies and get a job there?

What job application pipeline did you find to work best (except from networking, which is also something I plan to look into)?

https://redd.it/1ps9f1y
@r_devops
I'm so tired of using AI :/

I'm a senior devops with 10+ years of experience. Im at a company that uses PHP and a really old methodology for deployments. I've slowly been improving our workflows but my company really wants to use AI.

I've been using GitHub agents to automate a lot of our manual processes for onboarding new clients. Because we have clear processes for tasks I've found myself doing the following a lot:

- Given these 10 commits or 5 PRs use them as a template on how to create a new client space.
- Commits x-y show how we generate API keys and authorize them, can you generate a AGENTS.md file to document that process in a format I can just tell you to: "generate a new API key for company id #1234455"


My output due to AI has increased. But let's be real, I'm not programming, I'm not making .tpl files to fill in with later, I'm just using our history to automate flows.

I miss solving complex issues. I miss working on issues where the answer isn't just "ask AI, leverage AI". I want to work on memory overflows and networking debugging and cdk/noscripts, not giving Microsoft more money :/

https://redd.it/1psdwan
@r_devops
Friday night GPU spike hit $50k/day, shift-left governance fail, what tools prevent this chaos?

Got paged at 11pm Friday. GPU costs jumped to $50k/day from eng teams testing AI models. No quotas, no policies. We could've easily burned $200k by Monday. Spent much of my day manually killing instances, tagging everything.

This is our 3rd spike this quarter. We have no pre-deploy checks, no vuln-cost tying , no auto-enforcement on schedules/rightsizing. CloudHealth just show postmortem damage, anomaly alerts land on deaf eng ears.

I am here looking for advice before the next fire. What tools shift-left without turning my team into cloud cops? Would love to hear it all.

https://redd.it/1pshj2b
@r_devops
Which Infrastructure as Code tools are actually used most in production today?

I’m trying to understand real-world adoption, not just what’s popular in tutorials.

For teams running production workloads (AWS, GCP, Azure or multi-cloud):
- What IaC tool do you actually use day to day?
-Terraform / OpenTofu, CloudFormation, CDK, Pulumi, something else?
- And why did you choose it (team size, scale, compliance, velocity)?

Looking for practical answers, not marketing.

https://redd.it/1ps5058
@r_devops
Fast API with celery worker

Deployment strategy GitHub actions - ECS - EC2

EC2 2cpu - 4GB

Nginx serving front end less than 500mb

Fast API 1GB

Celery worker (fast api image )

API have a upload requirement but any time there’s an upload the fast API service restarts with 137 OOM out of memory…

File size 2kb

https://redd.it/1psjbug
@r_devops
Why the hell do container images come with a full freaking OS I don't need?

Seriously, who decided my Go binary needs bash, curl, and 47 other utilities it'll never touch? I'm drowning in CVE alerts for stuff that has zero business being in production containers. Half my vulnerability backlog is noise from base image bloat.

Anyone actually using distroless or minimal images in prod? How'd you sell the team on it? Devs are whining they can't shell into containers to debug anymore but honestly that sounds like a feature not a bug.

Need practical advice on making the switch without breaking everything.

https://redd.it/1pskpsd
@r_devops
6 years in DevOps (~14 projects) and I’m burning out — considering management or cybersecurity

I’m looking for some perspective from people who’ve been in this field longer than me.

I’ve been working in DevOps for \~6 years. I wouldn’t call myself a “rockstar” or a deep specialist in one niche, but I’ve had decent breadth: I’ve worked across \~14 projects(many different technologies). I’ve touched a lot of the standard DevOps stack: AWS/Azure, Terraform, Kubernetes, differents CI/CD, Helm charts, the usual stuff.

And lately I’ve been asking myself: do I actually want to keep doing this long-term?

I’m not quitting tomorrow, but I’m noticing something that looks a lot like burnout (or at least the early version of it). The biggest issue isn’t that I can’t do the work — it’s that I’m losing interest in the idea of being a “Senior DevOps” whose life is just… shipping more Helm charts and deployment pipelines forever. I’m starting to worry that if I keep pushing the same path, I’ll end up stuck and miserable.

On top of that, I’ve been thinking a lot about doing something that feels genuinely useful / meaningful. For me, “useful” looks like working on problems that matter beyond shipping features — and honestly, I’ve always seen military work as something with a stronger sense of purpose. That made me consider a longer-term plan: move into cybersecurity and potentially transition from civilian work into a military role (or defense-related work). My hope is that it would give me a stronger feeling that I’m doing something important.

So I’ve started thinking about alternative directions that still use my background, but feel like forward movement rather than “more of the same.” A few paths I’m considering:

Engineering manager / technical project manager / delivery-type role I have a 3-year degree in IT Project Management.
Cybersecurity (especially cloud/Kubernetes security, incident response, defensive security) Potentially aiming for a role that could translate into defense/military work later.

What I’m hoping to get from this post:

1. If you hit this “I don’t want to do the same DevOps work forever” phase — what did you do?
2. For people who moved from DevOps into management (EM/PM/TPM) — what skills mattered most and what surprised you?
3. For people who moved from DevOps into cybersecurity — what was the best entry point (cloud security, detection/response, security engineering, GRC, etc.) and what would you do differently?
4. Any advice for figuring out whether this is real burnout vs just needing a change of project/company?
5. If anyone has experience moving from civilian tech into defense/military-related work (even indirectly) — what should I know upfront?

I’d really appreciate any stories, recommendations, or even “here’s what I wish I knew earlier.”

Thanks.

https://redd.it/1psn7om
@r_devops
What are the biggest observability challenges with AI agents, ML, and multi‑cloud?

As more teams adopt AI agents, ML‑driven automation, and multi‑cloud setups, observability feels a lot more complicated than “collect logs and add dashboards.”​

My biggest problem right now: I often wait hours before I even know what failed or where in the flow it failed. I see symptoms (alerts, errors), but not a clear view of which stage in a complex workflow actually broke.

I’d love to hear from people running real systems:

1. What’s the single biggest challenge you face today in observability with AI/agent‑driven changes or ML‑based systems?​
2. How do you currently debug or audit actions taken by AI agents (auto‑remediation, config changes, PR updates, etc.)?​
3. In a multi‑cloud setup (AWS/GCP/Azure/on‑prem), what’s hardest for you: data collection, correlation, cost/latency, IAM/permissions, or something else?​
4. If you could snap your fingers and get one “observability superpower” for this new world (agents + ML + multi‑cloud), what would it be?​

Extra helpful if you can share concrete incidents or war stories where:

Something broke and it was hard to tell whether an agent/ML system or a human caused it.​
Traditional logs/metrics/traces weren’t enough to explain the sequence of stages or who/what did what when.​

Looking forward to learning from what you’re seeing on the ground.

https://redd.it/1psn5qc
@r_devops
Lewin and modern DevOps

I recently read an amazing piece by Dr. Richard Claydon called “Lewin, Rewritten: Rethinking “How Change Works” for a Run / Serve / Change World”,

it explores Kurt Lewin’s change models in a modern context, and my thoughts immediately wandered into the world of DevOps.

We spend so much time talking about the "DevOps" toolchain: Kubernetes, Cloud platforms, DORA metrics. But anyone who has led a transformation knows the tools are rarely (if ever) the hard part.

The hard part is the human system.

I realized that Lewin’s 3-stage model (Unfreeze, Change, Refreeze) maps very well to the engineering challenges we face today. It explains why we hit the "J-curve" of poor performance, why "Unfreezing" habits is so hard, and why we need to rethink what "Refreezing" means in an agile world.

I’ve written up my reflections on how Lewin’s thinking applies to modern DevOps and engineering leadership here,

https://cladam.github.io/2025/12/22/lewin-and-devops/

https://redd.it/1psuvjv
@r_devops
Application-layer attacks bypassing traditional defenses

Hey all, Even strong posture programs sometimes miss runtime risks like application-layer exploits, which trigger alerts only after significant damage.

This ArmoSec blog on cloud runtime threats highlights the most common runtime vectors and practical detection strategies. Have you seen runtime attacks in production? How did you detect them early?

https://redd.it/1pswoea
@r_devops
Found a really clean kubectl cheat sheet with 100+ essential commands

Was looking for a simple kubectl reference that doesn’t require jumping through the docs every time.

Came across this cheat sheet that groups 100+ commonly used kubectl commands by use case — getting resources, debugging, logs, exec, contexts, namespaces, rollouts, etc.

What I liked:

\- It’s task-based, not just a random command dump

\- Easy to scan when you’re in the middle of debugging

\- Covers the stuff you actually use day-to-day

Link:

https://www.makcloudhance.com/kubectl-cheat-sheet/

Sharing in case it helps someone else. If you know similar resources, drop them here too.

https://redd.it/1psyaqv
@r_devops
1
Experiences with Agentless security (Wiz / Orca), any concerns?

Hi all,

For those of you using Agentless Cloud Security tools like Wiz or Orca, I’m curious about your experience so far.

Are you generally happy with the agentless model?
Do you have any concerns around the fact that disk snapshots are copied to the vendor’s infrastructure and scanned from there?

In particular, I’m wondering:

How comfortable are you with the data exposure / trust model?
Did this raise concerns from security, legal, or compliance teams?
Were there specific mitigations or contractual guarantees that made this acceptable?
Or is the operational simplicity worth the trade-off for you?

Not trying to argue one way or another, just looking to understand how practitioners are thinking about this in real-world environments.

Thanks!

https://redd.it/1psz2ra
@r_devops
restricting user list to those assigned to project

I'm new so sorry if this is a dumb question, but I'm getting complaints from users editing work items in the web interface -

1. Clicking in the assigned user textbox is confusing people because they expect a dropdown, and when they don't see one they assume they don't have permission to edit. There is no affordance telling them they need to type something first.

2. It searches over the entire organization. I have a project manager that says this is unacceptable, visibility needs to be restricted to those who have been assigned to the project.

There's too much search noise trying to google this so maybe someone can tell me what's going on here, if they plan to fix this or what the rationale is.

https://redd.it/1pt0vyx
@r_devops
GenAI is fun… until you try to keep it running in prod

GenAI is fun… until you try to keep it running in prod 😅

I’ve been seeing tons of GenAI demos lately and yeah, they look great. But every time I end up thinking, okay cool, but how do you operate this thing after the demo?

Recently AWS started talking more seriously about GenAIOps.
GenAI just doesn’t behave like normal apps. Same prompt, different output. “Works” but not always right. Tokens quietly draining money. Stuff breaks in weird ways.

Funny thing is, just recently I found myself using shell noscripts and multi-stage Azure DevOps pipelines to build some guardrails and ops around GenAI workflows. Not fancy, but very real. And that’s when it hit me, yeah, this absolutely needs its own ops mindset.

AWS is basically saying the same: treat prompts, models, agents like deployable artifacts. Monitor quality, not just uptime. Add safety, cost controls, evals. It’s like MLOps… but leveled up for GenAI chaos.

This feels less like hype and more like reality catching up. We’re clearly moving from GenAI experiments to GenAI systems. And systems always need ops.

Good reads if you’re curious: https://aws.amazon.com/blogs/machine-learning/operationalize-generative-ai-workloads-and-scale-to-hundreds-of-use-cases-with-amazon-bedrock-part-1-genaiops/

I hope you are happy now @mods. 😜

#AWS #GenAIOps #GenerativeAI #DevOps #MLOps #CloudEngineering

https://redd.it/1pt3b7w
@r_devops
👍1
LLMs in prod: are we replacing deterministic automation with trust-based systems?

Hi,

Lately I’m seeing teams automate core workflows by wiring business logic in prompts directly to hosted LLMs like Claude or GPT.

Example I’ve seen in practice:
a developer says in chat that a container image is ready, the LLM decides it’s safe to deploy, generates a pipeline with parameters, and triggers it. No CI guardrails, no policy checks, just “the model followed the procedure”.

This makes me uneasy for a few reasons:

• Vendor lock-in at the reasoning/decision layer, not just APIs

• Leakage of operational knowledge via prompts and context

• Loss of determinism: no clear audit trail, replayability, or hard safety boundaries


I’m not anti-LLM. I see real value in summarization, explanation, anomaly detection, and operator assistance. But delegating state-changing decisions feels like a different class of risk.

Has anyone else run into this tension?

• Are you keeping LLMs assistive-only?

• Do you allow them to mutate state, and if so, how do you enforce guardrails?

• How are you thinking about this from an architecture / ops perspective?

Curious to hear how others are handling this long-term.

https://redd.it/1pt3xw5
@r_devops