Reddit DevOps – Telegram
Job Skills to Gain

This is going to sound like a weird ask, but I am asking for some suggestions on some skills I should learn.

I’m currently a senior cloud engineer and have a lot of the tech stuff down, if it’s something new I am also good enough to put it together and leverage AI to help me learn my missing gap.

I’m looking at things that could help enhance my career to architect or manager level. I was thinking about doing a communication course but the ones I found on Udemy were super dry.

I also was thinking of data analytics but I am missing the idea of where I can use it at since I’m a consultant.

Any suggestions would be appreciated.

https://redd.it/1p6qj05
@r_devops
Stop looking at CPU usage, start looking at PSI

Simple example with two Linux servers:

Server A: CPU \~100%. Latency is low, requests are fast. Doing video encode. Server B: CPU \~40%. API calls are timing out, SSH is lagging.

If you only look at CPU graphs, A looks worse than B. In reality A is just busy. B is the one under pressure because tasks are waiting for CPU. I still see alerts / autoscaling rules like:

>CPU > 80% for 5 minutes

CPU% just says “cores are busy”. It does not say “tasks are stuck”.

Linux (4.20+) has PSI (Pressure Stall Information) in /proc/pressure/*.
This tells you how much time tasks are stalled on CPU / memory / IO.

Example from /proc/pressure/cpu:

some avg10=0.00 avg60=5.23 avg300=2.10 total=1234567

Here avg60=5.23 means: in the last 60 seconds, tasks were stalled 5.23% of the time because there was no CPU.

For a small observability project I hack on (Linnix, eBPF-based), I stopped using load average and switched to /proc/pressure/cpu for the “is this box in trouble?” logic. False alarms dropped a lot.

Longer write-up with more details is here:
https://parth21shah.substack.com/p/stop-looking-at-cpu-usage-start-looking

Anyone here actually using PSI in prod alerts?

https://redd.it/1p6rur8
@r_devops
I built an open-source tool for debugging Kubernetes with LLMs - Kubently

Hey y'all - been working on a side project and figured this community might find it useful (or tear it apart, or most likely both) and I've learned a lot just building it. I've been part of another agentic platform engineering project (CAIPE) which introduced me to a lot of the concepts so definitely grateful for that but building this from scratch was a bigger undertaking than I think I originally intended, ha! Full disclosure - there's lots of room for improvement and I have lots of ideas on how to make it better but wanted to get some community feedback on what I have so far to understand if this is something people are actually interested in or if it's a total miss. I think it's useful as is but I definitely built with future enhancements in mind (ie black box architecture/easy to swap out core agent logic/LLM/etc) so its not an insane undertaking when I get around to tackling them.

**Kubently** is an open-source tool for troubleshooting Kubernetes agentically - basically lets you debug clusters through natural conversation with any major LLM. The name is a play on "Kubernetes" + "agentically" if that wasn't obvious.

Why I built it: kubectl output is verbose, debugging is manual, managing multiple clusters means constant context-switching, and honestly agents debug faster than I can half the time. So I built something that fixes this.

**What it does:**

* \~50ms command delivery via SSE
* Read-only operations by default (secure by design)
* Native A2A protocol support - works with whatever LLM you're running
* Integrates with existing A2A systems like [CAIPE](https://cnoe-io.github.io/ai-platform-engineering/)
* LangGraph/LangChain
* Runs on any K8s cluster - EKS, GKE, AKS, bare metal, doesn't matter
* Multi-cluster from day one - deploy lightweight executors to each cluster, manage from single API

Docs: [https://kubently.io](https://kubently.io)

GitHub: [https://github.com/kubently/kubently](https://github.com/kubently/kubently)

Would love feedback, bug reports, or feature requests. And if you find it useful, a star on GitHub would be awesome.

https://redd.it/1p6sld9
@r_devops
As a freshman in college in Europe, how should I get into devops in 2025?

So I figured the question isn't whether AI threatens DevOps, since the "traditional way" of approaching any specialization is basically threatened.

How do I get into DevOps with all the AI resources given? I felt lost in a sea of resources, which most honestly doesn't make much sense, so this subreddit might be a good place to ask.

Thank you for your perspective in advance!

https://redd.it/1p6u465
@r_devops
Found a great GitHub repo of hands on DevOps/Cloud projects

Hey folks,

I came across this GitHub repo, which seems like a solid collection of practical DevOps and cloud infrastructure projects for learning and building skills:

https://github.com/NotHarshhaa/DevOps-Projects


What I want feedback on (that’s why I’m sharing):
• Do you guys think the scope and complexity of these projects reflect “real-world DevOps” work?
• Are there parts or types of projects you’d consider essential for a strong DevOps portfolio that are missing?
• Would working through these give enough depth for someone preparing for cloud or DevOps roles (or certs)?
• Any concerns about using this kind of repo-based learning as a proxy for on the job experience?

If you know of better repos / project collections, or have had a similar experience learning via GitHub I’d love to hear about that too.

Thanks!


https://redd.it/1p6rote
@r_devops
Trying to break into SRE — need guidance

Hey everyone,
I’m looking to transition into an SRE role and I’m not fully sure what direction to take from here. I’m currently in a TechOps role where most of my time goes into debugging production issues, monitoring system behavior, and handling incident-style problems at an L1/L2 level.

Here’s what I’ve worked with so far:

Manual debugging using browser DevTools (network tab, console errors, API/asset failures)
Basic API investigation (REST + GraphQL)
Monitoring and observability: New Relic (dashboards + logs), Pingdom, Grafana
Linux fundamentals: logs, permissions, SSH, basic troubleshooting
Automating tasks using Bash, Python (early stage), and Playwright (web automation)
Cron-based scheduling for noscripts and recurring jobs
Source control: Git basics (branches, merge, revert, etc.)
Beginner cloud exposure (mostly AWS concepts but not deep hands-on yet)
Basic networking: DNS, ports, VPN, proxy behavior, routing, CDN troubleshooting

Outside my day job, I’ve been doing bug bounty as a side skill to sharpen my debugging mindset. I mainly focus on web security weaknesses and medium-level writeups, not just low-effort submissions. One of the notable findings I reported was to Salesforce — nothing huge, but it got acknowledged and boosted my confidence that I can spot real-world failures, not just theoretical ones.

Recently I’ve been learning Docker and Docker Compose and planning to move toward Kubernetes next. I’m also trying to learn CI/CD and Infrastructure-as-Code (Terraform, aws-cdk), but it’s hard to judge if I’m prioritizing the right things.

What I’m looking for help with:

What’s the expected foundational skill set for someone trying to break into SRE from support/TechOps?
Should I prioritize a cloud cert (AWS/GCP), or get hands-on with Kubernetes, Terraform, pipelines, etc. first?
Are there any projects that would make my profile stand out instead of just listing tools or tutorials?
How do you know when you’re “actually ready” to apply for SRE roles?
How to land my first DevOps/SRE job?

Any guidance, personal experience, or roadmap recommendations from folks who’ve already made this jump would help a lot.
Thanks in advance.

https://redd.it/1p6pz6v
@r_devops
FREE APP PROMOTION

DM me your app and we can talk about a possible collaboration

In simple terms, what I do is help founders grow early traction through short form content. We create and send out ready to post TikToks tailored to your app’s niche and you just post them. It is a collaboration. You get consistent reach and user feedback, while we handle the creative and strategy side.

No cost at all. The reason is we already produce hundreds of TikToks weekly, and what we really need are real founders who can post them. In return, you get content that is customized for your app, consistent posting without the burnout, and real reach that helps you find users and feedback faster.

You could do it solo, but this just saves you time, keeps it consistent, and gets you exposure with zero risk or learning curve.

https://redd.it/1p6zvnd
@r_devops
Which AI to choose for coding? URGENT

I'm not a developer or anyone with coding knowledge

Quick backstory : I hired a local developer to make me an accounting software for my workshop because of low budget. But he did a really poor job and then i moved to AI platforms to work on it myself. The software is based on Electron app and uses JS codes. It has one index file for the backend and logics and The frontend end folder has all the pages in it. I used AI platforms to modify the software and really did a lot but now i am stuck and cannot find a good AI platform to complete things for me.
They all mess it up and ruin the code.
The codes are mostly around 1000 line ( less or above). It has been 3 months of me constantly working on this day night and it just takes so much time to have one issue fixed because it constantly gives me errors and then having to go back and forth a thousand time with the AI to be able to achieve a good working code free of errors. I have to fix one thing then run the whole compilation again to be able to see if it's working now then again and again and again

Now, I'm willing to buy a subnoscription for one of the AIs to completely depend on that and wrap it up quickly because i cannot waste my time no more as it is effecting my work and time management really bad.

Platforms i have used:(free ones)

Chatgpt
Deepseek
Grok(best one yet)
Github copilot
Gemini
Claude

I am not a professional and have no knowledge on what models to use of what platforms but i feel like i should go for premium of something so that maybe it works more intelligently.

Please help me get this done🙏🏼

https://redd.it/1p71gom
@r_devops
Does hybrid security create invisible friction no one admits?

Hybrid security policies don’t just block access, they subtly shape how people work. Some teams duplicate work just to avoid policy conflicts. Some folks even find workarounds, probably not great. Nobody talks about it because it’s invisible to leadership, but it’s real. Do you all see this in your orgs, or is it just us?


https://redd.it/1p72igz
@r_devops
devs who’ve tested a bunch of AI tools, what actually reduced your workload instead of increasing it?



i’ve been hopping between a bunch of these coding agents and honestly most of them felt cool for a few days and then started getting in the way. after a while i just wanted a setup that doesn’t make me babysit it.

right now i’ve narrowed it down to a small mix. cosine has stayed in the rotation, along with aider, windsurf, cursor’s free tier, cody, and continue dev. tried a few others that looked flashy but didn’t really click long term.

curious what everyone else settled on. which ones did you keep, and which ones did you quietly uninstall after a week?

https://redd.it/1p72pjc
@r_devops
Relying on AI for learning, is it good or bad?

Hello everyone! I recently quit my Game Dev job and decided that DevOps is a better field for my mindset and work style so i made the switch.

I'm currently building my own homelab from scratch so i can use it as my portfolio and i can actually have some autonomy under my belt that i can rely on for my daily life. I'm pretty new to this, just started last week. So far i can confidently say that i have knowledge about the stuff i integrated.

Short summary of what i have;

I set up 2 Arch, 1 Debian Server PCs that i set up manually with partitions, encryption etc. I practice Linux daily on my main PC and i practice on terminal consistently. I SSH into other two PCs when i want to do something. Debian currently runs a Linkding with Nginx reverse proxy. I plan to integrate Github Actions CI, Grafana & Prometheus next. I have a few bash noscripts i run for my use and I can code in Python. Homelab is getting documented on Github with Readme files.

I quite enjoy learning something completely new to me and make progress in it but i do a lot of stuff by asking AI and learning why and how i should do it in that way. I'm mostly following it's recommendations even though i find different approaches from time to time.

I wonder if it's too dangerous for learning to approach AI as an assistant like this or am i just overthinking, i can't be sure. What are your thoughts about this, what would your recommendations be?

https://redd.it/1p73dns
@r_devops
Looking for a few Network / Automation Engineers to try a new multi-vendor CLI + automation workflow tool

Hey all,

I’m working with a small team on a new workflow tool for network and automation engineers. Before we open it to a bigger audience, we’re looking for a few people who regularly deal with things like:



• Multi-vendor networks (Cisco, Juniper, Arista, etc.)

• Lots of parallel SSH sessions

• Repetitive CLI workflows

• Troubleshooting or debugging across multiple devices

• Lab work (CML, EVE-NG, GNS3, vendor simulators)

• Python/Ansible automation or CI/CD validation



The goal is to make everyday operational tasks a lot smoother, especially for people who are constantly jumping between devices or dealing with multi-vendor issues.

We’re looking for a handful of engineers willing to try it out and give honest feedback based on your real workflows.

Happy to compensate for your time. approximately 1 hr/day for 1–2 months

If this sounds interesting, feel free to DM me or drop a comment and I’ll reach out with details.



Thanks!



https://redd.it/1p736at
@r_devops
If you had to pick one vendor for cross-browser + mobile + API testing, who’s your shortlist?

Our QA team is trying to consolidate tools instead of juggling 3–4 platforms.
Which vendors actually deliver all-in-one testing (cloud devices, browsers, API monitors)?
Is TestGrid, LambdaTest, or BrowserStack closer to a “single pane of glass,” or is that still unrealistic?

https://redd.it/1p75ume
@r_devops
How to run llama 3.1 70B on ec2.

Hi
Has anyone tried to run llama 3.1 70B on ec2 instance .

If yes which instance size did you choose.
I’m trying to run the same model from ollama but can’t figure out the perfect size of instance.



https://redd.it/1p7ak1j
@r_devops
I built an agentless K8s cost auditor (Bash + Python) to avoid long security reviews

I've been consulting for startups and kept running into the same wall: we needed to see where money was being wasted in the cluster, but installing tools like Kubecost or CastAI required a 3-month security review process because they install persistent agents/pods.

So I built a lightweight, client-side tool to do a "15-minute audit" without installing anything in the cluster.

How it works:
1. It runs locally on your machine using your existing kubectl context.
2. It grabs kubectl top metrics (usage) and compares them to deployments (requests/limits).
3. It calculates the cost gap using standard cloud pricing (AWS/GCP/Azure).
4. It prints the monthly waste total directly to your terminal.

Features:
100% Local: No data leaves your machine.
Stateless Viewer: If you want charts, I built a client-side web viewer (drag & drop JSON) that parses the data in your browser.
Privacy: Pod names are hashed locally before any export/visualization.
MIT Licensed: You can fork/modify it.

Repo: https://github.com/WozzHQ/wozz

Quick Start:
curl -sL https://raw.githubusercontent.com/WozzHQ/wozz/main/noscripts/wozz-audit.sh | bash

I'm looking for feedback on the waste calculation logic—specifically, does a 20% safety buffer on memory requests feel right for most production workloads?

Thanks!

https://redd.it/1p7baoc
@r_devops
I built a simple CLI tool to audit AWS IAM keys because I was tired of clicking through the Console. Roast my code.

Hey everyone,

I've been working on hardening cloud setups for a while and noticed I always run the same manual checks: looking for users without MFA, old access keys (>90 days), and dormant admins.

So I wrote a Python noscript (Boto3) to automate this and output a simple table.

It’s open-source. I’d love some feedback on the logic or suggestions on what other security checks I should add.
repo

https://redd.it/1p7bbop
@r_devops
DevOps engineer here – want to level up into MLOps / LLMOps + go deeper into Kubernetes. Best learning path in 2026?

I’ve been working as a DevOps engineer for a few years now (CI/CD, Terraform, AWS/GCP, Docker, basic K8s, etc.). I can get around a cluster, but I know my Kubernetes knowledge is still pretty surface-level.

With all the AI/LLM hype, I really want to pivot/sharpen my skills toward MLOps (and especially LLMOps) while also going much deeper into Kubernetes, because basically every serious ML platform today runs on K8s.

My questions:

1. What’s the best way in 2025 to learn MLOps/LLMOps coming from a DevOps background?
Are there any courses, learning paths, or certifications that you actually found worth the time?
Anything that covers the full cycle: data versioning, experiment tracking, model serving, monitoring, scaling inference, cost optimization, prompt management, RAG pipelines, etc.?
2. Separately, I want to become really strong at Kubernetes (not just “I deployed a yaml”).
Looking for a path that takes me from intermediate → advanced → “I can design and troubleshoot production clusters confidently”.
CKA → CKAD → CKS worth it in 2025? Or are there better alternatives (KodeKloud, Kubernetes the Hard Way, etc.)?

I’m willing to invest serious time (evenings + weekends) and some money if the content is high quality. Hands-on labs and real-world projects are a big plus for me.

https://redd.it/1p7ey3d
@r_devops
Impostor Syndrome in Tech: Why It Hits Hard and What to Do About it

Have you ever thought you are not good enough at work? You are not that smart to get that job, and it’s all just luck? That’s called the Impostor Syndrome! And it’s common than you think because many people don’t even dare to talk about it!

I wrote a post about that mainly focusing on DevOps, but it’s still valid for software engineering, and the tech industry in general:

- What is impostor syndrome, and what is not?
- Why does impostor syndrome hit hard?
- What to do about impostor syndrome?


Impostor Syndrome in Tech: Why It Hits Hard and What to Do About it

Enjoy :-)

https://redd.it/1p7fqm6
@r_devops
I feel lost, how do I manage to build the right pipeline as a junior dev in my company without a senior?

I have about 2 years of experience as a software developer.

In my last job I had a good senior who taught me a bit of DevOps with Azure DevOps, but here my current boss doesn't have knowledge about CI/CD and DevOps strategies in general, basically he worked directly on production and copied the compiled .exe on the server when done...

In the past months, In the few free moments that I had, I've set up a very simple pipeline on bitbucket which runs on a self hosted Windows machine, very simple:

BUILD->DEPLOY

But now I want to improve it by adding more steps, I want at least to version the db because otherwise is a mess, I've set up a test machine with the test database. I was thinking about starting simple with:

BUILD -> UPDATE TEST DB -> UPDATE PRODUCTION DB -> DEPLOY

is this ok? Should each one of us use a local copy of the db to work with? We always have to check for new changes in the db when working with it? We use Visual Studio.

I feel lost, I know that each environment is different and there isn't a strategy which works for everyone, but I don't even know where can I learn something about it.

https://redd.it/1p7bbz6
@r_devops
What’s the worst kind of API analytics setup you’ve inherited from a previous team?

Is it just me or do most teams over-engineer API observability?

https://redd.it/1p7iich
@r_devops