Reddit DevOps – Telegram
Rest api development in a microservices world, where does governance even fit and who owns it

Sixty services and the api layer looks like a yard sale. Different auth patterns, versioning nobody agreed on, rate limiting that exists on maybe half of them and is configured differently on each one that has it.

Platform team (three people including me) keeps getting pulled into incidents that should belong to service teams but don't because there's no standard anyone actually follows. And every time I raise this in an architecture review I get "it depends" answers that don't help me figure out what to actually do next week.

Gateway enforcement or ci/cd enforcement? Who owns the standard, platform or the services? How do you make teams follow it without becoming the bottleneck for every api deployment?

https://redd.it/1ralzfj
@r_devops
Looking to work for free on real devops projects to gain experience

Hi everyone,

I'm learning DevOps and looking to work under an experienced DevOps freelancer to understand real-world projects and workflows.

I'm comfortable with:

\- AWS basics (EC2, VPC, IAM, ALB)

\- Linux & networking fundamentals

\- CI/CD basics

\- Hands-on practice with deployments and troubleshooting

I'm not asking for payment. I'm happy to assist with tasks like documentation, monitoring, testing, basic deployments, or shadowing—anything that helps reduce your workload while | learn.

If you're a freelancer who could use an extra pair of hands (or know someone who might), I'd really appreciate connecting via DMs.

Thanks for reading!

https://redd.it/1r9zhgx
@r_devops
Infra aware tool

Hi. Got hired recently to a big product company and noticed how difficult is onboarding process. Outdated confluence pages, unclear inventory. Nobody can tell for sure how many clusters we have(except CTO maybe), VMs are spread across OCI, AWS and Azure clouds. Hundreds of build configurations in TeamCity for various purposes.

So for me as a new devops getting hands on this infra takes months and still I am finding stuff that I was never aware of.

Question is - if there will be some infra aware chat gpt that you can ask like how many VMs we have with windows arm 64 or which k8s clusters are below 1.30 version, etc. would it make sense in your team ? Would it solve your operational overhead as it would do for me?

https://redd.it/1ram2sv
@r_devops
How likely it is Reddit itself keeps subs alive by leveraging LLMs?

Is reddit becoming Moltbook.. it feels half of the posta and comments are written by agents. The same syntax, structure, zero mistakes, written like for a robot.

Wtf is happening, its not only this sub but a lot of them. Dead internet theory seems more and more real..

https://redd.it/1ralvnj
@r_devops
jq 101 – Practical guide to parsing JSON from the CLI

If you spend your days in the AWS CLI, Azure CLI, Kubernetes, or Terraform, you already know: you’re swimming in JSON. Most folks just pipe everything to grep, scroll through endless output, or hack together a Python noscript for a problem jq solves in seconds.

So, I put together a straight-to-the-point technical guide. It covers the core jq moves: things like .key, .array[\], select(), length, and sort_by. I walk through real examples with a public API, and I tie those examples directly to what you see in AWS and Azure CLI outputs. The patterns I show? They handle about 90% of what you actually deal with in the cloud.

No stories, no fluff. Just clear, practical jq tricks built for DevOps and SRE work. If you’re in the CLI all the time but JSON filtering still feels awkward, this guide clears things up.

Link:

https://medium.com/@odinumbelino/jq-101-how-to-parse-json-like-a-pro-a883ca08b3f9

Feedback welcome.

https://redd.it/1raeo3r
@r_devops
1
I'm being asked to provide inputs

I was asked recently which platform I should pick for our a new self-service pipeline. There are only 2 options given, ECS or EKS/AKS. We have presence on both providers. My knowledge on both is little so I can't decide which one to choose. It seems like my boss is leaning towards k8s since his team has used it before. However, he is still asking me which technology I should use. He also mentioned argocd. I saw it in action in a cncf conference and was quite amazed with the demo. How would you decide on it?

Oh, he is aware that it can take several months in building the new self service tooling and he's ok with that.

https://redd.it/1radje6
@r_devops
Is it possible to use your IDE on your phone??

Hey devs, I wanted to ask if there is any way that I can use my IDE directly on my phone? So that what I have on my laptop is syncing with my phone too.

Is this possible?

https://redd.it/1rat13s
@r_devops
Former software developers, how did you land your first DevOps role?

Hi there! I’m currently a senior full stack software developer in a .NET/react/Azure stack. I love programming and building products but my real passion is building Linux machines, working with Docker and kubernetes, building pipelines, writing automations and monitoring systems, and troubleshooting production issues. I have AWS experience in a previous job where we deployed services to an EKS cluster using GitOps (argocd)

I am currently learning everything I can get my hands on in the hopes of transitioning my career to full time DevOps (infra/cloud engineer, SRE, platform engineer, DevOps engineer, etc)

Right now I’m targeting moving internally - my company does not have a DevOps team and our architects handle all the k8s deployments, IaC, azure environments, etc and it’s proving to be a real bottleneck. I have some buy in already about standing up a true DevOps team but I fear I’ll be passed over because I’m thought to be too valuable on the product development side (inferred from convo with my manager).

I’ve also been scouring job boards for DevOps jobs but am still figuring out the gaps in my current knowledge to get me prepared for an external interview.

I also am in the process of building a kubernetes home lab on bare metal, and I run a side business building and hosting client apps on my Linode k8s cluster.

If you came from product dev as a software developer and are now full time DevOps, how did you do it?

Note: I am in the US.

Edit: adding that I am currently trying to learn Go as a compliment to the DevOps skills I have already - i noticed a lot of DevOps jobs are actually big on python - worth learning instead?

https://redd.it/1ratwxc
@r_devops
Self-Studying Data Engineering — Project Ideas & Open-Source Contributions

I'm a student self-learning Data Engineering. I have a few questions regarding :

1. Projects - What DE projects actually matter when applying without a traditional background in it ? What have you built or seen that genuinely impressed a hiring team?
2. Open Source - I want to contribute to DE/ML open source to learn in public and build credibility. Where should a self-taught person start , who doesn't have years of experience of production ? Specific repos with good onboarding would mean a lot.

FYI: I'm self-taught, comfortable with Python and SQL, dbt ; still learning concepts and growing stack.

https://redd.it/1rat38p
@r_devops
Need Suggestion for Devops Begineer

I'm beginning to learn DevOps, and I'd like to find internship/junior opportunities to get hands-on experience in the field. I am starting with foundational technologies such as Linux, Git, Docker, and CI/CD Pipelines but would appreciate any advice regarding how to proceed.



Here are my current skills/progress:

Docker containerization and using docker-compose

Using GitHub Actions and Jenkins for simple CI/CD

Cloud experiments using Free tier (AWS)



I have some questions specifically about remote opportunities.

What kind of portfolio projects would be attractive to remote companies?

What tools should I familiarize myself with that would be beneficial for remote or part-time positions?

What are some effective methods of applying for remote positions? (LinkedIn outreach, Upwork, AngelList, open-source?)

Are there any resources (virtual internships/bootcamps) that would provide me with valuable remote experience?

https://redd.it/1rashfr
@r_devops
Sprints/Agile/Scrum? What to use when not really doing Programming?

Sorry if this is a silly question but I would love to understand what others are doing?

For context, I was previously a SysAdmin specialising in On Prem servers. Three years ago, I moved to a Cloud Engineer role. I was the only Cloud Engineer for but I do now have a junior reporting to me.

Most of our work isn't programming. We do IaC and there's noscripting in Bash/PowerShell but we're not reporting to Project Managers the stage of a project, etc. A lot of our work is more to do with deployments, troubleshooting servers, maintenance, cost optimisation, etc.

Generally my to do list has always been captured in a notebook but I'm conscious we're not doing Sprints/Agile/Standup and I am wondering if I am missing out on something really powerful... When I've watched videos it sounds quite confusing with Scrum Managers, etc but I'm also concerned that if I went elsewhere as a Senior with no experience in these strategies I would look quite bad.

We have Jira at work - I personally found it quite complicated - Epics, Stories, Poker?, etc. I tried setting up a "sprint start" and "sprint end" meeting but it ended up just being a regular catchup because a lot of our work takes longer than a week since we are often waiting on other teams and dealing with ad-hoc tickets, etc.

Sorry if this isn't a great question. I feel a bit dumb asking but I would love to get a few "Day in the Life" examples from others so I can see how we compare and how I can better improve.

Thanks!

https://redd.it/1raz8zq
@r_devops
Starting Cloud/DevOps career — is full CCNA worth it or are networking basics enough?

Hi all,

I’m a CS student planning to move into Cloud/DevOps as a fresher and looking at a 6-8 month training program. They cover Linux + CCNA (networking) in the first half and AWS + DevOps tools in the second half.

My main confusion is about CCNA — for someone targeting entry-level DevOps roles, is doing the full CCNA actually worth the time, or are networking fundamentals (IP, DNS, ports, routing basics, etc.) enough to learn on my own?

If you were starting again as a beginner, what would you focus on instead to become job-ready faster?

Would really appreciate practical advice from people working in DevOps/Cloud. Thanks!

https://redd.it/1raogv8
@r_devops
Built a tool to search production logs 30x faster than jq

I built zog in Zig (early stages)

Goal: Search JSONL files at NVMe speed limits (3+ GB/s)

Key techniques:

1. SIMD pattern matching - Process 32 bytes/instruction instead of 1

2. Double-buffered async I/O - Eliminate I/O wait time

3. Zero heap allocations - All scanning in pre-allocated buffers

4. Pre-compiled query plans - No runtime overhead

Results: 30-60x faster than jq, 20-50x faster than grep

Trade-offs I made:

\- No JSON AST (can't track nesting)

\- Literal numeric matching (90 ≠ 90.0)

\- JSONL-only (no pretty-printed JSON)

For log analysis, these are acceptable limitations for the massive speedup.

GitHub: https://github.com/aikoschurmann/zog

Would love to get some feedback on this.

I was for example thinking about doing a post processing step where I do a full AST traversal after having done an early fast selection.

https://redd.it/1rb1kh3
@r_devops
Autonomous agents/complex workflows

Hey guys. I’m working on a small project and I need to find builders who are building autonomous agents and complex workflows. I’m not selling anything but just looking to talk about your set up and possibly running your agents through my alpha. My project is an execution and governance layer that sits between agent intent and agent action for reference.

https://redd.it/1rb3h2j
@r_devops
What's actually broken about post-mortems at your company?

What was the most broken part of your post-mortem process? Not the incident itself, the aftermath.For me, the worst part is always the "How did we miss this in staging?" question. It's never a simple answer, and trying to explain environmental drift or non-deterministic race conditions to a VP who just wants a "yes/no" feels like a losing battle. I end up writing a doc that's half technical narrative, half political damage control, and neither half is actually useful the next time something breaks. Curious whether this is universal or just a me problem. Maybe your team has actually figured this out. I genuinely want to know if anyone has a process that doesn't feel like reconstruction work after the fact.

https://redd.it/1raym9p
@r_devops
I turned my portfolio into my first DevOps project

Hi everyone!

I'm a software engineering student and wanted to share how (and why) I migrated my portfolio from Vercel to Oracle Cloud.

My site is fully static (Astro + Svelte) except for a runtime API endpoint that serves dynamic Open Graph images. A while back, Astro's sitemap integration had a bug that was specific to Vercel and was taking a while to get fixed. I'd also just started learning DevOps, so I used it as an excuse to move over to OCI and build something more hands on.

The whole site is containerized with Docker using a Node.js image. GitLab CI handles building and pushing the image to Docker Hub, then SSHs into my Ubuntu VM and triggers a deploy.sh noscript that stops the old container and starts the new one. Caddy runs on the VM as a reverse proxy, and Cloudflare sits in front for DNS, SSL, and caching.

The site itself is pretty simple but I'm really proud of the architecture and everything I learned putting it together.

Feel free to check out the repo and my site!

https://redd.it/1rba39q
@r_devops
Update: I built RunnerIQ in 9 days — priority-aware runner routing for GitLab, validated by 9 of you before I wrote code. Here's the result.

Two weeks ago I posted here asking if priority-aware runner scheduling for GitLab was worth building. 4,200 of you viewed it. 9 engineers gave detailed feedback. One EM pushed back on my design 4 times.

I shipped it. Here's what your feedback turned into.

## The Problem

GitLab issue #14976 — 523 comments, 101 upvotes, open since 2016. Runner scheduling is FIFO. A production deploy waits behind 15 lint checks. A hotfix queued behind a docs build.

## What I Built

4 agents in a pipeline:

- Monitor — Scans runner fleet (capacity, health, load)
- Analyzer — Scores every job 0-100 priority based on branch, stage, and pipeline context
- Assigner — Routes jobs to optimal runners using hybrid rules + Claude AI
- Optimizer — Tracks performance metrics and sustainability

## Design Decisions Shaped by r/devops Feedback

| Your Challenge | What I Built |
|---|---|
| "Why not just use job tags?" | Tag-aware routing as baseline, AI for cross-tag optimization |
| "What happens when Claude is down?" | Graceful degradation to FIFO — CI/CD never blocks |
| "This adds latency to every job" | Rules engine handles 70% in microseconds, zero API calls. Claude only for toss-ups |
| "How do you prevent priority inflation?" | Historical scoring calibration + anomaly detection in Agent 4 |

## The Numbers

- 3 milliseconds to assign 4 jobs to optimal runners
- Zero Claude API calls when decisions are obvious (~70% of cases)
- 712 tests, 100% mypy type compliance
- $5-10/month Claude API cost vs hundreds for dedicated runner pools
- Advisory mode — every decision logged for human review
- Falls back to FIFO if anything fails. The floor is today's behavior. The ceiling is intelligent.

## Architecture

Rules-first, AI-second. The hybrid engine scores runner-job compatibility. If the top two runners are within 15% of each other, Claude reasons through the ambiguity and explains why. Otherwise, rules assign instantly with zero API overhead.

Non-blocking by design. If RunnerIQ is down, removed, or misconfigured — your CI/CD runs exactly as it does today.

## Repo

Open source (MIT): https://gitlab.com/gitlab-ai-hackathon/participants/11553323

Built in 9 days from scratch for the GitLab AI Hackathon 2026. Python, Anthropic Claude, GitLab REST API.

---

Genuine question for this community: For teams running shared runner fleets (not K8s/autoscaling), what's the biggest pain point — queue wait times, resource contention, or lack of visibility into why jobs are slow? Trying to figure out where to focus the v2.0 roadmap.


https://redd.it/1rbgbft
@r_devops
Auto removal of posts from new accounts

Dear community, we heard you and we feel the same.

The settings for this sub were configured to automatically remove posts from new accounts. No more reviewing in the mod queue. There is just too many?

There may be still some false positives, we will keep an eye, please continue to report if you see something is wrong.

For the genuine posters, we are sorry but it is not the end of the world - take your time to look around, participate in existing threads, grow your account.

For the advertisements, self promotions, business startups and solo startups - it is clear that this community does not tolerate such posts very well.

There will always be someone unhappy with this decision or that decision, but cannot satisfy everyone. Sorry for that.

Enjoy your on topic discussions and please remain civil and professional, this is DevOps sub, related to DevOps industry, not a playground.

https://redd.it/1reggwn
@r_devops
After 8 years, my chaos testing tool learned to speak containerd — Pumba v1.0

Pumba is a CLI for chaos testing containers. Kill them. Inject network delays. Drop packets. Stress their CPUs until something breaks. Named after the Lion King warthog because a tool that intentionally breaks things should have a sense of humor about it.

For 8 years, it only spoke Docker. Then Docker stopped being the only container runtime that mattered, and here we are.

What changed:

pumba --runtime containerd --containerd-namespace k8s.io kill my-container


Three flags, full feature parity. Every chaos command works on both runtimes.

Things I learned the hard way building this:

1. Containerd's API is a different mindset. Docker gives you --net=container:X for network namespace sharing. Containerd hands you OCI specs and says "figure it out." More control, more footguns. Same destination, stick shift instead of automatic.

2. Sidecar cleanup will keep you up at night. When your parent context cancels, your sidecar still needs SIGKILL, wait for exit, task deletion, container removal. context.WithoutCancel() from Go 1.21 saved this from being a second background context just for deferred cleanup. Before 1.21, the workaround was ugly.

3. Container naming is a different kind of chaos. Kubernetes: io.kubernetes.container.name. nerdctl: nerdctl/name. Docker Compose: com.docker.compose.service. Raw containerd: here's a SHA256, best of luck. Pumba resolves all of them automatically, because nobody should be running ctr containers list and grepping for an ID just to inject a network delay.

4. cgroups v2 path construction depends on driver (cgroupfs vs systemd) and cgroup version, producing wildly different filesystem paths. Auto-detection is the only approach that works. The cg-inject binary handles all combinations and ships inside the ghcr.io/alexei-led/stress-ng scratch image.

5. Real OOM kills are not SIGKILL. This is worth repeating. Most chaos tools "simulate" OOM by sending SIGKILL and marking the checkbox. Real OOM kills produce OOMKilled: true in container state, different Kubernetes events, different alerting paths, different restart behavior. With --inject-cgroup, stress-ng shares the target's cgroup. Fill memory to the limit and the kernel OOM-kills the whole cgroup. We validated this with 40 advanced Go integration tests, including scenarios where the target gets OOM-killed mid-chaos and we verify Pumba detects it and cleans up without panicking.

GitHub: https://github.com/alexei-led/pumba

If you're doing chaos on containerd-based clusters, I'd be curious what gaps you're hitting. And if you're not doing chaos testing at all... that's a choice. Just an increasingly uncomfortable one.

https://redd.it/1rh4py4
@r_devops
Interviewed somebody today; lots of skills, not much person

I interviewed a person today for a DevOps role. His resume was very thick with technical things. Software he's used, frameworks, programming languages, security and compliance regulations, standards, etc. There was not much about how he worked with those things, what he did with them, which bits he was more familiar with and less familiar with.

I tried to get an idea about what kind of techie he is. Did he learn these things on his own? Or is he driven more by learning things as needed for the job? Has he designed anything on his own? Is he lawful good or chaotic neutral or...? Etc.

The answers I got made it feel like most of what he's done is work where someone else directed him, he coordinated with other teams, used vendor tools with pre-determined actions, ran noscripts, etc. This is okay, since this wasn't for a senior role. But it made me think about how important it is, as a job seeker, to give a potential employer an idea of what kind of work you do. It's not just about checking boxes or flexing on hard skills, but showing that you're a person as well. Especially since these days everyone's on the lookout for AI chatbot answers. In this case, maybe he was just nervous. Maybe he's not good in formal situations. Or maybe he's just "not a good fit", as they say.

https://redd.it/1rfr007
@r_devops
Lucrative DevOps Fields/Jobs?

Based on your experience, what DevOps positions tend to pay high salaries(250k+)?

I come from a networking background but since then ive made the switch to devops. Back then in the networking space if you wanted to make a lot of money you would get a CCIE certification and try to work at a networking vendor such as Cisco,Arista, and Juniper. There's also the option of working high frequency trading companies where stress levels are high but so is the pay..

Whats the equivalent for DevOps?

Do companies like AWS pay their in-house DevOps engineers a lot? What skills does the industry value to command that type of pay? Are there high paying DevOps vendors out there? I know certifications arent really valued anymore like they used to be.

https://redd.it/1rfvwf4
@r_devops