NEW BOT Телеграм, страница

Reddit DevOps

Starting Cloud/DevOps career — is full CCNA worth it or are networking basics enough?

Hi all,

I’m a CS student planning to move into Cloud/DevOps as a fresher and looking at a 6-8 month training program. They cover Linux + CCNA (networking) in the first half and AWS + DevOps tools in the second half.

My main confusion is about CCNA — for someone targeting entry-level DevOps roles, is doing the full CCNA actually worth the time, or are networking fundamentals (IP, DNS, ports, routing basics, etc.) enough to learn on my own?

If you were starting again as a beginner, what would you focus on instead to become job-ready faster?

Would really appreciate practical advice from people working in DevOps/Cloud. Thanks!

https://redd.it/1raogv8
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

12 views20:28

Reddit DevOps

Built a tool to search production logs 30x faster than jq

I built zog in Zig (early stages)

Goal: Search JSONL files at NVMe speed limits (3+ GB/s)

Key techniques:

1. SIMD pattern matching - Process 32 bytes/instruction instead of 1

2. Double-buffered async I/O - Eliminate I/O wait time

3. Zero heap allocations - All scanning in pre-allocated buffers

4. Pre-compiled query plans - No runtime overhead

Results: 30-60x faster than jq, 20-50x faster than grep

Trade-offs I made:

\- No JSON AST (can't track nesting)

\- Literal numeric matching (90 ≠ 90.0)

\- JSONL-only (no pretty-printed JSON)

For log analysis, these are acceptable limitations for the massive speedup.

GitHub: https://github.com/aikoschurmann/zog

Would love to get some feedback on this.

I was for example thinking about doing a post processing step where I do a full AST traversal after having done an early fast selection.

https://redd.it/1rb1kh3
@r_devops

GitHub

GitHub - aikoschurmann/zog

Contribute to aikoschurmann/zog development by creating an account on GitHub.

10 views21:33

Reddit DevOps

Autonomous agents/complex workflows

Hey guys. I’m working on a small project and I need to find builders who are building autonomous agents and complex workflows. I’m not selling anything but just looking to talk about your set up and possibly running your agents through my alpha. My project is an execution and governance layer that sits between agent intent and agent action for reference.

https://redd.it/1rb3h2j
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

10 views22:48

Reddit DevOps

What's actually broken about post-mortems at your company?

What was the most broken part of your post-mortem process? Not the incident itself, the aftermath.For me, the worst part is always the "How did we miss this in staging?" question. It's never a simple answer, and trying to explain environmental drift or non-deterministic race conditions to a VP who just wants a "yes/no" feels like a losing battle. I end up writing a doc that's half technical narrative, half political damage control, and neither half is actually useful the next time something breaks. Curious whether this is universal or just a me problem. Maybe your team has actually figured this out. I genuinely want to know if anyone has a process that doesn't feel like reconstruction work after the fact.

https://redd.it/1raym9p
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

11 views00:20

Reddit DevOps

I turned my portfolio into my first DevOps project

Hi everyone!

I'm a software engineering student and wanted to share how (and why) I migrated my portfolio from Vercel to Oracle Cloud.

My site is fully static (Astro + Svelte) except for a runtime API endpoint that serves dynamic Open Graph images. A while back, Astro's sitemap integration had a bug that was specific to Vercel and was taking a while to get fixed. I'd also just started learning DevOps, so I used it as an excuse to move over to OCI and build something more hands on.

The whole site is containerized with Docker using a Node.js image. GitLab CI handles building and pushing the image to Docker Hub, then SSHs into my Ubuntu VM and triggers a deploy.sh noscript that stops the old container and starts the new one. Caddy runs on the VM as a reverse proxy, and Cloudflare sits in front for DNS, SSL, and caching.

The site itself is pretty simple but I'm really proud of the architecture and everything I learned putting it together.

Feel free to check out the repo and my site!

https://redd.it/1rba39q
@r_devops

GitHub

GitHub - anav5704/anav.dev: Personal website hosted on Oracle Cloud using Docker and GitLab CI

Personal website hosted on Oracle Cloud using Docker and GitLab CI - anav5704/anav.dev

22 views04:03

Reddit DevOps

Update: I built RunnerIQ in 9 days — priority-aware runner routing for GitLab, validated by 9 of you before I wrote code. Here's the result.

Two weeks ago I posted here asking if priority-aware runner scheduling for GitLab was worth building. 4,200 of you viewed it. 9 engineers gave detailed feedback. One EM pushed back on my design 4 times.

I shipped it. Here's what your feedback turned into.

## The Problem

GitLab issue #14976 — 523 comments, 101 upvotes, open since 2016. Runner scheduling is FIFO. A production deploy waits behind 15 lint checks. A hotfix queued behind a docs build.

## What I Built

4 agents in a pipeline:

- Monitor — Scans runner fleet (capacity, health, load)
- Analyzer — Scores every job 0-100 priority based on branch, stage, and pipeline context
- Assigner — Routes jobs to optimal runners using hybrid rules + Claude AI
- Optimizer — Tracks performance metrics and sustainability

## Design Decisions Shaped by r/devops Feedback

| Your Challenge | What I Built |
|---|---|
| "Why not just use job tags?" | Tag-aware routing as baseline, AI for cross-tag optimization |
| "What happens when Claude is down?" | Graceful degradation to FIFO — CI/CD never blocks |
| "This adds latency to every job" | Rules engine handles 70% in microseconds, zero API calls. Claude only for toss-ups |
| "How do you prevent priority inflation?" | Historical scoring calibration + anomaly detection in Agent 4 |

## The Numbers

- 3 milliseconds to assign 4 jobs to optimal runners
- Zero Claude API calls when decisions are obvious (~70% of cases)
- 712 tests, 100% mypy type compliance
- $5-10/month Claude API cost vs hundreds for dedicated runner pools
- Advisory mode — every decision logged for human review
- Falls back to FIFO if anything fails. The floor is today's behavior. The ceiling is intelligent.

## Architecture

Rules-first, AI-second. The hybrid engine scores runner-job compatibility. If the top two runners are within 15% of each other, Claude reasons through the ambiguity and explains why. Otherwise, rules assign instantly with zero API overhead.

Non-blocking by design. If RunnerIQ is down, removed, or misconfigured — your CI/CD runs exactly as it does today.

## Repo

Open source (MIT): https://gitlab.com/gitlab-ai-hackathon/participants/11553323

Built in 9 days from scratch for the GitLab AI Hackathon 2026. Python, Anthropic Claude, GitLab REST API.

---

Genuine question for this community: For teams running shared runner fleets (not K8s/autoscaling), what's the biggest pain point — queue wait times, resource contention, or lack of visibility into why jobs are slow? Trying to figure out where to focus the v2.0 roadmap.

https://redd.it/1rbgbft
@r_devops

GitLab

Allow configuring GitLab runner priority (#14976) · Issues · GitLab.org / GitLab · GitLab

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

27 views10:46

Reddit DevOps

Auto removal of posts from new accounts

Dear community, we heard you and we feel the same.

The settings for this sub were configured to automatically remove posts from new accounts. No more reviewing in the mod queue. There is just too many?

There may be still some false positives, we will keep an eye, please continue to report if you see something is wrong.

For the genuine posters, we are sorry but it is not the end of the world - take your time to look around, participate in existing threads, grow your account.

For the advertisements, self promotions, business startups and solo startups - it is clear that this community does not tolerate such posts very well.

There will always be someone unhappy with this decision or that decision, but cannot satisfy everyone. Sorry for that.

Enjoy your on topic discussions and please remain civil and professional, this is DevOps sub, related to DevOps industry, not a playground.

https://redd.it/1reggwn
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

10 views15:28

Reddit DevOps

After 8 years, my chaos testing tool learned to speak containerd — Pumba v1.0

Pumba is a CLI for chaos testing containers. Kill them. Inject network delays. Drop packets. Stress their CPUs until something breaks. Named after the Lion King warthog because a tool that intentionally breaks things should have a sense of humor about it.

For 8 years, it only spoke Docker. Then Docker stopped being the only container runtime that mattered, and here we are.

What changed:

pumba --runtime containerd --containerd-namespace k8s.io kill my-container

Three flags, full feature parity. Every chaos command works on both runtimes.

Things I learned the hard way building this:

1. Containerd's API is a different mindset. Docker gives you --net=container:X for network namespace sharing. Containerd hands you OCI specs and says "figure it out." More control, more footguns. Same destination, stick shift instead of automatic.

2. Sidecar cleanup will keep you up at night. When your parent context cancels, your sidecar still needs SIGKILL, wait for exit, task deletion, container removal. context.WithoutCancel() from Go 1.21 saved this from being a second background context just for deferred cleanup. Before 1.21, the workaround was ugly.

3. Container naming is a different kind of chaos. Kubernetes: io.kubernetes.container.name. nerdctl: nerdctl/name. Docker Compose: com.docker.compose.service. Raw containerd: here's a SHA256, best of luck. Pumba resolves all of them automatically, because nobody should be running ctr containers list and grepping for an ID just to inject a network delay.

4. cgroups v2 path construction depends on driver (cgroupfs vs systemd) and cgroup version, producing wildly different filesystem paths. Auto-detection is the only approach that works. The cg-inject binary handles all combinations and ships inside the ghcr.io/alexei-led/stress-ng scratch image.

5. Real OOM kills are not SIGKILL. This is worth repeating. Most chaos tools "simulate" OOM by sending SIGKILL and marking the checkbox. Real OOM kills produce OOMKilled: true in container state, different Kubernetes events, different alerting paths, different restart behavior. With --inject-cgroup, stress-ng shares the target's cgroup. Fill memory to the limit and the kernel OOM-kills the whole cgroup. We validated this with 40 advanced Go integration tests, including scenarios where the target gets OOM-killed mid-chaos and we verify Pumba detects it and cleans up without panicking.

GitHub: https://github.com/alexei-led/pumba

If you're doing chaos on containerd-based clusters, I'd be curious what gaps you're hitting. And if you're not doing chaos testing at all... that's a choice. Just an increasingly uncomfortable one.

https://redd.it/1rh4py4
@r_devops

GitHub

GitHub - alexei-led/pumba: Chaos testing, network emulation, and stress testing tool for containers

Chaos testing, network emulation, and stress testing tool for containers - alexei-led/pumba

8 views16:28

Reddit DevOps

Interviewed somebody today; lots of skills, not much person

I interviewed a person today for a DevOps role. His resume was very thick with technical things. Software he's used, frameworks, programming languages, security and compliance regulations, standards, etc. There was not much about how he worked with those things, what he did with them, which bits he was more familiar with and less familiar with.

I tried to get an idea about what kind of techie he is. Did he learn these things on his own? Or is he driven more by learning things as needed for the job? Has he designed anything on his own? Is he lawful good or chaotic neutral or...? Etc.

The answers I got made it feel like most of what he's done is work where someone else directed him, he coordinated with other teams, used vendor tools with pre-determined actions, ran noscripts, etc. This is okay, since this wasn't for a senior role. But it made me think about how important it is, as a job seeker, to give a potential employer an idea of what kind of work you do. It's not just about checking boxes or flexing on hard skills, but showing that you're a person as well. Especially since these days everyone's on the lookout for AI chatbot answers. In this case, maybe he was just nervous. Maybe he's not good in formal situations. Or maybe he's just "not a good fit", as they say.

https://redd.it/1rfr007
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

8 views17:28

Reddit DevOps

Lucrative DevOps Fields/Jobs?

Based on your experience, what DevOps positions tend to pay high salaries(250k+)?

I come from a networking background but since then ive made the switch to devops. Back then in the networking space if you wanted to make a lot of money you would get a CCIE certification and try to work at a networking vendor such as Cisco,Arista, and Juniper. There's also the option of working high frequency trading companies where stress levels are high but so is the pay..

Whats the equivalent for DevOps?

Do companies like AWS pay their in-house DevOps engineers a lot? What skills does the industry value to command that type of pay? Are there high paying DevOps vendors out there? I know certifications arent really valued anymore like they used to be.

https://redd.it/1rfvwf4
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

9 views18:28

Reddit DevOps

Helm in production: lessons and gotchas

Hi everyone! I've been using Helm in production at scale for the past few years and collected lessons and gotchas that surprised me:

- Helm doesn't manage CRDs.
- --wait doesn't wait for readiness of all resources.
- Dry run is dependent on the state of an existing release.
- Values can be validated with JSON schema.
- OCI registries can be used for charts alongside container images.

I think the tip about values validation is the coolest, because loading the schema into yaml-language-server is a great development experience boost and helps LLMs do better work writing values.

Hope you find this post useful, I think even experienced Helm users can learn something from it.

https://redd.it/1rgdp5x
@r_devops

Sneakybugs

Helm in production: lessons and gotchas

Practical lessons from running Helm in production: CRD management, health checks, dry runs, schema validation, and OCI registries.

10 views19:28

Reddit DevOps

ECS CICD Rollback?

Hi Guys! What could be the best way to rollback on ECS CICD , do I describe last active task definition then rerun but it will give diff in GitHub task definition, or just revert back to last successful action I think this would be better or any other solution to it?

any blogs or suggestions would be great

https://redd.it/1rfx80d
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

10 views20:28

Reddit DevOps

What is platform engineering exactly?

Every time I tell someone what I like and how I think, they end up in some way or another recommending platform engineering.

For example I’ve always wanted to contribute to open source projects I liked but always thought I wasn’t technically there to help outside infra and cloud, which prompted another “PE is perfect” and every explanation I get is different, and not closely different but can be categorized as a different role

I won’t make the post long by explaining what exactly I like and what I don’t but I want to know what is it to maybe understand why it’s been recommended so much to me. I’d also appreciate some examples of the output of such a role compared to the normal DevOps for example.

https://redd.it/1rhefsl
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

8 views21:28

Reddit DevOps

Cloud Engineer roadmap check: Networking + Linux completed, next steps?

I’m transitioning to Cloud Engineering from scratch. I’ve completed basic networking (TCP/IP, DNS, subnetting) and Linux fundamentals (CLI, file permissions, processes). I’m currently learning Git and GitHub. My goal is to get a junior cloud role in 6–9 months. What should I focus on next.

https://redd.it/1rezupb
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views22:28

Reddit DevOps

CleanCloud v1.6.3 - 20 rules to find what's costing you money in AWS/Azure

A while ago I posted about CleanCloud \- a shift-left cloud waste report tool enforces hygiene as a CI/CD gate, now with cost estimates and --fail-on-cost CLI option

AWS Rules (10):

1. Unattached EBS volumes (HIGH)
2. Old EBS snapshots
3. Infinite retention logs
4. Unattached Elastic IPs (HIGH)
5. Detached ENIs
6. Untagged resources
7. Old AMIs
8. Idle NAT Gateways
9. Idle RDS instances (HIGH)
10. Idle load balancers (HIGH)

Azure Rules (10):

1. Unattached Managed Disks
2. Old Snapshots
3. Unused Public IPs
4. Empty Load Balancers
5. Empty Application Gateways
6. Empty App Service Plans
7. Idle VNet Gateways
8. Stopped (Not Deallocated) VMs — still incurring full compute charges
9. Idle SQL Databases (zero connections 14+ days)
10. Untagged Resources

Every finding includes:
\- Confidence level (HIGH / MEDIUM)
\- Evidence and signals used
\- Resource details and age
\- Cost waste estimates

Enforce in CI/CD:

cleancloud scan --provider aws --all-regions --fail-on-confidence HIGH --fail-on-cost 2000

Exit 0 = pass.

Exit 2 = policy violation.

pipx install cleancloud and run your first scan in 5 minutes.

If you’re one of the 200+ users who have downloaded CleanCloud, we’d love to hear what you found.

Please open an issue here or leave a comment below.

https://redd.it/1rf84m8
@r_devops

GitHub

GitHub - cleancloud-io/cleancloud: CleanCloud helps SRE teams safely identify orphaned, unowned, and potentially inactive AWS and…

CleanCloud helps SRE teams safely identify orphaned, unowned, and potentially inactive AWS and Azure resources using conservative, read-only cloud hygiene checks designed for trust, not auto-cleanu...

8 views23:28

Reddit DevOps

27001 didn’t change our stack but it sure as hell changed our discipline

We missed two deals so it finally made sense to leadership to pursue ISO 27001.

We did end up tightening parts of our stack. A few workflows became more structured, some things moved out of people’s heads and into systems but that wasn’t the real shift even though they definitely had their own positive sides to it.

The uncomfortable part was answering some questions we’d never formally defined. A lot of our processes were muscle memory and ISO forced us to define them, assign ownership and create review cadence.

The discipline we gained changed everything.

https://redd.it/1reqg60
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

10 views00:28

Reddit DevOps

Why does docker output everything to standard error?

Everytime I look inside my github wrokflows I see everything outputted to stderr, why does this happen?

Thank you!

https://redd.it/1rhts32
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

9 views10:34

Reddit DevOps

Build a website for DevOps Learning

Hey folks
After a long time, I finally rebuilt (vibe-coded ) and revamped one of my old projects DevOps Atlas.
It’s basically a one-stop search engine for DevOps learning resources.
The goal is simple:
Help DevOps engineers discover high-quality learning resources without endless searching.
Any suggestions and feedback are most welcome. Check it out at https://devopsatlas.com/ and let me know what you think!

https://redd.it/1rhwo1p
@r_devops

6 views13:08

Reddit DevOps

hackerbot-claw: An AI-Powered Bot Actively Exploiting GitHub Actions - Microsoft, DataDog, and CNCF Projects Hit So Far

https://www.stepsecurity.io/blog/hackerbot-claw-github-actions-exploitation#attack-6-aquasecuritytrivy---evidence-cleared

Now trivy repo is empty.... https://github.com/aquasecurity/trivy

some advices :

1. Verify the integrity of your Trivy binaries if installed at the end of February
2. Switch to the Docker image (if still available on GHCR/Docker Hub), verify Cosign signatures
3. Keep Checkov or Grype as a fallback
4. Audit your GitHub Actions workflows: no pull_request_target + checkout of the fork, no unescaped ${{ }} in run blocks:

https://redd.it/1ri4nwu
@r_devops

www.stepsecurity.io

hackerbot-claw: An AI-Powered Bot Actively Exploiting GitHub Actions - Microsoft, DataDog, and CNCF Projects Hit So Far - StepSecurity

A week-long automated attack campaign targeted CI/CD pipelines across major open source repositories, achieving remote code execution in at least 4 out of 5 targets. The attacker, an autonomous bot called hackerbot-claw, used 5 different exploitation techniques…

6 views21:36

About

Blog

Apps

Platform