Reddit DevOps – Telegram
How to go back to W2

I’ve been working for myself for the last 6 years. Built a small B2B SaaS and have strong relationships with my customers.

I’m tired of consulting and ready to wind that part of the business down. I still have high margin subnoscription revenue (low 6 figure ARR) and maintain the infrastructure, though it’s low effort these days.

Now, I’m interested in working for a large company. Something 9-5 where I can work with smart, driven people. I miss working with passionate peers. I only have a couple employees now who work 95% independently day to day.

I want to work on something new and exciting, but without killing myself or sinking all my money into it (I have young kids).

Am I even employable in my situation? I have no clue. I’m not in a rush, just looking for advice. Thank you!

https://redd.it/1nzpn8i
@r_devops
How are you scheduling GPU-heavy ML jobs in your org?

From speaking with many research labs over the past year, I’ve heard ML teams usually fall back to either SLURM or Kubernetes for training jobs. They’ve shared challenges for both:

SLURM is simple but rigid, especially for hybrid/on-demand setups
K8s is elastic, but manifests and debugging overhead don’t make for a smooth researcher experience

We’ve been experimenting with a different approach and just released Transformer Lab GPU Orchestration. It’s open-source and built on SkyPilot + Ray + K8s. It’s designed with modern AI/ML workloads in mind:

All GPUs (local + 20+ clouds) are abstracted up as a unified pool to researchers to be reserved
Jobs can burst to the cloud automatically when the local cluster is fully utilized
Distributed orchestration (checkpointing, retries, failover) handled under the hood
Admins get quotas, priorities, utilization reports

I’m curious how devops folks here handle ML training pipelines and if you’ve experienced any challenges we’ve heard?

If you’re interested, please check out the repo (https://github.com/transformerlab/transformerlab-gpu-orchestration) or sign up for our beta (https://lab.cloud). Again it’s open source and easy to set up a pilot alongside your existing SLURM implementation. Appreciate your feedback.

https://redd.it/1nzrmqx
@r_devops
Building dockerfile in container Jobs - Gitlab CI, ADO, GitHub CI

Majority of CI runners allow us nowadays to run pipeline jobs in containers which is great as you do not need to manage software on agent VM itself.

However, are there any established practices for building Dockerfiles when running job in containers? A few years ago Docker supported docker-in-docker. How does the landscape look nie?

https://redd.it/1nzpj9g
@r_devops
How to learn devops in 2025

Hello everyone! I’m new to DevOps and looking for the best ways to learn efficiently. I’d really appreciate any recommendations or resources!

https://redd.it/1nzsf9n
@r_devops
I open-sourced NimbusRun: autoscaling GitHub self-hosted runners on VMs (no Kubernetes)

TL;DR: If you run GitHub Actions on self-hosted VMs (AWS/GCP) and hate paying the “idle tax,” NimbusRun spins runners up on demand and scales back to zero when idle. It’s cloud-agnostic VM autoscaling designed for bursty CI, GPU/privileged builds, and teams who don’t want to run a k8s cluster just for CI. Azure not supported yet.

Repo: https://github.com/bourgeoisie-hacker/nimbus-run

# Why I built it

Many teams don’t have k8s (or don’t want to run it for CI).
Some jobs don’t fit well in containers (GPU, privileged builds, custom drivers/NVMe).
Always-on VMs are simple but expensive. I wanted scale-to-zero with plain VMs across clouds.
It was a fun project :)

# What it does (short version)

Watches your GitHub org/webhooks for `workflow_job` & `workflow_run` events.
Brings up ephemeral VM runners in your cloud (AWS/GCP today), tags them to your runner group, and tears them down when done.
Gives you metrics, logs, and a simple, YAML-driven config for multiple “action pools” (instance types, regions, subnets, disk, etc.).

# Show me setup (videos)

AWS setup (YouTube): https://youtu.be/n6u8J6iXBMw
GCP setup (YouTube): [https://youtu.be/nwrBL12NqiE](https://youtu.be/nwrBL12NqiE)

>

# Quick glance: how it fits

1. Deploy the NimbusRun service (container or binary) where it can receive GitHub webhooks.
2. Configure your action pools (per cloud/region/instance type, disks, subnets, SGs, etc.).
3. Point your GitHub org webhook at NimbusRun for `workflow_job` & `workflow_run` events.
4. Run a workflow with your runner labels; watch VMs spin up, execute, and scale back down.

Example workflow:

name: test
on:
push:
branches:
- master # or any branch you like
jobs:
test:
runs-on:
group: prod
labels:
- action-group=prod # required | same as group name
- action-pool=pool-name-1
#required
steps:
- name: test
run: echo "test"


# What it’s not

Not tied to Kubernetes.
Not vendor-locked to a single cloud (AWS/GCP today; Azure not yet supported).
Not a billing black box—you can see the instances, images, and lifecycle.

# Looking for feedback on

Must-have features before you’d adopt (spot/preemptible strategies, warm pools, GPU images, Windows, org-level quotas, etc.).
Operational gotchas in your environment (networking, image hardening, token handling).
Benchmarks that matter to you (cold-start SLOs, parallel burst counts, cost curves).

# Try it / kick the tires

Repo: https://github.com/bourgeoisie-hacker/nimbus-run
Follow one of the videos above (AWS/GCP).
Open an issue if anything’s rough—happy to iterate quickly on Day-0 feedback.

https://redd.it/1nzw6k9
@r_devops
"Infrastructure as code" apparently doesn't include laptop configuration

We automate everything. Kubernetes deployments, database migrations, CI/CD pipelines, monitoring, scaling. Everything is code.

Except laptop setup for new hires. That's still "download these 47 things manually and pray nothing conflicts."

New devops engineer started Monday. They're still configuring their local environment on Thursday. Docker, kubectl, terraform, AWS CLI, VPN clients, IDE plugins, SSH keys.

We can spin up entire cloud environments in minutes but can't ship a laptop that's ready to work immediately?

This feels like the most obvious automation target ever. Why are we treating laptop configuration like it's 2015 while everything else is fully automated?

https://redd.it/1nzxwbc
@r_devops
Just passed my CKA certification with a 66% score

The passing score is 66%, and I got a score of... 66% !

Honestly this exam was way harder than what people on reddit make it up to be. After I did the exam my first thought was that there is only a 50% chance that I passed it. I would say that it was a bit easier than the killer.sh but not by much, as it had many challenging questions too. There was even a question about activating linux kernel features, I had no idea how to do it. Luckily I found something on the kubernetes documentation so I copied what I read. On killer.sh my score was about 40%, to give you an element of comparison.

Good luck to anyone passing the exam, it's tougher than you would expect !

https://redd.it/1nzyog5
@r_devops
If youre a devops consultant (or firm)

Hi all, I was about to make a move but thought l'd ask for some advice from consultants here first.
I run a viso firm and I'm trying to expand my partnership network for things like audit prep for security compliance. Is there a natural path for devops consultants in general to offer this to their clientele?


Is this a partnership that would make sense? They architect/ build the infra- we secure it. I just don't want partnerships where I feel they would need to go out of their way to "sell", but rather prefer offering a no brainer upsell.


I know that I have early stage clients who would need devops consultants but no idea how it works the other way. Any insights here would be awesome. Thanks!

https://redd.it/1o02i89
@r_devops
Deployment responsibilities

How do you guys handle deployment responsibilities? in particular, security tooling. For example, our security team identifies what needs deploying (EDR agent updates, vuln scanners, etc.) but my platform team ends up owning all the operational work of rolling this out. Looking for examples of how other orgs divide this responsibility. If it helps, we're mostly a k8s shop, using Argo to manage our deployments.

Thanks!

https://redd.it/1o03n02
@r_devops
Stoplight is shutting down , what are the best alternatives?

Just saw that SmartBear is officially sunsetting Stoplight, and honestly, that’s pretty disappointing. A lot of teams (mine included) used it for API design, testing, and documentation, it was clean, stable, and actually developer-friendly.

Now with Stoplight going away, I’m curious what everyone else is planning to switch to. I’ve been checking out a few alternatives, but still not sure which one really fits best.

Here are some tools I’ve seen mentioned so far: SwaggerHub, Insomnia, Redocly, Hoppscotch, Apidog, RapidAPI Studio, Apiary, Paw, Scalar, Documenso, OpenAPI.Tools

Has anyone tried migrating yet?

Which of these actually feels close to Stoplight in workflow and team collaboration?

Any good open-source or self-hosted options worth looking at?

For those who’ve already switched, what’s working and what’s not?

Would love to hear real experiences before committing to a new stack. Seems like everyone’s trying to figure this one out right now.

https://redd.it/1o05jr6
@r_devops
built something, open for your valuable feedback and improve

Hello Guys ,

I was working as an intern and had good networking and met a lot of wonderful people and always I wanted to finish the allocated task before the deadline I was constantly relying on LLMs and switching multiple accounts if the usage limit is complete. Felt a gap and tried to learn the concept after building, but felt like there is Intellectual Privacy Risk of leakage and a lot of hallucinations. I always like Linux and The Rust Programming Language so felt the privacy to be for code and thought of making it #Zero_Knowledge like redacting the secrets , having the code I sent to be abstracted with non-meaningful placeholders like example :  openai_key: str | None = os.getenv("OPENAI_API_KEY") ->  variable_1: str | None = os.getenv(string_literal_1) , (<<FUNC_A3B4C5>>) and mapping and for Python, I was looking up and came across Abstract Syntax Tree (AST) parsing ,this disrupts the LLM's pattern-matching engine, forcing it to focus only on the generic logic problem and preventing it from "guessing" the purpose of your code or hallucinating . And the LLM is prompted with inbuilt LINE BY LINE guidance to return only the difference (a Unified Diff) for the modified files like GitHub , drastically cutting down output tokens and reducing API costs. Project File Tree and uses clear, standard Markdown language fences to give the LLM the full context of a multi-file project, addressing the common problem of LLMs missing the "big context" of a complex system code. there was good tools like #Flake8, #Bandit, #ESLint, #tsc, and #Cargo in parallel across multiple languages to check for syntax, security, and type issues and used it . final code is executed inside a resource-limited, network-disabled Docker sandbox to run tests (user-provided or generated placeholders). This catches runtime failures and provides static concurrency analysis for complex Python code, flagging potential lock-order deadlocks in code. I have added the support for local machines and small instruction to setup if you have good system built Google Chrome will work #Safari is blocking and working on it and the LLM's authoritative ROLE persona, ensuring a professional and security-conscious tone. so the LLM to commit to a #Chain_of_Thought reasoning before generating code. This significantly improves fix quality and reduces hallucinations. This is a BRING YOUR OWN KEY (#BYOK) model so you have your favourite API and you have the control and I limited the tiers just because to reduce my billings to run this and I'm working on improving this and building this as a one person so reach me out for all your feed back.



its live ! and its #ZERO_PIRATE \-> 0pirate



https://0pirate.com/

#developer #devtools

https://redd.it/1o05qd6
@r_devops
I pushed Python to 20,000 requests sent/second. Here's the code and kernel tuning I used.

I wanted to share a personal project exploring the limits of Python for high-throughput network I/O. My clients would always say "lol no python, only go", so I wanted to see what was actually possible.

After a lot of tuning, I managed to get a stable \~20,000 requests/second from a single client machine.

Here's 10 million requests submitted at once:

https://preview.redd.it/it065teb8ntf1.png?width=600&format=png&auto=webp&s=24c2f105fa6860f49b276983eb809f23b217bb8e

The code itself is based on asyncio and a library called rnet, which is a Python wrapper for the high-performance Rust library wreq. This lets me get the developer-friendly syntax of Python with the raw speed of Rust for the actual networking.

The most interesting part wasn't the code, but the OS tuning. The default kernel settings on Linux are nowhere near ready for this kind of load. The application would fail instantly without these changes.

Here are the most critical settings I had to change on both the client and server:

Increased Max File Denoscriptors: Every socket is a file. The default limit of 1024 is the first thing you'll hit.ulimit -n 65536
Expanded Ephemeral Port Range: The client needs a large pool of ports to make outgoing connections from.net.ipv4.ip_local_port_range = 1024 65535
Increased Connection Backlog: The server needs a bigger queue to hold incoming connections before they are accepted. The default is tiny.net.core.somaxconn = 65535
Enabled TIME_WAIT Reuse: This is huge. It allows the kernel to quickly reuse sockets that are in a TIME_WAIT state, which is essential when you're opening/closing thousands of connections per second.net.ipv4.tcp_tw_reuse = 1

I've open-sourced the entire test setup, including the client code, a simple server, and the full tuning noscripts for both machines. You can find it all here if you want to replicate it or just look at the code:

GitHub Repo: **https://github.com/lafftar/requestSpeedTest**

Blog Post (I go in a little more detail): **https://tjaycodes.com/pushing-python-to-20000-requests-second/**

On an 8-core machine, this setup hit \~15k req/s, and it scaled to \~20k req/s on a 32-core machine. Interestingly, the CPU was never fully maxed out, so the bottleneck likely lies somewhere else in the stack.

I'll be hanging out in the comments to answer any questions. Let me know what you think!

https://redd.it/1o08brn
@r_devops
Udemy 9$ courses or Manning(physical) 50$ books, which offer higher ROI for devops learners?

Say you want to learn docker, kubernetes, ci/cd, prometheus, grafana, ELK stack etc. Not just installing only. But actually learning to use them from a modern sysadmin pov.

Would you rather spend them on udemy or manning books(physical copy)?

I have pdfs of almost all books and never read pdfs. But I do read physical copies.

https://redd.it/1o08vlc
@r_devops
Migrating from Confluence to other alternatives

Similar to this post : https://www.reddit.com/r/devops/comments/10ksowi/alternative\_to\_atlassian\_jira\_and\_confluence/?utm\_source=share&utm\_medium=web3x&utm\_name=web3xcss&utm\_term=1&utm\_content=share\_button

I am looking into migrating our existing confluence wiki to some other alternative.

As far as I understood, my main issue is Confluence uses their own custom macro elements. I have also tried using Atlassian's Python API to export pages and attachments but it is not in proper html format but in XHTML format.

So I will have to read the exported xhtml file in python and convert the macro elements into plain html elements so that its able to render in the browser properly with information being intact.

Is there any other way to do this ?

Can I use any other way to export the pages somehow so that importing it into other alternative is actually easier ?

https://redd.it/1o0a2l5
@r_devops
Thoughts on AI-SRE tools in the market

Hi Folks,

Have been reading/seeing a lot about at least 20 ai-SRE tools to either automate or completely replace SREs. My biggest problem here is.. a LOT of this already exists in the form of automation. Correlating application alarms to infrastructure metrics for instance is trivial. On the other hand, in my experience, business logic bugs are very gnarly for AI to detect or suggest a fix today. (never mistyped a switch case as demo'd by some ai-sre tools as a business logic bug).

Production issues have always been a snowflake IME and most of the automation is very trivial to setup if not already present.

Interested in what folks think about existing tooling. To name a few (bacca, rootly, sre, resolve, incident)

https://redd.it/1o089mj
@r_devops
SFTP to S3 Transfer Options

I have the following:

1. Access to the SFTP Server.
2. An S3 bucket configured.

Requirement: We want to transfer the data from an SFTP server to AWS S3 bucket on a periodic basis. I am confused between AWS Transfer Family and rclone. Please help me here, how this can be used and when to use each one. I would really appreciate it.


https://redd.it/1o0ff16
@r_devops
Trying to understand an SSL chain of trust...

Pardon my ignorance when it comes to certificate management, but hoping someone might have clarity to a question I have.

I own a java spring boot kubernetes project living in AWS. We use a java alpine docker container. Our web service calls an external application using SOAP requests, and all is working today.

What I'm trying to understand is how our calls are working over HTTPS (uses basic username/password for auth) because the target application has a GlobalSign public certificate, and when I run a java keytool command against our jre cacerts file in my kubernetes pod, I don't see any GlobalSign certs listed within it. I see some entrust certs, AWS RDS certs, and my organization's internal certificates. Does Java just automatically trust outgoing connections to a public CA such as GlobalSign? Any thoughts? Just want to be sure this connection doesn't break in the future if this external platform ever renews its GlobalSign certificate.

Thanks!

https://redd.it/1o0hn2u
@r_devops
Entra ID in DevOps workflows.

My last post was about IAM and DevOps. This inquiry is about IAM and DevOps as well, but in a slightly different context.

Azure Entra ID tends to be the most used IAM solution out there. It’s so used that even places that use AWS as their primary cloud provider use Azure Entra ID. This is due to Office applications being used just about everywhere. Do any of you work for companies that predominantly use AWS but use Entra ID for IAM? How does that work in DevOps? Is it just another tool for you guys to work with? Is it an easy tool to integrate in your workflows, or is it a pain in the ass to manage?

https://redd.it/1o0gzg9
@r_devops
Alternatives for basic postman-ish things

I know Michael Dougas in the film Wall Street proudly said "greed is good" but at least 14$ per month per user for postman is..err...naughty

I can see there are a few opensource alternatives but wonder from a management/silent-delivery/dev-ops perspective are there ones to run-to and ones to run-from?

https://redd.it/1o0k1yq
@r_devops
The requirements went up. Foot in the door goalpost is moved a lot. Share some advice, please? Adjust my thinking fallacies.

Hello dear /r/devops.

 

## The preface

I'm feeling something akin to being sad. The standards, complexity and oversaturation of the field has raised the barrier to unexpected levels. Or am I just setting expectations too high in my head? Please amend my thinking, which is as follows.


## Current situation
As you, too know, the entry is quite hard now. It was easier before, but I always planned to rely on the wow factor, which seems completely gone now. **What do I mean by this?**

My strategy as a beginner to the field consisted of being better than average but not phenomenal, having certs that majority don't have and just being interesting in general with a lot of rare, but not spectacular projects. This was more than was required of a junior. I didn't intend to get paid in the beginning either, I was fine with internship, just to shadow and learn more and fill my gaps. I was happy to just be there and contribute. And later become an actual junior on payroll.

 

For example, not very hard, but rare stuff, sought after stuff in 2020 for a junior would be, at least from my perspective:

* Selfhosting your own GitLab instance,
* Fully working set up CI/CD pipeline for a project of yours (e.g. web scraper),
* Doing network routing on a junior netadmin level (CCNA equiv) - setting up ids and ips, p2p vpn, wireguard,
* Sysadmin stuff, very in depth Linux such as:
* Writing basic AppArmor rules and focusing on hardening stuff, same for kernel (mostly just automating stuff, setting it up, following written notes), not selinux in depth guru tier, but just on the normal level,
* then also writing crappy, but working code, that was the fantastic first foot in the door which I mentioned above. *To not write crappy code you need convention and experience, which you get as you work.*



## The outcomes?

This "portfolio" would alongside CCNA and one cloud cert of respectable tier (GCP/AWS/Azure) and possibly something Linux related, but not strictly needed and an university diploma should you manage to also get it in time (I did not), would yield people interviewing you or people in general seeking juniors having replies such as:

 

"Very nice! Not shockingly rare or awe-level amazing, but really nice, good try, you know very broadly, respect". Good junior! We want you.

 

Basically, people would always be intrigued by the things I mentioned above, and would like the broad knowledge, interest in embedded and electronics, passion and a ton of projects, often not directly related such as writing my own drivers, embedded stuff and PCB design in KiCad and some radio stuff (all side hobbies of mine).

 

## The reckoning

And then, the ML exploded. LLMs came. GPT came. AI came. Outsourcing came. Cheap workforce won out. Juniors became useless.

I shared some of the things I've done. It didn't intrigue anyone.

 

"I can teach that to a junior in a week" or "AI can be trained to do that for free".

 

I was always against gatekeeping. I always spread the knowledge. But it was hard to come by, while I was learning the old fashioned way. I learned this through years of reading manpages, experimenting, building my own homelab, wasting nights trying things out, talking on irc and other places, asking people, sharing and expchanging knowledge, all while slaving away at other job, without support of my family or anyone. I relied on myself.

 

And now, I look at the field and I realized, I can't match it anymore. As much as I learn, it's never enough or impressive.

Remember back in the day spinning your own docker containers was pretty cool? Like, oh wow man! Your own container. Really nice. VM's EOL!

 

Now? I tried out some LLMs. There's no way I can match them. Sure they make some mistakes that I fix. But the mistakes usually aren't noticed by me. I run the code, it shows mistakes, I fix the mistakes. It's all self
The requirements went up. Foot in the door goalpost is moved a lot. Share some advice, please? Adjust my thinking fallacies.

Hello dear /r/devops.

 

## The preface

I'm feeling something akin to being sad. The standards, complexity and oversaturation of the field has raised the barrier to unexpected levels. Or am I just setting expectations too high in my head? Please amend my thinking, which is as follows.

___________________
## Current situation
As you, too know, the entry is quite hard now. It was easier before, but I always planned to rely on the wow factor, which seems completely gone now. **What do I mean by this?**

My strategy as a beginner to the field consisted of being better than average but not phenomenal, having certs that majority don't have and just being interesting in general with a lot of rare, but not spectacular projects. This was more than was required of a junior. I didn't intend to get paid in the beginning either, I was fine with internship, just to shadow and learn more and fill my gaps. I was happy to just be there and contribute. And later become an actual junior on payroll.

 

For example, not very hard, but rare stuff, sought after stuff in 2020 for a junior would be, at least from my perspective:

* Selfhosting your own GitLab instance,
* Fully working set up CI/CD pipeline for a project of yours (e.g. web scraper),
* Doing network routing on a junior netadmin level (CCNA equiv) - setting up ids and ips, p2p vpn, wireguard,
* Sysadmin stuff, very in depth Linux such as:
* Writing basic AppArmor rules and focusing on hardening stuff, same for kernel (mostly just automating stuff, setting it up, following written notes), not selinux in depth guru tier, but just on the normal level,
* then also writing crappy, but working code, that was the fantastic first foot in the door which I mentioned above. *To not write crappy code you need convention and experience, which you get as you work.*

___________

## The outcomes?

This "portfolio" would alongside CCNA and one cloud cert of respectable tier (GCP/AWS/Azure) and possibly something Linux related, but not strictly needed and an university diploma should you manage to also get it in time (I did not), would yield people interviewing you or people in general seeking juniors having replies such as:

 

**"Very nice! Not shockingly rare or awe-level amazing, but really nice, good try, you know very broadly, respect". Good junior! We want you.**

 

Basically, people would always be intrigued by the things I mentioned above, and would like the broad knowledge, interest in embedded and electronics, passion and a ton of projects, often not directly related such as writing my own drivers, embedded stuff and PCB design in KiCad and some radio stuff (all side hobbies of mine).

 

## The reckoning

And then, the ML exploded. LLMs came. GPT came. AI came. Outsourcing came. Cheap workforce won out. Juniors became useless.

I shared some of the things I've done. It didn't intrigue anyone.

 

**"I can teach that to a junior in a week" or "AI can be trained to do that for free".**

 

I was always against gatekeeping. I always spread the knowledge. But it was hard to come by, while I was learning the old fashioned way. I learned this through years of reading manpages, experimenting, building my own homelab, wasting nights trying things out, talking on irc and other places, asking people, sharing and expchanging knowledge, all while slaving away at other job, without support of my family or anyone. I relied on myself.

 

And now, I look at the field and I realized, I can't match it anymore. As much as I learn, it's never enough or impressive.

**Remember back in the day spinning your own docker containers was pretty cool? Like, oh wow man! Your own container. Really nice. VM's EOL!**

 

Now? I tried out some LLMs. There's no way I can match them. Sure they make some mistakes that I fix. But the mistakes usually aren't noticed by me. I run the code, it shows mistakes, I fix the mistakes. It's all self