Reddit DevOps – Telegram
Are there established, open-source Kubernetes sandbox environments that are pre-configured to implement specific DevOps design patterns and are easily extensible for experimenting with and integrating new or unfamiliar technologies?

I want to try out various things on my local WSL2 environment, so I was looking for suggestions, so I can save some time.

https://redd.it/1p4your
@r_devops
Traefik bug squashed

Anyone else been getting bugged out by Traefik?
Just spent a week having a horrible time getting sites online.
Epic fails.
Used BACKTICK PLACEHOLDER.
sed after deployed.
All set.

https://redd.it/1p4zv3d
@r_devops
CICD System with Templating

The noscript says it all, I'm looking for a CICD system which will let a platforms team create modules with sane inputs and behavior for development teams to then freely use. I see a lot of great tools out there like Woodpecker, Semaphore and Gitness but none seem to support such functionality aside of GitlabCI and Jenkins. Is there possibly a third potential gem out there that I'm not aware of? Later Drone versions let you do that with Starlark (a python dialect) but the software is long discontinued. Thank you in advance for your input.

https://redd.it/1p51eu6
@r_devops
Spark UI is painful for debugging anyone else feel this

I love Spark, but the Web UI drives me crazy. Debugging failing jobs or figuring out why certain stages are slow takes forever. The UI shows logs and stages, but you cannot easily connect a stage failure to the exact task or code that caused it. You end up hunting through logs for minutes while the job keeps running.

It would be amazing to have a UI that highlights failing tasks, shows which stage is the bottleneck, and lets you jump straight from an alert to the exact part of the plan or code. Something like stage-level metrics combined with error pointers.

Right now I just stare at the UI spinning and think there has to be a better way. I want to see what others do when they get stuck in this mess, or even just commiserate with someone who has fought the same battle.


https://redd.it/1p568fb
@r_devops
Need advice on implementing CI/CD

Hey, I work at a SaaS company with many teams. I joined recently and noticed that there is no CI/CD process in place. I decided to automate the workflow, but I learned that the QA team is doing something similar to CI/CD, although not using Jenkins. We also have our own build tool based on Ant, as well as our own deployment tool. We typically trigger only 3–4 builds per day. I want to implement a proper CI/CD pipeline here. QA testing happens after the build is deployed to the test servers, and we also have a code check process that enforces certain company-specific rules.
How can I implement CI/CD in this environment? Any ideas?

https://redd.it/1p56tkp
@r_devops
Tako AI v1.5 - Your Okta AI sidekick

We just released **Tako AI v1.5** – an open-source agent for managing Okta environments that actually writes, tests, and fixes its own code.

**How it works:**

* Reads Okta API docs + your DB schema before writing any code
* Generates Python/SQL noscripts and runs them in a secure sandbox
* If it hits an error, it reads the stack trace and rewrites the code automatically

**Key features:**

* Runs on fast, cheap models (Gemini Flash, Haiku) without sacrificing accuracy
* Self-correction loop catches hallucinations
* Read-only by default, fully sandboxed, zero cloud dependencies
* Switches intelligently between local DB queries and live API calls

It's like having a junior engineer who reads the docs, tests their code, and fixes their own bugs—except it takes milliseconds instead of hours.

**GitHub:** [**https://github.com/fctr-id/okta-ai-agent**](https://github.com/fctr-id/okta-ai-agent)
**Blog:** [**https://iamse.blog/2025/11/23/tako-ai-v1-5-your-new-okta-ai-sidekick/**](https://iamse.blog/2025/11/23/tako-ai-v1-5-your-new-okta-ai-sidekick/)

Happy to answer questions about the architecture or self-healing logic.

https://redd.it/1p56wat
@r_devops
Specs for home build server

I would like to get some used machines for a build server to host my side projects at home. It will run git and build docker images using something like TeamCity. Would an i3 12100 with 8GB ram be fine or should I get an i5? What about those N100 mini PC's or used SFF machines with smth like a 8th gen Intel CPU?

I was also thinking of a way to run multiple agents so that I can run builds in parallel.

https://redd.it/1p59dx9
@r_devops
serverless vs server for mobile app discussion

context: not-startup company (so they have funds) wants POS-type mobile app with some offline functionality. handles daily business operations so cross-module logic mostly (inventory, checkout, etc.).

proposed solution: aws lambda functions

so, i am very new to the cloud (admittedly, just through this specific job, cloud really isn't my main interest) and i am more of a seasoned/capable app developer/software engr (whatever you wanna call it). i am familiar with AWS services & their use cases. but for this specific context, as a dev, i think an ec2 server or maybe even ECS + fargate would work better than individual lambda functions like, especially with cross-module logic won't that require like multiple of them talking to each other (don't get me started on the debugging)... the strong point i see is the unpredictable workload (what if the company's clients don't use said mobile app, so u pay for unnecessary idle server time) and the cost. (but assuming, this actually serves a problem of the company's clients i don't see why they won't use it)

but basically i go server here because, well, i just like servers more, i guess. in terms of development, debugging, and QA, i just think using a server is cleaner for this scenario - basically managing the backend as a whole.

i'm trying to be as open as possible. so if there is like a strong point in terms of management, development, debugging, workflow, cost & stuff, or anything that can convince a developer about lambda / serverless, please do share. because i'm, having a hard time accepting it. i can adapt, no doubt, but i feel like i need more convincing to gaslight myself for me to actually go "ah, i see why serverless is useful for this specific scenario..."

i've talked to chatgpt (YEAH AI) about this but i don't fully trust it because,,, it's AI. and the conversation i had with my co-worker is not very convincing for me. so maybe i guess i'm just searching for other seasoned developers who have used cloud as well to like share your thoughts.

please do correct me if i'm wrong, just don't be mean. (this is my first post, so please delete if i violate any of the rules - i mean that's exactly what's going to happen lol)

https://redd.it/1p5ad4k
@r_devops
Has anyone actually replaced Docker with WASM or other ‘next‑gen’ runtimes in production yet? Worth it or pure hype?

How many of you have pushed beyond experiments and are actually running WebAssembly or other ‘next‑gen’ runtimes in prod alongside or instead of containers?

What did you gain or regret after a few real releases, especially around cold starts, tooling, and debugging?

https://redd.it/1p5aomo
@r_devops
is generating Docker/Terraform/K8s configs still a huge pain for you?

I'm trying to confirm whether this is an actual problem or if I'm imagining it.

For anyone working with infrastructure:
When you need Docker Compose files, Kubernetes YAML, or Terraform configs, what’s the part that slows you down or annoys you the most?

A few things I’m curious about:
• Do you manually write these files every time?
• Do you reuse templates?
• Do you rely on AI, or does it make mistakes that cost you time?
• What’s the worst part of translating a simple denoscription into working config files?
• What would a perfect solution look like for you?

Not building anything yet. Just researching whether this pain point is common before I commit to making a tool. Any specifics from your experience would help a lot

https://redd.it/1p5c5j8
@r_devops
Trying to get on the wave into MLOps how would transitioning into this would look like?

Hi all, I am working as a DevOps engineer and want to transition into MLOps and jump on the AI wave while it's hot. I want to leverage it into higher salary, better benefits etc. I am wondering how to go about it, what should I learn? Should I start with the theory and learn machine learning, or jump straight into it and use n8n and claude to do actual stuff? Are there any courses which are worthwhile?

https://redd.it/1p5ce9j
@r_devops
i need help, always drowning in Spark logs

I swear every time I open a Spark job it is like opening a firehose of data. Logs, metrics, execution plans sometimes reach 2GB for a single run. You dig through it thinking you will find the culprit but it is just endless noise.

We tried tracking down slow stages and memory issues. Turns out maybe 5% of the data is actually useful. The rest is just redundant metrics, debug lines, and execution steps that do not lead anywhere.

The Spark UI is not much better. Loading large plans can take 5 to 10 mins. You sit there staring at the screen wondering if it is going to give you anything at all.

https://redd.it/1p5cfiw
@r_devops
Migrating from CodeCommit to GitHub. How to convince internal stakeholders

CodeCommit is on the chopping block. It might not be in the next month, or even in the next year, but I do not feel that it has a long time left before further deprecation.

The company I work at -- like many others -- is deeply embedded in the AWS ecosystem, and the current feeling is "if it's not broke, don't fix it." Aside from my personal gripes with CodeCommit, I feel that for the sake of longevity it is important that my company switches over to another git provider, more specifically GitHub.

One of my tasks for the next quarter is to work on standardizing internal operations and future-proofing my team, and I would love to start discussions on migrating from CodeCommit over to GitHub.

The issue at this point is making the case for doing it now rather than waiting for CodeCommit to be fully decommissioned. From what I have gathered, the relevant stakeholders are primarily concerned about the following:

* We already use AWS for everything else, so it would break our CI/CD pipelines
* All of our authorization/credentials are AWS-based, so GitHub would not be compatible and require different access provisioning
* We use Jira for project management, and it is already configured in AWS
* It is not as secure as AWS for storing our code
* ... various other considerations like these

I will admit that I am not too familiar with the security side of things, however, I do know that most of these are not actual roadblocks. We can integrate Jira, we can configure IAM support for GitHub actions and securely run our CI/CD in our AWS ecosystem, etc.

So my question for the community is two-fold: (1) Have you or your organization dealt with this as well, and if so how did you migrate? (2) Does anyone have any better, more concrete ideas for how to sell this to internal stakeholders, both technical and non-technical?

Thank you all in advance!

https://redd.it/1p5d9eu
@r_devops
Best OpsGenie alternatives? sunset is forcing migration, 50-person eng team

been putting off dealing with the opsg⁤enie sunset (April 2027) but leadership wants us to migrate Q1 next year so it's time to rip off the band-aid

running a 50-person engineering, about 12-15 incidents per month, mostly during work hours but the occasional late night

current setup is opsg⁤enie for on-call + Sl⁤ack for comms + Confluence for post-mortems. It's not sexy but it wo⁤rks (most of the time). we've had some issues with schedules before and the wrong person being messaged.

looking for alternatives that won't require retraining everyone or months of setup. research so far puts it between pagerduty, incident.io or firehydrant but need to do more digging and wanting to hear perspectives on here.

thanks

https://redd.it/1p5grfj
@r_devops
Looking for something to manage service accounts and AI agents

Our engineering team manages over 400 service accounts for CI/CD, Terraform, microservices and databases. We also create hundreds of short-lived credentials weekly for AI model endpoints and data jobs. Vault plus spreadsheets no longer scale. Rotation stays manual and audit logs live in different tools. We need one system that gives service accounts short-lived tokens, hands AI agents scoped credentials that auto expire, shows every non human identity in the same dashboard as users, keeps full audit trails and rotates secrets without breaking jobs. We are 80 people with a normal budget. Teams that solved this already, share the platform you use, current number of non human identities, time from pilot to production and real cost per month or per identity. This decides our business case this quarter. Thanks for direct answers.

https://redd.it/1p5g3fs
@r_devops
Is "Self-Documenting Code" a lie we tell ourselves to avoid writing docs?

Honest question for this sub. I'm reviewing our team's velocity and I've noticed a recurring pattern: my Senior devs are spending about 20-30% of their week acting as "human documentation" for new hires or juniors.

We have the standard "read the code" culture, but the reality is that context is lost the moment a PR is merged. When someone touches that module 6 months later, they spend hours deciphering why things were done that way.

I'm trying to figure out if this is a tooling problem or a discipline problem.

How are you guys handling this at scale? Do you actually enforce documentation updates on every PR? Or have you found a way to automate the "boring part" of explaining function logic so Seniors can actually code?

Feels like we are burning expensive time on something that should be solved by now.

https://redd.it/1p5hzfx
@r_devops
Shai Hulud Launches Second Supply-Chain Attack (2025-11-24)

Came across this (quite frightening) information. Some infected npm packages are executing malicious code to steal credentials and other secrets on developer machines, then publish them publicly on Github. Right now, thousands of new repo are being created to leak secrets. If you're using node in your pipeline, you should have a look in this.

Link to the article: https://www.aikido.dev/blog/shai-hulud-strikes-again-hitting-zapier-ensdomains (not affiliated in any way with them)

https://redd.it/1p5ih1j
@r_devops
We surveyed 200 Platform Engineers at KubeCon


Disclaimer: I’m the ceo of Port (no promotional stuff)

During KubeCon Atlanta a few weeks ago, we ran a small survey at our booth (~200 responses) to get a pulse on what Platform Engineering teams are actually dealing with day-to-day. Figured this subreddit might find some of the patterns interesting.

https://info.getport.io/hubfs/State%20of%20KubeCon%20Atlanta%202025.pdf?hstc=17958374.820a64313bb6ed5fb70cd5e6b36d95ac.1760895304604.1763984449811.1763987990522.6&hssc=17958374.17.1763987990522&hsfp=189584027

https://redd.it/1p5i5zz
@r_devops
My laptop died and locked me out of my homelab. It was the best thing that ever happened to my project.

Hello r/devops,

This is my second time posting on this sub after this post (link) where I shared my project for automating an RKE2 cluster on Proxmox with Terraform and Ansible. I got some great feedback, and since then, I've integrated HashiCorp Vault. It's been a journey, and I wanted to share what I learned.

Initially, I just thought having an automated K8s cluster was cool. But I soon realized I needed different environments (dev, staging, prod) for testing, verification, and learning. This forced me into a bad habit: copying .env files, pasting them into temp folders, and managing a mess of variables. After a while, I got it working but was tired of it. The whole idea was automation, and the manual steps to set up the automation were defeating the purpose.

Then, my laptop died a week ago (don't ask my why, it just didn't boot anymore, something related to TPM hardware changes)

And with it, I lost everything: all my environment variables, the only SSH key I'd authorized on my VMs, and my kubeconfig file. I was completely locked out of my own cluster. I had to manually regenerate the cloud-init files, swap the SSH keys on the VM disks, and fetch all the configs again.

This was the breaking point. I decided to build something more robust that would solve both the "dead laptop" problem and the manual copy/paste problem.

My solution was HashiCorp Vault + GitHub Actions.

At first, I was just using Vault as a glorified password manager, a central place to store secrets. I was still manually copying from Vault and pasting into .env files. I realized I was being "kinda dumb" until I found the Vault CLI and learned what it could really do. That's when I got the idea: run the entire Terraform+Ansible workflow in GitHub Actions.

This opened a huge rabbit hole, and I learned a ton about JWT/OIDC authentication. Here's what my new pipeline looks like:

1. GitHub Actions Auth: I started by (badly) using the Vault root token. I quickly learned I could have GHA authenticate to Vault using OIDC. The runner gets a short-lived JWT from GitHub, presents it to Vault, and Vault verifies it. No static Vault tokens in my GHA repo. I just need a separate, one-time Terraform project to configure Vault to trust GitHub's OIDC provider.
2. Dynamic SSH Keys: Instead of baking my static admin SSH key into cloud-init, I now configure my VMs to trust my Vault's SSH CA public key. When a GHA job runs, it:
Generates a brand new, fresh SSH keypair for that job.
Asks Vault (using its OIDC token) to sign the new public key.
Receives a short-lived SSH certificate back.
Uses that certificate to run Ansible. When the job is done, the key and cert are destroyed and are useless.
3. kubectl Auth: I applied the same logic to kubectl. I found out Vault can also be an OIDC provider. I no longer have to ssh into the control plane to fetch the admin config. I just use the kubelogin plugin. It pops open a browser, I log into Vault, and kubectl gets a short-lived OIDC token. My K8s API server (which I configured to trust Vault) maps that token to an RBAC role (admin, developer, or viewer) and grants me the right permissions.
4. In-Cluster Secrets: Finally, external-secrets-operator. It authenticates to Vault using its own K8s ServiceAccount JWT (just like the GHA runner), pulls secrets, and creates/syncs native K8s Secret objects. My pods don't even know Vault exists.

With all of that, now if I want to add a node, I just change a JSON file that defines my VMs, commit it, and open a PR. GitHub Actions runs terraform plan and posts the output as a comment. If I like it, I merge.

A new pipeline kicks off, fetches all secrets from Vault, applies the Terraform changes, and then runs Ansible (using a dynamic SSH cert) to bootstrap K8s. The cluster is fully configured with all my
Domain monitoring tool - looking for feedback/advice!

Hi guys!

For the past few months now I've been working on a little tool that routinely monitors the WHOIS/RDAP data, DNS records and the SSL status of domains. If any of this changes, you'll get a little email immediately letting you know.

I would really appreciate feedback on any aspect of the project, whether that's the landing page, something inside the app itself and such.

It doesn't have any ghastly AI features (nor does it need it!) and has only been worked on by myself so I'm pretty eager for feedback.

You can find the project here: https://domainwarden.app

Thank you so much for any feedback! I do appreciate it. :)

https://redd.it/1p5q7y3
@r_devops