Manual SBOM validation is killing my team, what base images are you folks using?
Current vendor requires manual SBOM validation for every image update. My team spends 15+ hours weekly cross-referencing CVE feeds against their bloated Ubuntu derivatives. 200+ packages per image, half we don't even use.
Need something with signed SBOMs that work, daily rebuilds, and minimal attack surface. Tired of vendors promising enterprise security then dumping manual processes on us.
Considered Chainguard but it became way too expensive for our scale. Heard of Minimus but my team is sceptical
What's working for you? Skip the marketing pitch please.
https://redd.it/1pk9rgg
@r_devops
Current vendor requires manual SBOM validation for every image update. My team spends 15+ hours weekly cross-referencing CVE feeds against their bloated Ubuntu derivatives. 200+ packages per image, half we don't even use.
Need something with signed SBOMs that work, daily rebuilds, and minimal attack surface. Tired of vendors promising enterprise security then dumping manual processes on us.
Considered Chainguard but it became way too expensive for our scale. Heard of Minimus but my team is sceptical
What's working for you? Skip the marketing pitch please.
https://redd.it/1pk9rgg
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Meta replaces SELinux with eBPF
SELinux was too slow for Meta so they replaced it with an eBPF based sandbox to safely run untrusted code.
bpfjailer handles things legacy MACs struggle with, like signed binary enforcement and deep protocol interception, without waiting for upstream kernel patches and without a measurable performance regressions across any workload/host type.
Full presentation here: https://lpc.events/event/19/contributions/2159/attachments/1833/3929/BpfJailer%20LPC%202025.pdf
https://redd.it/1pkhl58
@r_devops
SELinux was too slow for Meta so they replaced it with an eBPF based sandbox to safely run untrusted code.
bpfjailer handles things legacy MACs struggle with, like signed binary enforcement and deep protocol interception, without waiting for upstream kernel patches and without a measurable performance regressions across any workload/host type.
Full presentation here: https://lpc.events/event/19/contributions/2159/attachments/1833/3929/BpfJailer%20LPC%202025.pdf
https://redd.it/1pkhl58
@r_devops
I didn't like that cloud certificate practice exams cost money, so i built some free ones
https://exam-prep-6e334.web.app/
https://redd.it/1pklpt2
@r_devops
https://exam-prep-6e334.web.app/
https://redd.it/1pklpt2
@r_devops
Help troubleshooting Skopeo copy to GCP Artifact Registry
I wrote a small noscript that copies a list of public images to a private Artifact Registry account. I used skopeo and everything works on my local machine, but won't when run in the pipeline.
The error I see is reported below, and it seems to be related to the permissions of the service account used for skopeo but it is a artifactRegistry.admin...
https://redd.it/1pkmv3j
@r_devops
I wrote a small noscript that copies a list of public images to a private Artifact Registry account. I used skopeo and everything works on my local machine, but won't when run in the pipeline.
The error I see is reported below, and it seems to be related to the permissions of the service account used for skopeo but it is a artifactRegistry.admin...
time="2025-12-11T17:06:12Z" level=fatal msg="copying system image from manifest list: trying to reuse blob sha256:507427cecf82db8f5dc403dcb4802d090c9044954fae6f3622917a5ff1086238 at destination: checking whether a blob sha256:507427cecf82db8f5dc403dcb4802d090c9044954fae6f3622917a5ff1086238 exists in europe-west8-docker.pkg.dev/myregistry/bitnamilegacy/cert-manager: authentication required"
https://redd.it/1pkmv3j
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
EKS CI/CD security gates, too many false positives?
We’ve been trying this security gate in our EKS pipelines. It looks solid but its not… Webhook pushes risk scores and critical stuff into PRs. If certain IAM or S3 issues pop up, merges get blocked automatically. The problem is medium severity false positives keep breaking dev PRs. Old dependencies in non-prod namespaces constantly trip the gate. Custom Node.js policies help a bit, but tuning thresholds across prod, stage, and dev for five accounts is a nightmare. Feels like the tool slows devs down more than it protects production. Anyone here running EKS deploy gates? How do you cut the noise? Ideally, you only block criticals for assets that are actually exposed. Scripts or templates for multi-account policy inheritance would be amazing. Right now we poll
https://redd.it/1pko996
@r_devops
We’ve been trying this security gate in our EKS pipelines. It looks solid but its not… Webhook pushes risk scores and critical stuff into PRs. If certain IAM or S3 issues pop up, merges get blocked automatically. The problem is medium severity false positives keep breaking dev PRs. Old dependencies in non-prod namespaces constantly trip the gate. Custom Node.js policies help a bit, but tuning thresholds across prod, stage, and dev for five accounts is a nightmare. Feels like the tool slows devs down more than it protects production. Anyone here running EKS deploy gates? How do you cut the noise? Ideally, you only block criticals for assets that are actually exposed. Scripts or templates for multi-account policy inheritance would be amazing. Right now we poll
/api/v1/scans after Helm dry-run It works, but it’s clunky. Feels like we are bending CI/CD pipelines to fit the tool rather than the other way around. Any better approaches or tools that handle EKS pipelines cleanly?https://redd.it/1pko996
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
The agents I built are now someone elses problem
Two months since I left and I still get random anxiety about systems I dont own anymore
Did I ever actually document why that endpoint needs a retry with a 3 second sleep? Or did I just leave a comment that says "dont touch this". Pretty sure it was the comment.
Knowledge transfer was two weeks. Guy taking over seemed smart but had never worked with agents. Walked him through everything I could remember but so much context just lives in your head. Why certain prompts are phrased weird. Which integrations fail silently. That one thing that breaks on tuesdays for reasons I never figured out.
He messaged me once the first week asking about a config file and then nothing since. Either everything is fine or hes rebuilt it all or its on fire and nobody told me. I keep checking their status page like a psycho.
I know some of that code is bad. I know the docs have gaps. I know theres at least two hardcoded things I kept meaning to fix. Thats all someone elses problem now and I cant do anything about it.
Does this feeling go away or do you just collect ghosts from every job
https://redd.it/1pkrsm5
@r_devops
Two months since I left and I still get random anxiety about systems I dont own anymore
Did I ever actually document why that endpoint needs a retry with a 3 second sleep? Or did I just leave a comment that says "dont touch this". Pretty sure it was the comment.
Knowledge transfer was two weeks. Guy taking over seemed smart but had never worked with agents. Walked him through everything I could remember but so much context just lives in your head. Why certain prompts are phrased weird. Which integrations fail silently. That one thing that breaks on tuesdays for reasons I never figured out.
He messaged me once the first week asking about a config file and then nothing since. Either everything is fine or hes rebuilt it all or its on fire and nobody told me. I keep checking their status page like a psycho.
I know some of that code is bad. I know the docs have gaps. I know theres at least two hardcoded things I kept meaning to fix. Thats all someone elses problem now and I cant do anything about it.
Does this feeling go away or do you just collect ghosts from every job
https://redd.it/1pkrsm5
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Buildstash - Platform to organize, share, and distribute software binaries
We just launched a tool I'm working on called Buildstash. It's a platform for managing and sharing software binaries.
I'd worked across game dev, mobile apps, and agencies - and found every team had no real system for managing their built binaries. Often just dumped in a shared folder (if someone remembered!)
No proper system for versioning, keeping track of who'd signed off what when, or what exact build had gone to a client, etc.
Existing tools out there for managing build artifacts are really more focused on package repository management. But miss all the other types of software not being deployed that way.
That's the gap we'd seen and looked to solve with Buildstash. It's for organizing and distributing software binaries targeting any and all platforms, however they're deployed.
And we've really focused on the UX and making sure it's super easy to get setup - integrating with CI/CD or catching local builds, with a focus on making it accessible to teams of all sizes.
For mobile apps, it'll handle integrated beta distribution. For games, it has no problem with massive binaries targeting PC, consoles, or XR. Embedded teams who are keeping track of binaries across firmware, apps, and tools are also a great fit.
We launched open sign up on the product Monday and then another feature every day this week -
Today we launched Portals - a custom-branded space you can host on your website, and publish releases or entire build streams to your users. Think GitHub Releases but way more powerful. Or even think about any time you've seen some custom-built interface on a developers website for finding past builds by platform, looking through nightlies, viewing releases etc - Buildstash Portals can do all that out the box for you, customizable in a few minutes.
So that's the idea! I'd really love feedback from this community on what we've built so far / what you think we should focus on next?
- Here's a demo video - https://youtu.be/t4Fr6M_vIIc
- landing - https://buildstash.com
- and our GitHub - https://github.com/buildstash
https://redd.it/1pkslis
@r_devops
We just launched a tool I'm working on called Buildstash. It's a platform for managing and sharing software binaries.
I'd worked across game dev, mobile apps, and agencies - and found every team had no real system for managing their built binaries. Often just dumped in a shared folder (if someone remembered!)
No proper system for versioning, keeping track of who'd signed off what when, or what exact build had gone to a client, etc.
Existing tools out there for managing build artifacts are really more focused on package repository management. But miss all the other types of software not being deployed that way.
That's the gap we'd seen and looked to solve with Buildstash. It's for organizing and distributing software binaries targeting any and all platforms, however they're deployed.
And we've really focused on the UX and making sure it's super easy to get setup - integrating with CI/CD or catching local builds, with a focus on making it accessible to teams of all sizes.
For mobile apps, it'll handle integrated beta distribution. For games, it has no problem with massive binaries targeting PC, consoles, or XR. Embedded teams who are keeping track of binaries across firmware, apps, and tools are also a great fit.
We launched open sign up on the product Monday and then another feature every day this week -
Today we launched Portals - a custom-branded space you can host on your website, and publish releases or entire build streams to your users. Think GitHub Releases but way more powerful. Or even think about any time you've seen some custom-built interface on a developers website for finding past builds by platform, looking through nightlies, viewing releases etc - Buildstash Portals can do all that out the box for you, customizable in a few minutes.
So that's the idea! I'd really love feedback from this community on what we've built so far / what you think we should focus on next?
- Here's a demo video - https://youtu.be/t4Fr6M_vIIc
- landing - https://buildstash.com
- and our GitHub - https://github.com/buildstash
https://redd.it/1pkslis
@r_devops
YouTube
Buildstash Product Demo - Software binary and release management
CEO + co-founder Robbie Cargill presents a Buildstash product demo.
Buildstash is the software binary and release management platform.
Buildstash is the software binary and release management platform.
Is the promise of "AI-driven" incident management just marketing hype for DevOps teams?
We are constantly evaluating new platforms to streamline our on-call workflow and reduce alert fatigue. Tools that promise AI-driven incident management and full automation are everywhere now, like MonsterOps and similar providers.
I’m skeptical about whether these AIOps platforms truly deliver significant value for a team that already has well-defined runbooks and decent observability. Does the cost, complexity, and setup time for full automation really pay off in drastically reducing Mean Time To Resolution compared to simply improving our manual processes?
Did the AI significantly speed up your incident response, or did it mainly just reduce the noise?
https://redd.it/1pku5b6
@r_devops
We are constantly evaluating new platforms to streamline our on-call workflow and reduce alert fatigue. Tools that promise AI-driven incident management and full automation are everywhere now, like MonsterOps and similar providers.
I’m skeptical about whether these AIOps platforms truly deliver significant value for a team that already has well-defined runbooks and decent observability. Does the cost, complexity, and setup time for full automation really pay off in drastically reducing Mean Time To Resolution compared to simply improving our manual processes?
Did the AI significantly speed up your incident response, or did it mainly just reduce the noise?
https://redd.it/1pku5b6
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Serverless BI?
Have people worked with serverless BI yet, or is it still something you’ve only heard mentioned in passing? It has the potential to change how orgs approach analytics operations by removing the entire burden of tuning engines, managing clusters, and worrying about concurrency limits. The model scales automatically, giving data engineers a cleaner pipeline path, analysts fast access to insights, and ops teams far fewer moving parts to maintain. The real win is that sudden traffic bursts or dashboard surges no longer turn into operational fire drills because elasticity happens behind the scenes. Is this direction actually useful in your mind, or does it feel like another buzzword looking for a problem to solve?
https://redd.it/1pktuxa
@r_devops
Have people worked with serverless BI yet, or is it still something you’ve only heard mentioned in passing? It has the potential to change how orgs approach analytics operations by removing the entire burden of tuning engines, managing clusters, and worrying about concurrency limits. The model scales automatically, giving data engineers a cleaner pipeline path, analysts fast access to insights, and ops teams far fewer moving parts to maintain. The real win is that sudden traffic bursts or dashboard surges no longer turn into operational fire drills because elasticity happens behind the scenes. Is this direction actually useful in your mind, or does it feel like another buzzword looking for a problem to solve?
https://redd.it/1pktuxa
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
TRACKING DEPENDENCIES ACROSS A LARGE DEPLOYMENT PIPELINE
We have a large deployment environment where there are multiple custom tenants running different versions of code via release channels.
An issue we've had with these recent npm package vulnerabilities is that, while it's easy to track what is merged into main branch via SBOMs and tooling like socket.dev, snyk, etc., there is no easy way to view all dependencies across all deployed versions.
This is because there's such a large amount of data, there are 10-20 tags for each service, ~100 services, and while each tag generally might not be running different dependencies it becomes a pain to answer "Where across all services, tenants, and release channels is version 15.0.5 of next deployed".
Has anyone dealt with this before? It seems just like a big-data problem, and I'm not an expect at that. I can run custom sboms against those tags but quickly hit the GH API limits.
As I type this out, since not every tag will be a complete refactor (most won't be), they'll likely contain the same dependencies. So maybe for each new tag release, git --diff from the previous commit and only store changes in a DB or something?
https://redd.it/1pkthr3
@r_devops
We have a large deployment environment where there are multiple custom tenants running different versions of code via release channels.
An issue we've had with these recent npm package vulnerabilities is that, while it's easy to track what is merged into main branch via SBOMs and tooling like socket.dev, snyk, etc., there is no easy way to view all dependencies across all deployed versions.
This is because there's such a large amount of data, there are 10-20 tags for each service, ~100 services, and while each tag generally might not be running different dependencies it becomes a pain to answer "Where across all services, tenants, and release channels is version 15.0.5 of next deployed".
Has anyone dealt with this before? It seems just like a big-data problem, and I'm not an expect at that. I can run custom sboms against those tags but quickly hit the GH API limits.
As I type this out, since not every tag will be a complete refactor (most won't be), they'll likely contain the same dependencies. So maybe for each new tag release, git --diff from the previous commit and only store changes in a DB or something?
https://redd.it/1pkthr3
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How do approval flows feel in feature flag tools?
On paper they sound great, check the compliance and accountability boxes, but in practice I've seen them slow things down, turn into bottlenecks or just get ignored.
For anyone using Launchdarkly/ Unleash / Growthbook etc.: do approvals for feature flag changes actually help you? who ends up approving things in real life? do they make things safer or just more annoying?
https://redd.it/1pkt83z
@r_devops
On paper they sound great, check the compliance and accountability boxes, but in practice I've seen them slow things down, turn into bottlenecks or just get ignored.
For anyone using Launchdarkly/ Unleash / Growthbook etc.: do approvals for feature flag changes actually help you? who ends up approving things in real life? do they make things safer or just more annoying?
https://redd.it/1pkt83z
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Hyper-Volumetric DDoS: The 6,500 Daily Attacks Overwhelming Modern Infrastructure 🌊
https://instatunnel.my/blog/hyper-volumetric-ddos-the-6500-daily-attacks-overwhelming-modern-infrastructure
https://redd.it/1pkrneh
@r_devops
https://instatunnel.my/blog/hyper-volumetric-ddos-the-6500-daily-attacks-overwhelming-modern-infrastructure
https://redd.it/1pkrneh
@r_devops
InstaTunnel
Hyper-Volumetric DDoS Attacks: The 6,500-Per-Day Threat
Discover how hyper-volumetric DDoS attacks surged to 6,500 attacks per day in Q2 2025. Learn why traditional mitigation can’t keep pace and how organizations
Proxy solution for maven, node.js and oci
We use https://reposilite.com as a proxy for maven artifacts and https://www.verdaccio.org for node.js.
Before we choose another software as a proxy for oci artifacts (images, helm charts) we were thinking about if there's a solution (paid or free) that supports all of the mentioned types.
Anybody got a hint?
https://redd.it/1pl16nk
@r_devops
We use https://reposilite.com as a proxy for maven artifacts and https://www.verdaccio.org for node.js.
Before we choose another software as a proxy for oci artifacts (images, helm charts) we were thinking about if there's a solution (paid or free) that supports all of the mentioned types.
Anybody got a hint?
https://redd.it/1pl16nk
@r_devops
Reposilite
Lightweight and easy-to-use repository manager for Maven based artifacts in JVM ecosystem. This is simple, extensible and scalable self-hosted solution to replace managers like Nexus, Archiva or Artifactory, with reduced resources consumption.
Self-hosted WandB
We really like using WandB at my company, but we want to deploy it in a CMMC environment, and they have no support for that. Has anyone here self-hosted it using their operator? My experience is that the operator has tons of support but not much flexibility, and given our very specific requirements for data storage and ingress, it doesn't work for us. Does anyone have a working example, using a custom Ingress Controller and maybe Keycloak for user management.
https://redd.it/1pl2uha
@r_devops
We really like using WandB at my company, but we want to deploy it in a CMMC environment, and they have no support for that. Has anyone here self-hosted it using their operator? My experience is that the operator has tons of support but not much flexibility, and given our very specific requirements for data storage and ingress, it doesn't work for us. Does anyone have a working example, using a custom Ingress Controller and maybe Keycloak for user management.
https://redd.it/1pl2uha
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
how much time should seniors spend on reviews? trying to save time on manual code reviews
our seniors are spending like half their time reviewing prs and everyone's frustrated. Seniors feel like they're not coding anymore, juniors are waiting days for feedback, leadership is asking why everything takes so long.
I know code review is important and seniors should be involved but this seems excessive. We have about 8 seniors and 20 mid/junior engineers, everyone's doing prs constantly. Seniors get tagged on basically everything because they know the systems best.
trying to figure out what's reasonable here. Should seniors be spending 20 hours a week on reviews? 10? Less? And how do you actually reduce it without quality going to shit? We tried having seniors only review certain areas but then knowledge silos got worse.
https://redd.it/1pl6jj7
@r_devops
our seniors are spending like half their time reviewing prs and everyone's frustrated. Seniors feel like they're not coding anymore, juniors are waiting days for feedback, leadership is asking why everything takes so long.
I know code review is important and seniors should be involved but this seems excessive. We have about 8 seniors and 20 mid/junior engineers, everyone's doing prs constantly. Seniors get tagged on basically everything because they know the systems best.
trying to figure out what's reasonable here. Should seniors be spending 20 hours a week on reviews? 10? Less? And how do you actually reduce it without quality going to shit? We tried having seniors only review certain areas but then knowledge silos got worse.
https://redd.it/1pl6jj7
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
an open-source realistic exam simulator for CKAD, CKA, and CKS featuring timed sessions and hands-on labs with pre-configured clusters.
[https://github.com/sailor-sh/CK-X](https://github.com/sailor-sh/CK-X) \- found a really neat thing
* open-source
* designed for **CKA / CKAD / CKS** prep
* **hands-on labs**, not quizzes
* built around **real k8s clusters** you interact /w using `kubectl`
* capable of **timed sessions**, to mimic exam pressure
https://redd.it/1pl6rau
@r_devops
[https://github.com/sailor-sh/CK-X](https://github.com/sailor-sh/CK-X) \- found a really neat thing
* open-source
* designed for **CKA / CKAD / CKS** prep
* **hands-on labs**, not quizzes
* built around **real k8s clusters** you interact /w using `kubectl`
* capable of **timed sessions**, to mimic exam pressure
https://redd.it/1pl6rau
@r_devops
GitHub
GitHub - sailor-sh/CK-X: A realistic exam simulator for CKAD, CKA, and CKS featuring timed sessions and hands-on labs with pre…
A realistic exam simulator for CKAD, CKA, and CKS featuring timed sessions and hands-on labs with pre-configured clusters. - sailor-sh/CK-X
How in tf are you all handling 'vibe-coders'
This is somewhere between a rant and an actual inquiry, but how is your org currently handling the 'AI' frenzy that has permeated every aspect of our jobs? I'll preface this by saying, sure, LLMs have some potential use-cases and can sometimes do cool things, but it seems like plenty of companies, mine included, are touting it as the solution to all of the world's problems.
I get it, if you talk up AI you can convince people to buy your product and you can justify laying off X% of your workforce, but my company is also pitching it like this internally. What is the result of that? Well, it has evolved into non-engineers from every department in the org deciding that they are experts in software development, cloud architecture, picking the font in the docs I write, you know...everything! It has also resulted in these employees cranking out AI-slop code on a weekly basis and expecting us to just put it into production--even though no one has any idea of what the code is doing or accessing. Unfortunately, the highest levels of the org seem to be encouraging this, willfully ignoring the advice from those of us who are responsible for maintaining security and infrastructure integrity.
Are you all experiencing this too? Any advice on how to deal with it? Should I just lean into it and vibe-lawyer or vibe-c-suite? I'd rather not jump ship as the pay is good, but, damn, this is quickly becoming extremely frustrating.
*long exhale*
https://redd.it/1pl96e8
@r_devops
This is somewhere between a rant and an actual inquiry, but how is your org currently handling the 'AI' frenzy that has permeated every aspect of our jobs? I'll preface this by saying, sure, LLMs have some potential use-cases and can sometimes do cool things, but it seems like plenty of companies, mine included, are touting it as the solution to all of the world's problems.
I get it, if you talk up AI you can convince people to buy your product and you can justify laying off X% of your workforce, but my company is also pitching it like this internally. What is the result of that? Well, it has evolved into non-engineers from every department in the org deciding that they are experts in software development, cloud architecture, picking the font in the docs I write, you know...everything! It has also resulted in these employees cranking out AI-slop code on a weekly basis and expecting us to just put it into production--even though no one has any idea of what the code is doing or accessing. Unfortunately, the highest levels of the org seem to be encouraging this, willfully ignoring the advice from those of us who are responsible for maintaining security and infrastructure integrity.
Are you all experiencing this too? Any advice on how to deal with it? Should I just lean into it and vibe-lawyer or vibe-c-suite? I'd rather not jump ship as the pay is good, but, damn, this is quickly becoming extremely frustrating.
*long exhale*
https://redd.it/1pl96e8
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
how long until someone runs prod from chrome?
scrolling reddit, I saw something… unsettling
https://labs.leaningtech.com/blog/browserpod-beta-announcement.html
It’s a tool that lets you run node, py, and other runtimes directly in the browser
a little more of this, and we’ll genuinely be running k8s nodes - or something very kuber-adjacent - inside the browser itself
https://redd.it/1pl8am5
@r_devops
scrolling reddit, I saw something… unsettling
https://labs.leaningtech.com/blog/browserpod-beta-announcement.html
It’s a tool that lets you run node, py, and other runtimes directly in the browser
a little more of this, and we’ll genuinely be running k8s nodes - or something very kuber-adjacent - inside the browser itself
https://redd.it/1pl8am5
@r_devops
Leaning Technologies Developer Hub
BrowserPod: WebAssembly in-browser code sandboxes for Node, Python, and Rails
BrowserPod Beta is live: zero latency in-browser code sandboxes in WebAssembly for Node, Python and Rails.
GitHub Secret Leaks: The 13 Million API Credentials Sitting in Public Repos 🔐
https://instatunnel.my/blog/github-secret-leaks-the-13-million-api-credentials-sitting-in-public-repos
https://redd.it/1plbs5l
@r_devops
https://instatunnel.my/blog/github-secret-leaks-the-13-million-api-credentials-sitting-in-public-repos
https://redd.it/1plbs5l
@r_devops
InstaTunnel
GitHub Secret Leaks: 13 Million Exposed API Keys in Public
Learn how nearly 13 million API keys were exposed on public GitHub repos. Discover how attackers harvest secrets at scale and how to prevent credential leaks
What a Fintech Platform Team Taught Me About Crossplane, Terraform and the Cost of “Building It Yourself”
I recently spoke with a platform architect at a fintech company in Northern Europe.
They’ve been building their internal platform for about three years. Today, they manage **50-60 Kubernetes clusters in production**, usually **2-3 clusters per customer**, across multiple clouds (Azure today, AWS rolling out), with strong isolation requirements because of banking and compliance constraints.
Platform Engineering Tips is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
In other words: not a toy platform.
What they shared resonated with a lot of things I see elsewhere, so I’ll summarize it here in an anonymized way. If you’re in DevOps / platform engineering, you’ll probably recognize parts of your own world in this.
# Their Reality: A Platform Team at Scale
The platform team is around **7 people** and they own two big areas:
**Cloud infrastructure automation & standardization**
* Multi-account, multi-cluster setup
* Landing zones
* Compliance, security, DR tests, audits
* Cluster lifecycle, upgrades, observability
**Application infrastructure**
* Opinionated way to build and run apps
* Workflow orchestration running on Kubernetes
* Standardized “packages” that include everything an app needs: cluster, storage, secrets, networking, managed services (DBs, key vault, etc.)
Their goal is simple to describe, hard to execute:
>“Our goal is to do this at scale in a way that’s easy for us to operate, and then gradually put tools in the hands of other teams so they don’t depend on us.”
Classic platform mandate.
# Terraform Hit Its Limits
They started with Terraform. Like many. It worked… until it didn’t. This is what they hit:
**State problems at scale**
* Name changes and refactors causing subtle side effects
* Surprises when applies suddenly behave differently
**Complexity**
* Multiple pipelines for infra vs app
* Separate workflows for clusters, cloud resources, K8s resources
**Drift and visibility**
* Keeping Terraform state aligned with reality became painful
* Not a good fit when you want continuous reconciliation
Their conclusion:
>“We pushed Terraform to its limits for this use case. It wasn’t designed to orchestrate everything at this scale.”
That’s not Terraform-bashing. Terraform is great at what it does. But once you try to use it as **the control plane of your platform**, it starts to crack.
# Moving to a Kubernetes-Native Control Plane
So they moved to a **Kubernetes-native model**.
Roughly:
* **Crossplane** for cloud resources
* **Helm** for packaging
* **Argo CD** for GitOps and reconciliation
* A **hub control plane** managing all environments centrally
* Some custom controllers on top
Everything: clusters, databases, storage, secrets, etc. are now represented as **Kubernetes resources**.
Key benefit:
>*“We stopped thinking ‘this is cloud infra’ vs ‘this is app infra’.*
*For us, an environment now is the whole thing: cluster + cloud resources + app resources in one package.”*
So instead of “first run this Terraform stack, then another pipeline for K8s, then something else for app config”, they think in **full environment units** That’s a big mental shift.
# UI vs GitOps vs CLI: Different Teams, Different Needs
One thing that came out strongly:
* Some teams **don’t want to touch infra at all**. They just want: *“Here’s my code, please run it.”*
* Some teams are **comfortable going deep into Kubernetes and YAML**.
* Others want a **simple UI** to toggle capabilities (e.g. “enable logging for this environment”).
So they’re building **multiple abstraction layers**:
* **GitOps interface** as the “middle layer” (already established)
* A **CLI** for teams comfortable with infra
* Experiments with **UI portals** on top of their control plane
They experimented with tools like **Backstage**, using them as thin UIs on top of their existing orchestration:
>*“We built a lot of the UI in a portal by connecting it to
I recently spoke with a platform architect at a fintech company in Northern Europe.
They’ve been building their internal platform for about three years. Today, they manage **50-60 Kubernetes clusters in production**, usually **2-3 clusters per customer**, across multiple clouds (Azure today, AWS rolling out), with strong isolation requirements because of banking and compliance constraints.
Platform Engineering Tips is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
In other words: not a toy platform.
What they shared resonated with a lot of things I see elsewhere, so I’ll summarize it here in an anonymized way. If you’re in DevOps / platform engineering, you’ll probably recognize parts of your own world in this.
# Their Reality: A Platform Team at Scale
The platform team is around **7 people** and they own two big areas:
**Cloud infrastructure automation & standardization**
* Multi-account, multi-cluster setup
* Landing zones
* Compliance, security, DR tests, audits
* Cluster lifecycle, upgrades, observability
**Application infrastructure**
* Opinionated way to build and run apps
* Workflow orchestration running on Kubernetes
* Standardized “packages” that include everything an app needs: cluster, storage, secrets, networking, managed services (DBs, key vault, etc.)
Their goal is simple to describe, hard to execute:
>“Our goal is to do this at scale in a way that’s easy for us to operate, and then gradually put tools in the hands of other teams so they don’t depend on us.”
Classic platform mandate.
# Terraform Hit Its Limits
They started with Terraform. Like many. It worked… until it didn’t. This is what they hit:
**State problems at scale**
* Name changes and refactors causing subtle side effects
* Surprises when applies suddenly behave differently
**Complexity**
* Multiple pipelines for infra vs app
* Separate workflows for clusters, cloud resources, K8s resources
**Drift and visibility**
* Keeping Terraform state aligned with reality became painful
* Not a good fit when you want continuous reconciliation
Their conclusion:
>“We pushed Terraform to its limits for this use case. It wasn’t designed to orchestrate everything at this scale.”
That’s not Terraform-bashing. Terraform is great at what it does. But once you try to use it as **the control plane of your platform**, it starts to crack.
# Moving to a Kubernetes-Native Control Plane
So they moved to a **Kubernetes-native model**.
Roughly:
* **Crossplane** for cloud resources
* **Helm** for packaging
* **Argo CD** for GitOps and reconciliation
* A **hub control plane** managing all environments centrally
* Some custom controllers on top
Everything: clusters, databases, storage, secrets, etc. are now represented as **Kubernetes resources**.
Key benefit:
>*“We stopped thinking ‘this is cloud infra’ vs ‘this is app infra’.*
*For us, an environment now is the whole thing: cluster + cloud resources + app resources in one package.”*
So instead of “first run this Terraform stack, then another pipeline for K8s, then something else for app config”, they think in **full environment units** That’s a big mental shift.
# UI vs GitOps vs CLI: Different Teams, Different Needs
One thing that came out strongly:
* Some teams **don’t want to touch infra at all**. They just want: *“Here’s my code, please run it.”*
* Some teams are **comfortable going deep into Kubernetes and YAML**.
* Others want a **simple UI** to toggle capabilities (e.g. “enable logging for this environment”).
So they’re building **multiple abstraction layers**:
* **GitOps interface** as the “middle layer” (already established)
* A **CLI** for teams comfortable with infra
* Experiments with **UI portals** on top of their control plane
They experimented with tools like **Backstage**, using them as thin UIs on top of their existing orchestration:
>*“We built a lot of the UI in a portal by connecting it to
our control plane and CRDs. You go to an environment and say ‘enable logging’, it runs the GitOps changes in the background.”*
Because they already have the orchestration layer (Crossplane + Argo CD + custom controllers), portals can stay “just portals”: UI on top of an existing engine.
This is important: a portal *without* a strong control plane becomes just a dashboard. A portal *with* a strong control plane becomes a real self-service platform.
# The Real Challenges Are Not (Only) Technical
The interesting part of the conversation wasn’t “we use Crossplane” or “we use GitOps”. That’s expected. The harder problems they described were:
# 1. Different maturity levels across teams
* Some teams want full control over infra
* Some don’t care and just want things to “work”
* Some like GitOps, others are allergic to it
>*“It’s very hard to build a single solution that makes everyone happy.*
*You end up making trade-offs and accepting you won’t please all teams.”*
Hence the multi-layer approach.
# 2. Doing this with a small team
Even with 7 people, running:
* 50-60 clusters
* strict isolation per customer
* multi-cloud
* compliance, security, DR tests
* audits
…is hard.
>*“We want to automate as much as possible. Manual operations at this scale just don’t work.”*
This is where the real cost of “build it yourself” shows up. Even a very strong team ends up spending a lot of time on **operations and glue**, not on differentiating features.
# 3. Third-Party Tools vs Banking Compliance
They tried to adopt third-party tools for observability (Datadog, Sumo Logic, etc.). Technically, this made sense. Organizationally, it became painful.
* Every external SaaS triggered **risk assessment** on the customer side
* Technical teams were fine
* Legal and risk teams often said “no”
* Out of several customers, **only a few** accepted standardized third-party observability tools
The result:
* No consistent, standardized third-party layer
* More pressure to build and operate internally
If you’re in a regulated environment, this probably sounds familiar.
# Build vs Buy: The Platform Engineer’s Dilemma
One thing I appreciated was how honest they were about the **trade-offs**. On one side, building your own platform means:
* you control everything
* you can shape it to your domain
* you avoid some vendor risks
On the other side:
* A 7-person platform team easily costs **\~900,000€/year** (or more)
Most of their time is not spent on “cool problems”. It’s spent on: upgrades, security and compliance obligations, DR testing, provider bugs, drift, documentation, keeping everything running.
As they said:
>*“Sometimes buying seems expensive, but people don’t account for the time cost. A lot of money is wasted in time spent building and maintaining everything.”*
And they’re right. The build vs buy decision is less about tools, more about **where you want your team’s energy to go**.
# What I Took Away From This Conversation
A few things I keep seeing across companies, and this call reinforced them:
1. **Terraform is fantastic, but not a silver bullet for platforms.** Using it as the main engine for a large-scale, multi-cluster, multi-tenant control plane is painful.
2. **Kubernetes-native control planes are powerful when you unify cloud infra + app infra.** Treating “an environment” as a single unit (cluster + cloud resources + app resources) is a big win.
3. **Teams need multiple interfaces.** CLI, GitOps, and UI all have their place. Different teams want different levels of abstraction.
4. **Platform teams underestimate how much they’ll have to build around UX, RBAC, audit, and self-service.** This is where a lot of hidden time goes.
5. **Regulated environments distort the tool landscape.** You can’t always just “adopt Datadog” or “plug in X SaaS”. Legal and risk vetoes matter as much as technical arguments.
6. **Build vs buy is not a one-time decision.** You might build a strong internal platform today and later decide to complement or replace parts of it with external platforms as constraints change.
# You’re Not the
Because they already have the orchestration layer (Crossplane + Argo CD + custom controllers), portals can stay “just portals”: UI on top of an existing engine.
This is important: a portal *without* a strong control plane becomes just a dashboard. A portal *with* a strong control plane becomes a real self-service platform.
# The Real Challenges Are Not (Only) Technical
The interesting part of the conversation wasn’t “we use Crossplane” or “we use GitOps”. That’s expected. The harder problems they described were:
# 1. Different maturity levels across teams
* Some teams want full control over infra
* Some don’t care and just want things to “work”
* Some like GitOps, others are allergic to it
>*“It’s very hard to build a single solution that makes everyone happy.*
*You end up making trade-offs and accepting you won’t please all teams.”*
Hence the multi-layer approach.
# 2. Doing this with a small team
Even with 7 people, running:
* 50-60 clusters
* strict isolation per customer
* multi-cloud
* compliance, security, DR tests
* audits
…is hard.
>*“We want to automate as much as possible. Manual operations at this scale just don’t work.”*
This is where the real cost of “build it yourself” shows up. Even a very strong team ends up spending a lot of time on **operations and glue**, not on differentiating features.
# 3. Third-Party Tools vs Banking Compliance
They tried to adopt third-party tools for observability (Datadog, Sumo Logic, etc.). Technically, this made sense. Organizationally, it became painful.
* Every external SaaS triggered **risk assessment** on the customer side
* Technical teams were fine
* Legal and risk teams often said “no”
* Out of several customers, **only a few** accepted standardized third-party observability tools
The result:
* No consistent, standardized third-party layer
* More pressure to build and operate internally
If you’re in a regulated environment, this probably sounds familiar.
# Build vs Buy: The Platform Engineer’s Dilemma
One thing I appreciated was how honest they were about the **trade-offs**. On one side, building your own platform means:
* you control everything
* you can shape it to your domain
* you avoid some vendor risks
On the other side:
* A 7-person platform team easily costs **\~900,000€/year** (or more)
Most of their time is not spent on “cool problems”. It’s spent on: upgrades, security and compliance obligations, DR testing, provider bugs, drift, documentation, keeping everything running.
As they said:
>*“Sometimes buying seems expensive, but people don’t account for the time cost. A lot of money is wasted in time spent building and maintaining everything.”*
And they’re right. The build vs buy decision is less about tools, more about **where you want your team’s energy to go**.
# What I Took Away From This Conversation
A few things I keep seeing across companies, and this call reinforced them:
1. **Terraform is fantastic, but not a silver bullet for platforms.** Using it as the main engine for a large-scale, multi-cluster, multi-tenant control plane is painful.
2. **Kubernetes-native control planes are powerful when you unify cloud infra + app infra.** Treating “an environment” as a single unit (cluster + cloud resources + app resources) is a big win.
3. **Teams need multiple interfaces.** CLI, GitOps, and UI all have their place. Different teams want different levels of abstraction.
4. **Platform teams underestimate how much they’ll have to build around UX, RBAC, audit, and self-service.** This is where a lot of hidden time goes.
5. **Regulated environments distort the tool landscape.** You can’t always just “adopt Datadog” or “plug in X SaaS”. Legal and risk vetoes matter as much as technical arguments.
6. **Build vs buy is not a one-time decision.** You might build a strong internal platform today and later decide to complement or replace parts of it with external platforms as constraints change.
# You’re Not the