Terraform's dependency on github.com - what are your thoughts?
Hi all,
Like two weeks ago ( december the 18th ) github.com its reachability was affected by an issue on their side.
See -> https://www.githubstatus.com/incidents/xntfc1fz5rfb
We needed to do maintenance that very day. All of our terraform providers were defined as default. "Go get it from github" plus we didn't had any terraform caching active.
We needed to run some terraform noscripts multiple times to be lucky to not get a 500/503 from github downloading the providers. In the end we succeeded but it took a lot more time then anticipated.
We now worked on having all of our terraform providers on local hosted location.
Some tuning with .terraformrc, some extra's in our CI/CD pipeline for running terraform.
All together a nice project to put together, it requires you to think about what are the providers that we are using? And which versions do we exactly need.
But it also creates another technical nook in our infrastructure. F.e. when we want to up one of the provider versions we need to perform additional tasks.
What are your thoughts about this? Some services are treated like they are the light and water of the internet. They are always there ( github / dockerhub / cloudfare ) - until they are not and recently we noticed a lot of the latter behavior.
One thought is this doesn't happens that often, they have the top of the line infra + expertise.
It isn't worth doing this kind of workaround if you are not servicing infra for an hospital or a bank.
The other more personally thought is, I like the disruptive nature of these incidents, it encourages you to think past the assumption of tech building blocks that are to big to fail.
And it ignites the doubt that is not so wise that everybody should stick to the same golden standards from the big 7 in Silicon Valley.
Tell me!?
https://redd.it/1pzfe7e
@r_devops
Hi all,
Like two weeks ago ( december the 18th ) github.com its reachability was affected by an issue on their side.
See -> https://www.githubstatus.com/incidents/xntfc1fz5rfb
We needed to do maintenance that very day. All of our terraform providers were defined as default. "Go get it from github" plus we didn't had any terraform caching active.
We needed to run some terraform noscripts multiple times to be lucky to not get a 500/503 from github downloading the providers. In the end we succeeded but it took a lot more time then anticipated.
We now worked on having all of our terraform providers on local hosted location.
Some tuning with .terraformrc, some extra's in our CI/CD pipeline for running terraform.
All together a nice project to put together, it requires you to think about what are the providers that we are using? And which versions do we exactly need.
But it also creates another technical nook in our infrastructure. F.e. when we want to up one of the provider versions we need to perform additional tasks.
What are your thoughts about this? Some services are treated like they are the light and water of the internet. They are always there ( github / dockerhub / cloudfare ) - until they are not and recently we noticed a lot of the latter behavior.
One thought is this doesn't happens that often, they have the top of the line infra + expertise.
It isn't worth doing this kind of workaround if you are not servicing infra for an hospital or a bank.
The other more personally thought is, I like the disruptive nature of these incidents, it encourages you to think past the assumption of tech building blocks that are to big to fail.
And it ignites the doubt that is not so wise that everybody should stick to the same golden standards from the big 7 in Silicon Valley.
Tell me!?
https://redd.it/1pzfe7e
@r_devops
GitHub
GitHub · Change is constant. GitHub keeps you ahead.
Join the world's most widely adopted, AI-powered developer platform where millions of developers, businesses, and the largest open source community build software that advances humanity.
Kubernetes concepts in 60 seconds
Trying an experiment: explaining Kubernetes concepts in under 60 seconds.
Would love feedback.
Check out the videos on YouTube
https://youtube.com/@soulmaniqbal?si=pZCVwXQizNQXFzv1
https://redd.it/1pzfsir
@r_devops
Trying an experiment: explaining Kubernetes concepts in under 60 seconds.
Would love feedback.
Check out the videos on YouTube
https://youtube.com/@soulmaniqbal?si=pZCVwXQizNQXFzv1
https://redd.it/1pzfsir
@r_devops
Reddit
From the devops community on Reddit: Kubernetes concepts in 60 seconds
Explore this post and more from the devops community
qa tests blocking deploys 6 times today, averaging 40min per run
our pipeline is killing productivity. we've got this selenium test suite with about 650 tests that runs on every pr and it's become everyone's least favorite part of the day.
takes 40 minutes on average, sometimes up to an hour. but the real problem is the flakiness. probably 8 to 12 tests fail on every single run, always different ones. devs have learned to just click rerun and grab coffee.
we're trying to ship multiple times per day but qa stage is the bottleneck. and nobody trusts the tests anymore because they've cried wolf so many times. when something actually fails everyone assumes it's just another selector issue.
tried parallelizing more but hit our ci runner limits. tried being smarter about what runs when but then we miss integration issues. feels like we're stuck between slow and unreliable.
anyone actually solved this problem? need tests that are fast, stable, and catch real bugs. starting to think the whole selector based approach is fundamentally flawed for complex modern webapps.
https://redd.it/1pzgupz
@r_devops
our pipeline is killing productivity. we've got this selenium test suite with about 650 tests that runs on every pr and it's become everyone's least favorite part of the day.
takes 40 minutes on average, sometimes up to an hour. but the real problem is the flakiness. probably 8 to 12 tests fail on every single run, always different ones. devs have learned to just click rerun and grab coffee.
we're trying to ship multiple times per day but qa stage is the bottleneck. and nobody trusts the tests anymore because they've cried wolf so many times. when something actually fails everyone assumes it's just another selector issue.
tried parallelizing more but hit our ci runner limits. tried being smarter about what runs when but then we miss integration issues. feels like we're stuck between slow and unreliable.
anyone actually solved this problem? need tests that are fast, stable, and catch real bugs. starting to think the whole selector based approach is fundamentally flawed for complex modern webapps.
https://redd.it/1pzgupz
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Looking for help for my startup
Hey all!
I'm coming here to seek for some guidance or help on how to tackle my next challenge on the startup I am creating.
We currently have various services that some clients are currently using, and our next step is white labeling certain type of website.
Right now, we operate this website which is running over a mono-repo with React and NextJS, and is extremely connected with an admin panel in a different repository.
The website usually requests for data to the admin panel, including for secrets at server-boot (I did this to allow my future self to deploy multiple websites over the same codebase, without having a mess of secrets on GitHub). These secrets are being pulled from the admin panel using a slug I assigned to my website. Ideally, other websites in the future will use this same system.
The problem (or challenge): what's the way to go in order to have multiple deployments happening every time we merge into the main branch? Currently I am using GH actions but to me, it doesn't look sustainable in the future, once we have many white-labeled websites running out there.
It's also important to mention that each website will have it's own external Supabase, an internal (self-hosted) Redis instance, and all of them will use our centralized Soketi (Pusher alternative - self-hosted) service... So, ideally, the solution would include deploying that external Supabase (this is easy, APIs exist for that), a dedicated Redis, and... a server to host the backend, and that dedicated Redis.
I've been a Software Engineer for the last 7-8 years but never really had to actually take care of devops / infra / you-call-it. I'm really open to learn all of this, had multiple conversations with Claude but I always prefer human-to-human information transfers.
Thank you!
https://redd.it/1pzjdwk
@r_devops
Hey all!
I'm coming here to seek for some guidance or help on how to tackle my next challenge on the startup I am creating.
We currently have various services that some clients are currently using, and our next step is white labeling certain type of website.
Right now, we operate this website which is running over a mono-repo with React and NextJS, and is extremely connected with an admin panel in a different repository.
The website usually requests for data to the admin panel, including for secrets at server-boot (I did this to allow my future self to deploy multiple websites over the same codebase, without having a mess of secrets on GitHub). These secrets are being pulled from the admin panel using a slug I assigned to my website. Ideally, other websites in the future will use this same system.
The problem (or challenge): what's the way to go in order to have multiple deployments happening every time we merge into the main branch? Currently I am using GH actions but to me, it doesn't look sustainable in the future, once we have many white-labeled websites running out there.
It's also important to mention that each website will have it's own external Supabase, an internal (self-hosted) Redis instance, and all of them will use our centralized Soketi (Pusher alternative - self-hosted) service... So, ideally, the solution would include deploying that external Supabase (this is easy, APIs exist for that), a dedicated Redis, and... a server to host the backend, and that dedicated Redis.
I've been a Software Engineer for the last 7-8 years but never really had to actually take care of devops / infra / you-call-it. I'm really open to learn all of this, had multiple conversations with Claude but I always prefer human-to-human information transfers.
Thank you!
https://redd.it/1pzjdwk
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
I'm rejecting the next architecture PR that uses a Service Mesh for a team of 4 developers. We are gaslighting ourselves.
I’ve been lurking here for years, and after reading some recent posts, I need to say something that might make me unpopular with the "CV-Driven Development" crowd.
We are engineering our own burnout.
I've sat on hiring panels for the last 6 months, and the state of "Senior" DevOps is terrifying. I’m seeing a generation of engineers who can write complex Helm charts but can’t explain how DNS propagation works or debugging a TCP handshake.
Here is my analysis of why our industry is currently broken:
1. The Abstraction Addiction We are solving problems we don't have. I saw a candidate last week propose a multi-cluster Kubernetes setup with Istio for a simple internal CRUD app. When I asked why not just use a boring EC2 instance or ECS task, they looked at me like I suggested using FTP. We are choosing tools not because they solve a business problem, but because we want to put them on our LinkedIn. We are voluntarily taking on the operational overhead of Netflix without having their scale or their headcount.
2. The Death of Debugging To the user who posted "New DevOps please learn networking": Thank you. We are abstracting away the underlying systems so heavily that we are creating engineers who can "configure" but cannot "fix." When the abstraction leaks (and it always does, usually at 3 AM), these "YAML Engineers" are helpless because they don't understand the Linux primitives underneath.
3. Hiring is a Carnival Game We ask for 8 rounds of interviews to test for trivia on 15 different tools, but we don't test for systems thinking. Real seniority isn't knowing the flags for every CLI tool; it's knowing when not to use a tool. It's about telling management, "No, we don't need to migrate to that shiny new thing."
4. Complexity = Job Security (False) We tell ourselves that building complex systems makes us valuable. It doesn't. It makes us pagers. The best infrared engineers I know build systems so boring that they sleep through the night. If you are currently building a resume-padder architecture: Stop.
If you are a Junior: Stop trying to learn the entire CNCF landscape. Learn Linux. Learn Networking. Learn a noscripting language deeply. If you are a Senior: Stop checking boxes. Start deleting code.
The most senior thing you can do is build something so simple it looks like a junior did it, but it never goes down.
/endrant
https://redd.it/1pzkibf
@r_devops
I’ve been lurking here for years, and after reading some recent posts, I need to say something that might make me unpopular with the "CV-Driven Development" crowd.
We are engineering our own burnout.
I've sat on hiring panels for the last 6 months, and the state of "Senior" DevOps is terrifying. I’m seeing a generation of engineers who can write complex Helm charts but can’t explain how DNS propagation works or debugging a TCP handshake.
Here is my analysis of why our industry is currently broken:
1. The Abstraction Addiction We are solving problems we don't have. I saw a candidate last week propose a multi-cluster Kubernetes setup with Istio for a simple internal CRUD app. When I asked why not just use a boring EC2 instance or ECS task, they looked at me like I suggested using FTP. We are choosing tools not because they solve a business problem, but because we want to put them on our LinkedIn. We are voluntarily taking on the operational overhead of Netflix without having their scale or their headcount.
2. The Death of Debugging To the user who posted "New DevOps please learn networking": Thank you. We are abstracting away the underlying systems so heavily that we are creating engineers who can "configure" but cannot "fix." When the abstraction leaks (and it always does, usually at 3 AM), these "YAML Engineers" are helpless because they don't understand the Linux primitives underneath.
3. Hiring is a Carnival Game We ask for 8 rounds of interviews to test for trivia on 15 different tools, but we don't test for systems thinking. Real seniority isn't knowing the flags for every CLI tool; it's knowing when not to use a tool. It's about telling management, "No, we don't need to migrate to that shiny new thing."
4. Complexity = Job Security (False) We tell ourselves that building complex systems makes us valuable. It doesn't. It makes us pagers. The best infrared engineers I know build systems so boring that they sleep through the night. If you are currently building a resume-padder architecture: Stop.
If you are a Junior: Stop trying to learn the entire CNCF landscape. Learn Linux. Learn Networking. Learn a noscripting language deeply. If you are a Senior: Stop checking boxes. Start deleting code.
The most senior thing you can do is build something so simple it looks like a junior did it, but it never goes down.
/endrant
https://redd.it/1pzkibf
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Release management nightmare - how do you track what's actually going out?
Just had our third surprise production issue this month bc nobody knew which features were bundled in our release. Engineering says feature X is ready, QA cleared it last week, but somehow it wasn't in the build that went out Friday.
We have relied on Slack threads and manual Git tag checking, they have served us fine for a while but I think we've reached a breaking point. How does this roll up to leadership when they ask what shipped this sprint? Like, what are you using for release management to ensure everything falls into place?
https://redd.it/1pzi7l9
@r_devops
Just had our third surprise production issue this month bc nobody knew which features were bundled in our release. Engineering says feature X is ready, QA cleared it last week, but somehow it wasn't in the build that went out Friday.
We have relied on Slack threads and manual Git tag checking, they have served us fine for a while but I think we've reached a breaking point. How does this roll up to leadership when they ask what shipped this sprint? Like, what are you using for release management to ensure everything falls into place?
https://redd.it/1pzi7l9
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
ai generated k8s configs saved me time then broke prod in the weirdest way
context: migrating from docker swarm to k8s. small team, needed to move fast. i had some k8s experience but never owned a prod cluster
used cursor to generate configs for our 12 services. honestly saved my ass, would have taken days otherwise. got deployments, services, ingress done in maybe an hour. ran in staging for a few days, did some basic load testing on the api endpoints, looked solid
deployed tuesday afternoon during low traffic window. everything fine for about 6 hours. then around 9pm our monitoring started showing weird patterns - some requests fast, some timing out, no clear pattern
spent the next few hours debugging the most confusing issue. turns out multiple things were breaking simultaneously:
our main api was crashlooping but only 3 out of 8 pods. took forever to realize the ai set liveness probe initialDelaySeconds to 5s. works fine in staging where we have tiny test data. prod loads way more reference data on startup, usually takes 8-10 seconds but varies by node. so some pods would start fast enough, others kept getting killed mid-initialization. probably network latency or node performance differences, never figured out exactly why
while fixing that, noticed our batch processor was getting cpu throttled hard. ai had set pretty conservative limits - 500m cpu for most services. batch job spikes to like 2 cores during processing. didnt catch it in staging because we never run the full batch there, just tested the api layer
then our cache service started oom killing. 256Mi limit looked reasonable in the configs but under real load it needs closer to 1Gi. staging cache is basically empty so never saw this coming
the configs themselves were fine, just completely generic. real problem was my staging environment told me nothing useful:
test dataset is 1% of prod size
never run batch jobs in staging
no real traffic patterns
didnt know startup probes were even a thing
zero baseline metrics for what "normal" looks like
basically ai let me move fast but i had no idea what i didnt know. thought i was ready because the yaml looked correct and staging tests passed
took about 2 weeks to get everything stable:
added startup probes (game changer for slow-starting services)
actually load tested batch scenarios
set up prometheus properly, now i have real data
resource limits based on actual usage not guesses
tried a few different tools for generating configs after this mess. cursor is fast but pretty generic. copilot similar. someone mentioned verdent which seems to pick up more context from existing services, but honestly at this point i just validate everything manually regardless of what generates it
costs are down about 25% vs swarm which is nice. still probably over-provisioned in places but at least its stable
lesson learned: ai tools are incredible for velocity but they dont teach you what questions to ask. its like having an intern who codes really fast but never tells you when something might be a bad idea
https://redd.it/1pzn5f9
@r_devops
context: migrating from docker swarm to k8s. small team, needed to move fast. i had some k8s experience but never owned a prod cluster
used cursor to generate configs for our 12 services. honestly saved my ass, would have taken days otherwise. got deployments, services, ingress done in maybe an hour. ran in staging for a few days, did some basic load testing on the api endpoints, looked solid
deployed tuesday afternoon during low traffic window. everything fine for about 6 hours. then around 9pm our monitoring started showing weird patterns - some requests fast, some timing out, no clear pattern
spent the next few hours debugging the most confusing issue. turns out multiple things were breaking simultaneously:
our main api was crashlooping but only 3 out of 8 pods. took forever to realize the ai set liveness probe initialDelaySeconds to 5s. works fine in staging where we have tiny test data. prod loads way more reference data on startup, usually takes 8-10 seconds but varies by node. so some pods would start fast enough, others kept getting killed mid-initialization. probably network latency or node performance differences, never figured out exactly why
while fixing that, noticed our batch processor was getting cpu throttled hard. ai had set pretty conservative limits - 500m cpu for most services. batch job spikes to like 2 cores during processing. didnt catch it in staging because we never run the full batch there, just tested the api layer
then our cache service started oom killing. 256Mi limit looked reasonable in the configs but under real load it needs closer to 1Gi. staging cache is basically empty so never saw this coming
the configs themselves were fine, just completely generic. real problem was my staging environment told me nothing useful:
test dataset is 1% of prod size
never run batch jobs in staging
no real traffic patterns
didnt know startup probes were even a thing
zero baseline metrics for what "normal" looks like
basically ai let me move fast but i had no idea what i didnt know. thought i was ready because the yaml looked correct and staging tests passed
took about 2 weeks to get everything stable:
added startup probes (game changer for slow-starting services)
actually load tested batch scenarios
set up prometheus properly, now i have real data
resource limits based on actual usage not guesses
tried a few different tools for generating configs after this mess. cursor is fast but pretty generic. copilot similar. someone mentioned verdent which seems to pick up more context from existing services, but honestly at this point i just validate everything manually regardless of what generates it
costs are down about 25% vs swarm which is nice. still probably over-provisioned in places but at least its stable
lesson learned: ai tools are incredible for velocity but they dont teach you what questions to ask. its like having an intern who codes really fast but never tells you when something might be a bad idea
https://redd.it/1pzn5f9
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How would you define proactive AWS Hygiene and Ownership process
We currently lack a standardized way to track ownership, lifespan, and relevance of AWS resources, especially in non-prod accounts. This leads to unused resources, unnecessary cost, and ambiguity during alerts or incidents. We need a proactive process to keep AWS environments clean and accountable.
While I will give some thoughts about this. I want to ask to fellow people, how would you define a process? What steps should be good here? What requirements do you feel we as DevOps need here?
https://redd.it/1pzlj8c
@r_devops
We currently lack a standardized way to track ownership, lifespan, and relevance of AWS resources, especially in non-prod accounts. This leads to unused resources, unnecessary cost, and ambiguity during alerts or incidents. We need a proactive process to keep AWS environments clean and accountable.
While I will give some thoughts about this. I want to ask to fellow people, how would you define a process? What steps should be good here? What requirements do you feel we as DevOps need here?
https://redd.it/1pzlj8c
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Holiday hack: EKS with your own machines
Hey folks, I’m hacking on a side project over the holidays and would love a sanity check from folks running EKS at scale.
Problem: EKS/EC2 is still a big chunk of my AWS bills even after the “usual” optimizations. I’m exploring a way to reduce EKS costs even further without rewriting everything from scratch without EKS.
Most advice (and what I’ve done before) clusters around:
- Spot + smart autoscaling (Karpenter, consolidation, mixed instance types)
- Rightsizing requests/limits, bin packing, node shapes, and deleting idle workloads
- Graviton/ARM where possible
- Reduce cross-AZ spend (or even go single AZ if you can)
- FinOps visibility (Kubecost, etc.) to find the real culprits (eg, unallocated requests)
- “Kubernetes tax” avoidance: move some workloads to ECS/Fargate when you can
But even after doing all this, EC2 is just… Expensive.
So I'm playing around with a hybrid EKS cluster:
- Keep the managed EKS control plane in AWS
- Run worker nodes on much cheaper compute outside AWS (e.g. bare metal servers on Hetzner)
- Burst to EC2 for spikes using labels/taints + Karpenter on the AWS node pools
AWS now offers “EKS Hybrid Nodes” for this, but the pricing is even more expensive than EC2 itself (why?), so I’m experimenting with a hybrid setup without that managed layer.
Questions for the crowd:
- Would you ever run production workloads on off-AWS worker nodes while keeping EKS control plane in AWS? Why/why not?
- What’s the biggest deal-breaker: networking latency, security boundaries, ops overhead, supportability, something else?
If this resonates, I’m happy to share more details (or a small writeup) once I’ve cleaned it up a bit.
https://redd.it/1pzom7p
@r_devops
Hey folks, I’m hacking on a side project over the holidays and would love a sanity check from folks running EKS at scale.
Problem: EKS/EC2 is still a big chunk of my AWS bills even after the “usual” optimizations. I’m exploring a way to reduce EKS costs even further without rewriting everything from scratch without EKS.
Most advice (and what I’ve done before) clusters around:
- Spot + smart autoscaling (Karpenter, consolidation, mixed instance types)
- Rightsizing requests/limits, bin packing, node shapes, and deleting idle workloads
- Graviton/ARM where possible
- Reduce cross-AZ spend (or even go single AZ if you can)
- FinOps visibility (Kubecost, etc.) to find the real culprits (eg, unallocated requests)
- “Kubernetes tax” avoidance: move some workloads to ECS/Fargate when you can
But even after doing all this, EC2 is just… Expensive.
So I'm playing around with a hybrid EKS cluster:
- Keep the managed EKS control plane in AWS
- Run worker nodes on much cheaper compute outside AWS (e.g. bare metal servers on Hetzner)
- Burst to EC2 for spikes using labels/taints + Karpenter on the AWS node pools
AWS now offers “EKS Hybrid Nodes” for this, but the pricing is even more expensive than EC2 itself (why?), so I’m experimenting with a hybrid setup without that managed layer.
Questions for the crowd:
- Would you ever run production workloads on off-AWS worker nodes while keeping EKS control plane in AWS? Why/why not?
- What’s the biggest deal-breaker: networking latency, security boundaries, ops overhead, supportability, something else?
If this resonates, I’m happy to share more details (or a small writeup) once I’ve cleaned it up a bit.
https://redd.it/1pzom7p
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
👍1
I made a CLI game to learn Kubernetes by fixing broken clusters (50 levels, runs locally on kind)
Hey ,
I built this thing called K8sQuest because I was tired of paying for cloud sandboxes and wanted to practice debugging broken clusters.
## What it is
It's basically a game that intentionally breaks things in your local kind cluster and makes you fix them. 50 levels total, going from "why is this pod crashing" to "here's 9 broken things in a production scenario, good luck."
Runs entirely on Docker Desktop with kind. No cloud costs.
## How it works
1. Run
2. Open another terminal and debug with kubectl
3. Fix it however you want
4. Run
5. Get a debrief explaining what was wrong and why
The game Has hints, progress tracking, and step-by-step guides if you get stuck.
## What you'll debug
- World 1: CrashLoopBackOff, ImagePullBackOff, pending pods, labels, ports
- World 2: Deployments, HPA, liveness/readiness probes, rollbacks
- World 3: Services, DNS, Ingress, NetworkPolicies
- World 4: PVs, PVCs, StatefulSets, ConfigMaps, Secrets
- World 5: RBAC, SecurityContext, node scheduling, resource quotas
Level 50 is intentionally chaotic - multiple failures at once.
## Install
Needs: Docker Desktop, kubectl, kind, python3
## Why I made this
Reading docs didn't really stick for me. I learn better when things are broken and I have to figure out why. This simulates the actual debugging you do in prod, but locally and with hints.
Also has safety guards so you can't accidentally nuke your whole cluster (learned that the hard way).
Feedback welcome. If it helps you learn, cool. If you find bugs or have ideas for more levels, let me know.
GitHub: https://github.com/Aryan4266/k8squest
https://redd.it/1pzr4jh
@r_devops
Hey ,
I built this thing called K8sQuest because I was tired of paying for cloud sandboxes and wanted to practice debugging broken clusters.
## What it is
It's basically a game that intentionally breaks things in your local kind cluster and makes you fix them. 50 levels total, going from "why is this pod crashing" to "here's 9 broken things in a production scenario, good luck."
Runs entirely on Docker Desktop with kind. No cloud costs.
## How it works
1. Run
./play.sh - game starts, breaks something in k8s2. Open another terminal and debug with kubectl
3. Fix it however you want
4. Run
validate in the game to check5. Get a debrief explaining what was wrong and why
The game Has hints, progress tracking, and step-by-step guides if you get stuck.
## What you'll debug
- World 1: CrashLoopBackOff, ImagePullBackOff, pending pods, labels, ports
- World 2: Deployments, HPA, liveness/readiness probes, rollbacks
- World 3: Services, DNS, Ingress, NetworkPolicies
- World 4: PVs, PVCs, StatefulSets, ConfigMaps, Secrets
- World 5: RBAC, SecurityContext, node scheduling, resource quotas
Level 50 is intentionally chaotic - multiple failures at once.
## Install
git clone https://github.com/Aryan4266/k8squest.git
cd k8squest
./install.sh
./play.sh
Needs: Docker Desktop, kubectl, kind, python3
## Why I made this
Reading docs didn't really stick for me. I learn better when things are broken and I have to figure out why. This simulates the actual debugging you do in prod, but locally and with hints.
Also has safety guards so you can't accidentally nuke your whole cluster (learned that the hard way).
Feedback welcome. If it helps you learn, cool. If you find bugs or have ideas for more levels, let me know.
GitHub: https://github.com/Aryan4266/k8squest
https://redd.it/1pzr4jh
@r_devops
GitHub
GitHub - Aryan4266/k8squest: K8sQuest — A local, hands-on Kubernetes learning game with real-world troubleshooting challenges.…
K8sQuest — A local, hands-on Kubernetes learning game with real-world troubleshooting challenges. Practice Pods, Deployments, Services, networking, storage, and debugging using kubectl on a local c...
Docker's hardened images, just Bitnami panic marketing or useful?
Our team's been burned by vendor rug pulls before. Docker drops these hardened images right after Bitnami licensing drama. Feels suspicious.
Limited to Alpine/Debian only, CVE scanning still inconsistent between tools, and suppressed vulns worry me.
Anyone moving prod workloads to these? What's your take?
https://redd.it/1pzrz1p
@r_devops
Our team's been burned by vendor rug pulls before. Docker drops these hardened images right after Bitnami licensing drama. Feels suspicious.
Limited to Alpine/Debian only, CVE scanning still inconsistent between tools, and suppressed vulns worry me.
Anyone moving prod workloads to these? What's your take?
https://redd.it/1pzrz1p
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How do you integrate identity verification into CI/CD without slowing pipelines?
Hey folks, DevOps teams always need identity verification that plugs straight into pipelines without blocking deployments or creating security gaps since most solutions either slow everything down or leave staging environments exposed and we're looking for clean API handoffs delivering reliable signals at real scale.
Does anyone know of what works seamlessly for CI/CD flows?
https://redd.it/1pzuoy1
@r_devops
Hey folks, DevOps teams always need identity verification that plugs straight into pipelines without blocking deployments or creating security gaps since most solutions either slow everything down or leave staging environments exposed and we're looking for clean API handoffs delivering reliable signals at real scale.
Does anyone know of what works seamlessly for CI/CD flows?
https://redd.it/1pzuoy1
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
I got tired of the GitHub runner scare, so I moved my CI/CD to a self-hosted Gitea runner.
With the recent uncertainty around GitHub runner pricing and data privacy, I finally moved my personal projects to a self-hosted Gitea instance running on Docker.
The biggest finding: Gitea Actions is compatible with existing GitHub Actions
It’s now running on my home server (Portainer) with $0 cost, zero cold-starts, and total data privacy.
Full walkthrough of the
Is anyone else running Gitea Actions for actual production workloads yet? Curious how it scales.
https://redd.it/1pzvjv0
@r_devops
With the recent uncertainty around GitHub runner pricing and data privacy, I finally moved my personal projects to a self-hosted Gitea instance running on Docker.
The biggest finding: Gitea Actions is compatible with existing GitHub Actions
.yaml files. I didn't have to rewrite my pipelines; I just spun up a local runner container, pointed it to my Gitea instance, and the existing noscripts worked immediately.It’s now running on my home server (Portainer) with $0 cost, zero cold-starts, and total data privacy.
Full walkthrough of the
docker-compose setup and runner registration:https://youtu.be/-tCRlfaOMjMIs anyone else running Gitea Actions for actual production workloads yet? Curious how it scales.
https://redd.it/1pzvjv0
@r_devops
YouTube
Goodbye GitHub? Hosting my own Code.
GitHub's recent changes and runner taxes were the final straw. It's time to stop being a tenant in Microsoft's building and become the landlord of your own code. In this video, I show you how to completely replace GitHub with a self-hosted Gitea instance…
How do u know a CloudFormation CHANGE won’t break something subtle?
You change one resource.
The stack deploys successfully.
Nothing errors.
But something downstream breaks.
How do you catch that before deploy?
Or do you just accept the risk?
Curious how people think about this in practice.
https://redd.it/1pzu7dl
@r_devops
You change one resource.
The stack deploys successfully.
Nothing errors.
But something downstream breaks.
How do you catch that before deploy?
Or do you just accept the risk?
Curious how people think about this in practice.
https://redd.it/1pzu7dl
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Does anyone here use rapidapi? Having issues making a payment
I'm trying to add my card to purchase a subnoscription yet my card keeps declining. So then I decide to use klarna as a loan payback option and it gets declined. Then I use affirm for loan payback and the loan was charged but the payment was blocked by rapidapi. The only possible conclusion why this happened is I was making api calls from my laptop while using hotspot so I don't know if rapidapi considered this a proxy and decided to block me from making payments?
https://redd.it/1q00zkz
@r_devops
I'm trying to add my card to purchase a subnoscription yet my card keeps declining. So then I decide to use klarna as a loan payback option and it gets declined. Then I use affirm for loan payback and the loan was charged but the payment was blocked by rapidapi. The only possible conclusion why this happened is I was making api calls from my laptop while using hotspot so I don't know if rapidapi considered this a proxy and decided to block me from making payments?
https://redd.it/1q00zkz
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
I built a browser extension for managing multiple AWS accounts
I wanted to share this browser extension I built a few days ago. I built it to solve my own problem while working with different clients’ AWS environments. My password manager was not very helpful, as it struggled to keep credentials organized in one place and quickly became messy.
So I decided to build a solution for myself, and I thought I would share it here in case others are dealing with a similar issue.
The extension is very simple and does the following:
Stores AWS accounts with nicknames and color coding
Displays a colored banner in the AWS console to identify the current account
Supports one click account switching
Provides keyboard shortcuts (Cmd or Ctrl + Shift + 1 to 5) for frequently used accounts
Allows importing accounts from CSV or `~/.aws/config`
Groups accounts by project or client
I have currently published it on the Firefox Store:
https://addons.mozilla.org/en-US/firefox/addon/aws-omniconsole/
The source code is also available on GitHub:
https://github.com/mraza007/aws-omni
https://redd.it/1q02rc4
@r_devops
I wanted to share this browser extension I built a few days ago. I built it to solve my own problem while working with different clients’ AWS environments. My password manager was not very helpful, as it struggled to keep credentials organized in one place and quickly became messy.
So I decided to build a solution for myself, and I thought I would share it here in case others are dealing with a similar issue.
The extension is very simple and does the following:
Stores AWS accounts with nicknames and color coding
Displays a colored banner in the AWS console to identify the current account
Supports one click account switching
Provides keyboard shortcuts (Cmd or Ctrl + Shift + 1 to 5) for frequently used accounts
Allows importing accounts from CSV or `~/.aws/config`
Groups accounts by project or client
I have currently published it on the Firefox Store:
https://addons.mozilla.org/en-US/firefox/addon/aws-omniconsole/
The source code is also available on GitHub:
https://github.com/mraza007/aws-omni
https://redd.it/1q02rc4
@r_devops
addons.mozilla.org
AWS OmniConsole – Get this Extension for 🦊 Firefox (en-US)
Download AWS OmniConsole for Firefox. Manage multiple AWS accounts in one place. Switch between consoles with a single click instead of juggling multiple browsers or password managers.
Is it just me or are some KodKloud course materials AI-generated?
Been using KodeKloud for a while now — love the hands-on labs and sandbox environments, they're genuinely useful for practical learning.
But I've started noticing some of the written course content has all the hallmarks of AI-generated text:
Forced analogies every other paragraph ("think of it like a VIP list...")
Formulaic transitions ("First things first," "Next up," "Time for a test run")
Repeated phrases/typos that suggest no human reviewed it ("violations and violations," "real-world world scenario")
Generic safety disclaimers at the end
Combined with other production issues I've noticed — choppy video edits, inconsistent audio quality, pixelated graphics, cropped screenshots cutting off text — it feels like they're prioritizing quantity over quality.
Anyone else noticing this? For what we pay, I'd expect better QA on the content. The practical stuff is solid but the courseware itself feels rushed.
EDIT: Typo in the noscript, oops, KodeKloud.
https://redd.it/1q04riy
@r_devops
Been using KodeKloud for a while now — love the hands-on labs and sandbox environments, they're genuinely useful for practical learning.
But I've started noticing some of the written course content has all the hallmarks of AI-generated text:
Forced analogies every other paragraph ("think of it like a VIP list...")
Formulaic transitions ("First things first," "Next up," "Time for a test run")
Repeated phrases/typos that suggest no human reviewed it ("violations and violations," "real-world world scenario")
Generic safety disclaimers at the end
Combined with other production issues I've noticed — choppy video edits, inconsistent audio quality, pixelated graphics, cropped screenshots cutting off text — it feels like they're prioritizing quantity over quality.
Anyone else noticing this? For what we pay, I'd expect better QA on the content. The practical stuff is solid but the courseware itself feels rushed.
EDIT: Typo in the noscript, oops, KodeKloud.
https://redd.it/1q04riy
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community