Scheduling ML Workloads on Kubernetes
Hey guys. This article covers NVIDIA Kai-Scheduler, including gang scheduling, bin packing, consolidation, and queue features, etc:
https://martynassubonis.substack.com/p/scheduling-ml-workloads-on-kubernetes
https://redd.it/1oehdnd
@r_devops
Hey guys. This article covers NVIDIA Kai-Scheduler, including gang scheduling, bin packing, consolidation, and queue features, etc:
https://martynassubonis.substack.com/p/scheduling-ml-workloads-on-kubernetes
https://redd.it/1oehdnd
@r_devops
Substack
Scheduling ML Workloads on Kubernetes
On Gang Scheduling, Bin Packing, Consolidation, and the Like
Suggestions of tools to improve life quality of a devops engineer
I'm looking for suggestions that will improve my day to day operations as a devops engineer across the whole stack. For example a tool or ide that helps visualize and interact with the k8s cluster. I'm aware of something called lens ide but havent looked too much into it. Or autocompletion/suggestions for dockerfiles etc.. anything really. What is something you are using and would never go back to not using it again?
https://redd.it/1oebaei
@r_devops
I'm looking for suggestions that will improve my day to day operations as a devops engineer across the whole stack. For example a tool or ide that helps visualize and interact with the k8s cluster. I'm aware of something called lens ide but havent looked too much into it. Or autocompletion/suggestions for dockerfiles etc.. anything really. What is something you are using and would never go back to not using it again?
https://redd.it/1oebaei
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Anyone else feel AI is making them a faster typist, but a dumber developer? 😩
I feel like I'm not programming anymore, I'm just auditing AI output.
Copilot/Cursor is great for boilerplate. It’ll crank out a CRUD endpoint in seconds. But then I spend 3x the time trying to spot the subtle, contextual bug it slipped in (e.g., a tiny thread-safety issue, or a totally wrong way to handle an old library).
It feels like my brain’s problem-solving pathways are atrophying. I trade the joy of solving a hard problem for the anxiety of verifying a complex, auto-generated one. This isn't higher velocity; it's just a different, more draining kind of work.
Am I alone in feeling this cognitive burnout?
https://redd.it/1oepjg3
@r_devops
I feel like I'm not programming anymore, I'm just auditing AI output.
Copilot/Cursor is great for boilerplate. It’ll crank out a CRUD endpoint in seconds. But then I spend 3x the time trying to spot the subtle, contextual bug it slipped in (e.g., a tiny thread-safety issue, or a totally wrong way to handle an old library).
It feels like my brain’s problem-solving pathways are atrophying. I trade the joy of solving a hard problem for the anxiety of verifying a complex, auto-generated one. This isn't higher velocity; it's just a different, more draining kind of work.
Am I alone in feeling this cognitive burnout?
https://redd.it/1oepjg3
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Spent 40k on a monitoring solution we never used.
The purchase decision:
\- Sales demo looked amazing
\- Promised AI-powered anomaly detection
\- Would solve all our monitoring problems
\- Got VP approval for 40k annual contract
What happened:
\- Setup took 3 months
\- Required custom instrumentation
\- AI features needed 6 months of data
\- Dashboard was too complex
\- Team kept using Grafana instead
One year later:
\- Login count: 47 times
\- Alerts configured: 3
\- Useful insights: 0
\- Money spent: $40,000
Why it failed:
\- Didn't pilot with smaller team first
\- Bought for features, not current needs
\- No champions within the team
\- Too complex for our maturity level
\- Existing tools were good enough
Lesson: Enterprise sales demos show what's possible, not what you need. Start with free tools and upgrade when you feel the pain.
(https://x.com/brankopetric00/status/1981484857440993523)
https://redd.it/1oeqkvs
@r_devops
The purchase decision:
\- Sales demo looked amazing
\- Promised AI-powered anomaly detection
\- Would solve all our monitoring problems
\- Got VP approval for 40k annual contract
What happened:
\- Setup took 3 months
\- Required custom instrumentation
\- AI features needed 6 months of data
\- Dashboard was too complex
\- Team kept using Grafana instead
One year later:
\- Login count: 47 times
\- Alerts configured: 3
\- Useful insights: 0
\- Money spent: $40,000
Why it failed:
\- Didn't pilot with smaller team first
\- Bought for features, not current needs
\- No champions within the team
\- Too complex for our maturity level
\- Existing tools were good enough
Lesson: Enterprise sales demos show what's possible, not what you need. Start with free tools and upgrade when you feel the pain.
(https://x.com/brankopetric00/status/1981484857440993523)
https://redd.it/1oeqkvs
@r_devops
X (formerly Twitter)
Branko (@brankopetric00) on X
Spent 40k on a monitoring solution we never used.
The purchase decision:
- Sales demo looked amazing
- Promised AI-powered anomaly detection
- Would solve all our monitoring problems
- Got VP approval for 40k annual contract
What happened:
- Setup took…
The purchase decision:
- Sales demo looked amazing
- Promised AI-powered anomaly detection
- Would solve all our monitoring problems
- Got VP approval for 40k annual contract
What happened:
- Setup took…
Auto scaling RabbitMq
I am busy working on a project to replace our AWS managed RabbitMQ service with a Rabbitmq hosted on an EC2 instance. We want to move away from the managed service due to the mandatory maintenance window imposed by AWS.
We are a startup so money is tight. So i am looking to do this in the most cost effective manner.
My current thinking is having one dedicate reserved instance that runs 24/7.
The having a ASG that is able to spin up a spot instance or two when we have a message storm.
We have an IOT company and when the APN blips all our devices reconnect at once causing our current RabbitMQ service's CPU to Spike.
So I would like an extra node to spin up, assist the master node with processing and then gracefully scale down again, leaving us with a single instance rabbit.
Is rabbit built to handle this type of thing? I am getting contrasting information and I am looking to hear from someone else who has gone down this route before.
Any advise, or experience welcome.
https://redd.it/1oeqo8r
@r_devops
I am busy working on a project to replace our AWS managed RabbitMQ service with a Rabbitmq hosted on an EC2 instance. We want to move away from the managed service due to the mandatory maintenance window imposed by AWS.
We are a startup so money is tight. So i am looking to do this in the most cost effective manner.
My current thinking is having one dedicate reserved instance that runs 24/7.
The having a ASG that is able to spin up a spot instance or two when we have a message storm.
We have an IOT company and when the APN blips all our devices reconnect at once causing our current RabbitMQ service's CPU to Spike.
So I would like an extra node to spin up, assist the master node with processing and then gracefully scale down again, leaving us with a single instance rabbit.
Is rabbit built to handle this type of thing? I am getting contrasting information and I am looking to hear from someone else who has gone down this route before.
Any advise, or experience welcome.
https://redd.it/1oeqo8r
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
A fast, private, secure, open-source S3 GUI
Since the web interfaces for Amazon S3 and Cloudflare R2 are a bit tedious, a friend of mine and I decided to build nicebucket, an open-source alternative using Tauri and React, released under the GPLv3 license.
I think it is useful for anyone who works with S3, R2, or any other S3 compatible service. We do not track any data and store all credentials safely via the native keychains.
We are still quite early so feedback is very much appreciated!
https://redd.it/1oeql17
@r_devops
Since the web interfaces for Amazon S3 and Cloudflare R2 are a bit tedious, a friend of mine and I decided to build nicebucket, an open-source alternative using Tauri and React, released under the GPLv3 license.
I think it is useful for anyone who works with S3, R2, or any other S3 compatible service. We do not track any data and store all credentials safely via the native keychains.
We are still quite early so feedback is very much appreciated!
https://redd.it/1oeql17
@r_devops
GitHub
GitHub - nicebucket-org/nicebucket: A fast, private, open-source S3 GUI.
A fast, private, open-source S3 GUI. Contribute to nicebucket-org/nicebucket development by creating an account on GitHub.
Built a desktop app for unified K8s + GitOps visibility - looking for feedback
Hey everyone,
We just shipped something and would love honest feedback from the community.
What we built: Kunobi is a new platform that brings Kubernetes cluster management and GitOps workflows into a single, extensible system — so teams don’t have to juggle Lens, K9s, and GitOps CLIs to stay in control.
We make it easier to use Flux and Argo by enabling seamless interaction with GitOps tools. We’ve focused on addressing pain points we’ve faced ourselves — tools that are slow, memory-heavy, or just not built for scale.
Key features include:
Kubernetes resource discovery
Full RBAC compliance
Multi-cluster support
Fast keyboard navigation
Helm release history
Helm values and manifest diffing
Flux resource tree visualization
[Here’s a short demo video for clarity.](https://youtu.be/y0m5L_XqGps?si=CSKS5Dqby-NqIixH)
Who we are: Kunobi is built by Zondax AG, a Swiss-based engineering team that’s been working in DevOps, blockchain, and infrastructure for years. We’ve built low-level, performance-critical tools for projects in the CNCF and Web3 ecosystems - Kunobi started as an internal tool to manage our own clusters, and evolved into something we wanted to share with others facing the same GitOps challenges.
Current state: It’s rough and in beta, but fully functional. We’ve been using it internally for a few months.
What we’re looking for:
Feedback on whether this actually solves a real problem for you
What features/integrations matter most
Any concerns or questions about the approach
Fair warning — we’re biased since we use this daily. But that’s also why we think it might be useful to others dealing with the same tool sprawl.
Happy to answer questions about how it works, architecture decisions, or anything else.
🔗 https://kunobi.ninja — download the beta here.
https://redd.it/1oetwyc
@r_devops
Hey everyone,
We just shipped something and would love honest feedback from the community.
What we built: Kunobi is a new platform that brings Kubernetes cluster management and GitOps workflows into a single, extensible system — so teams don’t have to juggle Lens, K9s, and GitOps CLIs to stay in control.
We make it easier to use Flux and Argo by enabling seamless interaction with GitOps tools. We’ve focused on addressing pain points we’ve faced ourselves — tools that are slow, memory-heavy, or just not built for scale.
Key features include:
Kubernetes resource discovery
Full RBAC compliance
Multi-cluster support
Fast keyboard navigation
Helm release history
Helm values and manifest diffing
Flux resource tree visualization
[Here’s a short demo video for clarity.](https://youtu.be/y0m5L_XqGps?si=CSKS5Dqby-NqIixH)
Who we are: Kunobi is built by Zondax AG, a Swiss-based engineering team that’s been working in DevOps, blockchain, and infrastructure for years. We’ve built low-level, performance-critical tools for projects in the CNCF and Web3 ecosystems - Kunobi started as an internal tool to manage our own clusters, and evolved into something we wanted to share with others facing the same GitOps challenges.
Current state: It’s rough and in beta, but fully functional. We’ve been using it internally for a few months.
What we’re looking for:
Feedback on whether this actually solves a real problem for you
What features/integrations matter most
Any concerns or questions about the approach
Fair warning — we’re biased since we use this daily. But that’s also why we think it might be useful to others dealing with the same tool sprawl.
Happy to answer questions about how it works, architecture decisions, or anything else.
🔗 https://kunobi.ninja — download the beta here.
https://redd.it/1oetwyc
@r_devops
YouTube
Meet Kunobi — The Visual GitOps Platform for Effortless Kubernetes Management
What if your GitOps workflows actually felt effortless?
Meet Kunobi – your all-in-one platform for managing Kubernetes and GitOps with total clarity.
See everything across clusters, environments, and repositories in one clean visual interface that complements…
Meet Kunobi – your all-in-one platform for managing Kubernetes and GitOps with total clarity.
See everything across clusters, environments, and repositories in one clean visual interface that complements…
MongoDB Pod dont create User inside container
This is my mongodb manifest yaml file, when pod running success, i checked inside mongodb container dont create my user despite i add mono-init.js to folder: docker-entrypoint-initdb.d.
I do the same with docker-compose and everything will be ok!
How to fix this issue. Please help me
https://redd.it/1oeuvm5
@r_devops
This is my mongodb manifest yaml file, when pod running success, i checked inside mongodb container dont create my user despite i add mono-init.js to folder: docker-entrypoint-initdb.d.
I do the same with docker-compose and everything will be ok!
How to fix this issue. Please help me
https://redd.it/1oeuvm5
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Real world production on a cv for ansible
Hi all,
I have a network engineer background
I have done playbooks on network devices, mainly for f5
But I was contacted for an ansible job, so I need to put more "system" or DevOps kind of project
Can you give me ideas of what are you doing in production so I can do it myself and put it in my CV
Would an ansible certificate be useful, I have the basis
https://redd.it/1oetwcf
@r_devops
Hi all,
I have a network engineer background
I have done playbooks on network devices, mainly for f5
But I was contacted for an ansible job, so I need to put more "system" or DevOps kind of project
Can you give me ideas of what are you doing in production so I can do it myself and put it in my CV
Would an ansible certificate be useful, I have the basis
https://redd.it/1oetwcf
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Only allow specific country IP range to SSH
Hi, May I know what is the simplest way to allow a specific country IP range to access my VPS SSH?
I prefer using UFW but not iptable coz I am a newbie and afraid drilling that down will mess things up
I am reading this post but not sure if it's valid to go with Ubunutu
https://blog.reverside.ch/UFW-GeoIP-and-how-to-get-there/
https://redd.it/1oexn4l
@r_devops
Hi, May I know what is the simplest way to allow a specific country IP range to access my VPS SSH?
I prefer using UFW but not iptable coz I am a newbie and afraid drilling that down will mess things up
I am reading this post but not sure if it's valid to go with Ubunutu
https://blog.reverside.ch/UFW-GeoIP-and-how-to-get-there/
https://redd.it/1oexn4l
@r_devops
our postmortem from last week just identified the same root cause from june
had database connection pool exhaustion issue last tuesday. took three hours to fix. wrote the postmortem yesterday and vp pointed out we had the exact same issue in june.
pulled up that postmortem. action items were increase pool size and add better monitoring. neither happened because we needed to ship features to stay competitive.
so we shipped features for four months while the known prod issue sat unfixed. then it broke again and leadership acted shocked.
now they want to know why we keep having repeat incidents. maybe because postmortem action items go into backlog behind feature work and nobody looks at them until the same thing breaks again.
third time this year we've had a repeat incident where the fix was documented but never implemented. starting to wonder why we even write postmortems if nothing changes.
how do you actually get action items prioritized or is this just accepted everywhere?
https://redd.it/1oeyqqd
@r_devops
had database connection pool exhaustion issue last tuesday. took three hours to fix. wrote the postmortem yesterday and vp pointed out we had the exact same issue in june.
pulled up that postmortem. action items were increase pool size and add better monitoring. neither happened because we needed to ship features to stay competitive.
so we shipped features for four months while the known prod issue sat unfixed. then it broke again and leadership acted shocked.
now they want to know why we keep having repeat incidents. maybe because postmortem action items go into backlog behind feature work and nobody looks at them until the same thing breaks again.
third time this year we've had a repeat incident where the fix was documented but never implemented. starting to wonder why we even write postmortems if nothing changes.
how do you actually get action items prioritized or is this just accepted everywhere?
https://redd.it/1oeyqqd
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Database branches to simplify CI/CD
Careful some self-promo ahead (But I genuinely think this is an interesting topic to discuss).
In my experience failed migrations and database differences between environments are one of the most common causes of incidents. I have had failed deployments, half-applied migrations and even full-blown outages because someone didn't consider the legacy null values that were present in production but not on dev.
Many devs think "down migrations" are the answer to this. But they are hard to get right since a rollback of the code usually also removes the migration code from the container.
I work at Tiger Data (formerly Timescale) and we released a feature to fork an existing database this week. I wasn't involved in the development of the underlying tech, but it uses a copy on write mechanism that makes this process complete in under a minute. Imo these kind of features are a great way to simplify CI/CD and prevent issues such as the ones I mentioned above.
Modern infrastructure like this (e.g. Neon also has branches) actually offer a lot of options to simplify CI/CD. You can cheaply create a clone of your production database and use that for testing your migrations. You can even get a good idea of how long it will take to run your migrations by doing that.
Of course you'll also need to cleanup again and figure out if the additional cost of automatically running a db instance in your workflow is worth it. You could in theory even go further though and use the mechanism to spin up a complete test environment for each PR that a developer creates. Similar to how this is often done for frontend changes in my experience.
In practice a lot of the CI/CD setups I have worked with in other companies are really dusty and do not take advantage of the capabilities of the infrastructure that is available. It's also often hard to get buy in from decision makers to invest time in this kind of automation. But when it works it is down right beautiful.
https://redd.it/1of09uc
@r_devops
Careful some self-promo ahead (But I genuinely think this is an interesting topic to discuss).
In my experience failed migrations and database differences between environments are one of the most common causes of incidents. I have had failed deployments, half-applied migrations and even full-blown outages because someone didn't consider the legacy null values that were present in production but not on dev.
Many devs think "down migrations" are the answer to this. But they are hard to get right since a rollback of the code usually also removes the migration code from the container.
I work at Tiger Data (formerly Timescale) and we released a feature to fork an existing database this week. I wasn't involved in the development of the underlying tech, but it uses a copy on write mechanism that makes this process complete in under a minute. Imo these kind of features are a great way to simplify CI/CD and prevent issues such as the ones I mentioned above.
Modern infrastructure like this (e.g. Neon also has branches) actually offer a lot of options to simplify CI/CD. You can cheaply create a clone of your production database and use that for testing your migrations. You can even get a good idea of how long it will take to run your migrations by doing that.
Of course you'll also need to cleanup again and figure out if the additional cost of automatically running a db instance in your workflow is worth it. You could in theory even go further though and use the mechanism to spin up a complete test environment for each PR that a developer creates. Similar to how this is often done for frontend changes in my experience.
In practice a lot of the CI/CD setups I have worked with in other companies are really dusty and do not take advantage of the capabilities of the infrastructure that is available. It's also often hard to get buy in from decision makers to invest time in this kind of automation. But when it works it is down right beautiful.
https://redd.it/1of09uc
@r_devops
Tiger Data Blog
Fast, Zero-Copy Database Forks: Deploy on Fridays with Confidence
Test AI-generated migrations on production data safely. Fork your database in minutes, validate changes, and deploy on Fridays with confidence.
Linux admin to devops
I am moving from Linux admin to devops role via an internal movement....
The thing is I know lil of all ansible,terraform, docker, kubernetes nd jenkins... I don't write any complex or big stuff... And I won't have much ppl to guide in new team....How should I start now ..where to begin !? I have a months time before I land up in new team...
https://redd.it/1oezeoz
@r_devops
I am moving from Linux admin to devops role via an internal movement....
The thing is I know lil of all ansible,terraform, docker, kubernetes nd jenkins... I don't write any complex or big stuff... And I won't have much ppl to guide in new team....How should I start now ..where to begin !? I have a months time before I land up in new team...
https://redd.it/1oezeoz
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Question Version Bumping and Automating Releases
I work at a small company (2 person dev team) and there are no real protocols in place for version control or CI/CD. It's basically very smart scientists creating tools to aid R&D and QA on our product.
I don't want to re-invent the wheel, but I also want to take advantage of the freedom I have at work to learn how these processes and tools come about.
Our entire tech stack is basically python using PyQt to make windows desktop applications (yes i'm developing entirely on windows).
The workflow i've come up with is the following:
- Versions tracked in a .py file
- referenced by my pyinstaller .spec file, and my main.py to update noscript bar version, and file name version after compiling
- I have a noscript that bumps the version on
- allows inputs of
- The noscript pushes the tag to main, which then triggers a GH actions
- the GH actions compiles and creates a release with a changelog generated from commits between version tags
- (eg summary of commits between v1.0.0..v1.1.0)
I'm trying to implement a git flow branching system, but have not incorporated
here's some ASCII art from claude (with a review and edits) attempting to demonstrate my release workflow from what i described (going bottom to top like
I know the workflow is missing release branches, where i would ideally go like the following:
My question is mostly about the automation of all the above workflows. How are people managing versions? Is a .py file given my stack reasonable/a professional approach?
Could I offload more of this process to GH actions for example? and have say a noscript that is just called
https://redd.it/1of1sa6
@r_devops
I work at a small company (2 person dev team) and there are no real protocols in place for version control or CI/CD. It's basically very smart scientists creating tools to aid R&D and QA on our product.
I don't want to re-invent the wheel, but I also want to take advantage of the freedom I have at work to learn how these processes and tools come about.
Our entire tech stack is basically python using PyQt to make windows desktop applications (yes i'm developing entirely on windows).
The workflow i've come up with is the following:
- Versions tracked in a .py file
- referenced by my pyinstaller .spec file, and my main.py to update noscript bar version, and file name version after compiling
- I have a noscript that bumps the version on
dev when i'm ready to put out a new release - allows inputs of
major, minor, or patch to determine how the version is bumped.- The noscript pushes the tag to main, which then triggers a GH actions
- the GH actions compiles and creates a release with a changelog generated from commits between version tags
- (eg summary of commits between v1.0.0..v1.1.0)
I'm trying to implement a git flow branching system, but have not incorporated
release branches yet. here's some ASCII art from claude (with a review and edits) attempting to demonstrate my release workflow from what i described (going bottom to top like
git log):* Merge main back into dev - sync release v1.2.0 (HEAD -> dev)
|\
| * v1.2.0 - release tagged on main (release created on GH here) (tag: v1.2.0, main)
| |\
| | * Merge dev into main for release v1.2.0
| |/
| * QA complete on dev (dev)
| * Merge feat/fix into dev
| |\
| | * Implement feature X (feat/fix)
| | * Branch feat/fix created from dev
| |/
* Dev baseline before feature work
I know the workflow is missing release branches, where i would ideally go like the following:
feat -> dev -> release -> dev dev
` -> main | main -> release created from main
| | |
`-> hotfix (if needed)
My question is mostly about the automation of all the above workflows. How are people managing versions? Is a .py file given my stack reasonable/a professional approach?
Could I offload more of this process to GH actions for example? and have say a noscript that is just called
release.py or .sh that triggers this entire process?https://redd.it/1of1sa6
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Outsider Curiosity - Outages
I sat through the Alaska Airlines “IT outage” yesterday and it got me very curious about how these situations get managed behind the scenes.
I’m very curious to know how many people are involved in troubleshooting/debugging something like that. Is there a solid staff that’s scheduled around the clock that can be trusted? Or does the company have to call in the savant no matter what time of day it is? Intuitively I feel like this could potentially be a “too many cooks in the kitchen” situation if the task isn’t handed over to a select group.
Are you clocking overtime during these situations or everyone’s salaried and just has to suck it up? Are the suits breathing down your neck during an outage or do they give you some space to work?
I feel like there must be some good insider stories here that I haven’t heard/read before. Feel free to link me any reading. Apologies if this is a common post in this sub, it’s just been on the front of my mind since last night.
https://redd.it/1of4qje
@r_devops
I sat through the Alaska Airlines “IT outage” yesterday and it got me very curious about how these situations get managed behind the scenes.
I’m very curious to know how many people are involved in troubleshooting/debugging something like that. Is there a solid staff that’s scheduled around the clock that can be trusted? Or does the company have to call in the savant no matter what time of day it is? Intuitively I feel like this could potentially be a “too many cooks in the kitchen” situation if the task isn’t handed over to a select group.
Are you clocking overtime during these situations or everyone’s salaried and just has to suck it up? Are the suits breathing down your neck during an outage or do they give you some space to work?
I feel like there must be some good insider stories here that I haven’t heard/read before. Feel free to link me any reading. Apologies if this is a common post in this sub, it’s just been on the front of my mind since last night.
https://redd.it/1of4qje
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
What Happened This Week at AWS – Full Technical Breakdown
Step 1: Routine Update and Rare Automation Failure
During the night between October 19 and 20, a routine update was performed on the core DynamoDB API system in the us-east-1 region.
The update triggered an internal automated process responsible for synchronizing DNS records between servers in different data centers.
During this synchronization, a rare race condition occurred, where two systems simultaneously wrote conflicting information to the internal DNS table.
Step 2: DNS Failure – Loss of Synchronization Between Servers
Because of this error, one of AWS’s internal DNS management systems lost synchronization with several regional DNS servers.
As a result, services trying to access DynamoDB using internal domain names were unable to resolve them to IP addresses, causing all calls to the DynamoDB API to fail.
Step 3: A Cross-Service Chain Reaction
DynamoDB is a foundational service relied upon by many other components, including Lambda, EC2, SQS, Step Functions, and Redshift.
When DynamoDB stopped responding, massive backlogs and timeouts started to build up in dependent services.
This created abnormal loads and partial outages across multiple cloud systems.
Step 4: Impact on the Internal Network Fabric
At the same time, the NLB Health Monitor, which is responsible for detecting healthy servers and routing internal traffic, received faulty data due to the DNS failure.
As a result, active servers were mistakenly marked as unavailable and removed from the network fabric, worsening the incident and increasing pressure on the remaining infrastructure.
Step 5: Detection and Containment
At 02:01 AM PDT (12:01 PM Saudi Arabia time), AWS teams identified that the issue originated from DNS resolution of the DynamoDB API.
By 02:24 AM, the chain reaction was contained, the DNS records were fixed, and the API returned to normal operation.
However, dependent services like Lambda and SQS required several more hours to rebuild queues and stabilize.
Step 6: Recovery and Lessons Learned
As a precautionary measure, AWS temporarily limited new EC2 launches to stabilize the system and prevent further cascading failures.
By the evening (US time), AWS reported that all services had fully recovered.
According to AWS, the root cause was faulty internal automation that led to DNS desynchronization, not a cyberattack.
Root Cause Summary
A faulty automation process and a race condition in the internal DNS synchronization system caused loss of name resolution for the DynamoDB API in us-east-1.
This triggered a widespread chain reaction, impacting many other services in the cloud infrastructure.
The incident lasted approximately nine hours, leading to outages or performance degradation in major applications like Snapchat, Reddit, and Fortnite, along with over a hundred additional services.
AWS announced plans to strengthen its monitoring and automation mechanisms to reduce the likelihood of similar incidents in the future.
https://redd.it/1of5763
@r_devops
Step 1: Routine Update and Rare Automation Failure
During the night between October 19 and 20, a routine update was performed on the core DynamoDB API system in the us-east-1 region.
The update triggered an internal automated process responsible for synchronizing DNS records between servers in different data centers.
During this synchronization, a rare race condition occurred, where two systems simultaneously wrote conflicting information to the internal DNS table.
Step 2: DNS Failure – Loss of Synchronization Between Servers
Because of this error, one of AWS’s internal DNS management systems lost synchronization with several regional DNS servers.
As a result, services trying to access DynamoDB using internal domain names were unable to resolve them to IP addresses, causing all calls to the DynamoDB API to fail.
Step 3: A Cross-Service Chain Reaction
DynamoDB is a foundational service relied upon by many other components, including Lambda, EC2, SQS, Step Functions, and Redshift.
When DynamoDB stopped responding, massive backlogs and timeouts started to build up in dependent services.
This created abnormal loads and partial outages across multiple cloud systems.
Step 4: Impact on the Internal Network Fabric
At the same time, the NLB Health Monitor, which is responsible for detecting healthy servers and routing internal traffic, received faulty data due to the DNS failure.
As a result, active servers were mistakenly marked as unavailable and removed from the network fabric, worsening the incident and increasing pressure on the remaining infrastructure.
Step 5: Detection and Containment
At 02:01 AM PDT (12:01 PM Saudi Arabia time), AWS teams identified that the issue originated from DNS resolution of the DynamoDB API.
By 02:24 AM, the chain reaction was contained, the DNS records were fixed, and the API returned to normal operation.
However, dependent services like Lambda and SQS required several more hours to rebuild queues and stabilize.
Step 6: Recovery and Lessons Learned
As a precautionary measure, AWS temporarily limited new EC2 launches to stabilize the system and prevent further cascading failures.
By the evening (US time), AWS reported that all services had fully recovered.
According to AWS, the root cause was faulty internal automation that led to DNS desynchronization, not a cyberattack.
Root Cause Summary
A faulty automation process and a race condition in the internal DNS synchronization system caused loss of name resolution for the DynamoDB API in us-east-1.
This triggered a widespread chain reaction, impacting many other services in the cloud infrastructure.
The incident lasted approximately nine hours, leading to outages or performance degradation in major applications like Snapchat, Reddit, and Fortnite, along with over a hundred additional services.
AWS announced plans to strengthen its monitoring and automation mechanisms to reduce the likelihood of similar incidents in the future.
https://redd.it/1of5763
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
I have an interview lined up for devops engineer 1 need guidance
Hey folks , I have an devops engineer interview lined up (Tech stack is GCP and GKS) .I have 1 yoe experience as a SRE and have no experience with cloud as my current org is on-prem.
I am not sure how to approach the preparation should I be honest and say I dont have hands on exp with cloud tools but am familiar with the concepts and revise my basics.
Or
Should I try some hands-on experiments with these tools ,I only have like 1 week to the interview.
anyone with similar experience of switching from on-prem to cloud please let me know how did you approach
Any relevant study material is highly appreciated
https://redd.it/1of4u8w
@r_devops
Hey folks , I have an devops engineer interview lined up (Tech stack is GCP and GKS) .I have 1 yoe experience as a SRE and have no experience with cloud as my current org is on-prem.
I am not sure how to approach the preparation should I be honest and say I dont have hands on exp with cloud tools but am familiar with the concepts and revise my basics.
Or
Should I try some hands-on experiments with these tools ,I only have like 1 week to the interview.
anyone with similar experience of switching from on-prem to cloud please let me know how did you approach
Any relevant study material is highly appreciated
https://redd.it/1of4u8w
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Adding my on-call shifts into my private calendar? Looking for best practices
Hey all,
are you pushing your on-call shifts from your Incident Response tool (e.g. PagerDuty/Opsgenie/FireHydrant) into your personal calendars or do you keep it 100% in your professional calendar?
Asking for best practices from the community. Adding it to my personal calendar feels like work will completely take over my private life. But I guess that's just the way it is?
https://redd.it/1of811q
@r_devops
Hey all,
are you pushing your on-call shifts from your Incident Response tool (e.g. PagerDuty/Opsgenie/FireHydrant) into your personal calendars or do you keep it 100% in your professional calendar?
Asking for best practices from the community. Adding it to my personal calendar feels like work will completely take over my private life. But I guess that's just the way it is?
https://redd.it/1of811q
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Is RHCE enough for jr DevOps?
Sorry, I'm been depressed due to family circumstances. So just trying to find motivation to push forward since on November 15th my red hat would expires. I started as support at a MSP in 2020 then spent a year to earn CCNA, 2 years for RHCSA, and put in around 6 months for CCNP encore until I realized I was going into 2 different directions. I use gsn3 to lab everything to memory since covid allowed remote work.
but I didn't found alot of opportunities, which it seem Linux role became DevOps operations so I decided to go for RHCE. I feel I'm close though I've been on this certificates wheel for so long while my sister would be graduating bachelor registered nursing soon. I couldn't afford college since I had to support my family but Ioved learning, in fact my curiosity from my practice labs made me encounter linting (hence why CI/CD is needed) that Cisco encourage under devnet so that was something that was on the road map. Now it does feel like I just wasted my 20s, when so many HR filter you you for degrees anyway. Anyway besides that rant, it seem like it nevers enough at least to leave the proverbial helpdesk.
So I want to check would RHCE be the turning point to begin? I don't know how hard finding entry level roles for DevOps would be, but I don't know where I be in the next few months if I be living alone or under a bridge. I'm not asking for a 7 figure roles, but somewhere I could progress and feel their something to push toward.
https://redd.it/1ofalho
@r_devops
Sorry, I'm been depressed due to family circumstances. So just trying to find motivation to push forward since on November 15th my red hat would expires. I started as support at a MSP in 2020 then spent a year to earn CCNA, 2 years for RHCSA, and put in around 6 months for CCNP encore until I realized I was going into 2 different directions. I use gsn3 to lab everything to memory since covid allowed remote work.
but I didn't found alot of opportunities, which it seem Linux role became DevOps operations so I decided to go for RHCE. I feel I'm close though I've been on this certificates wheel for so long while my sister would be graduating bachelor registered nursing soon. I couldn't afford college since I had to support my family but Ioved learning, in fact my curiosity from my practice labs made me encounter linting (hence why CI/CD is needed) that Cisco encourage under devnet so that was something that was on the road map. Now it does feel like I just wasted my 20s, when so many HR filter you you for degrees anyway. Anyway besides that rant, it seem like it nevers enough at least to leave the proverbial helpdesk.
So I want to check would RHCE be the turning point to begin? I don't know how hard finding entry level roles for DevOps would be, but I don't know where I be in the next few months if I be living alone or under a bridge. I'm not asking for a 7 figure roles, but somewhere I could progress and feel their something to push toward.
https://redd.it/1ofalho
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Git repo question
Do you think this repo is legit? https://github.com/robertlestak/vault-secret-sync
https://redd.it/1of549t
@r_devops
Do you think this repo is legit? https://github.com/robertlestak/vault-secret-sync
https://redd.it/1of549t
@r_devops
GitHub
GitHub - robertlestak/vault-secret-sync: vault-secret-sync provides fully automated real-time secret syncronization from HashiCorp…
vault-secret-sync provides fully automated real-time secret syncronization from HashiCorp Vault to other remote secret stores. This enables you to take advantage of natively integrated cloud secret...
List of my job interview experiences
A while ago I found myself in the sudden predicament of finding a new role. I interviewed with multiple Platform Engineer roles in companies in London and wish to share my experiences. Feel free to add any of your anonymous experiences in the comments:
- Loadsure - recruiter call, ghosted, role was filled
- Checkatrade - final stage, senior engineer had attitude issues, feedback was word spaghetti.
- Lifi - ghosted
- GSS - nice call, comp too low
- Appvia - weird, recruiter call, rejected due to "not using AWS enough recently". Ive split the last decade on all 3 main providers... a good engineer can adapt?
- FDM - passed tech test, comp too low
- Mubi - more of an architectural tech test, felt good vibes, ghosted
- Zyte - ghosted
- NTT Data - comp too low
- Lightricks - 5 stages + take home, lowball comp, mega waste of time
- Citibank - surprisingly nice folk, 3 stages, ghosted, big fans of Golang
- WWT - good interview, job freeze
- anon trading fintech- 4 stages, offer, deep interview but fair
- brutal fintech - harsh grilling, immediate offer
- Trailmix games - comp too low
- Blackrock - offer, very deep interview
- Mastercard - offer, nice folk
- Balyasny - hedgefund lottery, talk to 5 people, ghosted
- JP Morgan - Senior VP with huge attitude problems. Staring at different screens and sighing. Worst of them all by far. Felt like a lecture, should we all just memorise ciphersuites and talk about multicasting? Ego trip
- Lloyds bank, fun but too long drawn out, comp lowball
- Synechron, good vibe, ghost
- Fasanara, hedgefund, brutal multiround in person interview, feedback: want CDK experience.. but tested me on Terraform? Circus
- Zencore, perfect match, comp too low
- Nucleus security, good vibe, ghosted
- MUFG, ghosted
- Palantir - auto rejection email
- US Bank - auto rejection email
- BCG - auto rejection email
- Vitol - auto rejection email
- DRW - hire freeze
- PA Consulting - hire freeze
- IG Group - auto rejection email
A couple I can't mention, but in the end the offer I accepted ended up being from the nicest interview process. Interviewing is exhausting, and frankly in 2020 I'd walk into a role. Stay strong to those on their search.
Advice to companies: you don't realise it, but you might be the candidates 7th interview of the week. Cut to the chase and make hiring processes short and to the point... and pay if you want talent.
https://redd.it/1ofgpal
@r_devops
A while ago I found myself in the sudden predicament of finding a new role. I interviewed with multiple Platform Engineer roles in companies in London and wish to share my experiences. Feel free to add any of your anonymous experiences in the comments:
- Loadsure - recruiter call, ghosted, role was filled
- Checkatrade - final stage, senior engineer had attitude issues, feedback was word spaghetti.
- Lifi - ghosted
- GSS - nice call, comp too low
- Appvia - weird, recruiter call, rejected due to "not using AWS enough recently". Ive split the last decade on all 3 main providers... a good engineer can adapt?
- FDM - passed tech test, comp too low
- Mubi - more of an architectural tech test, felt good vibes, ghosted
- Zyte - ghosted
- NTT Data - comp too low
- Lightricks - 5 stages + take home, lowball comp, mega waste of time
- Citibank - surprisingly nice folk, 3 stages, ghosted, big fans of Golang
- WWT - good interview, job freeze
- anon trading fintech- 4 stages, offer, deep interview but fair
- brutal fintech - harsh grilling, immediate offer
- Trailmix games - comp too low
- Blackrock - offer, very deep interview
- Mastercard - offer, nice folk
- Balyasny - hedgefund lottery, talk to 5 people, ghosted
- JP Morgan - Senior VP with huge attitude problems. Staring at different screens and sighing. Worst of them all by far. Felt like a lecture, should we all just memorise ciphersuites and talk about multicasting? Ego trip
- Lloyds bank, fun but too long drawn out, comp lowball
- Synechron, good vibe, ghost
- Fasanara, hedgefund, brutal multiround in person interview, feedback: want CDK experience.. but tested me on Terraform? Circus
- Zencore, perfect match, comp too low
- Nucleus security, good vibe, ghosted
- MUFG, ghosted
- Palantir - auto rejection email
- US Bank - auto rejection email
- BCG - auto rejection email
- Vitol - auto rejection email
- DRW - hire freeze
- PA Consulting - hire freeze
- IG Group - auto rejection email
A couple I can't mention, but in the end the offer I accepted ended up being from the nicest interview process. Interviewing is exhausting, and frankly in 2020 I'd walk into a role. Stay strong to those on their search.
Advice to companies: you don't realise it, but you might be the candidates 7th interview of the week. Cut to the chase and make hiring processes short and to the point... and pay if you want talent.
https://redd.it/1ofgpal
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community