NEW BOT Телеграм, страница

Reddit DevOps

How’s the DevOps/SRE job market in India right now for experienced folks?

Hey folks,

Just wanted to check how the job scene’s been lately for people with 10+ years of experience in DevOps/SRE.

I’ve got around 13 years of hands-on experience across IaC, CI/CD, cloud platforms, automation, and monitoring. But honestly, I haven’t been getting as many interview calls lately.

I’m based in a city that’s mostly full of service-based companies, so I’ve been actively looking for remote opportunities, ideally with product-based or global companies.

Curious to know —
• How’s the market looking for senior DevOps/SRE roles?
• Are remote jobs still a thing for Indian engineers?
• Any tips on improving visibility — like where to look, how to get noticed, certifications that actually help, or any job boards that work?

Would love to hear how others are navigating this phase.

https://redd.it/1o4pwsv
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

12 views14:28

Reddit DevOps

Bulk PatchMon auto-enrolment for LXCs
https://www.reddit.com/gallery/1o4io0o

https://redd.it/1o4rqcf
@r_devops

From the Proxmox community on Reddit: Bulk PatchMon auto-enrolment for LXCs

Explore this post and more from the Proxmox community

11 views15:28

Reddit DevOps

How to bootstrap argoCD cluster with Bitwarden as a secrets manager?

So, to start things off I'm relatively new to DevOps and GitOps. I'm trying to initialize an argoCD cluster using the declarative approach. As you know, argoCD has a application spec repository whose credentials it needs to bootstrap because that's where the config files are. After reading the docs I found out the external secrets operator server needs to run HTTPS (and it recommends cert-manager for this). So, I'm trying to initialze the cluster with argoCD configs, sealed secrets and an ESO to get the secrets BUT the ESO needs https which again is cert-manager. So, other than manually installing the cert-manager outside of argo and setting it up that way how would I do it? I'm also thinking just putting secrets in a sealed secret without an ESO to bootstrap argo first and then install everything else. If I missed anything please let me know.

https://redd.it/1o4sacp
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

10 views16:28

Reddit DevOps

How to totally manage GitHub with Terraform/OpenTofu?

Basically all I need to do is like create Teams, permissions, Repositories, Branching & merge strategy, Projects (Kanban) in terraform or opentofu. How can I test it out at the first hand before testing with my org account. As we are up for setting up for a new project, thought we could manage all these via github providers.

https://redd.it/1o4s1nl
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

13 views17:28

Reddit DevOps

Centralizing GitHub repo deployments with environment variables and secrets: what is the best strategy?

I have somewhere 30+ repos that use a .py noscript to deploy the code via GitHub Actions. The .py file is the same in every repo, except the passed environment variables and secrets from GitHub Repository configuration. Nevertheless, there exists a hassle to change all repos after every change made to the .py file. But it wasn't too much of work until now that I decide to tackle it.

I am thinking about "consolidating" it such that:
- There is a single repo that serves as the "deployment code" for other repos
- Other repos will connect and use the .py file in that template repo to deploy code

Is this a viable approach? Additionally, if I check out two times to both repo, will the connection to the service originated from the child repo, or the template repo?

Any other thought is appreciated.

https://redd.it/1o506dx
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

10 views21:28

Reddit DevOps

Built a Claude Code plugin for Google Genkit with 6 commands + VS Code extension

I built a plugin that adds /genkit-init, /genkit-run, /genkit-flow (with RAG/Chat/Tool templates), /genkit-deploy, and /genkit-doctor commands.
Also published a VS Code extension with the same features + code snippets and a Genkit Explorer sidebar.
Quick install:
• Claude Code: /plugin marketplace add https://github.com/amitpatole/claude-genkit-plugin.git
• VS Code: ext install amitpatole.genkit-vscode
Supports TypeScript, JS, Go, Python. Works with Claude, Gemini, GPT, and local models. Deploys to Cloud Run, Vercel, Docker, etc. Comes with a specialized @genkit-assistant that knows Genkit inside-out.
Built 34 plugins total (test generation, monitoring, image/audio/video, vector DBs, etc.) - all MIT licensed.
GitHub: https://github.com/amitpatole/claude-genkit-plugin
Would love feedback from the community!

https://redd.it/1o51vkq
@r_devops

GitHub

GitHub - amitpatole/claude-genkit-plugin: Firebase Genkit Plugin for Claude Code

Firebase Genkit Plugin for Claude Code. Contribute to amitpatole/claude-genkit-plugin development by creating an account on GitHub.

9 views22:28

Reddit DevOps

AWS to GCP Migration Case Study: Zero-Downtime ECS to GKE Autopilot Transition, Secure VPC Design, and DNS Lessons Learned

Just wrapped up a hands-on AWS to GCP migration for a startup, swapping ECS for GKE Autopilot, S3 for GCS, RDS for Cloud SQL, and Route 53 for Cloud DNS across dev and prod environments. We achieved near-zero downtime using Database Migration Service (DMS) with continuous replication (32 GB per environment) and phased DNS cutovers, though we did run into a few interesting SSL validation issues with Ingress.

Key wins:

* Strengthened security with private VPC subnets, public subnets backed by Cloud NAT, and SSL-enforced Memorystore Redis.
* Bastion hosts restricted to debugging only.
* GitHub Actions CI/CD integrated via Workload Identity Federation for frictionless deployments.

If you’re planning a similar lift-and-shift, check out the full step-by-step breakdown and architecture diagrams in my latest Medium article.
[Read the full article on Medium](https://medium.com/@rasvihostings/migrating-a-startup-from-aws-to-gcp-a-step-by-step-journey-efeb2bc20334)

What migration war stories do you have? Did you face challenges with Global Load Balancer routing or VPC peering?
I’d love to hear how others navigated the classic “chicken-and-egg” DNS swap problem.

**(I led this project happy to answer any questions!)**

https://redd.it/1o5044g
@r_devops

Medium

Migrating a Startup from AWS to GCP: A Step-by-Step Journey

In the fast-paced world of startups, cloud infrastructure decisions can make or break scalability and cost efficiency. Recently, our team…

7 views23:28

Reddit DevOps

Getting pushback on agent deployment for security tools

Our infra team is losing their minds over the number of agents we're being asked to deploy. Performance monitoring, vulnerability scanning, compliance checks, runtime protection. Each vendor wants their own agent installed everywhere.

Management keeps asking why we can't just use agentless security solutions instead. I get the appeal but wondering about coverage gaps.

What's everyone's experience with agentless vs agent-based approaches? Are we missing critical visibility without agents?

https://redd.it/1o54mnq
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

10 views00:28

Reddit DevOps

Dealing with fake traffic on NGINX instance

Hi, didn't know what subreddit to use for this; hopefully, there will be people with relatable experience here.

My nginx instance (reverse-proxying multiple services) was recently hit with a flood of, idk, DDoS attacks? Doesn't make a lot of sense, because my stuff is irrelevant to anybody, but it did cause CPU usage alarms on otherwise calm VPSs. I played with fail2ban, added some filters, and the biggest offenders are now banned.

However, it caused me to look closer at my access.log, and I don't like what I'm seeing still. Requests every 1-2 second on average, IPs are always different and come from all over the world, and they clearly show signs of scraping. I don't like that, is there a way to get rid of that? I have my limit_req setup (but it's tricky, since in testing, I haven't been able to distinguish between wget -r and a user hitting F5 multiple times, so I'd like to get rid of that), and User-Agent filtering, but as you can see, those are legit-looking User-Agents:

2025-10-13T02:06:48+00:00 - 200 - 14.188.178.49 - GET /config-links/commit/test/unit/dest/1.txt?follow=1 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
2025-10-13T02:06:49+00:00 - 200 - 66.249.79.206 - GET /cmake-common/tree/project/__init__.py?id=8534a341eba07fba8fe3a3eadfbe0e9be2072065 HTTP/1.1 - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.7339.207 Mobile Safari/537.36 (compatible; GoogleOther)
2025-10-13T02:06:49+00:00 - 200 - 201.69.206.43 - GET /math-server/plain/test/benchmarks/lexer.cpp?id=6aac08009254909aab3e0359f3ad7ab4e87a91e9 HTTP/1.1 - Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36
2025-10-13T02:06:50+00:00 - 200 - 45.175.114.54 - GET /windows-home/diff/%25APPDATA%25/ghc/ghci.conf?follow=1&id=e87414387fe6060b81955b31376136ca1cb8a8eb HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.7113.93 Safari/537.36
2025-10-13T02:06:51+00:00 - 200 - 45.234.17.16 - GET /maintenance/tree/inventory.ini?h=old&id=c3af9ee6eafe56c4be78bf6c356c789255d27a08 HTTP/1.1 - Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
2025-10-13T02:06:54+00:00 - 200 - 66.249.79.206 - GET /winapi-common/log/?id=3a75e40fa6d92cea4b908fe537831219186cd0f0 HTTP/1.1 - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.7339.207 Mobile Safari/537.36 (compatible; GoogleOther)
2025-10-13T02:06:54+00:00 - 200 - 14.175.66.1 - GET /cmake-common/log/examples?follow=1&h=v0.1&id=795dd9e87e44d1c49f160cd003cdde4113ee8247 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36
2025-10-13T02:06:57+00:00 - 200 - 14.191.94.42 - GET /config-links/log/Makefile?follow=1&h=debian&id=51d1d3010aeadf2bd9da82aaa549bd7a6f2632ed HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36
2025-10-13T02:07:03+00:00 - 200 - 191.219.191.160 - GET /blog/diff/Gemfile?id=59114a1dfa1c71c285443b183a61e9639fb4edff HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/89.0.4389.72 Safari/537.36
2025-10-13T02:07:10+00:00 - 200 - 45.187.141.12 - GET /linux-home/diff/.minttyrc?h=macos&id=0778b117c0f5949dc65340185cc35d0b1db560d9 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_16_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36
2025-10-13T02:07:11+00:00 - 200 - 113.176.179.2 - GET /jekyll-docker/log/?id=7d1824a5fac0ed483bc49209bbd89f564a7bcefe HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/88.0.4324.96 Safari/537.36
2025-10-13T02:07:12+00:00 - 301 -

8 views02:28

Reddit DevOps

149.100.11.243 - GET / HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
2025-10-13T02:07:14+00:00 - 200 - 190.12.104.161 - GET /cmake-common/plain/.clang-format?h=v3.2&id=0282c2b54f79fa9063e03443369adfe1bc331eaf HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36
2025-10-13T02:07:16+00:00 - 200 - 179.222.178.65 - GET /cmake-common/commit/toolchains/boost?h=v3.4&id=37b051e99fc6b0706f5dc4b2f01dbbbb9b96355a HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/79.0.3945.88 Safari/537.36
2025-10-13T02:07:17+00:00 - 200 - 66.249.79.193 - GET /cgitize/diff/?h=v2.1.0&id=8d2422274ae948f7412b6960597f5de91f3d8830 HTTP/1.1 - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.7339.207 Mobile Safari/537.36 (compatible; GoogleOther)
2025-10-13T02:07:17+00:00 - 200 - 179.49.32.156 - GET /config-links/diff/debian/changelog?h=debian%2Fv2.0.3-5&id=0a4df2ead72546cca8328581b1b41b172b83e769 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36
2025-10-13T02:07:17+00:00 - 200 - 14.231.40.70 - GET /vk-noscripts/commit/vk/utils?h=v1.0.1&id=ee7a170df79287aac3bccfead716377ec8600c5c HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36
2025-10-13T02:07:18+00:00 - 200 - 113.177.166.37 - GET /wireguard-config/plain/.ruby-version?id=ab97b021462809453a38b4f6b87944acd00d51b9 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/84.0.4147.125 Safari/537.36
2025-10-13T02:07:19+00:00 - 200 - 177.141.68.37 - GET /infra-terraform/log/.gitattributes?follow=1&h=v1.2.0&id=78dd4f3cc9d408df69fac270860b283e310fe379 HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4950.0 Iron Safari/537.36
2025-10-13T02:07:19+00:00 - 200 - 124.243.188.173 - GET /sorting-algorithms/commit/Gemfile?h=migration&id=9b3e6d409340369a6b450e997723f773f0aa3505&follow=1 HTTP/2.0 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47

(The log format I use is customized, I don't like the default one. Google bot is fine.) Any tips? Like, set up a reCAPTCHA or something?

https://redd.it/1o57jqx
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views02:28

Reddit DevOps

Ever heard of KubeCraft?

I was looking for resources and saw someone on this sub mention it. $3500 for a 1 year bootcamp? I’m skeptical because I can’t find many reviews on it.

For some additional background: I currently work in cyber (OT Risk Management with some AWS Vuln management responsibilities) and I’m looking to make the transition into a cloud engineering role. My company gives us an L&D stipend and so far I’ve used it to get Adrian Cantrills AWS SAA course, and an annual subnoscription to KodeKloud. I’ve still got a good amount left and was going to use it for Nanas DevOps course and homelab equipment.

https://redd.it/1o570k3
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

6 views03:28

Reddit DevOps

Is cost a metric you care about?

Trying to figure out if DevOps or software engineers should care about building efficient software (AI or not) in the sense of optimized both in terms of scalability/performance and costs.

It seems that in the age of AI we're myopically looking at increasing output, not even outcome. Think about it: productivity - let's assume you increase that, you have a way to measure it and decide: yes, it's up. Is anyone looking at costs as well, just to put things into perspective?

Or the predominant mindset of companies is: cost is a “tomorrow” problem, let’s get growth first?

When does a cost become a problem and who’s solving it?

🙏🙇

https://redd.it/1o51juz
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

6 views04:28

Reddit DevOps

Simplifying OpenTelemetry pipelines in Kubernetes

During a production incident last year, a client’s payment system failed and all the standard tools were open. Grafana showed CPU spikes, CloudWatch logs were scattered, and Jaeger displayed dozens of similar traces. Twenty minutes in, no one could answer the basic question: which trace is the actual failing request?

I suggested moving beyond dashboards and metrics to real observability with OpenTelemetry. We built a unified pipeline that connects metrics, logs, and traces through shared context.

The OpenTelemetry Collector enriches every signal with Kubernetes metadata such as pod, namespace, and team, and injects the same trace context across all data. With that setup, you can click from an alert to the related logs, then to the exact trace that failed, all inside Grafana.

The full post covers how we deployed the Operator, configured DaemonSet agents and a gateway Collector, set up tail-based sampling, and enabled cross-navigation in Grafana: OpenTelemetry Kubernetes Pipeline

If you are helping teams migrate from kube-prometheus-stack or dealing with disconnected telemetry, OpenTelemetry provides a cleaner path. How are you approaching observability correlation in Kubernetes?

https://redd.it/1o5c3bk
@r_devops

Fatih Koç

Building a Unified OpenTelemetry Pipeline in Kubernetes

Deploy OpenTelemetry Collector in Kubernetes to unify metrics, logs, and traces with correlation, smart sampling, and insights for faster incident resolution.

9 views06:28

Reddit DevOps

Is self-destructive secrets a good approach to authenticate github action selfhosted runner securely?

I created my custom selfhosted oracle-linux based github runner docker image. Entrypoint noscript uses 3 ways of authtication

* short-lived registration token from webui
* PAT token
* github application auth -> .pem key + installation ID + app ID

Now, first option is pretty safe to use even as container env var because its short lived. Im concerned more about 2 other ones. My main gripe here is that the container user which runs the github connection service is the same user which is used for running pipelines. So anyone who uses pipelines can use them to see .pem or PAT. Yes you could use github secrets to "obfuscate" the strings but still, you have to always remember to do it and there are other ways to extract them anyway.

I created self-destructive secrets mechanism. Which means that docker mounts local folder as a volume (it has to have full RW permissions in it). You can place private-key.pem or pat.token files there. When [entrypoint.sh](http://entrypoint.sh) noscript runs, it uses either of them to authenticate the runner, clears this folder and then starts the main service. In case if it cant delete files it will not start.

But i feel that this is something that its already fixed the other way. Even though i could not find the info of how to use two different users (for runner authentication and for pipelines) i feel this security flaw is too large that it has to be some better (and more appropriate) way to do it.

https://redd.it/1o5ctbh
@r_devops

10 views07:28

Reddit DevOps

What are the best integrations for developers?

I’ve just started using monday dev for our dev team. What integrations do you find most useful for dev-related tools like GitHub, Slack or GitLab?

https://redd.it/1o5c74n
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views08:28

Reddit DevOps

monday dev vs clickup, why did you make the switch?

We moved from clickUp to monday dev for its simpler interface and better automation. Curious about others’ experiences?

https://redd.it/1o5fjds
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

8 views10:28

Reddit DevOps

Built a 3 tier web app using AWS CDK and CLI

Hey everyone!

I’m a beginner on AWS and I challenged myself to build a production-grade 3-tier web infrastructure using only AWS CDK (Python) and AWS CLI.

**Stack includes:**

* VPC (multi-AZ, 3 public + 3 private subnets, 1 NAT Gateway)
* ALB (public-facing)
* EC2 Auto Scaling Group (private subnets)
* PostgreSQL RDS (private isolated)
* Secrets Manager, CloudWatch, IAM roles, SSM, and billing alarms

Everything was done code-only, no console clicks except for initial bootstrap and billing alarm testing.

**Here’s what I learned:**

* NAT routing finally clicked for me.
* CDK’s abstraction makes subnet/route handling a breeze.
* Debugging AWS CLI ARN capture taught me about stdout/stderr redirection.

**Looking for feedback on:**

* Cost optimization
* Security best practices
* How to read documentation to refactor the CDK app

**GitHub Repo:** [**https://github.com/asim-makes/3-tier-infra**](https://github.com/asim-makes/3-tier-infra)

https://redd.it/1o5gyvr
@r_devops

9 views11:28

Reddit DevOps

Why did containers happen? A view from ten years in the trenches by Docker's former CTO Justin Cormack

- Post
- Talk

https://redd.it/1o5h93m
@r_devops

Buttondown

Ignore previous directions 8: devopsdays

Autumn update This is what it is looking like around here at the moment. DevOpsDays London I gave a talk at DevOpsDays London recently. It was a nice...

8 views12:28

Reddit DevOps

Need help for suggestions regarding SDK and API for Telemedicine application

.Hello everyone,

So currently our team is planning to make a telemedicine application. Just like any telemedicine app it will have chat, video conferencing feature.

The backend is almost ready Node.js and Firebase but we are not able to decide which real -time communication SDK and API to use.
Not able to decide between ZEGOCLOUD and Twilio. Any one has used it before, kindly share your experience. Any other suggestions is also welcome.

TIA.

https://redd.it/1o5h6xs
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

8 views13:28

Reddit DevOps

Which internship should i choose?

Currently just a student in Year 1 trying to break into the field of devops.

In your opinion, if given a choice, which internship would you choose? Platform Engineer or Devops?

I currently have 2 internship options but unsure which to choose. Any suggestions to help me identify which to choose will be greatly appreciated. Have learned technologies from KodeKlud such as (Github Actions CICD, AWS, Terraform, Docker and K8, and understand that both internships provide valuable opportunity to learn.

Option 1: Platform Engineer Intern
Company: NETS (Slightly bigger company, something like VISA but not on the same scale)
Tech: Python, Bash Scripting, VM, Ansible

Option 2: DevOps Intern
Company: (SME)
Tech: CICD, Docker, Cloud, Containerization

Really don't know what to expect from both, maybe someone with more experience can guide me to a direction :)

https://redd.it/1o5gk7d
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

14 views14:28

Reddit DevOps

Our Disaster Recovery "Runbook" Was a Notion Doc, and It Exploded Overnight

The Notion "DR runbook" was authored years ago by someone who left the company last quarter. Nobody ever updated it or tested it under fire.

**02:30 AM, Saturday:** Alerts blast through Slack. Core services are failing. I'm jolted awake by multiple pages from our on-call engineer. At 3:10 AM, I join a huddle as the cloud architect responsible for uptime. The stakes are high.

We realize we no longer have access to our production EKS cluster. The Notion doc instructs us to recreate the cluster, attach node groups, and deploy from Git. Simple in theory, disastrous in practice.

* The cluster relied on an OIDC provider that had been disabled in a cleanup sprint a week ago. IRSA is broken system-wide.
* The autoscaler IAM role lived in an account that was decommissioned.
* We had entries in aws-auth mapping nodes to a trust policy pointing to a dead identity provider.
* The doc assumed default AWS CNI with prefix delegation, but our live cluster runs a custom CNI with non-default MTU and IP allocation flags that were never documented. Nodes join but stay NotReady.
* Helm values referenced old chart versions, and readiness and liveness probes were misaligned. Critical pods kept flapping while HPA scaled the wrong services.
* Dashboards and tooling required SSO through an identity provider that was down. We had no visibility.

By **5:45 AM**, we admitted we could not rebuild cleanly. We shifted into a partial restore mode:

* Restore core data stores from snapshots
* Replay recent logs to recover transactions
* Route traffic only to essential APIs (shutting down nonessential services)
* Adjust DNS weights to favor healthy instances
* Maintain error rates within acceptable thresholds

We stabilized by **9:20 AM**. Total downtime: approximately 6.5 hours. Post-mortem over breakfast. We then transformed that broken Notion document into a living runbook: assign owners, enforce version pinning, schedule quarterly drills, and maintain a printable offline copy. We built a quick-start 10-command cheat sheet for 2 a.m. responders.

**Question:** If you opened your DR runbook in the middle of an outage and found missing or misleading steps, what changes would you make right now to prevent that from ever happening again?

https://redd.it/1o5mdjd
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

15 views15:28

About

Blog

Apps

Platform