Trying to get precise historical resource usage from Railway — why is this so hard?
I’ve been trying to get the exact resource usage (CPU, memory, network, etc.) for a specific Railway project within a specific time range, but I can’t seem to find a proper way to do it.
The API doesn’t give me consistent data, and the dashboard only shows recent stats.
Has anyone here managed to pull accurate historical usage from Railway?
Would really appreciate any pointers or workarounds.
https://redd.it/1oat9ne
@r_devops
I’ve been trying to get the exact resource usage (CPU, memory, network, etc.) for a specific Railway project within a specific time range, but I can’t seem to find a proper way to do it.
The API doesn’t give me consistent data, and the dashboard only shows recent stats.
Has anyone here managed to pull accurate historical usage from Railway?
Would really appreciate any pointers or workarounds.
https://redd.it/1oat9ne
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
VPS + Managing DB Migrations in CI
Hi all, I'm posting a similar question I posed to r/selfhosted, basically looking for advice on how to manage DB migrations via CI. I have this setup:
1. VPS running services (frontend, backend, db) via docker compose (using Dokploy)
2. SSH locked down to only allow access via private VPN (using Tailscale)
3. DB is not exposed to external internet, only accessible to other services within the VPS.
The issue is I cannot determine what the right CI/CD processes should be for checking/applying migrations. Basically, my thought is I need to access prod DB from CI at two points in time: when I have a PR, we need to check to see if any migrations would be needed, and when deploying I should apply migrations as part of that process.
I previously had my DB open to the internet on e.g. port 5432. This worked since I could just access via standard connection string, but I was seeing a lot of invalid access logs, which made me think it was a possible risk/attack surface, so I switched it to be internal only.
After switching DB to no longer be accessible to the internet, I have a new set of issues, which is just accessing and running the DB commands is tricky. It seems my options are:
1. Keep DB port open and just deal with attack attempts. I was not successful configuring UFW to allow Tailscale only for TCP, but if this is possible it's probably a good option.
2. Close DB port, run migration/checks against DB via SSH somehow, but this gets complex. As an example, if I wanted to run a migration for Better Auth, as far as I can tell it can't be run in the prod container on startup, since it requires npx + files that are tree shaken/minified/chunked (migration noscripts, auth.ts file), as part of the standard build/packaging process and no longer present. So if we go this route, it seems like it needs a custom container just for migrations (assuming we spin it up as a separate ephemeral service).
How are other folks managing this? I'm open to any advice or patterns you've found helpful.
https://redd.it/1oavnun
@r_devops
Hi all, I'm posting a similar question I posed to r/selfhosted, basically looking for advice on how to manage DB migrations via CI. I have this setup:
1. VPS running services (frontend, backend, db) via docker compose (using Dokploy)
2. SSH locked down to only allow access via private VPN (using Tailscale)
3. DB is not exposed to external internet, only accessible to other services within the VPS.
The issue is I cannot determine what the right CI/CD processes should be for checking/applying migrations. Basically, my thought is I need to access prod DB from CI at two points in time: when I have a PR, we need to check to see if any migrations would be needed, and when deploying I should apply migrations as part of that process.
I previously had my DB open to the internet on e.g. port 5432. This worked since I could just access via standard connection string, but I was seeing a lot of invalid access logs, which made me think it was a possible risk/attack surface, so I switched it to be internal only.
After switching DB to no longer be accessible to the internet, I have a new set of issues, which is just accessing and running the DB commands is tricky. It seems my options are:
1. Keep DB port open and just deal with attack attempts. I was not successful configuring UFW to allow Tailscale only for TCP, but if this is possible it's probably a good option.
2. Close DB port, run migration/checks against DB via SSH somehow, but this gets complex. As an example, if I wanted to run a migration for Better Auth, as far as I can tell it can't be run in the prod container on startup, since it requires npx + files that are tree shaken/minified/chunked (migration noscripts, auth.ts file), as part of the standard build/packaging process and no longer present. So if we go this route, it seems like it needs a custom container just for migrations (assuming we spin it up as a separate ephemeral service).
How are other folks managing this? I'm open to any advice or patterns you've found helpful.
https://redd.it/1oavnun
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Building a DevOps homelab and AWS portfolio project. Looking for ideas from people who have done this well
Hey everyone,
I am setting up a DevOps homelab and want to host my own portfolio website on AWS as part of it. The goal is to have something that both shows my skills and helps me learn by doing. I want to treat it like a real production-style setup with CI/CD, infrastructure as code, monitoring, and containerization.
I am trying to think through how to make it more than just a static site. I want it to evolve as I grow, and I want to avoid building something that looks cool but teaches me nothing.
Here are some questions I am exploring and would love input on:
• How do you decide what is the right balance between keeping it simple and adding more components for realism?
• What parts of a DevOps pipeline or environment are worth showing off in a personal project?
• For hands-on learning, is it better to keep everything on AWS or mix in self-hosted systems and a local lab setup?
• How do you keep personal projects maintainable when they get complex?
• What are some underrated setups or tools that taught you real-world lessons when you built your own homelab?
I would really appreciate hearing from people who have gone through this or have lessons to share. My main goal is to make this project a long-term learning environment that also reflects real DevOps thinking.
Thanks in advance.
https://redd.it/1oazccy
@r_devops
Hey everyone,
I am setting up a DevOps homelab and want to host my own portfolio website on AWS as part of it. The goal is to have something that both shows my skills and helps me learn by doing. I want to treat it like a real production-style setup with CI/CD, infrastructure as code, monitoring, and containerization.
I am trying to think through how to make it more than just a static site. I want it to evolve as I grow, and I want to avoid building something that looks cool but teaches me nothing.
Here are some questions I am exploring and would love input on:
• How do you decide what is the right balance between keeping it simple and adding more components for realism?
• What parts of a DevOps pipeline or environment are worth showing off in a personal project?
• For hands-on learning, is it better to keep everything on AWS or mix in self-hosted systems and a local lab setup?
• How do you keep personal projects maintainable when they get complex?
• What are some underrated setups or tools that taught you real-world lessons when you built your own homelab?
I would really appreciate hearing from people who have gone through this or have lessons to share. My main goal is to make this project a long-term learning environment that also reflects real DevOps thinking.
Thanks in advance.
https://redd.it/1oazccy
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Browser Automation Tools
I’ve been playing around with selenium and puppeteer for a few workloads but they crash way too often and maintaining them is a pain. browserbase has been decent, there’s a new one called steel.dev, and i’ve tried browser-use too but it hasn’t been that performant for me. I'm trying to use it more and more for web testing and deep research, but is there is anything else where it can work well?
Curious what everyone’s using browser automation for these days; scraping, ai agents, qa? What actually makes your setup work well. what tools are you running, what problems have you hit, and what makes one setup better than another in your experience?
Big thanks!
https://redd.it/1ob3xsf
@r_devops
I’ve been playing around with selenium and puppeteer for a few workloads but they crash way too often and maintaining them is a pain. browserbase has been decent, there’s a new one called steel.dev, and i’ve tried browser-use too but it hasn’t been that performant for me. I'm trying to use it more and more for web testing and deep research, but is there is anything else where it can work well?
Curious what everyone’s using browser automation for these days; scraping, ai agents, qa? What actually makes your setup work well. what tools are you running, what problems have you hit, and what makes one setup better than another in your experience?
Big thanks!
https://redd.it/1ob3xsf
@r_devops
steel.dev
Steel | Open-source Headless Browser API
Steel is an open-source browser API purpose-built for AI agents.
CI/CD template for FastAPI: CodeQL, Dependabot, GHCR publishing
Focus is the pipeline rather than the framework.
Push triggers tests, lint, CodeQL
Tag triggers Docker build, health check, push to GHCR, and GitHub Release
Dependabot for dependencies and Actions
Optional Postgres and Sentry via secrets without breaking first run
Repo: https://github.com/ArmanShirzad/fastapi-production-template
https://redd.it/1ob387d
@r_devops
Focus is the pipeline rather than the framework.
Push triggers tests, lint, CodeQL
Tag triggers Docker build, health check, push to GHCR, and GitHub Release
Dependabot for dependencies and Actions
Optional Postgres and Sentry via secrets without breaking first run
Repo: https://github.com/ArmanShirzad/fastapi-production-template
https://redd.it/1ob387d
@r_devops
GitHub
GitHub - ArmanShirzad/fastapi-production-template: production-ready FastAPI template with Docker, CI/CD, observability, and one…
production-ready FastAPI template with Docker, CI/CD, observability, and one-click deployment to Render or Koyeb. - ArmanShirzad/fastapi-production-template
Job Market is crazy
The job market is crazy out there right now, I am lucky I currently have one and just browsing. I applied to one position I meet all the requirements to and was sent a rejection email before I received the indeed confirmation it felt like. I understand they cannot look at all resumes, but what are these AIs looking for when all the skills match their requirements?
I wish anyone dealing with real job hunting the best of luck.
https://redd.it/1ob6vej
@r_devops
The job market is crazy out there right now, I am lucky I currently have one and just browsing. I applied to one position I meet all the requirements to and was sent a rejection email before I received the indeed confirmation it felt like. I understand they cannot look at all resumes, but what are these AIs looking for when all the skills match their requirements?
I wish anyone dealing with real job hunting the best of luck.
https://redd.it/1ob6vej
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
We developed a web monitoring tool ZomniLens and want your opinion
We've recent built a web monitoring tool [https://zomnilens.com](https://zomnilens.com) to detect websites anomaly. The following features are included in the Standard plan:
* 60s monitoring interval.
* Supports HTTP GET, POST and PUT
* Each client has a beautiful service status page to ensure security and data protection. It can be made public at any time if desired. [demo page](https://dashboard.zomnilens.com/dashboard/snapshot/1pZmr4luQyVrhU1RzcABxiBytNVS8ceq?orgId=0&from=2025-10-07T20:00:21.510Z&to=2025-10-08T01:00:21.510Z&timezone=browser).
* Currently it supports email and SMS alerts. We are working on integrating other alerting channels (Slack, Webex, etc.) and they will be included in the same Standard pricing plan once available.
* Alert will be triggered on downtime, slow response time, to-be-expired SSL certificate and keyword matching failure.
We would like to hear your thoughts on:
* What are the features you think the service is missing and like us to include in future releases.
* What are the other areas the service should improve on.
Feel free to submit a free trial request via [https://zomnilens.com/pricing/](https://zomnilens.com/pricing/) and try it out and let me know if you like it or not for your personal or business needs.
https://redd.it/1ob3lwk
@r_devops
We've recent built a web monitoring tool [https://zomnilens.com](https://zomnilens.com) to detect websites anomaly. The following features are included in the Standard plan:
* 60s monitoring interval.
* Supports HTTP GET, POST and PUT
* Each client has a beautiful service status page to ensure security and data protection. It can be made public at any time if desired. [demo page](https://dashboard.zomnilens.com/dashboard/snapshot/1pZmr4luQyVrhU1RzcABxiBytNVS8ceq?orgId=0&from=2025-10-07T20:00:21.510Z&to=2025-10-08T01:00:21.510Z&timezone=browser).
* Currently it supports email and SMS alerts. We are working on integrating other alerting channels (Slack, Webex, etc.) and they will be included in the same Standard pricing plan once available.
* Alert will be triggered on downtime, slow response time, to-be-expired SSL certificate and keyword matching failure.
We would like to hear your thoughts on:
* What are the features you think the service is missing and like us to include in future releases.
* What are the other areas the service should improve on.
Feel free to submit a free trial request via [https://zomnilens.com/pricing/](https://zomnilens.com/pricing/) and try it out and let me know if you like it or not for your personal or business needs.
https://redd.it/1ob3lwk
@r_devops
I can’t understand Docker and Kubernetes practically
I am trying to understand Docker and Kubernetes - and I have read about them and watched tutorials. I have a hard time understanding something without being able to relate it to something practical that I encounter in day to day life.
I understand that a docker file is the blueprint to create a docker image, docker images can then be used to create many docker containers, which are replicas of the docker images. Kubernetes could then be used to orchestrate containers - this means that it can scale containers as necessary to meet user demands. Kubernetes creates as many or as little (depending on configuration) pods, which consist of containers as well as kubelet within nodes. Kubernetes load balances and is self-healing - excellent stuff.
WHAT DO YOU USE THIS FOR? I need an actual example. What is in the docker containers???? What apps??? Are applications on my phone just docker containers? What needs to be scaled? Is the google landing page a container? Does Kubernetes need to make a new pod for every 1000 people googling something? Please help me understand, I beg of you. I have read about functionality and design and yet I can’t find an example that makes sense to me.
Edit: First, I want to thank you all for the responses, most are very helpful and I am grateful that you took time to try and explain this to me. I am not trolling, I just have never dealt with containerization before. Folks are asking for more context about what I know and what I don't, so I'll provide a bit more info.
I am a data scientist. I access datasets from data sources either on the cloud or download smaller datasets locally. I've created ETL pipelines, I've created ML models (mainly using tensorflow and pandas, creating customized layer architectures) for internal business units, I understand data lake, warehouse and lakehouse architectures, I have a strong statistical background, and I've had to pick up programming since that's where I am less knowledgeable. I have a strong mathematical foundation and I understand things like Apache Spark, Hadoop, Kafka, LLMs, Neural Networks, etc. I am not very knowledgeable about software development, but I understand some basics that enable my job. I do not create consumer-facing applications. I focus on data transformation, gaining insights from data, creating data visualizations, and creating strategies backed by data for business decisions. I also have a good understanding of data structures and algorithms, but almost no understanding about networking principles. Hopefully this sets the stage.
https://redd.it/1odqipt
@r_devops
I am trying to understand Docker and Kubernetes - and I have read about them and watched tutorials. I have a hard time understanding something without being able to relate it to something practical that I encounter in day to day life.
I understand that a docker file is the blueprint to create a docker image, docker images can then be used to create many docker containers, which are replicas of the docker images. Kubernetes could then be used to orchestrate containers - this means that it can scale containers as necessary to meet user demands. Kubernetes creates as many or as little (depending on configuration) pods, which consist of containers as well as kubelet within nodes. Kubernetes load balances and is self-healing - excellent stuff.
WHAT DO YOU USE THIS FOR? I need an actual example. What is in the docker containers???? What apps??? Are applications on my phone just docker containers? What needs to be scaled? Is the google landing page a container? Does Kubernetes need to make a new pod for every 1000 people googling something? Please help me understand, I beg of you. I have read about functionality and design and yet I can’t find an example that makes sense to me.
Edit: First, I want to thank you all for the responses, most are very helpful and I am grateful that you took time to try and explain this to me. I am not trolling, I just have never dealt with containerization before. Folks are asking for more context about what I know and what I don't, so I'll provide a bit more info.
I am a data scientist. I access datasets from data sources either on the cloud or download smaller datasets locally. I've created ETL pipelines, I've created ML models (mainly using tensorflow and pandas, creating customized layer architectures) for internal business units, I understand data lake, warehouse and lakehouse architectures, I have a strong statistical background, and I've had to pick up programming since that's where I am less knowledgeable. I have a strong mathematical foundation and I understand things like Apache Spark, Hadoop, Kafka, LLMs, Neural Networks, etc. I am not very knowledgeable about software development, but I understand some basics that enable my job. I do not create consumer-facing applications. I focus on data transformation, gaining insights from data, creating data visualizations, and creating strategies backed by data for business decisions. I also have a good understanding of data structures and algorithms, but almost no understanding about networking principles. Hopefully this sets the stage.
https://redd.it/1odqipt
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Who actually owns container security?
In our company, developers build Dockerfiles, ops teams run Kubernetes and security just scans results. When a vulnerability is found, nobody agrees on who should fix it. Devs say not my code, ops say not my job and security doesnt have access. Who owns container security in your org? Is it devs, ops or security?
https://redd.it/1oe24mm
@r_devops
In our company, developers build Dockerfiles, ops teams run Kubernetes and security just scans results. When a vulnerability is found, nobody agrees on who should fix it. Devs say not my code, ops say not my job and security doesnt have access. Who owns container security in your org? Is it devs, ops or security?
https://redd.it/1oe24mm
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Linux Sysadmin Competency
Hey all! I’ve recently started work in DevOps as a junior engineer, will be handling GHE administration, creating/administering CI/CD workflow, and some basic K8s stuff after those two which has priority.
My background is I’m currently on a career switch, took a course on cloud&devops..
What can do to quickly gain the skill set and competency level for Linux sysadmin role? Which exams that I can consider? What courses are there which is useful on Udemy? I’ll be getting kodekloud subnoscription once I’m proficient and moving on to Kubernetes.
Will be working in a secure air gapped environment.
https://redd.it/1oe58ra
@r_devops
Hey all! I’ve recently started work in DevOps as a junior engineer, will be handling GHE administration, creating/administering CI/CD workflow, and some basic K8s stuff after those two which has priority.
My background is I’m currently on a career switch, took a course on cloud&devops..
What can do to quickly gain the skill set and competency level for Linux sysadmin role? Which exams that I can consider? What courses are there which is useful on Udemy? I’ll be getting kodekloud subnoscription once I’m proficient and moving on to Kubernetes.
Will be working in a secure air gapped environment.
https://redd.it/1oe58ra
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Which bullets are the most impressive?
Which 5-7 of these accomplishments would you prioritize for a senior/lead engineer? I have limited space and want to highlight what's most impressive to hiring managers and technical leaders.
* **Serverless architecture processing 1M+ transformations/month at 300ms latency** \- Built high-performance async content pipeline using AWS Lambda, S3, CloudFront, and httpx
* **Complete product economics infrastructure** \- Designed token-based pricing, gamified leaderboards, affiliate referral system, and usage-based metered billing handling 30K+ API calls/month
* **Multi-tenancy PostgreSQL database design** \- Implemented UUID-based multi-tenancy with SQLAlchemy ORM and Alembic migrations on AWS RDS
* **OAuth2 authentication system** \- Integrated Clerk provider with async httpx client for secure cross-platform identity management
* **£0 to $6.4K monthly revenue in 6 months** \- Architected and monetized the entire platform from scratch
* **34% churn reduction** \- Used behavioral cohort analysis and DynamoDB event tracking to drive data-driven product decisions
* **Stripe payment integration** \- Built complete billing infrastructure with webhook handlers triggering Lambda functions via API Gateway and SQS queues
* **73% deployment time reduction** \- Built automated IaC CI/CD pipelines using AWS CDK, Terraform, and Nx distributed caching across multi-stage environments
* **Production-grade Nx Python monorepo** \- Evolved codebase with clean separation of concerns, dependency injection, and modular boundaries
* **Comprehensive testing suite** \- Unit, integration, and E2E tests with IaC deployment enabling continuous delivery across dev/staging/prod
* **Scaled team from 1 to 5 developers** \- Established technical hiring process and onboarded developers while maintaining code quality
* **Developer experience infrastructure** \- Built Docker containerization and local testing suites enabling team to ship production features
* **GenAI video/image editing automation** \- Implemented AI-powered content pipeline serving production workloads
Over 2 years I have started a bootstrapped company just adding each day, these are the main things; which should I include on my result?
https://redd.it/1oed073
@r_devops
Which 5-7 of these accomplishments would you prioritize for a senior/lead engineer? I have limited space and want to highlight what's most impressive to hiring managers and technical leaders.
* **Serverless architecture processing 1M+ transformations/month at 300ms latency** \- Built high-performance async content pipeline using AWS Lambda, S3, CloudFront, and httpx
* **Complete product economics infrastructure** \- Designed token-based pricing, gamified leaderboards, affiliate referral system, and usage-based metered billing handling 30K+ API calls/month
* **Multi-tenancy PostgreSQL database design** \- Implemented UUID-based multi-tenancy with SQLAlchemy ORM and Alembic migrations on AWS RDS
* **OAuth2 authentication system** \- Integrated Clerk provider with async httpx client for secure cross-platform identity management
* **£0 to $6.4K monthly revenue in 6 months** \- Architected and monetized the entire platform from scratch
* **34% churn reduction** \- Used behavioral cohort analysis and DynamoDB event tracking to drive data-driven product decisions
* **Stripe payment integration** \- Built complete billing infrastructure with webhook handlers triggering Lambda functions via API Gateway and SQS queues
* **73% deployment time reduction** \- Built automated IaC CI/CD pipelines using AWS CDK, Terraform, and Nx distributed caching across multi-stage environments
* **Production-grade Nx Python monorepo** \- Evolved codebase with clean separation of concerns, dependency injection, and modular boundaries
* **Comprehensive testing suite** \- Unit, integration, and E2E tests with IaC deployment enabling continuous delivery across dev/staging/prod
* **Scaled team from 1 to 5 developers** \- Established technical hiring process and onboarded developers while maintaining code quality
* **Developer experience infrastructure** \- Built Docker containerization and local testing suites enabling team to ship production features
* **GenAI video/image editing automation** \- Implemented AI-powered content pipeline serving production workloads
Over 2 years I have started a bootstrapped company just adding each day, these are the main things; which should I include on my result?
https://redd.it/1oed073
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
New DevOps engineer — how do you track metrics to show impact across multiple clients/projects?
Hey folks,
I’ve recently been promoted to a DevOps Engineer at a large IT outsourcing company. My team works on a wide range of projects — anything from setting up CI/CD pipelines with GitHub Actions, to managing Rancher Kubernetes clusters, to creating Prometheus/Grafana dashboards. Some clients are on AWS, others on GCP, and most are big enterprises with pretty monolithic and legacy setups that we help modernize.
I love the variety (it’s a great place to learn), but I’m trying to be proactive about tracking my performance and impact — both for internal promotions and for future job opportunities.
The challenge is that since I jump between projects for different clients, it’s hard to use standardized metrics. A lot of these companies don’t track things like “deployment frequency” or “lead time to production,” and I’m not sure what’s realistic for me to track personally.
So I’d really appreciate your help:
What DevOps metrics or KPIs do you personally track to demonstrate your impact?
How do you handle this when working across multiple clients or short-term projects?
Any tips on what to log or quantify so it’s useful later (e.g., for a performance review or a resume)?
I want more oomph then things like “implemented GitHub Actions CI/CD for X project” or “migrated on-prem app to GCP”, a way to make my future work appear more impactful.
Thanks in advance
https://redd.it/1oeeiuu
@r_devops
Hey folks,
I’ve recently been promoted to a DevOps Engineer at a large IT outsourcing company. My team works on a wide range of projects — anything from setting up CI/CD pipelines with GitHub Actions, to managing Rancher Kubernetes clusters, to creating Prometheus/Grafana dashboards. Some clients are on AWS, others on GCP, and most are big enterprises with pretty monolithic and legacy setups that we help modernize.
I love the variety (it’s a great place to learn), but I’m trying to be proactive about tracking my performance and impact — both for internal promotions and for future job opportunities.
The challenge is that since I jump between projects for different clients, it’s hard to use standardized metrics. A lot of these companies don’t track things like “deployment frequency” or “lead time to production,” and I’m not sure what’s realistic for me to track personally.
So I’d really appreciate your help:
What DevOps metrics or KPIs do you personally track to demonstrate your impact?
How do you handle this when working across multiple clients or short-term projects?
Any tips on what to log or quantify so it’s useful later (e.g., for a performance review or a resume)?
I want more oomph then things like “implemented GitHub Actions CI/CD for X project” or “migrated on-prem app to GCP”, a way to make my future work appear more impactful.
Thanks in advance
https://redd.it/1oeeiuu
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How is AI changing DevOps?
Hey everyone,
Most of us have been using AI tools in our DevOps work for a while now, and I think we're at an interesting point to reflect on what we're actually learning.
I'm curious to hear from the community:
What's working well? Which AI tools have genuinely improved your workflow? What use cases have been most valuable?
Where are the gaps? What hasn't lived up to the hype? Where do these tools still fall short?
How is the role changing? Are you noticing shifts in where you spend your time or what skills are becoming more important?
Best practices emerging? Have you developed any strategies or approaches that others might benefit from?
I suspect many of us are navigating similar questions about how to stay effective and relevant as the landscape evolves. Would be great to hear what you're all experiencing and how you're thinking about it.
Looking forward to the discussion!
https://redd.it/1oefomy
@r_devops
Hey everyone,
Most of us have been using AI tools in our DevOps work for a while now, and I think we're at an interesting point to reflect on what we're actually learning.
I'm curious to hear from the community:
What's working well? Which AI tools have genuinely improved your workflow? What use cases have been most valuable?
Where are the gaps? What hasn't lived up to the hype? Where do these tools still fall short?
How is the role changing? Are you noticing shifts in where you spend your time or what skills are becoming more important?
Best practices emerging? Have you developed any strategies or approaches that others might benefit from?
I suspect many of us are navigating similar questions about how to stay effective and relevant as the landscape evolves. Would be great to hear what you're all experiencing and how you're thinking about it.
Looking forward to the discussion!
https://redd.it/1oefomy
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How do you handle configuration drift in your environments?
We've been facing issues with configuration drift across our environments lately, especially with multiple teams deploying changes. It’s becoming a challenge to keep everything in sync and compliant with our standards.
What strategies do you use to manage this? Are there specific tools that have helped you maintain consistency? I'm curious about both proactive and reactive approaches.
https://redd.it/1oe4q90
@r_devops
We've been facing issues with configuration drift across our environments lately, especially with multiple teams deploying changes. It’s becoming a challenge to keep everything in sync and compliant with our standards.
What strategies do you use to manage this? Are there specific tools that have helped you maintain consistency? I'm curious about both proactive and reactive approaches.
https://redd.it/1oe4q90
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
AWS us-east-1 outage postmortem
AWS’s retrospective on the DynamoDB disruption in US-East-1 isn’t remarkable because something broke, things break every day.
What stands out is how long it took to see the full extend of picture and how predictable that delay was.
A small defect in DNS automation quietly rewrote endpoint records.
To be clear, this wasn’t DNS. It was a latent race condition that surfaced through DNS.
At AWS scale, even something as simple as “which IP should this endpoint resolve to” is managed by layers of automation. DynamoDB’s routing is backed by thousands of load balancers across multiple AZs, with automated systems continuously adjusting DNS records.
That one race condition broke the implicit contract every AWS service in us-east-1 relied on: that DynamoDB would always be reachable.
Everything downstream continued behaving as if everything was still consistent: DynamoDB calls timed out, EC2 provisioning stalled, Network Load Balancers reported bad health checks.
There was no alerts paging that “DNS is down.” But there were a lot of individual alerts paging for several tother reasons.
At Rootly, we see this pattern everywhere. The hardest part of a major incident isn’t the fix, it’s realizing that ten small, unrelated failures are all symptoms of the same thing; it’s the root cause.
Every distributed system runs on invisible contracts: this record will resolve, this endpoint will respond, this region will behave like the others. Boundaries are the invisible contracts that are baked into how teams and systems reason about the digital world.
When one breaks silently the failure can hide behind normal behaviour. Systems continue with exactly what they have been programmed to do, now just on the wrong assumptions.
By the time patterns become visible, the real question isn’t ok what failed, it’s how many other systems still trust it.
In this case, the DNS automation bug was just the first crack in a chain of invisible contracts that everyone assumed was safe.
AWS’s DNS automation followed instructions perfectly, as automation does, otherwise why would we automate it? The problem was that the instructions were out of date.
There’s a reason we automate things: automation is great at doing things quickly. That’s an obvious statement. Here’s another one: autmation is terrible at deciding whether it should still be done. Otherwise it would be autonomy.
Across large complex systems, we see this dynamic repeatedly. As a matter of fact, Anthropic published a similar retrospective only days ago.
When every safeguard is automatic, you lose the pauses where intuition normally kicks in.
The result isn’t chaos, it’s confidence that everything must be working because no one has said otherwise.
In AWS’s timeline, DynamoDB errors appeared hours before EC2 and NLB issues were connected.
At scale, no single team owns the entire picture.
Each service has its own alerts, escalation policies, and vocabulary.
From inside DynamoDB, it looked like increased error rates.
From inside EC2, provisioning delays.
From NLB, unhealthy targets.
Every team was right. It was just incomplete and missing context.
The coordination overhead of discovering that everyone is actually working on the same problem is massive.
I’ve heard endless stories about organizations spending more time figuring out who should respond rather than fixing what’s actually wrong. That’s not incompetence. Some of the smartest people in the world work at AWS and other large complex companies. It’s just what happens when visibility is local and failure is global.
AWS actually fixed the race condition quickly, but the region didn’t return to steady state for hours.
In my humble opinion and experience that’s normal. Distributed systems don’t snap back, they tend to drift toward normal states.
If you’re ever part of an outage like this temper your expectations so they are not linear, your systems aren’t waiting on you;
AWS’s retrospective on the DynamoDB disruption in US-East-1 isn’t remarkable because something broke, things break every day.
What stands out is how long it took to see the full extend of picture and how predictable that delay was.
A small defect in DNS automation quietly rewrote endpoint records.
To be clear, this wasn’t DNS. It was a latent race condition that surfaced through DNS.
At AWS scale, even something as simple as “which IP should this endpoint resolve to” is managed by layers of automation. DynamoDB’s routing is backed by thousands of load balancers across multiple AZs, with automated systems continuously adjusting DNS records.
That one race condition broke the implicit contract every AWS service in us-east-1 relied on: that DynamoDB would always be reachable.
Everything downstream continued behaving as if everything was still consistent: DynamoDB calls timed out, EC2 provisioning stalled, Network Load Balancers reported bad health checks.
There was no alerts paging that “DNS is down.” But there were a lot of individual alerts paging for several tother reasons.
At Rootly, we see this pattern everywhere. The hardest part of a major incident isn’t the fix, it’s realizing that ten small, unrelated failures are all symptoms of the same thing; it’s the root cause.
Every distributed system runs on invisible contracts: this record will resolve, this endpoint will respond, this region will behave like the others. Boundaries are the invisible contracts that are baked into how teams and systems reason about the digital world.
When one breaks silently the failure can hide behind normal behaviour. Systems continue with exactly what they have been programmed to do, now just on the wrong assumptions.
By the time patterns become visible, the real question isn’t ok what failed, it’s how many other systems still trust it.
In this case, the DNS automation bug was just the first crack in a chain of invisible contracts that everyone assumed was safe.
AWS’s DNS automation followed instructions perfectly, as automation does, otherwise why would we automate it? The problem was that the instructions were out of date.
There’s a reason we automate things: automation is great at doing things quickly. That’s an obvious statement. Here’s another one: autmation is terrible at deciding whether it should still be done. Otherwise it would be autonomy.
Across large complex systems, we see this dynamic repeatedly. As a matter of fact, Anthropic published a similar retrospective only days ago.
When every safeguard is automatic, you lose the pauses where intuition normally kicks in.
The result isn’t chaos, it’s confidence that everything must be working because no one has said otherwise.
In AWS’s timeline, DynamoDB errors appeared hours before EC2 and NLB issues were connected.
At scale, no single team owns the entire picture.
Each service has its own alerts, escalation policies, and vocabulary.
From inside DynamoDB, it looked like increased error rates.
From inside EC2, provisioning delays.
From NLB, unhealthy targets.
Every team was right. It was just incomplete and missing context.
The coordination overhead of discovering that everyone is actually working on the same problem is massive.
I’ve heard endless stories about organizations spending more time figuring out who should respond rather than fixing what’s actually wrong. That’s not incompetence. Some of the smartest people in the world work at AWS and other large complex companies. It’s just what happens when visibility is local and failure is global.
AWS actually fixed the race condition quickly, but the region didn’t return to steady state for hours.
In my humble opinion and experience that’s normal. Distributed systems don’t snap back, they tend to drift toward normal states.
If you’re ever part of an outage like this temper your expectations so they are not linear, your systems aren’t waiting on you;
Anthropic
A postmortem of three recent issues
This is a technical report on three bugs that intermittently degraded responses from Claude. Below we explain what happened, why it took time to fix, and what we're changing.
they are re-learning what “healthy” means.
The question after every incident shouldn’t be “How did this happen?”, it should be “How do we recognize it faster next time?”
AWS’s transparency helps remind everyone that even at hyperscale, the fundamentals are the same: boundaries drift, context fragments, automation repeats mistakes perfectly.
Reliability isn’t about stopping that, it’s about building the reflexes to see it sooner, talk about it clearly, and learn from it completely.
https://redd.it/1oej8pw
@r_devops
The question after every incident shouldn’t be “How did this happen?”, it should be “How do we recognize it faster next time?”
AWS’s transparency helps remind everyone that even at hyperscale, the fundamentals are the same: boundaries drift, context fragments, automation repeats mistakes perfectly.
Reliability isn’t about stopping that, it’s about building the reflexes to see it sooner, talk about it clearly, and learn from it completely.
https://redd.it/1oej8pw
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Finding the Right Audience Without Feeling “Salesy” or Pushy
I’ve been thinking a lot lately about how to genuinely connect with the right audience — whether it’s for a creative project, small business, content channel, or personal brand. There’s so much advice out there about “target demographics” and “Individual DM's,” but sometimes it feels like that turns people into metrics instead of humans.
How do you find and attract the audience who actually resonates with what you do without coming across as pushy or overly promotional?
https://redd.it/1oeksjg
@r_devops
I’ve been thinking a lot lately about how to genuinely connect with the right audience — whether it’s for a creative project, small business, content channel, or personal brand. There’s so much advice out there about “target demographics” and “Individual DM's,” but sometimes it feels like that turns people into metrics instead of humans.
How do you find and attract the audience who actually resonates with what you do without coming across as pushy or overly promotional?
https://redd.it/1oeksjg
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
New to Devops - Why Is Everything Structured Differently?
I’m currently transitioning from IT to DevOps at my workplace. So far, it’s been going okay, but one thing that confuses me is encountering code that’s structured differently from other code. It’s hard to find consistency. I’m not sure if it’s because I work at a startup, but I constantly have to dig to figure out why one thing has a certain feature enabled while another doesn’t. There is a lot of these "context-specific decisions" on our code base and there are so many namespaces, so many models, it gets difficult to understand. Is this normal?
https://redd.it/1oejuje
@r_devops
I’m currently transitioning from IT to DevOps at my workplace. So far, it’s been going okay, but one thing that confuses me is encountering code that’s structured differently from other code. It’s hard to find consistency. I’m not sure if it’s because I work at a startup, but I constantly have to dig to figure out why one thing has a certain feature enabled while another doesn’t. There is a lot of these "context-specific decisions" on our code base and there are so many namespaces, so many models, it gets difficult to understand. Is this normal?
https://redd.it/1oejuje
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Scheduling ML Workloads on Kubernetes
Hey guys. This article covers NVIDIA Kai-Scheduler, including gang scheduling, bin packing, consolidation, and queue features, etc:
https://martynassubonis.substack.com/p/scheduling-ml-workloads-on-kubernetes
https://redd.it/1oehdnd
@r_devops
Hey guys. This article covers NVIDIA Kai-Scheduler, including gang scheduling, bin packing, consolidation, and queue features, etc:
https://martynassubonis.substack.com/p/scheduling-ml-workloads-on-kubernetes
https://redd.it/1oehdnd
@r_devops
Substack
Scheduling ML Workloads on Kubernetes
On Gang Scheduling, Bin Packing, Consolidation, and the Like
Suggestions of tools to improve life quality of a devops engineer
I'm looking for suggestions that will improve my day to day operations as a devops engineer across the whole stack. For example a tool or ide that helps visualize and interact with the k8s cluster. I'm aware of something called lens ide but havent looked too much into it. Or autocompletion/suggestions for dockerfiles etc.. anything really. What is something you are using and would never go back to not using it again?
https://redd.it/1oebaei
@r_devops
I'm looking for suggestions that will improve my day to day operations as a devops engineer across the whole stack. For example a tool or ide that helps visualize and interact with the k8s cluster. I'm aware of something called lens ide but havent looked too much into it. Or autocompletion/suggestions for dockerfiles etc.. anything really. What is something you are using and would never go back to not using it again?
https://redd.it/1oebaei
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Anyone else feel AI is making them a faster typist, but a dumber developer? 😩
I feel like I'm not programming anymore, I'm just auditing AI output.
Copilot/Cursor is great for boilerplate. It’ll crank out a CRUD endpoint in seconds. But then I spend 3x the time trying to spot the subtle, contextual bug it slipped in (e.g., a tiny thread-safety issue, or a totally wrong way to handle an old library).
It feels like my brain’s problem-solving pathways are atrophying. I trade the joy of solving a hard problem for the anxiety of verifying a complex, auto-generated one. This isn't higher velocity; it's just a different, more draining kind of work.
Am I alone in feeling this cognitive burnout?
https://redd.it/1oepjg3
@r_devops
I feel like I'm not programming anymore, I'm just auditing AI output.
Copilot/Cursor is great for boilerplate. It’ll crank out a CRUD endpoint in seconds. But then I spend 3x the time trying to spot the subtle, contextual bug it slipped in (e.g., a tiny thread-safety issue, or a totally wrong way to handle an old library).
It feels like my brain’s problem-solving pathways are atrophying. I trade the joy of solving a hard problem for the anxiety of verifying a complex, auto-generated one. This isn't higher velocity; it's just a different, more draining kind of work.
Am I alone in feeling this cognitive burnout?
https://redd.it/1oepjg3
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community