Looking for feedback on my AWS TUI tool
I built a terminal UI for AWS resource management (think k9s but for AWS). Would love feedback from people who actually manage AWS infrastructure daily.
GitHub: https://github.com/clawscli/claws
Main features:
Query multiple profiles × regions at once
Vim-style navigation
60+ services, 160+ resource types
Read-only mode for safe exploration
Specifically interested in:
What services/resources are missing that you'd actually use?
Any UX pain points?
https://redd.it/1q92v5s
@r_devops
I built a terminal UI for AWS resource management (think k9s but for AWS). Would love feedback from people who actually manage AWS infrastructure daily.
GitHub: https://github.com/clawscli/claws
Main features:
Query multiple profiles × regions at once
Vim-style navigation
60+ services, 160+ resource types
Read-only mode for safe exploration
Specifically interested in:
What services/resources are missing that you'd actually use?
Any UX pain points?
https://redd.it/1q92v5s
@r_devops
GitHub
GitHub - clawscli/claws: A terminal UI for AWS resource management with vim-style navigation
A terminal UI for AWS resource management with vim-style navigation - clawscli/claws
A practical 2026 roadmap for production observability & debugging
I kept seeing observability content that stops at “add metrics + dashboards” and still leaves teams blind during real incidents.
I put together a roadmap that reflects how production observability actually works in distributed systems:
– monitoring vs observability (signals vs symptoms)
– metrics, logs, traces as a system, not silos
– context propagation across async and service boundaries
– instrumentation strategy (what not to instrument)
– sampling & cost reality (debugging without full fidelity)
– latency without errors, errors without load, silent failures
– incident debugging playbooks
– cascading failure patterns & partial outages
– alerting, SLOs, and operational feedback loops
The focus is how to think during production incidents, not tools or vendors.
Language- and stack-agnostic by design.
Roadmap image + interactive version here:
👉 https://nemorize.com/roadmaps/production-observability-from-signals-to-root-cause-2026
Curious what people think is missing, overkill, or ordered incorrectly.
https://redd.it/1q94jzi
@r_devops
I kept seeing observability content that stops at “add metrics + dashboards” and still leaves teams blind during real incidents.
I put together a roadmap that reflects how production observability actually works in distributed systems:
– monitoring vs observability (signals vs symptoms)
– metrics, logs, traces as a system, not silos
– context propagation across async and service boundaries
– instrumentation strategy (what not to instrument)
– sampling & cost reality (debugging without full fidelity)
– latency without errors, errors without load, silent failures
– incident debugging playbooks
– cascading failure patterns & partial outages
– alerting, SLOs, and operational feedback loops
The focus is how to think during production incidents, not tools or vendors.
Language- and stack-agnostic by design.
Roadmap image + interactive version here:
👉 https://nemorize.com/roadmaps/production-observability-from-signals-to-root-cause-2026
Curious what people think is missing, overkill, or ordered incorrectly.
https://redd.it/1q94jzi
@r_devops
Nemorize
Production Observability: From Signals to Root Cause (2026) - Learning Roadmap | Nemorize
Phase 0 – The mindset shift
Goal: Stop treating observability as dashboards and start treating it as causality.
You will learn
Why monitoring is not obser...
Goal: Stop treating observability as dashboards and start treating it as causality.
You will learn
Why monitoring is not obser...
I need advice on meaningful personal projects (developer + DevOps, tool-building focus)
Im trying to decide on what kind of personal project to make that will be meaningful for learning and possibly useful for job applications, but learning comes first. I've made many small projects before while creating my homelab setup but I am looking for something more like actually creating my own tools.
Im aiming for something that sits between developer and DevOps.
I want to improve my coding skills and understand DevOps tools on a deeper level. I'm kind of sick of just using tools and not creating my own, if that makes sense.
Maybe Im having the wrong take on these things, a comment I always get from older gen engineers is how much they learned when they had to create their own tools. So, I thought it would be cool too.
I would be grateful for any guidance regarding this topic, if my thought pattern is incorrect I'm open to hearing what I should focus on instead.
Some additional context, Ive been a DevOps for 4 years and recently I have become unemployed and I want to start a project but everything I've seen online feels like I've done better versions of those in real production environments.
https://redd.it/1q99x3q
@r_devops
Im trying to decide on what kind of personal project to make that will be meaningful for learning and possibly useful for job applications, but learning comes first. I've made many small projects before while creating my homelab setup but I am looking for something more like actually creating my own tools.
Im aiming for something that sits between developer and DevOps.
I want to improve my coding skills and understand DevOps tools on a deeper level. I'm kind of sick of just using tools and not creating my own, if that makes sense.
Maybe Im having the wrong take on these things, a comment I always get from older gen engineers is how much they learned when they had to create their own tools. So, I thought it would be cool too.
I would be grateful for any guidance regarding this topic, if my thought pattern is incorrect I'm open to hearing what I should focus on instead.
Some additional context, Ive been a DevOps for 4 years and recently I have become unemployed and I want to start a project but everything I've seen online feels like I've done better versions of those in real production environments.
https://redd.it/1q99x3q
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Why the hell are devs still putting passwords in AI prompts? It's 2026!
Writing this because I keep seeing devs hardcode API keys and passwords directly in prompts during code reviews. Your LLM logs everything. Your prompts get cached. Your secrets end up in training data.
Use environment variables. Use secret managers. Sanitize inputs before they hit the model.
This should be basic security hygiene by now but apparently it needs saying.
https://redd.it/1q9cw8r
@r_devops
Writing this because I keep seeing devs hardcode API keys and passwords directly in prompts during code reviews. Your LLM logs everything. Your prompts get cached. Your secrets end up in training data.
Use environment variables. Use secret managers. Sanitize inputs before they hit the model.
This should be basic security hygiene by now but apparently it needs saying.
https://redd.it/1q9cw8r
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Vendor selection: enterprise vs startup vs build your own?
Hey! Solopreneur here who just launched an observability SaaS.
Need honest feedback on how you make vendor decisions.
Three options with identical SLA and infrastructure:
Enterprise with high prices ($$$)
Small company/solo founder with moderate prices ($$)
Build your own (Prometheus, Grafana, Loki) ($)
Which do you choose and why?
Key questions:
How much does brand recognition matter (to you vs management)?
Hard requirements on vendor stability/longevity?Support team size important?
Build vs Buy: what tips the scale - control/customization or time-to-market/maintenance?
If self-hosted: how many FTEs maintaining your stack?
On integrations:
Unified dashboard - deal breaker or nice-to-have?
Alert integrations (PagerDuty, Slack)?
API access?
Appreciate any feedback, especially recent vendor selection or migration experiences
https://redd.it/1q9bu87
@r_devops
Hey! Solopreneur here who just launched an observability SaaS.
Need honest feedback on how you make vendor decisions.
Three options with identical SLA and infrastructure:
Enterprise with high prices ($$$)
Small company/solo founder with moderate prices ($$)
Build your own (Prometheus, Grafana, Loki) ($)
Which do you choose and why?
Key questions:
How much does brand recognition matter (to you vs management)?
Hard requirements on vendor stability/longevity?Support team size important?
Build vs Buy: what tips the scale - control/customization or time-to-market/maintenance?
If self-hosted: how many FTEs maintaining your stack?
On integrations:
Unified dashboard - deal breaker or nice-to-have?
Alert integrations (PagerDuty, Slack)?
API access?
Appreciate any feedback, especially recent vendor selection or migration experiences
https://redd.it/1q9bu87
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Open Source Built a self-hosted PAM system - Looking for feedback
Hey r/devops!
I've been building Orion-Belt, an open-source Privileged Access Management system, and would love your feedback from folks who've dealt with SSH access at scale.
The problem we're solving:
After getting quoted $50k-$200k/year for commercial PAM solutions as a startup, we decided to build a self-hosted alternative that doesn't require enterprise budgets.
What it does:
\- Zero inbound firewall rules: Agents use reverse SSH tunneling to dial out to the gateway
\- Fine-grained access control: Specify which users can access which machines as which remote users (e.g., "Jane can SSH to prod-db as postgres")
\- Session recording & audit trails: Full compliance logging for SOC2/ISO27001
\- Temporary access workflows: Time-limited access with admin approval
\- Standard SSH compatibility.
Tech stack:
\- Backend: Go (Gin framework, golang.org/x/crypto/ssh)
\- Permissions: ReBAC with OpenFGA
\- Storage: PostgreSQL
\- Deployment: Docker + systemd, multi-distro support
Current state: Core functionality working, deployed in production in our homelab/staging environments.
Why I'm posting: Before building more features, I want to validate we're solving real problems.
Questions for the community:
1. What's your current SSH access management strategy?
(SSH keys everywhere? Jump hosts? Commercial PAM? Something else?)
2.If you've looked at commercial PAM solutions, what stopped you from adopting them?
(Cost? Complexity? Vendor lock-in?)
3. What would make a tool like this worth adopting in your environment?
(Specific features? Integration points? Deployment model?)
GitHub: https://github.com/zrougamed/orion-belt
Looking for:
\- Beta testers: Deploy it, break it, tell me what's missing
\- Contributors: Go backend developers and Frontend/UI folks (currently no UI - WIP)
\- Feedback: Honest criticism about architecture, features, docs
Happy to answer technical questions about the reverse tunneling implementation, session recording, or anything else!
https://redd.it/1q9k1kk
@r_devops
Hey r/devops!
I've been building Orion-Belt, an open-source Privileged Access Management system, and would love your feedback from folks who've dealt with SSH access at scale.
The problem we're solving:
After getting quoted $50k-$200k/year for commercial PAM solutions as a startup, we decided to build a self-hosted alternative that doesn't require enterprise budgets.
What it does:
\- Zero inbound firewall rules: Agents use reverse SSH tunneling to dial out to the gateway
\- Fine-grained access control: Specify which users can access which machines as which remote users (e.g., "Jane can SSH to prod-db as postgres")
\- Session recording & audit trails: Full compliance logging for SOC2/ISO27001
\- Temporary access workflows: Time-limited access with admin approval
\- Standard SSH compatibility.
Tech stack:
\- Backend: Go (Gin framework, golang.org/x/crypto/ssh)
\- Permissions: ReBAC with OpenFGA
\- Storage: PostgreSQL
\- Deployment: Docker + systemd, multi-distro support
Current state: Core functionality working, deployed in production in our homelab/staging environments.
Why I'm posting: Before building more features, I want to validate we're solving real problems.
Questions for the community:
1. What's your current SSH access management strategy?
(SSH keys everywhere? Jump hosts? Commercial PAM? Something else?)
2.If you've looked at commercial PAM solutions, what stopped you from adopting them?
(Cost? Complexity? Vendor lock-in?)
3. What would make a tool like this worth adopting in your environment?
(Specific features? Integration points? Deployment model?)
GitHub: https://github.com/zrougamed/orion-belt
Looking for:
\- Beta testers: Deploy it, break it, tell me what's missing
\- Contributors: Go backend developers and Frontend/UI folks (currently no UI - WIP)
\- Feedback: Honest criticism about architecture, features, docs
Happy to answer technical questions about the reverse tunneling implementation, session recording, or anything else!
https://redd.it/1q9k1kk
@r_devops
pkg.go.dev
ssh package - golang.org/x/crypto/ssh - Go Packages
Package ssh implements an SSH client and server.
SMS as an alerting channel who do you actually trust?
If SMS is your last-resort alert channel, which providers have actually been reliable for you in production?
https://redd.it/1q9af9l
@r_devops
If SMS is your last-resort alert channel, which providers have actually been reliable for you in production?
https://redd.it/1q9af9l
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Anyone else finding AI code review tools useless once you hit 10+ microservices?
We've been trying to integrate AI-assisted code review into our pipeline for the last 6 months. Started with a lot of optimism.
The problem: we run \~30 microservices across 4 repos. Business logic spans multiple services—a single order flow touches auth, inventory, payments, and notifications.
Here's what we're seeing:
\- The tool reviews each service in isolation. Zero awareness that a change in Service A could break the contract with Service B.
\- It chunks code for analysis and loses the relationships that actually matter. An API call becomes a meaningless string without context from the target service.
\- False positives are multiplying. The tool flags verbose utility functions while missing actual security issues that span services.
We're not using some janky open-source wrapper—this is a legit, well-funded tool with RAG-based retrieval.
Starting to think the fundamental approach (chunking + retrieval) just doesn't work for distributed systems. You can't understand a microservices codebase by looking at fragments.
Anyone else hitting this wall? Curious if teams with complex architectures have found tools that actually trace logic across service boundaries.
https://redd.it/1q9tup1
@r_devops
We've been trying to integrate AI-assisted code review into our pipeline for the last 6 months. Started with a lot of optimism.
The problem: we run \~30 microservices across 4 repos. Business logic spans multiple services—a single order flow touches auth, inventory, payments, and notifications.
Here's what we're seeing:
\- The tool reviews each service in isolation. Zero awareness that a change in Service A could break the contract with Service B.
\- It chunks code for analysis and loses the relationships that actually matter. An API call becomes a meaningless string without context from the target service.
\- False positives are multiplying. The tool flags verbose utility functions while missing actual security issues that span services.
We're not using some janky open-source wrapper—this is a legit, well-funded tool with RAG-based retrieval.
Starting to think the fundamental approach (chunking + retrieval) just doesn't work for distributed systems. You can't understand a microservices codebase by looking at fragments.
Anyone else hitting this wall? Curious if teams with complex architectures have found tools that actually trace logic across service boundaries.
https://redd.it/1q9tup1
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Showcase High-density architecture: Running 100+ containers on a single VPS with Traefik and FrankenPHP
Hi everyone,
I wanted to share a breakdown of the infrastructure I just built for a new SaaS project (a dependency health monitor).
As a DevOps consultant, I usually deal with K8s clusters, but for this project, I wanted to see how much performance I could squeeze out of a single multi-site VPS using a Docker Compose stack.
The Architecture:
Currently running \~30 projects and close to 100 containers on one node with high availability.
Ingress/Routing: Traefik (Auto-discovery of new docker containers is a lifesaver).
Runtime: FrankenPHP + Laravel Octane. This runs the app as a long-running Go process rather than traditional PHP-FPM, keeping the application bootstrapped in memory.
Caching: 2-hour aggressive Edge caching via Cloudflare to minimize hit-rate on the backend.
Storage: Redis for queues/cache.
The Workflow:
User Request -> Cloudflare (Edge) -> Traefik (VPS Ingress) -> FrankenPHP (App Container)
I wrote a blog post detailing the specific setup and how this stack handles the traffic:
**https://danielpetrica.com/how-i-built-a-high-performance-directory-with-laravel-octane-and-filament/**
Curious to hear your thoughts on pushing vertical scaling/Docker Compose this far versus moving to a small K8s cluster/Nomad setup. At what point do you usually force the switch?
https://redd.it/1q9y4tc
@r_devops
Hi everyone,
I wanted to share a breakdown of the infrastructure I just built for a new SaaS project (a dependency health monitor).
As a DevOps consultant, I usually deal with K8s clusters, but for this project, I wanted to see how much performance I could squeeze out of a single multi-site VPS using a Docker Compose stack.
The Architecture:
Currently running \~30 projects and close to 100 containers on one node with high availability.
Ingress/Routing: Traefik (Auto-discovery of new docker containers is a lifesaver).
Runtime: FrankenPHP + Laravel Octane. This runs the app as a long-running Go process rather than traditional PHP-FPM, keeping the application bootstrapped in memory.
Caching: 2-hour aggressive Edge caching via Cloudflare to minimize hit-rate on the backend.
Storage: Redis for queues/cache.
The Workflow:
User Request -> Cloudflare (Edge) -> Traefik (VPS Ingress) -> FrankenPHP (App Container)
I wrote a blog post detailing the specific setup and how this stack handles the traffic:
**https://danielpetrica.com/how-i-built-a-high-performance-directory-with-laravel-octane-and-filament/**
Curious to hear your thoughts on pushing vertical scaling/Docker Compose this far versus moving to a small K8s cluster/Nomad setup. At what point do you usually force the switch?
https://redd.it/1q9y4tc
@r_devops
Daniel Petrica
I built Laraplugins.io to cure your "Dependency Anxiety" 💊
Stop wasting time vetting packages. I built Laraplugins.io to automate Laravel plugin health checks & protect your stack
How liable are DevOps for redundancies in acquisitions (UK)?
Hi folks!
As the noscript says, my current company has just been acquired in the last week and while this is an acquisition (financially), this is going to be a merger i.e. our company merging into their company.
The next steps in the integration phase, AFAIK, is a company restructure, and as I have read the employees in the acquired company would be more at risk than the acquirer employees. Therefore, that would make me more at risk.
The DevOps team I am in is 7 DevOps engineers, 1 Tech lead DevOps and 1 Team lead.
I believe on their side it is 4/5 DevOps engineers.
We host our product heavily on AWS, and from what I can see they use Azure.
My main questions here is:
1. Has anyone been in a similar situation
2. If so, what happened? What side of the table where you on?
3. How "At Risk" are DevOps engineers in a merger compared to other areas of business?
4. Any other things / pointers you can give me? It is my first time in this situation.
I know that it is different company-to-company, but if I could get a general consensus of others past experience then I can come to my own conclusion on whether or not I would be highly at risk.
Any comments are appreciated.
Thanks!
https://redd.it/1q9yrii
@r_devops
Hi folks!
As the noscript says, my current company has just been acquired in the last week and while this is an acquisition (financially), this is going to be a merger i.e. our company merging into their company.
The next steps in the integration phase, AFAIK, is a company restructure, and as I have read the employees in the acquired company would be more at risk than the acquirer employees. Therefore, that would make me more at risk.
The DevOps team I am in is 7 DevOps engineers, 1 Tech lead DevOps and 1 Team lead.
I believe on their side it is 4/5 DevOps engineers.
We host our product heavily on AWS, and from what I can see they use Azure.
My main questions here is:
1. Has anyone been in a similar situation
2. If so, what happened? What side of the table where you on?
3. How "At Risk" are DevOps engineers in a merger compared to other areas of business?
4. Any other things / pointers you can give me? It is my first time in this situation.
I know that it is different company-to-company, but if I could get a general consensus of others past experience then I can come to my own conclusion on whether or not I would be highly at risk.
Any comments are appreciated.
Thanks!
https://redd.it/1q9yrii
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Headless browser sessions keep timing out after ~30 minutes. Has anyone managed to fix this?
I’ve been automating dashboard logins and data extraction using Puppeteer and Selenium for a while now. Single runs are solid, but once I scale to multiple tabs or let jobs run for hours, things start falling apart. Sessions randomly expire, cookies disappear, tabs lose state, and accounts get logged out mid flow. I’ve tried rotating proxies, custom user agents, persisted cookies, and even moved to headless=new. It helped a bit but still not reliable enough for production workloads. At this point I’m trying to understand what’s actually causing this instability. Is it session isolation, anti automation defenses, browser lifecycle issues, or something else entirely? Looking for approaches or tools that support long lived, multi account browser workflows without constant monitoring. Any real world experience appreciated.
https://redd.it/1qa1uvy
@r_devops
I’ve been automating dashboard logins and data extraction using Puppeteer and Selenium for a while now. Single runs are solid, but once I scale to multiple tabs or let jobs run for hours, things start falling apart. Sessions randomly expire, cookies disappear, tabs lose state, and accounts get logged out mid flow. I’ve tried rotating proxies, custom user agents, persisted cookies, and even moved to headless=new. It helped a bit but still not reliable enough for production workloads. At this point I’m trying to understand what’s actually causing this instability. Is it session isolation, anti automation defenses, browser lifecycle issues, or something else entirely? Looking for approaches or tools that support long lived, multi account browser workflows without constant monitoring. Any real world experience appreciated.
https://redd.it/1qa1uvy
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Self host Gitlab (GitOps) in k8s, or stand alone?
Hi! Linux sysadmin and hobby programmer here, I'm learning iac by converting my infra at home using OpenTofu against Proxmox. I use workspaces to launch stages as dev (and staging etc in the future). Figured it would be cool to orient everything around it.. but as I'm gonna learn/use Talos k8s ahead, I can't figure out how to deal with deploying apps with the same workspace approach in mind, to avoid being repetitive and all that.
Never automated via Gitlab before, but understood what is called GitOps is used for automation, and it's baked into Gitlab. So the thing I can't figure out is if I should setup Gitlab in k8s, or as stand alone. The first means HA, but if k8s breaks then GitOps goes down I assume. The latter means skip k8s dependency, but no HA.
Idk, maybe I'm overthinking this at such a early time, but would appreciate some insight into how others setup their self hosted iac based IT.
Cheers!
https://redd.it/1qa67nj
@r_devops
Hi! Linux sysadmin and hobby programmer here, I'm learning iac by converting my infra at home using OpenTofu against Proxmox. I use workspaces to launch stages as dev (and staging etc in the future). Figured it would be cool to orient everything around it.. but as I'm gonna learn/use Talos k8s ahead, I can't figure out how to deal with deploying apps with the same workspace approach in mind, to avoid being repetitive and all that.
Never automated via Gitlab before, but understood what is called GitOps is used for automation, and it's baked into Gitlab. So the thing I can't figure out is if I should setup Gitlab in k8s, or as stand alone. The first means HA, but if k8s breaks then GitOps goes down I assume. The latter means skip k8s dependency, but no HA.
Idk, maybe I'm overthinking this at such a early time, but would appreciate some insight into how others setup their self hosted iac based IT.
Cheers!
https://redd.it/1qa67nj
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community