I’m building runtime “IAM for AI agents” policies, mandates, hard enforcement. Does this problem resonate?
I’m working on an MVP that treats AI agents as **economic actors**, not just noscripts or prompts and I want honest validation from people actually running agents in production.
The problem I keep seeing
Agents today can:
* spend money (LLM calls, APIs)
* call tools (email, DB, infra, MCP servers)
* act repeatedly and autonomously
But we mostly “control” them with:
* prompts
* conventions
* code
There’s no real concept of:
* agent identity
* hard authority
* budgets that can’t be bypassed
* deterministic enforcement
If an agent goes rogue, you usually find out **after** money is spent or damage is done.
What I’m building
A small infra layer that sits **outside** the LLM and enforces authority mechanically.
Core ideas:
* **Agent** = stable identity (not a process)
* **Policy** = static, versioned authority template (what **could** be allowed)
* **Rule** = context-based selection (user tier, env, tenant, etc.)
* **Mandate** = short-lived authority issued per invocation
* Enforcement = allow/block tool/MCP + LLM calls at runtime
No prompt tricks. No AI judgment. Just deterministic allow / block.
Examples:
* Free users → agent can only read data, $1 budget
* Paid users → same agent code, higher budget + more tools
* Kill switch → instantly block all future actions
* All actions audited with reason codes
What this is NOT
* Not an agent framework
* Not AI safety / content moderation
* Not prompt guardrails
* Not model alignment
It’s closer to IAM / firewall thinking, but for agents.
Why I’m unsure
This feels **obvious** once you see it, but also very infra-heavy.
I don’t know if enough teams feel the pain **yet**, or if this is too early.
I’d love feedback on:
1. If you run agents in prod: what failures scare you most?
2. Do you rely on prompts for control today? Has that burned you?
3. Would you adopt a hard enforcement layer like this?
4. What would make this a “no-brainer” vs “too much overhead”?
I’m not selling anything, just trying to validate whether this is a real problem worth going deeper on.
github repo for mvp (local only): [https://github.com/kashaf12/mandate](https://github.com/kashaf12/mandate)
https://redd.it/1pvat3j
@r_devops
I’m working on an MVP that treats AI agents as **economic actors**, not just noscripts or prompts and I want honest validation from people actually running agents in production.
The problem I keep seeing
Agents today can:
* spend money (LLM calls, APIs)
* call tools (email, DB, infra, MCP servers)
* act repeatedly and autonomously
But we mostly “control” them with:
* prompts
* conventions
* code
There’s no real concept of:
* agent identity
* hard authority
* budgets that can’t be bypassed
* deterministic enforcement
If an agent goes rogue, you usually find out **after** money is spent or damage is done.
What I’m building
A small infra layer that sits **outside** the LLM and enforces authority mechanically.
Core ideas:
* **Agent** = stable identity (not a process)
* **Policy** = static, versioned authority template (what **could** be allowed)
* **Rule** = context-based selection (user tier, env, tenant, etc.)
* **Mandate** = short-lived authority issued per invocation
* Enforcement = allow/block tool/MCP + LLM calls at runtime
No prompt tricks. No AI judgment. Just deterministic allow / block.
Examples:
* Free users → agent can only read data, $1 budget
* Paid users → same agent code, higher budget + more tools
* Kill switch → instantly block all future actions
* All actions audited with reason codes
What this is NOT
* Not an agent framework
* Not AI safety / content moderation
* Not prompt guardrails
* Not model alignment
It’s closer to IAM / firewall thinking, but for agents.
Why I’m unsure
This feels **obvious** once you see it, but also very infra-heavy.
I don’t know if enough teams feel the pain **yet**, or if this is too early.
I’d love feedback on:
1. If you run agents in prod: what failures scare you most?
2. Do you rely on prompts for control today? Has that burned you?
3. Would you adopt a hard enforcement layer like this?
4. What would make this a “no-brainer” vs “too much overhead”?
I’m not selling anything, just trying to validate whether this is a real problem worth going deeper on.
github repo for mvp (local only): [https://github.com/kashaf12/mandate](https://github.com/kashaf12/mandate)
https://redd.it/1pvat3j
@r_devops
GitHub
GitHub - kashaf12/mandate: Runtime enforcement layer for AI agent authority. Intercepts LLM and tool calls, evaluates them against…
Runtime enforcement layer for AI agent authority. Intercepts LLM and tool calls, evaluates them against policies, and blocks unauthorized actions. - kashaf12/mandate
How do you automate license key delivery after purchase?
I’m selling a desktop app with one-time license keys (single-use). I already generated a large pool of unique keys and plan to sell them in tiers (1 key, 5 keys, 25 keys).
What’s the best way to automatically:
assign unused keys when someone purchases, and
email the key(s) to the buyer right after checkout?
I’m open to using a storefront platform + external automation, but I’m trying to avoid manual fulfillment and exposing the full key list to customers.
If you’ve done this before or have a recommended stack/workflow, I’d love to hear what works well and what to avoid.
Also, is this by chance possible on FourthWall?
https://redd.it/1pvhb5a
@r_devops
I’m selling a desktop app with one-time license keys (single-use). I already generated a large pool of unique keys and plan to sell them in tiers (1 key, 5 keys, 25 keys).
What’s the best way to automatically:
assign unused keys when someone purchases, and
email the key(s) to the buyer right after checkout?
I’m open to using a storefront platform + external automation, but I’m trying to avoid manual fulfillment and exposing the full key list to customers.
If you’ve done this before or have a recommended stack/workflow, I’d love to hear what works well and what to avoid.
Also, is this by chance possible on FourthWall?
https://redd.it/1pvhb5a
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Seeking a Mentor in DevOps for Guidance on Projects, Production Environments, and Managing Complexity
Hello, fellow DevOps enthusiasts!
I am actively looking for a mentor who can guide me through the intricacies of DevOps, particularly when it comes to managing real-world production environments and tackling the complexities that come with them. I’ve been exploring DevOps tools and concepts, but I feel that having someone with hands-on experience would greatly accelerate my learning.
Specifically, I'm looking for guidance on:
* Managing production environments at scale
* Optimizing CI/CD pipelines for larger projects
* Understanding and mitigating the complexities of infrastructure
* Best practices for automation, monitoring, and security in production
* Working on and improving existing projects with a focus on reliability and efficiency
If you have experience in these areas and would be willing to help me navigate the challenges, I would greatly appreciate your mentorship. I'm eager to learn, share ideas, and work on real-world projects that will enhance my skills.
Feel free to message me if you’re open to a mentorship opportunity, and I look forward to connecting with some of you!
Thanks in advance!
https://redd.it/1pvdqry
@r_devops
Hello, fellow DevOps enthusiasts!
I am actively looking for a mentor who can guide me through the intricacies of DevOps, particularly when it comes to managing real-world production environments and tackling the complexities that come with them. I’ve been exploring DevOps tools and concepts, but I feel that having someone with hands-on experience would greatly accelerate my learning.
Specifically, I'm looking for guidance on:
* Managing production environments at scale
* Optimizing CI/CD pipelines for larger projects
* Understanding and mitigating the complexities of infrastructure
* Best practices for automation, monitoring, and security in production
* Working on and improving existing projects with a focus on reliability and efficiency
If you have experience in these areas and would be willing to help me navigate the challenges, I would greatly appreciate your mentorship. I'm eager to learn, share ideas, and work on real-world projects that will enhance my skills.
Feel free to message me if you’re open to a mentorship opportunity, and I look forward to connecting with some of you!
Thanks in advance!
https://redd.it/1pvdqry
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Is there a book that covers every production-grade cloud architecture used or the most common ones?
Is there a recipe book that covers every production-grade cloud architecture or the most common ones? I stopped taking tutorial courses, because 95% of them are useless and cover things I already know, but I am looking for a book that features complete end-to-end IaC solutions you would find in big tech companies like Facebook, Google and Microsoft.
https://redd.it/1pvk0ni
@r_devops
Is there a recipe book that covers every production-grade cloud architecture or the most common ones? I stopped taking tutorial courses, because 95% of them are useless and cover things I already know, but I am looking for a book that features complete end-to-end IaC solutions you would find in big tech companies like Facebook, Google and Microsoft.
https://redd.it/1pvk0ni
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Would you consider putting an audit proxy in front to postgres/mysql
Lately I've been dealing with compliance requirements for on-prem database(Postgres). One of those is providing audit logs, but enabling slow query log for every query(i.e. log_min_duration_statement=0) is not recommended for production databases and pgAudit seems to be consuming too much I/O.
I'm writing a simple proxy which will pass all authentication and other setup and then parse every message and log all queries. Since the proxy is stateless it is easy to scale it and it doesn't eat the precious resources of the primary database. The parsing/logging is happening asynchronously from the proxying
So far it is working good, I still need to hammer it with more load tests and do some edge case testing (e.g. behavior when the database is extremely slow). I wrote the same thing for MySQL with the idea to open-sourcing it.
I'm not sure if other people will be interested in utilizing such proxy, so here I am asking about your opinion.
Edit: Grammar
https://redd.it/1pvm6qv
@r_devops
Lately I've been dealing with compliance requirements for on-prem database(Postgres). One of those is providing audit logs, but enabling slow query log for every query(i.e. log_min_duration_statement=0) is not recommended for production databases and pgAudit seems to be consuming too much I/O.
I'm writing a simple proxy which will pass all authentication and other setup and then parse every message and log all queries. Since the proxy is stateless it is easy to scale it and it doesn't eat the precious resources of the primary database. The parsing/logging is happening asynchronously from the proxying
So far it is working good, I still need to hammer it with more load tests and do some edge case testing (e.g. behavior when the database is extremely slow). I wrote the same thing for MySQL with the idea to open-sourcing it.
I'm not sure if other people will be interested in utilizing such proxy, so here I am asking about your opinion.
Edit: Grammar
https://redd.it/1pvm6qv
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Versioning cache keys to avoid rolling deployment issues
During rolling deployments, we had multiple versions of the same service running concurrently, all reading and writing to the same cache. This caused subtle and hard-to-debug production issues when cache entries were shared across versions.
One pattern that worked well for us was versioning cache keys \- new deployments write to new keys, while old instances continue using the previous ones. This avoided cache poisoning without flushing Redis or relying on aggressive TTLs.
I wrote up the reasoning, tradeoffs, and an example here:
https://medium.com/dev-genius/version-your-cache-keys-to-survive-rolling-deployments-a62545326220
How are others handling cache consistency during rolling deploys? TTLs? blue/green? dual writes?
https://redd.it/1pvow88
@r_devops
During rolling deployments, we had multiple versions of the same service running concurrently, all reading and writing to the same cache. This caused subtle and hard-to-debug production issues when cache entries were shared across versions.
One pattern that worked well for us was versioning cache keys \- new deployments write to new keys, while old instances continue using the previous ones. This avoided cache poisoning without flushing Redis or relying on aggressive TTLs.
I wrote up the reasoning, tradeoffs, and an example here:
https://medium.com/dev-genius/version-your-cache-keys-to-survive-rolling-deployments-a62545326220
How are others handling cache consistency during rolling deploys? TTLs? blue/green? dual writes?
https://redd.it/1pvow88
@r_devops
Medium
Version Your Cache Keys to Survive Rolling Deployments
Why schema changes break production — and how automatic cache versioning fixes it
🛡️ Built MCP Guard - a security proxy for Cursor/Claude agents (I'm the dev)
Hey everyone! 👋
I've been working on something for the past few weeks and wanted to share it here.
The problem I faced:
I use Cursor with MCP to interact with my databases. One day, I accidentally let my agent run with full read/write/delete access. I watched in horror as it started building queries... and I realized I had zero control over what it could do.
What if it runs DROP TABLE users instead of SELECT *?
What I built:
MCP Guard - a lightweight security proxy that sits between your AI agent and your MCP servers.
Features:
Block dangerous commands (DROP, DELETE, TRUNCATE, etc.)
Generate API keys with rate limits and RBAC
Full audit logs of every agent interaction
Sub-3ms latency
Why I'm posting here:
I'm launching the beta on Dec 28 and looking for feedback from actual users. Not trying to sell anything - the free tier gives you 1,000 requests/month with no credit card.
If you're using MCP with Cursor/Claude and have thoughts on security, I'd love to hear from you.
Link: https://mcp-shield.vercel.app
Happy to answer any questions! I'm the sole developer behind this, so AMA about how it works. 🔥
https://redd.it/1pvw25o
@r_devops
Hey everyone! 👋
I've been working on something for the past few weeks and wanted to share it here.
The problem I faced:
I use Cursor with MCP to interact with my databases. One day, I accidentally let my agent run with full read/write/delete access. I watched in horror as it started building queries... and I realized I had zero control over what it could do.
What if it runs DROP TABLE users instead of SELECT *?
What I built:
MCP Guard - a lightweight security proxy that sits between your AI agent and your MCP servers.
Features:
Block dangerous commands (DROP, DELETE, TRUNCATE, etc.)
Generate API keys with rate limits and RBAC
Full audit logs of every agent interaction
Sub-3ms latency
Why I'm posting here:
I'm launching the beta on Dec 28 and looking for feedback from actual users. Not trying to sell anything - the free tier gives you 1,000 requests/month with no credit card.
If you're using MCP with Cursor/Claude and have thoughts on security, I'd love to hear from you.
Link: https://mcp-shield.vercel.app
Happy to answer any questions! I'm the sole developer behind this, so AMA about how it works. 🔥
https://redd.it/1pvw25o
@r_devops
mcp-shield.vercel.app
MCP Guard - API Key Management
Secure API key proxy and management system
what actually helps once ai code leaves the chat window?
ai makes it easy to spin things up, but once that code hits a real repo, that’s where i slow down. most of my time goes into figuring out what depends on what and what i’m about to accidentally break.
i still use chatgpt for quick thinking and cosine just to trace logic across files. nothing fancy.
curious what others lean on once ai code is real.
https://redd.it/1pvwxey
@r_devops
ai makes it easy to spin things up, but once that code hits a real repo, that’s where i slow down. most of my time goes into figuring out what depends on what and what i’m about to accidentally break.
i still use chatgpt for quick thinking and cosine just to trace logic across files. nothing fancy.
curious what others lean on once ai code is real.
https://redd.it/1pvwxey
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Securing the frontend application and backend apis
Hi all,
In am looking for a reliable solution to secure the frontend url and backend apis so that is only accisible to people who has our VPN. Is it possible to do so ? I am using AWS currently, how I can do that reliably. Please help!
https://redd.it/1pvwujj
@r_devops
Hi all,
In am looking for a reliable solution to secure the frontend url and backend apis so that is only accisible to people who has our VPN. Is it possible to do so ? I am using AWS currently, how I can do that reliably. Please help!
https://redd.it/1pvwujj
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Throwback 2025 - Securing Your OTel Collector
Hi there, Juraci here. I've been working with OpenTelemetry since its early days and this year I started Telemetry Drops - a bi-weekly ~30 min live stream diving into OTel and observability topics.
We're 7 episodes in since we started four months ago. Some highlights:
AI observability and observability with AI (two different things!)
The isolation forest processor
How to write a good KubeCon talk proposal
A special about the Collector Builder
One of the most-watched so far is this walkthrough of how to secure your Collector - based on a blog post I've been updating for years as the Collector evolves.
https://youtube.com/live/4-T4eNQ6V-A
New episodes drop ~every other Friday on YouTube. If you speak Portuguese, check out Dose de Telemetria, which I've been running for some years already!
Would love feedback on what topics would be most useful - what OTel questions keep you up at night?
https://redd.it/1pw15e0
@r_devops
Hi there, Juraci here. I've been working with OpenTelemetry since its early days and this year I started Telemetry Drops - a bi-weekly ~30 min live stream diving into OTel and observability topics.
We're 7 episodes in since we started four months ago. Some highlights:
AI observability and observability with AI (two different things!)
The isolation forest processor
How to write a good KubeCon talk proposal
A special about the Collector Builder
One of the most-watched so far is this walkthrough of how to secure your Collector - based on a blog post I've been updating for years as the Collector evolves.
https://youtube.com/live/4-T4eNQ6V-A
New episodes drop ~every other Friday on YouTube. If you speak Portuguese, check out Dose de Telemetria, which I've been running for some years already!
Would love feedback on what topics would be most useful - what OTel questions keep you up at night?
https://redd.it/1pw15e0
@r_devops
YouTube
Securing your OpenTelemetry Collector pipeline
In this Telemetry Drops live session, we'll discuss practical ways to implement protection layers in your pipeline with a focus on OpenTelemetry, preventing sensitive data leaks and strengthening your architecture.
This is an essential conversation for SRE…
This is an essential conversation for SRE…
Your localhost, online in seconds.Public URL for everyone
Tired of complex tunneling tools?
Portex cli exposes your localhost to the internet in seconds. Features secure tunnels with end-to-end encryption, one-click PIN protection for client demos, and a built-in traffic inspector to debug webhooks instantly.
No forced logins, just portex start. Works on macOS, Windows & Linux.
https://portex.space/
https://redd.it/1pvyzeb
@r_devops
Tired of complex tunneling tools?
Portex cli exposes your localhost to the internet in seconds. Features secure tunnels with end-to-end encryption, one-click PIN protection for client demos, and a built-in traffic inspector to debug webhooks instantly.
No forced logins, just portex start. Works on macOS, Windows & Linux.
https://portex.space/
https://redd.it/1pvyzeb
@r_devops
portex.space
Portex - Your Localhost, Online in Seconds
Securely expose your local server to the internet with one command. The modern, fast, and beautiful alternative to ngrok.
Cache npm dependencies
I am trying to cache my npm dependencies so every time my GitHub Actions runs, it pulls the dependencies from cache unless package-lock.json changes. I tried the code below, but it does not work (the npm install is still happening on every run):
https://redd.it/1pw5ars
@r_devops
I am trying to cache my npm dependencies so every time my GitHub Actions runs, it pulls the dependencies from cache unless package-lock.json changes. I tried the code below, but it does not work (the npm install is still happening on every run):
build:runs-on: ubuntu-latestneeds: securitysteps:- uses: actions/checkout@v3- name: Set up Node.js versionuses: actions/setup-node@v4with:node-version: '14.17.6'cache: 'npm'- name: Cache node modulesuses: actions/cache@v3with:path: ~/.npmkey: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}restore-keys: |${{ runner.os }}-node-- name: npm install and buildrun: |export NODE_OPTIONS="--max-old-space-size=4096"npm cinpm run buildenv:CI: falseREACT_APP_ENV: dev- name: Zip artifact for deploymentrun: cd build && zip -r ../release.zip *- name: Upload artifact for deployment jobuses: actions/upload-artifact@v4with:name: node-apppath: `release.zip`https://redd.it/1pw5ars
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
What 2025 Taught Me About DevOps
As 2025 comes to an end, I’ve been reflecting on what actually changed for me as a DevOps Engineer.
\- Not the tools I use.
\- Not the certifications I hold.
But how I think about systems, failures, and tradeoffs.
This year brought up patterns I’ve now seen repeatedly across interviews, production incidents, and mentoring other engineers.
Here are the 10 lessons that stood out, starting with the one that matters most:
\- Most outages come from small configuration changes, not big mistakes, eg, AWS and Cloudflare outages.
\- Tools are just tools. Understanding systems is what separates engineers.
\- Platform engineering is becoming the default way teams scale safely.
\- GitOps is no longer optional in serious environments.
\- Certifications stopped being reliable signals on their own.
\- AI increases leverage, but only if fundamentals are solid.
\- Boring and proven technology continues to outperform trendy/fancy alternatives.
\- Cloud cost is now an engineering responsibility, not just finance.
\- Fundamentals (Linux, networking, Git, communication) outlast trends.
\- Protect your peace, or this field will eat you alive
None of these lessons came from theory.
They came from actual systems, real interviews, real failures, and real conversations with engineers at different stages of their careers.
If you’re heading into 2026 trying to decide what to focus on next, my advice is simple: Understand fundamentals, understand systems and services before tools, learn to communicate well, and remember to protect your peace.
Happy holidays, and may your deployments be ever in your favour. 🎄🧑🏽🎄
https://redd.it/1pw4ury
@r_devops
As 2025 comes to an end, I’ve been reflecting on what actually changed for me as a DevOps Engineer.
\- Not the tools I use.
\- Not the certifications I hold.
But how I think about systems, failures, and tradeoffs.
This year brought up patterns I’ve now seen repeatedly across interviews, production incidents, and mentoring other engineers.
Here are the 10 lessons that stood out, starting with the one that matters most:
\- Most outages come from small configuration changes, not big mistakes, eg, AWS and Cloudflare outages.
\- Tools are just tools. Understanding systems is what separates engineers.
\- Platform engineering is becoming the default way teams scale safely.
\- GitOps is no longer optional in serious environments.
\- Certifications stopped being reliable signals on their own.
\- AI increases leverage, but only if fundamentals are solid.
\- Boring and proven technology continues to outperform trendy/fancy alternatives.
\- Cloud cost is now an engineering responsibility, not just finance.
\- Fundamentals (Linux, networking, Git, communication) outlast trends.
\- Protect your peace, or this field will eat you alive
None of these lessons came from theory.
They came from actual systems, real interviews, real failures, and real conversations with engineers at different stages of their careers.
If you’re heading into 2026 trying to decide what to focus on next, my advice is simple: Understand fundamentals, understand systems and services before tools, learn to communicate well, and remember to protect your peace.
Happy holidays, and may your deployments be ever in your favour. 🎄🧑🏽🎄
https://redd.it/1pw4ury
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
What checks do you run before deploying that tests and CI won’t catch?
Curious how others handle this.
Even with solid test coverage and CI in place, there always seem to be a few classes of issues that only show up after a deploy, things like misconfigured env vars, expired certs, health endpoints returning something unexpected, missing redirects, or small infra or config mistakes.
I’m interested in what manual or pre deploy checks people still rely on today, whether that’s noscripts, checklists, conventions, or just experience.
What are the things you’ve learned to double check before shipping that tests and CI don’t reliably cover?
https://redd.it/1pw5phr
@r_devops
Curious how others handle this.
Even with solid test coverage and CI in place, there always seem to be a few classes of issues that only show up after a deploy, things like misconfigured env vars, expired certs, health endpoints returning something unexpected, missing redirects, or small infra or config mistakes.
I’m interested in what manual or pre deploy checks people still rely on today, whether that’s noscripts, checklists, conventions, or just experience.
What are the things you’ve learned to double check before shipping that tests and CI don’t reliably cover?
https://redd.it/1pw5phr
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Scaling beyond basic VPS+nginx: Next steps for a growing Go backend?
I come from a background of working in companies with established infrastructure where everything usually just works. Recently, I've been building my own SaaS and micro-SaaS projects using Go (backend) and Angular. It's been a great learning experience, but I’ve noticed that my backends occasionally fail—nothing catastrophic, just small hiccups, occasional 500 errors, or brief downtime.
My current setup is as basic as it gets: a single VPS running nginx as a reverse proxy, with a systemd service running my Go executable. It works fine for now, but I'm expecting user growth and want to be prepared for hundreds of thousands of users.
My question is: once you’ve outgrown this simple setup, what’s the logical next step to scale without overcomplicating things? I’m not looking to jump straight into Kubernetes or a full-blown microservices architecture just yet, but I do need something more resilient and scalable than a single point of failure.
What would you recommend? I’d love to hear about your experiences and any straightforward, incremental improvements you’ve made to scale your Go applications.
Thanks in advance!
https://redd.it/1pw5uwx
@r_devops
I come from a background of working in companies with established infrastructure where everything usually just works. Recently, I've been building my own SaaS and micro-SaaS projects using Go (backend) and Angular. It's been a great learning experience, but I’ve noticed that my backends occasionally fail—nothing catastrophic, just small hiccups, occasional 500 errors, or brief downtime.
My current setup is as basic as it gets: a single VPS running nginx as a reverse proxy, with a systemd service running my Go executable. It works fine for now, but I'm expecting user growth and want to be prepared for hundreds of thousands of users.
My question is: once you’ve outgrown this simple setup, what’s the logical next step to scale without overcomplicating things? I’m not looking to jump straight into Kubernetes or a full-blown microservices architecture just yet, but I do need something more resilient and scalable than a single point of failure.
What would you recommend? I’d love to hear about your experiences and any straightforward, incremental improvements you’ve made to scale your Go applications.
Thanks in advance!
https://redd.it/1pw5uwx
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Migrating legacy GCE-based API stack to GKE
Hi everyone!
Solo DevOps looking for a solid starting point
I’m starting a new project where I’m essentially the only DevOps / infra guy, and I need to build a clear plan for a fairly complex setup.
Current architecture (high level)
* Java-based API services
* Running on multiple Compute Engine Instance Groups
* A dedicated HAProxy VM in front, routing traffic based on URL and request payload
* One very large MySQL database running on a GCE VM
* Several smaller Cloud SQL MySQL instances replicating selected tables from the main DB (apparently to reduce load on the primary)
* One service requires outbound internet access, so there’s a custom NAT solution backed by two GCE VMs (Cloud NAT was avoided due to cost concerns)
Target direction / my ideas so far
* Establish a solid IaC foundation using Terraform + GitHub Actions
* Design VPCs and subnetting from scratch (first time doing this for a high-load production environment)
* Build proper CI/CD for the APIs (Docker + Helm)
* Gradually migrate services to GKE, starting with the least critical ones
My concerns/open questions:
* What’s a cost-effective and low-maintenance NAT strategy in GCP for this kind of setup?
* How would you approach eliminating HAProxy in a GKE-based architecture (Ingress, Gateway API, L7 LB, etc.)?
* Any red flags in the current DB setup that should be addressed early?
* How would you structure the migration to minimize risk, given there’s no existing IaC?
If you’ve done a similar GCE → GKE migration or built something like this from scratch:
* What would you tackle first?
* Any early decisions you wish you had made differently?
* Any recommended starting point, reference architecture, or pitfalls to watch out for?
Appreciate any insights 🙏
https://redd.it/1pw7evz
@r_devops
Hi everyone!
Solo DevOps looking for a solid starting point
I’m starting a new project where I’m essentially the only DevOps / infra guy, and I need to build a clear plan for a fairly complex setup.
Current architecture (high level)
* Java-based API services
* Running on multiple Compute Engine Instance Groups
* A dedicated HAProxy VM in front, routing traffic based on URL and request payload
* One very large MySQL database running on a GCE VM
* Several smaller Cloud SQL MySQL instances replicating selected tables from the main DB (apparently to reduce load on the primary)
* One service requires outbound internet access, so there’s a custom NAT solution backed by two GCE VMs (Cloud NAT was avoided due to cost concerns)
Target direction / my ideas so far
* Establish a solid IaC foundation using Terraform + GitHub Actions
* Design VPCs and subnetting from scratch (first time doing this for a high-load production environment)
* Build proper CI/CD for the APIs (Docker + Helm)
* Gradually migrate services to GKE, starting with the least critical ones
My concerns/open questions:
* What’s a cost-effective and low-maintenance NAT strategy in GCP for this kind of setup?
* How would you approach eliminating HAProxy in a GKE-based architecture (Ingress, Gateway API, L7 LB, etc.)?
* Any red flags in the current DB setup that should be addressed early?
* How would you structure the migration to minimize risk, given there’s no existing IaC?
If you’ve done a similar GCE → GKE migration or built something like this from scratch:
* What would you tackle first?
* Any early decisions you wish you had made differently?
* Any recommended starting point, reference architecture, or pitfalls to watch out for?
Appreciate any insights 🙏
https://redd.it/1pw7evz
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Building my Open-Source 11labs Ops Tool: Secure Backups + Team Access
I am building an open-source, free tool to help teams manage and scale ElevenLabs voice agents safely in production.
I currently run 71 agents in production for multiple clients, and once you hit that level, some things become painful very fast: collaboration, QA, access control, backups, and compliance.
This project is my attempt to solve those problems in a clean, in-tenant way.
Advanced workflow optimization: Let senior team members run staging versions of their workflow and agent, do controlled A/B testing with real conversation QA, compare production vs. staging, and deploy changes with proper QA and approbation process.
Granular conversation access for teams: Filter and scope access by location, client, case type, etc. Session-backed permissions ensure people only see what they are authorized to see.
Advanced workflow optimization and QA: Run staging versions of agents and workflows, replay real conversations, do controlled A/B testing, compare staging vs production, and deploy changes with proper review.
Incremental backups and granular restore: Hourly, daily, or custom schedules. Restore only what you need, for example workflow or KB for a specific agent.
Agent and configuration migration: Migrate agents between accounts or batch-update settings and KBs across many agents.
Full in-tenant data sovereignty: Configs, workflows, backups, and conversation history stay in your cloud or infrastructure. No third-party egress.
Flexible deployment options: Terraform or Helm/Kubernetes Self-hosted Docker (including bare metal with NAS backups) Optional 100 percent Cloudflare Workers and Workers AI deployment
Demo (rough but shows the core inspector, workflow replay, permissions, backups, etc.):
Video: https://www.youtube.com/watch?v=Pzu2CVWnpl8
I'll push the code to GitHub early January 2026. Project name will change soon (current temp name conflicts with an existing "Eleven Guard" SSL monitoring company).
I am building this primarily for my own use, but I suspect others running ElevenLabs at scale may run into the same issues. If you have feature requests, concerns, or feel there are tools missing to better manage ElevenLabs within your company, I would genuinely love to hear about them. 😄
https://redd.it/1pwb44y
@r_devops
I am building an open-source, free tool to help teams manage and scale ElevenLabs voice agents safely in production.
I currently run 71 agents in production for multiple clients, and once you hit that level, some things become painful very fast: collaboration, QA, access control, backups, and compliance.
This project is my attempt to solve those problems in a clean, in-tenant way.
Advanced workflow optimization: Let senior team members run staging versions of their workflow and agent, do controlled A/B testing with real conversation QA, compare production vs. staging, and deploy changes with proper QA and approbation process.
Granular conversation access for teams: Filter and scope access by location, client, case type, etc. Session-backed permissions ensure people only see what they are authorized to see.
Advanced workflow optimization and QA: Run staging versions of agents and workflows, replay real conversations, do controlled A/B testing, compare staging vs production, and deploy changes with proper review.
Incremental backups and granular restore: Hourly, daily, or custom schedules. Restore only what you need, for example workflow or KB for a specific agent.
Agent and configuration migration: Migrate agents between accounts or batch-update settings and KBs across many agents.
Full in-tenant data sovereignty: Configs, workflows, backups, and conversation history stay in your cloud or infrastructure. No third-party egress.
Flexible deployment options: Terraform or Helm/Kubernetes Self-hosted Docker (including bare metal with NAS backups) Optional 100 percent Cloudflare Workers and Workers AI deployment
Demo (rough but shows the core inspector, workflow replay, permissions, backups, etc.):
Video: https://www.youtube.com/watch?v=Pzu2CVWnpl8
I'll push the code to GitHub early January 2026. Project name will change soon (current temp name conflicts with an existing "Eleven Guard" SSL monitoring company).
I am building this primarily for my own use, but I suspect others running ElevenLabs at scale may run into the same issues. If you have feature requests, concerns, or feel there are tools missing to better manage ElevenLabs within your company, I would genuinely love to hear about them. 😄
https://redd.it/1pwb44y
@r_devops
YouTube
11Guard - Voice AI Ops for ElevenLabs
Building an open-source, free tool new for Voice AI Ops.
• A command center for managing large-scale ElevenLabs voice agent operations
• Inspect conversations, workflows, and tool usage in real time with granular permissions
• Govern access with true in-tenant…
• A command center for managing large-scale ElevenLabs voice agent operations
• Inspect conversations, workflows, and tool usage in real time with granular permissions
• Govern access with true in-tenant…
Guidance for my DevOps journey
Hello everyone, I'm interested in getting into DevOps but I don't know where to start, I'm currently in a private university in Berlin Germany and I'm performing bachelors of computers science, my studies stared 3 months ago, I just wanted to get a headstart in getting into DevOps early, my questions are:
1- Is there any masters field that's more preferred for getting into DevOps?
2- I keep seeing people say it's hard to get into junior DevOps jobs, so most try to get into other jobs like system administrator, and cloud related jobs, I wanted to know which ones would be best for DevOps.
3- Which languages are best for DevOps field
4- Do people work in DevOps related jobs before getting promoted and becoming a DevOps engineer, or do they just work DevOps related jobs and then apply for different companies on the basis of those other jobs as relavent experience?
5- Which skills would I need for DevOps
6- Do I need certificates for every skill? Or is job experience I'm related field enough?
Any other advice given would be helpful too
https://redd.it/1pw998h
@r_devops
Hello everyone, I'm interested in getting into DevOps but I don't know where to start, I'm currently in a private university in Berlin Germany and I'm performing bachelors of computers science, my studies stared 3 months ago, I just wanted to get a headstart in getting into DevOps early, my questions are:
1- Is there any masters field that's more preferred for getting into DevOps?
2- I keep seeing people say it's hard to get into junior DevOps jobs, so most try to get into other jobs like system administrator, and cloud related jobs, I wanted to know which ones would be best for DevOps.
3- Which languages are best for DevOps field
4- Do people work in DevOps related jobs before getting promoted and becoming a DevOps engineer, or do they just work DevOps related jobs and then apply for different companies on the basis of those other jobs as relavent experience?
5- Which skills would I need for DevOps
6- Do I need certificates for every skill? Or is job experience I'm related field enough?
Any other advice given would be helpful too
https://redd.it/1pw998h
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
The State of DevOps Jobs in H2 2025
Hi guys, since I did an 2025 H1 report a followup was in order for the H2 period.
I'm not an expert in data analysis and I'm just getting started to get into the analysis of it all but I hope this will benefit you a bit and you'll get a sense of how the second part of this year was for the DevOps market.
https://devopsprojectshq.com/role/devops-market-h2-2025/
https://redd.it/1pwf717
@r_devops
Hi guys, since I did an 2025 H1 report a followup was in order for the H2 period.
I'm not an expert in data analysis and I'm just getting started to get into the analysis of it all but I hope this will benefit you a bit and you'll get a sense of how the second part of this year was for the DevOps market.
https://devopsprojectshq.com/role/devops-market-h2-2025/
https://redd.it/1pwf717
@r_devops
DevOps Projects
DevOps Job Market Report H2 2025
832 jobs analyzed • $177,500 median salary • 70.6% remote work • AWS, Kubernetes, Python dominate
Im creating new app that will help to new DevOps developers better understand concepts of DevOps and how it works
So, im a passionate developer based in Lithuania and now im trying to start my own project that will help to others to better understand and use devops/ci-cd/docker instances.
The concept is here! The name is PipeViz that will be visualzing your ideas, schemas, and CI/CD pipelines that they actually are. and of course im creating GitHub,GitLab, Google auth for further implementation.
What could you add to the project? what ideas i could realize that? i know, the design maybe is suck, but im still at the beginning of it!
Now im working on the full e2e auth with Github/GitLab/Google/Apple for further work and pipelines. I wish this project has future and you will love it!
I will appreciate all ideas and fixes from the devops Community! Hope that it will be my step to real world programming!
https://redd.it/1pwesig
@r_devops
So, im a passionate developer based in Lithuania and now im trying to start my own project that will help to others to better understand and use devops/ci-cd/docker instances.
The concept is here! The name is PipeViz that will be visualzing your ideas, schemas, and CI/CD pipelines that they actually are. and of course im creating GitHub,GitLab, Google auth for further implementation.
What could you add to the project? what ideas i could realize that? i know, the design maybe is suck, but im still at the beginning of it!
Now im working on the full e2e auth with Github/GitLab/Google/Apple for further work and pipelines. I wish this project has future and you will love it!
I will appreciate all ideas and fixes from the devops Community! Hope that it will be my step to real world programming!
https://redd.it/1pwesig
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How to leverage HashiCorp Packer to automatically provision VM templates for Proxmox
Hey, my fellow engineers
I recently published a post (on medium) regarding the use of HashiCorp's Packer tool to automatically provision VM templates for Proxmox. I would greatly appreciate your feedback.
Here is the link
Thank you, and happy holidays.
https://redd.it/1pwjehy
@r_devops
Hey, my fellow engineers
I recently published a post (on medium) regarding the use of HashiCorp's Packer tool to automatically provision VM templates for Proxmox. I would greatly appreciate your feedback.
Here is the link
Thank you, and happy holidays.
https://redd.it/1pwjehy
@r_devops
Medium
Phase II — Part 1: Automating VM Provisioning in Proxmox w/ Packer
Use some Packer for a better life as a cloud/devops engineer