Reddit DevOps – Telegram
Python for Automating stuff on Azure and Kafka

Hi,

I need some suggestions from the community here, I been working bash for noscripting in CI and CD pipeline jobs with minimal exposure to python in the automation pipelines.

I am looking to start focusing on developing my python skills and get some hands on with Azure python SDK and Kafka libraries to start using python at my workplace.

Need some suggestions on online learning platform and books to get started. Looking to invest about 10-12 hours each week in learning.

https://redd.it/1oyctub
@r_devops
Manage Vault in GitOps way

Hi all,

In my home cluster I'm introducing Vault and Vault operator to handle secrets within the cluster.
How to you guys manage Vault in an automated way? For example I would like to create kv and policies in a declarative way maybe managed with Argo CD

Any suggestings?

https://redd.it/1oygbil
@r_devops
Productizing LangGraph Agents

Hey,
I'm trying to understand which option is better based on your experience. I

want to deploy enterprise-ready agentic applications, my current agent framework is Langgraph.

To be production-ready, I need horizontal scaling and durable state so that if a failure occurs, the system can resume from the last successful step.

I’ve been reading a lot about Temporal and the Langsmith Agent Server, both seem to offer similar capabilities and promise durable execution for agents, tools, and MCPs.
I'm not sure which one is more recommended.

I did notice one major difference: in Langgraph I need to explicitly define retry policies in my code, while Temporal handles retries more transparently.

I’d love to get your feedback on this.

https://redd.it/1oyh93l
@r_devops
Trouble sharing a Windows Server 2022 AMI between AWS accounts (no RDP password, no SSM connection)



Hello everyone,

I've been trying for the last two days to share a custom Windows Server 2022 AMI from Account A to Account B, but without success.
The source AMI is based on the official Windows_Server-2022-English-Full-Base image, and I installed a few internal programs and agents on it.

After creating and sharing the AMI, I can successfully launch instances from it in the target account (Account B), but:

I cannot retrieve the Windows password via “Get Windows password” (it says “This instance was launched from a custom AMI...”);

The SSM Agent doesn’t start or connect to Systems Manager;

The instance shows 3/3 health checks OK, but remains inaccessible over RDP or SSM.



---

🔹 What I have tried so far

1. Standard AMI creation:

Created the image via EC2 console → Create image.

Shared both the AMI and its snapshot with the target AWS account (including Allow EBS volume creation).



2. First attempt (no sysprep):

The image worked but AWS couldn’t decrypt the Windows password.

Expected behavior, since Windows wasn’t generalized.



3. Second attempt (sysprep with /oobe /generalize /shutdown):

Ran from SSM:

Start-Process "C:\Windows\System32\Sysprep\sysprep.exe" -ArgumentList "/oobe /generalize /shutdown" -Wait

Result: instance stopped correctly, but when launching from this AMI the system got stuck on the “Hi there” screen (OOBE GUI), so no EC2Launch automation, no RDP, no SSM.



4. Third attempt (sysprep with /generalize /shutdown only):

Based on the AWS official documentation, /oobe should not be used — EC2LaunchV2 handles first boot automatically.

However, the AMI was based on an older image that had EC2Launch v1, not EC2LaunchV2, so I verified this via:

Get-Service | Where-Object { $_.Name -like "EC2Launch*" }

and confirmed it was the legacy EC2Launch service.

Started the service:

Set-Service EC2Launch -StartupType Automatic
Start-Service EC2Launch

Re-ran:

Start-Process "C:\Windows\System32\Sysprep\sysprep.exe" -ArgumentList "/generalize /shutdown" -Wait

The process completed and the instance shut down, but in the new account I still couldn’t decrypt the Windows password (AWS said custom AMI).



5. Tried reinstalling EC2LaunchV2 manually:

Using:

Invoke-WebRequest "https://ec2-launch-v2.s3.amazonaws.com/latest/EC2LaunchV2.msi" -OutFile "$env:TEMP\EC2LaunchV2.msi"
Start-Process msiexec.exe -ArgumentList "/i $env:TEMP\EC2LaunchV2.msi /quiet" -Wait

However, the service didn’t register, likely because the image is built on a base that doesn’t support EC2LaunchV2 natively (Windows Server 2022 + legacy AMI lineage).



https://redd.it/1oyh932
@r_devops
Is there a standard list of all potential metrics that one can / should extract from technologies like HTTP / gRPC / GraphQL server & clients? Or for Request Response systems in general?

We all deal with developing / maintaining servers and clients. With observability playing its part, I am trying to figure out wouldn't we have standardized metrics that one can by default use for such servers?

If so is there actually a project / foundation / tool that is working on it?

e.g. with server there can prometheus metrics for requests, responses
for client could be something similar. I mean developers can choose metrics they deem useful but having a list of what are potentially available metrics would be much better strategy IMHO.

I don't know if OpenTelemetry solves this issue, from what I understand it provides tools to obtain metrics, traces, logs but doesn't define a definitive set as to what most of these standard models can provide

https://redd.it/1oylwuc
@r_devops
How do you handle infrastructure audits across multiple monitoring tools?

Our team just went through an annual audit of our internal tools.

Some of the audits we do are the following:

1. Alerts - We have alerts spanning across Cloudwatch, Splunk, Chronosphere, Grafana, and custom cron jobs. We audit for things like if we still need the alert, is it still accurate, etc..
2. ASGs - We went through all the AWS ASGs that we own and ensured they have appropriate resources (not too much or too little), does our team still own it, etc…

That’s just a small portion of our audit.

Often these audits require the auditor to go to different systems and pull some data to get an idea on the current status of the infrastructure/tool in question.

All of this data is put into a spreadsheet and different audits are assigned to different team members.

Curious on a few things:
- Are you auditing your infra/tools regularly?
- Do you have tooling for this? Something beyond simple spreadsheets.
- How long does it take you to audit?

Looking to hear what works well for others!



https://redd.it/1oyomjm
@r_devops
FREE Security audit for your code in exchange for 10 min feedback

 Hey everyone,



I'm building a security analyzer called CodeSlick.dev that detects OWASP Top 10 vulnerabilities in JavaScript, Python, Java, and TypeScript.

To improve it, I'm offering free security audits in exchange for honest feedback.



What you get:

  \- Instant security analysis (<3 seconds)

  \- AI-powered fix suggestions with one-click apply

  \- CVSS severity scoring

  \- Downloadable HTML report



  What I need:

  \- 10-minute feedback survey after you see results

  \- Your honest thoughts on what worked/what didn't



  Zero friction:

  \- No signup required

  \- No installation

  \- Just paste code → Get report → Share feedback



  Interested? Please feel free to comment below or DM me.



https://redd.it/1oyq3gi
@r_devops
Offline Scalable CICD Platform Recommendations

Hello all,

I was wondering if anyone could recommend any scalable platforms for running CICD in an offline environment. At present we have a bunch of VMs with GitLab runners on them, but due to mixed use of the VMs (like users logging in to do other stuff) it’s quite hard to manage security and keep config consistent.

Unfortunately a lot of the VMs need to be Windows based because that’s the target environment. Most jobs small jobs are Python, the larger jobs are Java, C++ etc. The Java stuff is super simple, but the other languages tend to be trickier. This network has about 40 proper devs and 60 python bandits.

We’re looking for a solution that can be purchased to run on an air gapped network that can do load balancing, re-base-lining etc without much manual maintenance.

I’d suggested doing it with Kubernetes ourselves but we are time restricted and have some budget to buy something. One of my colleagues say a VmWare Tanzu demo that looked good, but anyone with hands on experience would be more useful than a conference sale pitch.

Any suggestions would be appreciated, and I can provide more info if needed. We have about £200k budget for both the compute and the management platform.

Just in case anyone tries to sell me something directly, I won’t be the one making the decision or purchase.

Thanks in advance

https://redd.it/1oytnsx
@r_devops
what’s the one type of alert that ruins your sleep the most?

just trying to understand how bad on-call life really is outside my bubble.
Last night a friend got woken up at 3AM… for an alert that turned out to be nothing.

Curious:
• What alert always turns out to be noise?
• What’s the dumbest 3AM wake-up you’ve had?
• If you could delete one alert type forever, which one would it be?

https://redd.it/1oyv1lx
@r_devops
System Design interview for DevOps roles

For a year, system design interview has taken its place in the interview process of DevOps roles. At least I am seeing for a year.

In each interview, I was asked to design different systems (api design and database design) to achieve different requirements. These interviews always seem to focus on software itself, rather than infrastructure or operating systems or cloud. Personally I feel they’re judging a fish if it can fly.

Have you seen the same? What’s your opinion?

https://redd.it/1oywe81
@r_devops
want to build a microservice containing amixture of open source IAM and RBAC

im trying to build a microservice to handle my auth and rbac for a project im starting, though i dont want to waste my time on it, and ould rather use some opensource solutions to handle the requirements:


Authentication:

\- JWT + OAuth2 Password Flow

\- Access tokens + Refresh tokens

\- Token revocation, password reset, user invitations

\- bcrypt password hashing....



Multitenancy:

\- Database-per-tenant architecture

\- Shared schema (super_admins, entities) + Tenant schemas

\- Complete data isolation between entities



RBAC:

\- 3 fixed roles: Super Admin, Admin, User

\- Profile-based permissions for Users

\- Granular permissions: resource.action format (e.g., example.create, billing.*)

\- Admin creates custom profiles with specific permissions

\- Entity-level feature toggles

initially i did set hanko "great solution", but it doesnt align with my system requirements and will need a lot of customization, then i though about using Keycloak, or Ory Kratos ... with OpenFGA for RBAC


but i wonder, what could be the best combination for such requirements, or am i on a completly wrong track?

https://redd.it/1oyw32n
@r_devops
I need help, I'm trying to build my first web application

Hello everyone.
I'm trying to deploy my first real app and I'm honestly already losing perspective after 2 days stuck with this. I would like to ask you two things:

1. Recommendations on how I should deploy my stack correctly (Docker Compose).


2. Help to understand why Dokploy/Coolify/DigitalOcean are returning me such weird errors.




---

My stack (everything runs locally with docker compose up without problems):

Backend: Django + Django REST Framework

Tasks: Celery + Celery Beat

Messaging: Redis

Database: PostgreSQL (on DigitalOcean)

Frontend: React with Vite

Everything runs dockerized.

Locally it works perfectly, including Celery, Beat and Redis.

1) DigitalOcean App Platform

I tried it first.
My backend worked, connected fine to the external DO database, but App Platform doesn't support Celery, Celery Beat or Redis in separate services (at least not in a simple way without costing an arm and a leg).
For my project they are essential, so I discarded them.


---

2) Coolify

I tried…but I honestly felt like I was going in circles and not moving forward.
I couldn't get my complete compose up.
I got lost between pipelines, resources, static sites and failing builds.
I gave up.


---

3) Dokploy

Now I am here because in theory it is the clearest option and with the best feedback.

I like that it lets me see logs, connections, containers, etc.
But I have several problems that I don't even know where to attack:


---

Problem 1: Backend goes up, but Django admin gives 404 or Bad Gateway

Dokploy builds my container without errors.

It connects perfectly to my DigitalOcean database.


Buuut... when I open /admin/ or any route I get:

404

or Bad Gateway



Random. I don't understand.


---

Problem 2: I bought a domain, associated it with Dokploy... and now Chrome says that “the connection is not private”

The DNS is correctly configured according to Dokploy (it shows everything green).

But when entering the URL:

> “An attacker may be trying to steal information…”


And below it shows that my site uses "HSTS" (I don't even know what that is 💀).

I don't know if it is a failure of certificates, of the proxy, of misconfigured HTTPS or if something else must happen before it works. Maybe an our father


---

What exactly am I looking for?

1. Realistic and direct advice:
What is the most practical and stable way to deploy a stack like this using Docker Compose?
Backend + React + Redis + Celery + Celery Beat.


2. If someone uses Dokploy:

How do you set up domains and certificates without Chrome saying a hacker wants to steal from me?

Why can a Django that compiles well throw 404 or Bad Gateway only in /admin/?



3. Alternative options:

Should I go back to DigitalOcean Droplets and do a classic deploy with manual docker-compose?

Or was Coolify the right route and I was the problem?





---

I close with this:

I've been stuck between logs for two days
If anyone can give me a clear direction, I would greatly appreciate it 🙏

https://redd.it/1oz1pe4
@r_devops
Cloud Infrastructure Engineer

Are there any cloud infrastructure engineer in here that can share their interview experience?

https://redd.it/1oz2y6e
@r_devops
How I'm using Infisical to secure my secrets in my pyATS/NetBox agent.

Hey everyone, just wanted to share a use case I'm really happy with. I'm building a multi-container AI agent for network automation (pyATS, NetBox, Streamlit) and was dreading how to manage all the device passwords, database strings, and API keys. Infisical was the perfect solution.

My docker_startup.sh noscript just fetches the Machine Identities, and then each container's `entrypoint.sh` uses infisical run to wrap the app (like a secure bubble). This injects all 35+ secrets as environment variables. The best part is my Python code is totally clean—it just uses os.getenv() and has no idea Infisical even exists. It's a fantastic way to keep credentials out of my Docker files. This is the link for the video I made. https://youtu.be/JBJOj8EE-JE

https://redd.it/1oz3gr6
@r_devops
Decoding DevOps

I'm a software specialist with DevOps background and I'm thinking of taking this course: Decoding DevOps – From Basics to Advanced Projects with AI by Imran Teli to strengthen my portfolio and CV to land mid-to-senior DevOps position ASAP.Would it help or there's better options?

https://redd.it/1oz17ab
@r_devops
Our production crashed for 48 hours because of a version mismatch

ClickHouse migration went wrong. Old region: v22.8. New region: v23.3. Nobody noticed.

Two days of debugging with premium support. Zero results.

Finally caught it ourselves after 48 hours.

Building a tool now to prevent these config nightmares. Lesson learned: always verify versions across environments.

https://redd.it/1oz7rcs
@r_devops
How to send Supabase Postgres logs to New Relic on Pro (cloud, not self-hosted)?

Hey everyone,

I’m trying to figure out a clean way to get Supabase Postgres logs into New Relic without changing my whole setup or upgrading plans.

My situation:

- I’m using Supabase Cloud, not self-hosted
- I’m currently on the Pro plan
- I don’t want to upgrade to Team just to get log drains
- I’ve already successfully integrated New Relic with my Supabase Edge Functions (Node/TypeScript), and that part is working fine
- What I’m missing is Postgres/DB logs (slow queries, errors, etc.) inside New Relic

From what I’ve seen, the “proper” / official way seems to be using log drains, which are only available on the higher tiers. Since I’m on Pro, I’m looking for any of the following:

- Has anyone found a workaround to get Postgres logs or query data from Supabase Cloud → New Relic while staying on Pro?
- Is there any way to forward logs via webhooks, or some pattern like:
- Supabase → Function / Trigger → HTTP → New Relic ingest endpoint?
- Or maybe using database triggers / audit tables + a job that pushes data into New Relic in some structured way?


If anyone has:
- A working setup
- Even a partial solution (e.g. just errors or slow queries)
- Or can confirm that it’s basically impossible without Team / Enterprise

…I’d really appreciate the details.

Thanks in advance.

https://redd.it/1oza164
@r_devops
How can I start learning AWS or Azure without a credit/debit card?


I'm trying to get into cloud computing, but I'm stuck at the very first step. I don't have a credit or debit card, and my college ID isn’t eligible for the Azure for Students offer. Because of that, I can’t sign up for the free tiers on AWS or Azure.

For anyone who’s been in a similar situation — how did you start learning? Are there any alternatives, free resources, sandbox environments, or training platforms I can use without needing a card? I really want to get hands-on practice instead of only watching videos.

Any suggestions would be really appreciated!


https://redd.it/1oz9wrh
@r_devops
Anyone else tired of juggling SonarQube, Snyk, and manual reviews just to keep code clean?

Our setup has become ridiculous. SonarQube runs nightly, Snyk yells about vulnerabilities once a week, and reviewers manually check for style and logic. It's all disconnected - different dashboards, overlapping issues, and zero visibility on whether we're actually improving. I've been wondering if there's a sane way to bring code quality, review automation, and security scanning into a single workflow. Ideally something that plugs into GitHub so we stop context-switching between five tabs every PR.

https://redd.it/1ozc6lj
@r_devops