Reddit DevOps – Telegram
Need help for suggestions regarding SDK and API for Telemedicine application

.Hello everyone,

So currently our team is planning to make a telemedicine application. Just like any telemedicine app it will have chat, video conferencing feature.

The backend is almost ready Node.js and Firebase but we are not able to decide which real -time communication SDK and API to use.
Not able to decide between ZEGOCLOUD and Twilio. Any one has used it before, kindly share your experience. Any other suggestions is also welcome.

TIA.

https://redd.it/1o5h6xs
@r_devops
Which internship should i choose?

Currently just a student in Year 1 trying to break into the field of devops.

In your opinion, if given a choice, which internship would you choose? Platform Engineer or Devops?

I currently have 2 internship options but unsure which to choose. Any suggestions to help me identify which to choose will be greatly appreciated. Have learned technologies from KodeKlud such as (Github Actions CICD, AWS, Terraform, Docker and K8, and understand that both internships provide valuable opportunity to learn.

Option 1: Platform Engineer Intern
Company: NETS (Slightly bigger company, something like VISA but not on the same scale)
Tech: Python, Bash Scripting, VM, Ansible

Option 2: DevOps Intern
Company: (SME)
Tech: CICD, Docker, Cloud, Containerization

Really don't know what to expect from both, maybe someone with more experience can guide me to a direction :)

https://redd.it/1o5gk7d
@r_devops
Our Disaster Recovery "Runbook" Was a Notion Doc, and It Exploded Overnight

The Notion "DR runbook" was authored years ago by someone who left the company last quarter. Nobody ever updated it or tested it under fire.

**02:30 AM, Saturday:** Alerts blast through Slack. Core services are failing. I'm jolted awake by multiple pages from our on-call engineer. At 3:10 AM, I join a huddle as the cloud architect responsible for uptime. The stakes are high.

We realize we no longer have access to our production EKS cluster. The Notion doc instructs us to recreate the cluster, attach node groups, and deploy from Git. Simple in theory, disastrous in practice.

* The cluster relied on an OIDC provider that had been disabled in a cleanup sprint a week ago. IRSA is broken system-wide.
* The autoscaler IAM role lived in an account that was decommissioned.
* We had entries in aws-auth mapping nodes to a trust policy pointing to a dead identity provider.
* The doc assumed default AWS CNI with prefix delegation, but our live cluster runs a custom CNI with non-default MTU and IP allocation flags that were never documented. Nodes join but stay NotReady.
* Helm values referenced old chart versions, and readiness and liveness probes were misaligned. Critical pods kept flapping while HPA scaled the wrong services.
* Dashboards and tooling required SSO through an identity provider that was down. We had no visibility.

By **5:45 AM**, we admitted we could not rebuild cleanly. We shifted into a partial restore mode:

* Restore core data stores from snapshots
* Replay recent logs to recover transactions
* Route traffic only to essential APIs (shutting down nonessential services)
* Adjust DNS weights to favor healthy instances
* Maintain error rates within acceptable thresholds

We stabilized by **9:20 AM**. Total downtime: approximately 6.5 hours. Post-mortem over breakfast. We then transformed that broken Notion document into a living runbook: assign owners, enforce version pinning, schedule quarterly drills, and maintain a printable offline copy. We built a quick-start 10-command cheat sheet for 2 a.m. responders.

**Question:** If you opened your DR runbook in the middle of an outage and found missing or misleading steps, what changes would you make right now to prevent that from ever happening again?

https://redd.it/1o5mdjd
@r_devops
How much of this AWS bill is a waste?

Started working with a big telecom provider here in Canada, these guys are wasting so much on useless shit it boggles my mind

Monthly bill for their cutting edge "tech innovation department" (the in-house tech accelerator) clocks in at $30k/m.

The department is suppose to be leading the charge on using AI to reduce cost and use the best stuff AWS can offer and "deliver best experience for the end user".

First day observations.

EC2 over provisioned by 50%. currently x50 instance could be half to 25. No cloudwatch, no logging, no monitoring is enabled, no one can answer "do we need it?" questions.

No one have done any usage analysis over the past 18 months, let alone the best practice of evaluating every 3-6 month.

There's no performance baseline, no SLAs for any of the services. No uptime guarantee (and they wonder why everyone hates them), no load/response time monitoring.. no cost impact analysis.

NO infra as code (ie terraform), no auto scaling policies and definitely no red teaming/resilience test.

I spoke to a handful architects and no one can point me to the direction of FinOps team who's in charge of cost optimization. so basically the budget keeps growing and they keep getting sold to.

I honestly don't know why I'm here.

https://redd.it/1o5toxi
@r_devops
Do homelabs really help improve DevOps skills?

I’ve seen many people build small clusters with Proxmox or Docker Swarm to simulate production. For those who tried it, which homelab projects actually improved your real world DevOps work and which ones were just fun experiments?

https://redd.it/1o5w3sv
@r_devops
How do you keep IaC repositories clean as teams grow?

Our Terraform setup began simple but now every microservice team adds their own modules and variables. It’s becoming messy with inconsistent naming and ownership. How do you organize large IaC repos without forcing everything into a single centralized structure?

https://redd.it/1o5w3di
@r_devops
Anyone else experimenting with AI assisted on call setups?

We started testing a workflow where alerts trigger a small LLM agent that summarizes logs and suggests a likely cause before a human checks it. Sometimes it helps a lot, other times it makes mistakes. Has anyone here tried something similar or added AI triage to their DevOps process?

https://redd.it/1o5w30f
@r_devops
Who is responsible for owning the artifact server in the software development lifecycle?

So the company I work at is old, but brand new to internal software development. We don’t even have a formal software engineering team, but we have a sonatype nexus artifact server. Currently, we can pull packages from all of the major repositories (pypi, npm, nuget, dockerhub, etc…).

Our IT team doesn’t develop any applications, but they are responsible for the “security” of this server. I feel like they have the settings cranked as high as possible. For example, all linux docker images (slim bookworm, alpine, etc) are quarantined for stuff like glib.c vulnerabilities where “a remote attacker can do something with the stack”… or python’s pandas is quarantined for serializing remote pickle files, sqlalchemy for its loads methods, everything related to AI like langchain… all of npm is quarantined because it is a package that allows you to “install malicious code”. I’ll reiterate, we have no public facing software. Everything is hosted on premise and inside of our firewalls.

Do all organizations with an internal artifact server just have to deal with this? Find other ways to do things? Who typically creates the policies that say package x or y should be allowed? If you have had to deal with a situation like this, what strategies did you implement to create a more manageable developer experience?

https://redd.it/1o5zv57
@r_devops
self-hosted AI analytics tool useful? (Docker + BYO-LLM)

I’m the founder of Athenic AI (tool to explore/analyze data w natural language). Toying with the idea of a self-hosted community edition and wanted to get input from people who work with data...

the community edition would be:

Bring-Your-Own-LLM (use whichever model you want)
Dockerized, self-contained, easy to deploy
Designed for teams who want AI-powered insights without relying on a cloud service

IF interested, please let me know:

Would a self-hosted version be useful
What would you actually use it for
Any must-have features or challenges we should consider

https://redd.it/1o5voxu
@r_devops
Rundeck Community Edition

Its been a while since i have looked at Rundeck and not to my surprise, pagerduty is pushing for people to purchase a commercial license. Looking at the comparison chart, i wonder if the CE is useless. I dont care for aupport and HA but not being able to schedule jobs is a deal breaker for us. Is anyone using rundeck and can vouch that it is still useful with the free edition? Are plugins available?

What we need
- self service center for adhoc jobs
- schedule job
- retry failed jobs
- fire off multiple worker nodes (ecs containers) to run multiple jobs independent of one another

https://redd.it/1o6344v
@r_devops
Need advice — Should I focus on Cloud, DevOps, or go for Python + Linux + AWS + DevOps combo?

Hey everyone,

I’m currently planning my long-term learning path and wanted some genuine advice from people already working in tech.

I’m starting from scratch (no coding experience yet), but my goal is to get into a high-paying and sustainable tech role in the next few years. After researching a bit, I’ve shortlisted three directions:
1. Core Cloud Computing (AWS, Azure, GCP, etc.)
2. Core DevOps (CI/CD, Docker, Kubernetes, automation, etc.)
3. A full combo path — Python + Linux + AWS + basic DevOps

I’ve heard that the third path gives the best long-term flexibility and salary growth, but it’s also a bit longer to learn.
What do you guys think?
• Should I specialize deeply in Cloud or DevOps?
• Or should I build the full foundation first (Python + Linux + AWS + DevOps) even if it takes longer?
• What’s best for getting a high-paying, stable job in 4–5 years?

Would love to hear from professionals already in these roles.

https://redd.it/1o64ct8
@r_devops
DevOps experts: What’s costing teams the most time or money today?

What’s the biggest source of wasted time, money, or frustration in your workflow?
Some examples might be flaky pipelines, manual deployment steps, tool sprawl, or communication breakdowns — but I’m curious about what you think is hurting productivity most.


Personally, coming from a software background and recently joining a DevOps team, I find the cognitive load of learning all the tools overwhelming — but I’d love to hear if others experience similar or different pain points.

https://redd.it/1o672nn
@r_devops
Need advice — Physics grad but confused between DevOps, ML, or CFA

Hey everyone,
I graduated this year with a degree in Physics from a good college. I’ve been into coding since childhood — used to mess around on XDA Developers about 10 years ago, making random projects and tinkering with stuff.

This year I took a drop to work on a startup with my friends — we’re building a VM provisioning system, and I wrote most of the backend and part of the frontend. Before that, around 3 years ago, I even tried starting something in cybersecurity.

Now I’m kind of stuck deciding where to go next. A few options I’ve been thinking about:
• Doing a Master’s in Physics from IIT (I actually love the subject).
• Doing BCA again, just to strengthen my theoretical CS fundamentals.
• Getting deeper into DevOps, because I really enjoyed working with stuff like Firecracker and Kubernetes during our project.
• Going into Machine Learning, since I already have a good math background and love problem-solving.
• Or maybe even pursuing CFA, because I’ve always been interested in finance and markets too.

I know these fields are pretty different, but they all genuinely interest me in different ways.
What do you guys think — where should I focus next or double down?


https://redd.it/1o67ka8
@r_devops
Migrating from Lightsail to EC2 for Terraform experience?

Hey everyone! I’m currently handling DevOps for our company, and we’ve been using AWS Lightsail for most of our projects. It’s been great in terms of simplicity and cost savings, but as the number of projects and servers grows, it’s getting harder to manage.

We use Docker Swarm to deploy stacks (1 stack = 1 app), and we host dev/test/prod environments together on some servers.

I'm planning to slowly migrate to ec2 so I can adopt terraform for infrastructure management. As well as I wanna personally grow and learn it. But ec2 is more expensive and since we’re a startup, I need to justify the cost difference before suggesting it to management.

Would it be possible to do it without increasing our cost to run the servers? or save more? Has anyone here gone through the transition? Would love to hear your insights. Thanks



https://redd.it/1o6a77i
@r_devops
Tool for productivity: notes, links, pass

Hi

Do you use any tool to track notes, links, credentials, any files etc for your work?

I am working on multiple projects that are vastly different and have multiple sources of notes. Something is in git, something online in Jira, some notes during development in text files and some noscripts everywhere. And its for all project and im having hard time to search relevant info.

I would like to have some tool where i can create main 'folders' and under that subfolders where can be password manager, links to system files, notes etc etc..

Also i use only linux. Any idea?

https://redd.it/1o6a22f
@r_devops
HackerRank devops assessment of Arcesium

Hi everyone! I have been shortlisted for the SSE Infrastructure role at Arcesium. The HR has shared a HackerRank assessment link that needs to be completed within the next 48 hours. Can anyone share what kind of questions are usually asked? This will be my first time attempting a HackerRank test. Has anyone attended it? It will be very helpful for me if anyone has attempted it.

https://redd.it/1o6c7hu
@r_devops
KubeGUI - release v1.8

🎉[Release\] KubeGUI v1.8 — lightweight desktop client for managing Kubernetes clusters

Hey folks 👋

Just released KubeGUI v1.8, a free desktop app for visualizing and managing Kubernetes clusters without server-side or other dependencies. You can use it for any personal or commercial needs.

Highlights:

🤖Now possible to configure and use AI (like phind or openai compatible apis) to provide fix suggestions directly inside application based on error message text.

🩺Live resource updates (pods, deployments, etc.)

📝Integrated YAML editor with syntax highlighting and validation.

💻Built-in pod shell access directly from app.

👀Aggregated (multiple or single containers) live log viewer.

🍱CRD awareness (example generator).

Faster UI and lower memory footprint.

Runs locally on Windows & macOS - just point it at your kubeconfig and go.

👉 Download: [https://kubegui.io\](https://kubegui.io](https://kubegui.io))

🐙 GitHub: [https://github.com/gerbil/kubegui\](https://github.com/gerbil/kubegui](https://github.com/gerbil/kubegui)) (your suggestions are always welcome!)

💚 To support project: [https://ko-fi.com/kubegui\](https://ko-fi.com/kubegui](https://ko-fi.com/kubegui))

Would love to hear your thoughts or suggestions — what’s missing, what could make it more useful for your day-to-day ops?

https://redd.it/1o6e7om
@r_devops
Best solution to automate docker bundle backup ?

Hi. I have been scratching my head around this one for a while, multiple back and forth with AI too, but in the end, I can never decide. I thought asking DevOps might be better...

My OS is Ubuntu 24.04 Pro.
Using Docker to self-host a bunch of services, with a mix of named volume and bind mount for persistent storage. Some services use Postgres / Supabase and n8n for automations so it is better not to interrupt it for too long (or at all), generally speaking.

I am basically unsure what is the most straightforward / easy solution to implement a periodic auto backup of everything (the data for all containers), just in case my server dies out (it's an old pc, I use it for experimenting).

I'd like the backup to be auto uploaded to the cloud.

I initially thought I'd use Ubuntu's "online accounts" feature which integrates Google account, so I could just use "deja dup backups" + only bind mounts for containers, and upload a folder of everything to Gdrive weekly.

The problem is that this is not acceptable for Postgres db, and instead I should do a proper pg dump first. I haven't even downloaded Supabase CLI nor the pg dump / pg restore tools yet.
Copying and pasting a folder with all bind mounts is not a valid way of doing it correctly.

\-------

I have recently discovered and installed Coolify, so I dunno if you guys recommend leveraging its features to deal with that, or is there an even better way ?

I have no formal engineering degree, by the way. I'm keen to dig the technical details but generally speaking, I obviously prefer a solution that involves less complexity.

Thanks in advance



https://redd.it/1o6gbyv
@r_devops
I have a DevSecOps intern interview tomorrow. What to expect?

As the noscript suggest, i have a DevSecOps Intern interview tomorrow, and would really like to secure this internship. Considering it is an internship, what do you think is expected of me to know? They did say that my resume caught their attention. Hence, i was given the interview opportunity

https://redd.it/1o6he7h
@r_devops
Are you running your tests in argocd? If so how are you getting the reports out?

We're running applications with gitops using argocd and looking at post-sync test jobs for running E2E tests.

Got my POC running before realizing i have no good way of getting this report out and in front of devs.

How are you exposing test results from jobs with argocd?

https://redd.it/1o6k6px
@r_devops