Reddit DevOps – Telegram
149.100.11.243 - GET / HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
2025-10-13T02:07:14+00:00 - 200 - 190.12.104.161 - GET /cmake-common/plain/.clang-format?h=v3.2&id=0282c2b54f79fa9063e03443369adfe1bc331eaf HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36
2025-10-13T02:07:16+00:00 - 200 - 179.222.178.65 - GET /cmake-common/commit/toolchains/boost?h=v3.4&id=37b051e99fc6b0706f5dc4b2f01dbbbb9b96355a HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/79.0.3945.88 Safari/537.36
2025-10-13T02:07:17+00:00 - 200 - 66.249.79.193 - GET /cgitize/diff/?h=v2.1.0&id=8d2422274ae948f7412b6960597f5de91f3d8830 HTTP/1.1 - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.7339.207 Mobile Safari/537.36 (compatible; GoogleOther)
2025-10-13T02:07:17+00:00 - 200 - 179.49.32.156 - GET /config-links/diff/debian/changelog?h=debian%2Fv2.0.3-5&id=0a4df2ead72546cca8328581b1b41b172b83e769 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36
2025-10-13T02:07:17+00:00 - 200 - 14.231.40.70 - GET /vk-noscripts/commit/vk/utils?h=v1.0.1&id=ee7a170df79287aac3bccfead716377ec8600c5c HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36
2025-10-13T02:07:18+00:00 - 200 - 113.177.166.37 - GET /wireguard-config/plain/.ruby-version?id=ab97b021462809453a38b4f6b87944acd00d51b9 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/84.0.4147.125 Safari/537.36
2025-10-13T02:07:19+00:00 - 200 - 177.141.68.37 - GET /infra-terraform/log/.gitattributes?follow=1&h=v1.2.0&id=78dd4f3cc9d408df69fac270860b283e310fe379 HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4950.0 Iron Safari/537.36
2025-10-13T02:07:19+00:00 - 200 - 124.243.188.173 - GET /sorting-algorithms/commit/Gemfile?h=migration&id=9b3e6d409340369a6b450e997723f773f0aa3505&follow=1 HTTP/2.0 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47

(The log format I use is customized, I don't like the default one. Google bot is fine.) Any tips? Like, set up a reCAPTCHA or something?

https://redd.it/1o57jqx
@r_devops
Ever heard of KubeCraft?

I was looking for resources and saw someone on this sub mention it. $3500 for a 1 year bootcamp? I’m skeptical because I can’t find many reviews on it.

For some additional background: I currently work in cyber (OT Risk Management with some AWS Vuln management responsibilities) and I’m looking to make the transition into a cloud engineering role. My company gives us an L&D stipend and so far I’ve used it to get Adrian Cantrills AWS SAA course, and an annual subnoscription to KodeKloud. I’ve still got a good amount left and was going to use it for Nanas DevOps course and homelab equipment.

https://redd.it/1o570k3
@r_devops
Is cost a metric you care about?

Trying to figure out if DevOps or software engineers should care about building efficient software (AI or not) in the sense of optimized both in terms of scalability/performance and costs.

It seems that in the age of AI we're myopically looking at increasing output, not even outcome. Think about it: productivity - let's assume you increase that, you have a way to measure it and decide: yes, it's up. Is anyone looking at costs as well, just to put things into perspective?

Or the predominant mindset of companies is: cost is a “tomorrow” problem, let’s get growth first?

When does a cost become a problem and who’s solving it?

🙏🙇

https://redd.it/1o51juz
@r_devops
Simplifying OpenTelemetry pipelines in Kubernetes

During a production incident last year, a client’s payment system failed and all the standard tools were open. Grafana showed CPU spikes, CloudWatch logs were scattered, and Jaeger displayed dozens of similar traces. Twenty minutes in, no one could answer the basic question: which trace is the actual failing request?

I suggested moving beyond dashboards and metrics to real observability with OpenTelemetry. We built a unified pipeline that connects metrics, logs, and traces through shared context.

The OpenTelemetry Collector enriches every signal with Kubernetes metadata such as pod, namespace, and team, and injects the same trace context across all data. With that setup, you can click from an alert to the related logs, then to the exact trace that failed, all inside Grafana.

The full post covers how we deployed the Operator, configured DaemonSet agents and a gateway Collector, set up tail-based sampling, and enabled cross-navigation in Grafana: OpenTelemetry Kubernetes Pipeline

If you are helping teams migrate from kube-prometheus-stack or dealing with disconnected telemetry, OpenTelemetry provides a cleaner path. How are you approaching observability correlation in Kubernetes?

https://redd.it/1o5c3bk
@r_devops
Is self-destructive secrets a good approach to authenticate github action selfhosted runner securely?

I created my custom selfhosted oracle-linux based github runner docker image. Entrypoint noscript uses 3 ways of authtication

* short-lived registration token from webui
* PAT token
* github application auth -> .pem key + installation ID + app ID

Now, first option is pretty safe to use even as container env var because its short lived. Im concerned more about 2 other ones. My main gripe here is that the container user which runs the github connection service is the same user which is used for running pipelines. So anyone who uses pipelines can use them to see .pem or PAT. Yes you could use github secrets to "obfuscate" the strings but still, you have to always remember to do it and there are other ways to extract them anyway.

I created self-destructive secrets mechanism. Which means that docker mounts local folder as a volume (it has to have full RW permissions in it). You can place private-key.pem or pat.token files there. When [entrypoint.sh](http://entrypoint.sh) noscript runs, it uses either of them to authenticate the runner, clears this folder and then starts the main service. In case if it cant delete files it will not start.

But i feel that this is something that its already fixed the other way. Even though i could not find the info of how to use two different users (for runner authentication and for pipelines) i feel this security flaw is too large that it has to be some better (and more appropriate) way to do it.

https://redd.it/1o5ctbh
@r_devops
What are the best integrations for developers?

I’ve just started using monday dev for our dev team. What integrations do you find most useful for dev-related tools like GitHub, Slack or GitLab?

https://redd.it/1o5c74n
@r_devops
monday dev vs clickup, why did you make the switch?

We moved from clickUp to monday dev for its simpler interface and better automation. Curious about others’ experiences?

https://redd.it/1o5fjds
@r_devops
Built a 3 tier web app using AWS CDK and CLI

Hey everyone!

I’m a beginner on AWS and I challenged myself to build a production-grade 3-tier web infrastructure using only AWS CDK (Python) and AWS CLI.

**Stack includes:**

* VPC (multi-AZ, 3 public + 3 private subnets, 1 NAT Gateway)
* ALB (public-facing)
* EC2 Auto Scaling Group (private subnets)
* PostgreSQL RDS (private isolated)
* Secrets Manager, CloudWatch, IAM roles, SSM, and billing alarms

Everything was done code-only, no console clicks except for initial bootstrap and billing alarm testing.

**Here’s what I learned:**

* NAT routing finally clicked for me.
* CDK’s abstraction makes subnet/route handling a breeze.
* Debugging AWS CLI ARN capture taught me about stdout/stderr redirection.

**Looking for feedback on:**

* Cost optimization
* Security best practices
* How to read documentation to refactor the CDK app

**GitHub Repo:** [**https://github.com/asim-makes/3-tier-infra**](https://github.com/asim-makes/3-tier-infra)

https://redd.it/1o5gyvr
@r_devops
Need help for suggestions regarding SDK and API for Telemedicine application

.Hello everyone,

So currently our team is planning to make a telemedicine application. Just like any telemedicine app it will have chat, video conferencing feature.

The backend is almost ready Node.js and Firebase but we are not able to decide which real -time communication SDK and API to use.
Not able to decide between ZEGOCLOUD and Twilio. Any one has used it before, kindly share your experience. Any other suggestions is also welcome.

TIA.

https://redd.it/1o5h6xs
@r_devops
Which internship should i choose?

Currently just a student in Year 1 trying to break into the field of devops.

In your opinion, if given a choice, which internship would you choose? Platform Engineer or Devops?

I currently have 2 internship options but unsure which to choose. Any suggestions to help me identify which to choose will be greatly appreciated. Have learned technologies from KodeKlud such as (Github Actions CICD, AWS, Terraform, Docker and K8, and understand that both internships provide valuable opportunity to learn.

Option 1: Platform Engineer Intern
Company: NETS (Slightly bigger company, something like VISA but not on the same scale)
Tech: Python, Bash Scripting, VM, Ansible

Option 2: DevOps Intern
Company: (SME)
Tech: CICD, Docker, Cloud, Containerization

Really don't know what to expect from both, maybe someone with more experience can guide me to a direction :)

https://redd.it/1o5gk7d
@r_devops
Our Disaster Recovery "Runbook" Was a Notion Doc, and It Exploded Overnight

The Notion "DR runbook" was authored years ago by someone who left the company last quarter. Nobody ever updated it or tested it under fire.

**02:30 AM, Saturday:** Alerts blast through Slack. Core services are failing. I'm jolted awake by multiple pages from our on-call engineer. At 3:10 AM, I join a huddle as the cloud architect responsible for uptime. The stakes are high.

We realize we no longer have access to our production EKS cluster. The Notion doc instructs us to recreate the cluster, attach node groups, and deploy from Git. Simple in theory, disastrous in practice.

* The cluster relied on an OIDC provider that had been disabled in a cleanup sprint a week ago. IRSA is broken system-wide.
* The autoscaler IAM role lived in an account that was decommissioned.
* We had entries in aws-auth mapping nodes to a trust policy pointing to a dead identity provider.
* The doc assumed default AWS CNI with prefix delegation, but our live cluster runs a custom CNI with non-default MTU and IP allocation flags that were never documented. Nodes join but stay NotReady.
* Helm values referenced old chart versions, and readiness and liveness probes were misaligned. Critical pods kept flapping while HPA scaled the wrong services.
* Dashboards and tooling required SSO through an identity provider that was down. We had no visibility.

By **5:45 AM**, we admitted we could not rebuild cleanly. We shifted into a partial restore mode:

* Restore core data stores from snapshots
* Replay recent logs to recover transactions
* Route traffic only to essential APIs (shutting down nonessential services)
* Adjust DNS weights to favor healthy instances
* Maintain error rates within acceptable thresholds

We stabilized by **9:20 AM**. Total downtime: approximately 6.5 hours. Post-mortem over breakfast. We then transformed that broken Notion document into a living runbook: assign owners, enforce version pinning, schedule quarterly drills, and maintain a printable offline copy. We built a quick-start 10-command cheat sheet for 2 a.m. responders.

**Question:** If you opened your DR runbook in the middle of an outage and found missing or misleading steps, what changes would you make right now to prevent that from ever happening again?

https://redd.it/1o5mdjd
@r_devops
How much of this AWS bill is a waste?

Started working with a big telecom provider here in Canada, these guys are wasting so much on useless shit it boggles my mind

Monthly bill for their cutting edge "tech innovation department" (the in-house tech accelerator) clocks in at $30k/m.

The department is suppose to be leading the charge on using AI to reduce cost and use the best stuff AWS can offer and "deliver best experience for the end user".

First day observations.

EC2 over provisioned by 50%. currently x50 instance could be half to 25. No cloudwatch, no logging, no monitoring is enabled, no one can answer "do we need it?" questions.

No one have done any usage analysis over the past 18 months, let alone the best practice of evaluating every 3-6 month.

There's no performance baseline, no SLAs for any of the services. No uptime guarantee (and they wonder why everyone hates them), no load/response time monitoring.. no cost impact analysis.

NO infra as code (ie terraform), no auto scaling policies and definitely no red teaming/resilience test.

I spoke to a handful architects and no one can point me to the direction of FinOps team who's in charge of cost optimization. so basically the budget keeps growing and they keep getting sold to.

I honestly don't know why I'm here.

https://redd.it/1o5toxi
@r_devops
Do homelabs really help improve DevOps skills?

I’ve seen many people build small clusters with Proxmox or Docker Swarm to simulate production. For those who tried it, which homelab projects actually improved your real world DevOps work and which ones were just fun experiments?

https://redd.it/1o5w3sv
@r_devops
How do you keep IaC repositories clean as teams grow?

Our Terraform setup began simple but now every microservice team adds their own modules and variables. It’s becoming messy with inconsistent naming and ownership. How do you organize large IaC repos without forcing everything into a single centralized structure?

https://redd.it/1o5w3di
@r_devops
Anyone else experimenting with AI assisted on call setups?

We started testing a workflow where alerts trigger a small LLM agent that summarizes logs and suggests a likely cause before a human checks it. Sometimes it helps a lot, other times it makes mistakes. Has anyone here tried something similar or added AI triage to their DevOps process?

https://redd.it/1o5w30f
@r_devops
Who is responsible for owning the artifact server in the software development lifecycle?

So the company I work at is old, but brand new to internal software development. We don’t even have a formal software engineering team, but we have a sonatype nexus artifact server. Currently, we can pull packages from all of the major repositories (pypi, npm, nuget, dockerhub, etc…).

Our IT team doesn’t develop any applications, but they are responsible for the “security” of this server. I feel like they have the settings cranked as high as possible. For example, all linux docker images (slim bookworm, alpine, etc) are quarantined for stuff like glib.c vulnerabilities where “a remote attacker can do something with the stack”… or python’s pandas is quarantined for serializing remote pickle files, sqlalchemy for its loads methods, everything related to AI like langchain… all of npm is quarantined because it is a package that allows you to “install malicious code”. I’ll reiterate, we have no public facing software. Everything is hosted on premise and inside of our firewalls.

Do all organizations with an internal artifact server just have to deal with this? Find other ways to do things? Who typically creates the policies that say package x or y should be allowed? If you have had to deal with a situation like this, what strategies did you implement to create a more manageable developer experience?

https://redd.it/1o5zv57
@r_devops
self-hosted AI analytics tool useful? (Docker + BYO-LLM)

I’m the founder of Athenic AI (tool to explore/analyze data w natural language). Toying with the idea of a self-hosted community edition and wanted to get input from people who work with data...

the community edition would be:

Bring-Your-Own-LLM (use whichever model you want)
Dockerized, self-contained, easy to deploy
Designed for teams who want AI-powered insights without relying on a cloud service

IF interested, please let me know:

Would a self-hosted version be useful
What would you actually use it for
Any must-have features or challenges we should consider

https://redd.it/1o5voxu
@r_devops
Rundeck Community Edition

Its been a while since i have looked at Rundeck and not to my surprise, pagerduty is pushing for people to purchase a commercial license. Looking at the comparison chart, i wonder if the CE is useless. I dont care for aupport and HA but not being able to schedule jobs is a deal breaker for us. Is anyone using rundeck and can vouch that it is still useful with the free edition? Are plugins available?

What we need
- self service center for adhoc jobs
- schedule job
- retry failed jobs
- fire off multiple worker nodes (ecs containers) to run multiple jobs independent of one another

https://redd.it/1o6344v
@r_devops
Need advice — Should I focus on Cloud, DevOps, or go for Python + Linux + AWS + DevOps combo?

Hey everyone,

I’m currently planning my long-term learning path and wanted some genuine advice from people already working in tech.

I’m starting from scratch (no coding experience yet), but my goal is to get into a high-paying and sustainable tech role in the next few years. After researching a bit, I’ve shortlisted three directions:
1. Core Cloud Computing (AWS, Azure, GCP, etc.)
2. Core DevOps (CI/CD, Docker, Kubernetes, automation, etc.)
3. A full combo path — Python + Linux + AWS + basic DevOps

I’ve heard that the third path gives the best long-term flexibility and salary growth, but it’s also a bit longer to learn.
What do you guys think?
• Should I specialize deeply in Cloud or DevOps?
• Or should I build the full foundation first (Python + Linux + AWS + DevOps) even if it takes longer?
• What’s best for getting a high-paying, stable job in 4–5 years?

Would love to hear from professionals already in these roles.

https://redd.it/1o64ct8
@r_devops