NEW BOT Телеграм, страница

Reddit DevOps

Is cost a metric you care about?

Trying to figure out if DevOps or software engineers should care about building efficient software (AI or not) in the sense of optimized both in terms of scalability/performance and costs.

It seems that in the age of AI we're myopically looking at increasing output, not even outcome. Think about it: productivity - let's assume you increase that, you have a way to measure it and decide: yes, it's up. Is anyone looking at costs as well, just to put things into perspective?

Or the predominant mindset of companies is: cost is a “tomorrow” problem, let’s get growth first?

When does a cost become a problem and who’s solving it?

🙏🙇

https://redd.it/1o51juz
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

6 views04:28

Reddit DevOps

Simplifying OpenTelemetry pipelines in Kubernetes

During a production incident last year, a client’s payment system failed and all the standard tools were open. Grafana showed CPU spikes, CloudWatch logs were scattered, and Jaeger displayed dozens of similar traces. Twenty minutes in, no one could answer the basic question: which trace is the actual failing request?

I suggested moving beyond dashboards and metrics to real observability with OpenTelemetry. We built a unified pipeline that connects metrics, logs, and traces through shared context.

The OpenTelemetry Collector enriches every signal with Kubernetes metadata such as pod, namespace, and team, and injects the same trace context across all data. With that setup, you can click from an alert to the related logs, then to the exact trace that failed, all inside Grafana.

The full post covers how we deployed the Operator, configured DaemonSet agents and a gateway Collector, set up tail-based sampling, and enabled cross-navigation in Grafana: OpenTelemetry Kubernetes Pipeline

If you are helping teams migrate from kube-prometheus-stack or dealing with disconnected telemetry, OpenTelemetry provides a cleaner path. How are you approaching observability correlation in Kubernetes?

https://redd.it/1o5c3bk
@r_devops

Fatih Koç

Building a Unified OpenTelemetry Pipeline in Kubernetes

Deploy OpenTelemetry Collector in Kubernetes to unify metrics, logs, and traces with correlation, smart sampling, and insights for faster incident resolution.

9 views06:28

Reddit DevOps

Is self-destructive secrets a good approach to authenticate github action selfhosted runner securely?

I created my custom selfhosted oracle-linux based github runner docker image. Entrypoint noscript uses 3 ways of authtication

* short-lived registration token from webui
* PAT token
* github application auth -> .pem key + installation ID + app ID

Now, first option is pretty safe to use even as container env var because its short lived. Im concerned more about 2 other ones. My main gripe here is that the container user which runs the github connection service is the same user which is used for running pipelines. So anyone who uses pipelines can use them to see .pem or PAT. Yes you could use github secrets to "obfuscate" the strings but still, you have to always remember to do it and there are other ways to extract them anyway.

I created self-destructive secrets mechanism. Which means that docker mounts local folder as a volume (it has to have full RW permissions in it). You can place private-key.pem or pat.token files there. When [entrypoint.sh](http://entrypoint.sh) noscript runs, it uses either of them to authenticate the runner, clears this folder and then starts the main service. In case if it cant delete files it will not start.

But i feel that this is something that its already fixed the other way. Even though i could not find the info of how to use two different users (for runner authentication and for pipelines) i feel this security flaw is too large that it has to be some better (and more appropriate) way to do it.

https://redd.it/1o5ctbh
@r_devops

10 views07:28

Reddit DevOps

What are the best integrations for developers?

I’ve just started using monday dev for our dev team. What integrations do you find most useful for dev-related tools like GitHub, Slack or GitLab?

https://redd.it/1o5c74n
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views08:28

Reddit DevOps

monday dev vs clickup, why did you make the switch?

We moved from clickUp to monday dev for its simpler interface and better automation. Curious about others’ experiences?

https://redd.it/1o5fjds
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

8 views10:28

Reddit DevOps

Built a 3 tier web app using AWS CDK and CLI

Hey everyone!

I’m a beginner on AWS and I challenged myself to build a production-grade 3-tier web infrastructure using only AWS CDK (Python) and AWS CLI.

**Stack includes:**

* VPC (multi-AZ, 3 public + 3 private subnets, 1 NAT Gateway)
* ALB (public-facing)
* EC2 Auto Scaling Group (private subnets)
* PostgreSQL RDS (private isolated)
* Secrets Manager, CloudWatch, IAM roles, SSM, and billing alarms

Everything was done code-only, no console clicks except for initial bootstrap and billing alarm testing.

**Here’s what I learned:**

* NAT routing finally clicked for me.
* CDK’s abstraction makes subnet/route handling a breeze.
* Debugging AWS CLI ARN capture taught me about stdout/stderr redirection.

**Looking for feedback on:**

* Cost optimization
* Security best practices
* How to read documentation to refactor the CDK app

**GitHub Repo:** [**https://github.com/asim-makes/3-tier-infra**](https://github.com/asim-makes/3-tier-infra)

https://redd.it/1o5gyvr
@r_devops

9 views11:28

Reddit DevOps

Why did containers happen? A view from ten years in the trenches by Docker's former CTO Justin Cormack

- Post
- Talk

https://redd.it/1o5h93m
@r_devops

Buttondown

Ignore previous directions 8: devopsdays

Autumn update This is what it is looking like around here at the moment. DevOpsDays London I gave a talk at DevOpsDays London recently. It was a nice...

8 views12:28

Reddit DevOps

Need help for suggestions regarding SDK and API for Telemedicine application

.Hello everyone,

So currently our team is planning to make a telemedicine application. Just like any telemedicine app it will have chat, video conferencing feature.

The backend is almost ready Node.js and Firebase but we are not able to decide which real -time communication SDK and API to use.
Not able to decide between ZEGOCLOUD and Twilio. Any one has used it before, kindly share your experience. Any other suggestions is also welcome.

TIA.

https://redd.it/1o5h6xs
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

8 views13:28

Reddit DevOps

Which internship should i choose?

Currently just a student in Year 1 trying to break into the field of devops.

In your opinion, if given a choice, which internship would you choose? Platform Engineer or Devops?

I currently have 2 internship options but unsure which to choose. Any suggestions to help me identify which to choose will be greatly appreciated. Have learned technologies from KodeKlud such as (Github Actions CICD, AWS, Terraform, Docker and K8, and understand that both internships provide valuable opportunity to learn.

Option 1: Platform Engineer Intern
Company: NETS (Slightly bigger company, something like VISA but not on the same scale)
Tech: Python, Bash Scripting, VM, Ansible

Option 2: DevOps Intern
Company: (SME)
Tech: CICD, Docker, Cloud, Containerization

Really don't know what to expect from both, maybe someone with more experience can guide me to a direction :)

https://redd.it/1o5gk7d
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

14 views14:28

Reddit DevOps

Our Disaster Recovery "Runbook" Was a Notion Doc, and It Exploded Overnight

The Notion "DR runbook" was authored years ago by someone who left the company last quarter. Nobody ever updated it or tested it under fire.

**02:30 AM, Saturday:** Alerts blast through Slack. Core services are failing. I'm jolted awake by multiple pages from our on-call engineer. At 3:10 AM, I join a huddle as the cloud architect responsible for uptime. The stakes are high.

We realize we no longer have access to our production EKS cluster. The Notion doc instructs us to recreate the cluster, attach node groups, and deploy from Git. Simple in theory, disastrous in practice.

* The cluster relied on an OIDC provider that had been disabled in a cleanup sprint a week ago. IRSA is broken system-wide.
* The autoscaler IAM role lived in an account that was decommissioned.
* We had entries in aws-auth mapping nodes to a trust policy pointing to a dead identity provider.
* The doc assumed default AWS CNI with prefix delegation, but our live cluster runs a custom CNI with non-default MTU and IP allocation flags that were never documented. Nodes join but stay NotReady.
* Helm values referenced old chart versions, and readiness and liveness probes were misaligned. Critical pods kept flapping while HPA scaled the wrong services.
* Dashboards and tooling required SSO through an identity provider that was down. We had no visibility.

By **5:45 AM**, we admitted we could not rebuild cleanly. We shifted into a partial restore mode:

* Restore core data stores from snapshots
* Replay recent logs to recover transactions
* Route traffic only to essential APIs (shutting down nonessential services)
* Adjust DNS weights to favor healthy instances
* Maintain error rates within acceptable thresholds

We stabilized by **9:20 AM**. Total downtime: approximately 6.5 hours. Post-mortem over breakfast. We then transformed that broken Notion document into a living runbook: assign owners, enforce version pinning, schedule quarterly drills, and maintain a printable offline copy. We built a quick-start 10-command cheat sheet for 2 a.m. responders.

**Question:** If you opened your DR runbook in the middle of an outage and found missing or misleading steps, what changes would you make right now to prevent that from ever happening again?

https://redd.it/1o5mdjd
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

15 views15:28

Reddit DevOps

Resume Suggestions

I am applying for Cloud Intern / DevOps Intern roles for Summer 2026. This is my resume. Please provide suggestions.

Also, please let me know if any internships are open in your company.

https://redd.it/1o5nscv
@r_devops

Google Docs

Resume - Review.docx

Aqua Demo aqua@gmail.com • (xxx) xxx xxxx • www.linkedin.com/in/xxxx/ EDUCATION University of Master's in Computer Science (Aug 2025 - May 2027 (Expected)) University Bachelor of Science in Computer Science and Information Technology (Aug 2017 - May…

6 views16:28

Reddit DevOps

How much of this AWS bill is a waste?

Started working with a big telecom provider here in Canada, these guys are wasting so much on useless shit it boggles my mind

Monthly bill for their cutting edge "tech innovation department" (the in-house tech accelerator) clocks in at $30k/m.

The department is suppose to be leading the charge on using AI to reduce cost and use the best stuff AWS can offer and "deliver best experience for the end user".

First day observations.

EC2 over provisioned by 50%. currently x50 instance could be half to 25. No cloudwatch, no logging, no monitoring is enabled, no one can answer "do we need it?" questions.

No one have done any usage analysis over the past 18 months, let alone the best practice of evaluating every 3-6 month.

There's no performance baseline, no SLAs for any of the services. No uptime guarantee (and they wonder why everyone hates them), no load/response time monitoring.. no cost impact analysis.

NO infra as code (ie terraform), no auto scaling policies and definitely no red teaming/resilience test.

I spoke to a handful architects and no one can point me to the direction of FinOps team who's in charge of cost optimization. so basically the budget keeps growing and they keep getting sold to.

I honestly don't know why I'm here.

https://redd.it/1o5toxi
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

6 views20:28

Reddit DevOps

Do homelabs really help improve DevOps skills?

I’ve seen many people build small clusters with Proxmox or Docker Swarm to simulate production. For those who tried it, which homelab projects actually improved your real world DevOps work and which ones were just fun experiments?

https://redd.it/1o5w3sv
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

6 views21:28

Reddit DevOps

How do you keep IaC repositories clean as teams grow?

Our Terraform setup began simple but now every microservice team adds their own modules and variables. It’s becoming messy with inconsistent naming and ownership. How do you organize large IaC repos without forcing everything into a single centralized structure?

https://redd.it/1o5w3di
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

8 views22:28

Reddit DevOps

Anyone else experimenting with AI assisted on call setups?

We started testing a workflow where alerts trigger a small LLM agent that summarizes logs and suggests a likely cause before a human checks it. Sometimes it helps a lot, other times it makes mistakes. Has anyone here tried something similar or added AI triage to their DevOps process?

https://redd.it/1o5w30f
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

12 views23:28

Reddit DevOps

Who is responsible for owning the artifact server in the software development lifecycle?

So the company I work at is old, but brand new to internal software development. We don’t even have a formal software engineering team, but we have a sonatype nexus artifact server. Currently, we can pull packages from all of the major repositories (pypi, npm, nuget, dockerhub, etc…).

Our IT team doesn’t develop any applications, but they are responsible for the “security” of this server. I feel like they have the settings cranked as high as possible. For example, all linux docker images (slim bookworm, alpine, etc) are quarantined for stuff like glib.c vulnerabilities where “a remote attacker can do something with the stack”… or python’s pandas is quarantined for serializing remote pickle files, sqlalchemy for its loads methods, everything related to AI like langchain… all of npm is quarantined because it is a package that allows you to “install malicious code”. I’ll reiterate, we have no public facing software. Everything is hosted on premise and inside of our firewalls.

Do all organizations with an internal artifact server just have to deal with this? Find other ways to do things? Who typically creates the policies that say package x or y should be allowed? If you have had to deal with a situation like this, what strategies did you implement to create a more manageable developer experience?

https://redd.it/1o5zv57
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

9 views00:28

Reddit DevOps

self-hosted AI analytics tool useful? (Docker + BYO-LLM)

I’m the founder of Athenic AI (tool to explore/analyze data w natural language). Toying with the idea of a self-hosted community edition and wanted to get input from people who work with data...

the community edition would be:

Bring-Your-Own-LLM (use whichever model you want)
Dockerized, self-contained, easy to deploy
Designed for teams who want AI-powered insights without relying on a cloud service

IF interested, please let me know:

Would a self-hosted version be useful
What would you actually use it for
Any must-have features or challenges we should consider

https://redd.it/1o5voxu
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

9 views01:28

Reddit DevOps

Rundeck Community Edition

Its been a while since i have looked at Rundeck and not to my surprise, pagerduty is pushing for people to purchase a commercial license. Looking at the comparison chart, i wonder if the CE is useless. I dont care for aupport and HA but not being able to schedule jobs is a deal breaker for us. Is anyone using rundeck and can vouch that it is still useful with the free edition? Are plugins available?

What we need
- self service center for adhoc jobs
- schedule job
- retry failed jobs
- fire off multiple worker nodes (ecs containers) to run multiple jobs independent of one another

https://redd.it/1o6344v
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

8 views02:28

Reddit DevOps

Need advice — Should I focus on Cloud, DevOps, or go for Python + Linux + AWS + DevOps combo?

Hey everyone,

I’m currently planning my long-term learning path and wanted some genuine advice from people already working in tech.

I’m starting from scratch (no coding experience yet), but my goal is to get into a high-paying and sustainable tech role in the next few years. After researching a bit, I’ve shortlisted three directions:
1. Core Cloud Computing (AWS, Azure, GCP, etc.)
2. Core DevOps (CI/CD, Docker, Kubernetes, automation, etc.)
3. A full combo path — Python + Linux + AWS + basic DevOps

I’ve heard that the third path gives the best long-term flexibility and salary growth, but it’s also a bit longer to learn.
What do you guys think?
• Should I specialize deeply in Cloud or DevOps?
• Or should I build the full foundation first (Python + Linux + AWS + DevOps) even if it takes longer?
• What’s best for getting a high-paying, stable job in 4–5 years?

Would love to hear from professionals already in these roles.

https://redd.it/1o64ct8
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

8 views03:28

Reddit DevOps

DevOps experts: What’s costing teams the most time or money today?

What’s the biggest source of wasted time, money, or frustration in your workflow?
Some examples might be flaky pipelines, manual deployment steps, tool sprawl, or communication breakdowns — but I’m curious about what you think is hurting productivity most.

Personally, coming from a software background and recently joining a DevOps team, I find the cognitive load of learning all the tools overwhelming — but I’d love to hear if others experience similar or different pain points.

https://redd.it/1o672nn
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views06:28

Reddit DevOps

Need advice — Physics grad but confused between DevOps, ML, or CFA

Hey everyone,
I graduated this year with a degree in Physics from a good college. I’ve been into coding since childhood — used to mess around on XDA Developers about 10 years ago, making random projects and tinkering with stuff.

This year I took a drop to work on a startup with my friends — we’re building a VM provisioning system, and I wrote most of the backend and part of the frontend. Before that, around 3 years ago, I even tried starting something in cybersecurity.

Now I’m kind of stuck deciding where to go next. A few options I’ve been thinking about:
• Doing a Master’s in Physics from IIT (I actually love the subject).
• Doing BCA again, just to strengthen my theoretical CS fundamentals.
• Getting deeper into DevOps, because I really enjoyed working with stuff like Firecracker and Kubernetes during our project.
• Going into Machine Learning, since I already have a good math background and love problem-solving.
• Or maybe even pursuing CFA, because I’ve always been interested in finance and markets too.

I know these fields are pretty different, but they all genuinely interest me in different ways.
What do you guys think — where should I focus next or double down?

https://redd.it/1o67ka8
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

5 views07:28

About

Blog

Apps

Platform