Reddit DevOps – Telegram
Linux Sysadmin Competency

Hey all! I’ve recently started work in DevOps as a junior engineer, will be handling GHE administration, creating/administering CI/CD workflow, and some basic K8s stuff after those two which has priority.

My background is I’m currently on a career switch, took a course on cloud&devops..
What can do to quickly gain the skill set and competency level for Linux sysadmin role? Which exams that I can consider? What courses are there which is useful on Udemy? I’ll be getting kodekloud subnoscription once I’m proficient and moving on to Kubernetes.
Will be working in a secure air gapped environment.

https://redd.it/1oe58ra
@r_devops
Which bullets are the most impressive?

Which 5-7 of these accomplishments would you prioritize for a senior/lead engineer? I have limited space and want to highlight what's most impressive to hiring managers and technical leaders.

* **Serverless architecture processing 1M+ transformations/month at 300ms latency** \- Built high-performance async content pipeline using AWS Lambda, S3, CloudFront, and httpx
* **Complete product economics infrastructure** \- Designed token-based pricing, gamified leaderboards, affiliate referral system, and usage-based metered billing handling 30K+ API calls/month
* **Multi-tenancy PostgreSQL database design** \- Implemented UUID-based multi-tenancy with SQLAlchemy ORM and Alembic migrations on AWS RDS
* **OAuth2 authentication system** \- Integrated Clerk provider with async httpx client for secure cross-platform identity management
* **£0 to $6.4K monthly revenue in 6 months** \- Architected and monetized the entire platform from scratch
* **34% churn reduction** \- Used behavioral cohort analysis and DynamoDB event tracking to drive data-driven product decisions
* **Stripe payment integration** \- Built complete billing infrastructure with webhook handlers triggering Lambda functions via API Gateway and SQS queues
* **73% deployment time reduction** \- Built automated IaC CI/CD pipelines using AWS CDK, Terraform, and Nx distributed caching across multi-stage environments
* **Production-grade Nx Python monorepo** \- Evolved codebase with clean separation of concerns, dependency injection, and modular boundaries
* **Comprehensive testing suite** \- Unit, integration, and E2E tests with IaC deployment enabling continuous delivery across dev/staging/prod
* **Scaled team from 1 to 5 developers** \- Established technical hiring process and onboarded developers while maintaining code quality
* **Developer experience infrastructure** \- Built Docker containerization and local testing suites enabling team to ship production features
* **GenAI video/image editing automation** \- Implemented AI-powered content pipeline serving production workloads

Over 2 years I have started a bootstrapped company just adding each day, these are the main things; which should I include on my result?

https://redd.it/1oed073
@r_devops
New DevOps engineer — how do you track metrics to show impact across multiple clients/projects?

Hey folks,

I’ve recently been promoted to a DevOps Engineer at a large IT outsourcing company. My team works on a wide range of projects — anything from setting up CI/CD pipelines with GitHub Actions, to managing Rancher Kubernetes clusters, to creating Prometheus/Grafana dashboards. Some clients are on AWS, others on GCP, and most are big enterprises with pretty monolithic and legacy setups that we help modernize.

I love the variety (it’s a great place to learn), but I’m trying to be proactive about tracking my performance and impact — both for internal promotions and for future job opportunities.

The challenge is that since I jump between projects for different clients, it’s hard to use standardized metrics. A lot of these companies don’t track things like “deployment frequency” or “lead time to production,” and I’m not sure what’s realistic for me to track personally.

So I’d really appreciate your help:

What DevOps metrics or KPIs do you personally track to demonstrate your impact?

How do you handle this when working across multiple clients or short-term projects?

Any tips on what to log or quantify so it’s useful later (e.g., for a performance review or a resume)?

I want more oomph then things like “implemented GitHub Actions CI/CD for X project” or “migrated on-prem app to GCP”, a way to make my future work appear more impactful.

Thanks in advance

https://redd.it/1oeeiuu
@r_devops
How is AI changing DevOps?

Hey everyone,

Most of us have been using AI tools in our DevOps work for a while now, and I think we're at an interesting point to reflect on what we're actually learning.

I'm curious to hear from the community:

What's working well? Which AI tools have genuinely improved your workflow? What use cases have been most valuable?

Where are the gaps? What hasn't lived up to the hype? Where do these tools still fall short?

How is the role changing? Are you noticing shifts in where you spend your time or what skills are becoming more important?

Best practices emerging? Have you developed any strategies or approaches that others might benefit from?

I suspect many of us are navigating similar questions about how to stay effective and relevant as the landscape evolves. Would be great to hear what you're all experiencing and how you're thinking about it.

Looking forward to the discussion!

https://redd.it/1oefomy
@r_devops
How do you handle configuration drift in your environments?

We've been facing issues with configuration drift across our environments lately, especially with multiple teams deploying changes. It’s becoming a challenge to keep everything in sync and compliant with our standards.

What strategies do you use to manage this? Are there specific tools that have helped you maintain consistency? I'm curious about both proactive and reactive approaches.

https://redd.it/1oe4q90
@r_devops
AWS us-east-1 outage postmortem

AWS’s retrospective on the DynamoDB disruption in US-East-1 isn’t remarkable because something broke, things break every day.

What stands out is how long it took to see the full extend of picture and how predictable that delay was.

A small defect in DNS automation quietly rewrote endpoint records.

To be clear, this wasn’t DNS. It was a latent race condition that surfaced through DNS.

At AWS scale, even something as simple as “which IP should this endpoint resolve to” is managed by layers of automation. DynamoDB’s routing is backed by thousands of load balancers across multiple AZs, with automated systems continuously adjusting DNS records.

That one race condition broke the implicit contract every AWS service in us-east-1 relied on: that DynamoDB would always be reachable.

Everything downstream continued behaving as if everything was still consistent: DynamoDB calls timed out, EC2 provisioning stalled, Network Load Balancers reported bad health checks.

There was no alerts paging that “DNS is down.” But there were a lot of individual alerts paging for several tother reasons.

At Rootly, we see this pattern everywhere. The hardest part of a major incident isn’t the fix, it’s realizing that ten small, unrelated failures are all symptoms of the same thing; it’s the root cause.

Every distributed system runs on invisible contracts: this record will resolve, this endpoint will respond, this region will behave like the others. Boundaries are the invisible contracts that are baked into how teams and systems reason about the digital world.

When one breaks silently the failure can hide behind normal behaviour. Systems continue with exactly what they have been programmed to do, now just on the wrong assumptions.

By the time patterns become visible, the real question isn’t ok what failed, it’s how many other systems still trust it.

In this case, the DNS automation bug was just the first crack in a chain of invisible contracts that everyone assumed was safe.

AWS’s DNS automation followed instructions perfectly, as automation does, otherwise why would we automate it? The problem was that the instructions were out of date.

There’s a reason we automate things: automation is great at doing things quickly. That’s an obvious statement. Here’s another one: autmation is terrible at deciding whether it should still be done. Otherwise it would be autonomy.

Across large complex systems, we see this dynamic repeatedly. As a matter of fact, Anthropic published a similar retrospective only days ago.

When every safeguard is automatic, you lose the pauses where intuition normally kicks in.

The result isn’t chaos, it’s confidence that everything must be working because no one has said otherwise.

In AWS’s timeline, DynamoDB errors appeared hours before EC2 and NLB issues were connected.

At scale, no single team owns the entire picture.

Each service has its own alerts, escalation policies, and vocabulary.

From inside DynamoDB, it looked like increased error rates.

From inside EC2, provisioning delays.

From NLB, unhealthy targets.

Every team was right. It was just incomplete and missing context.

The coordination overhead of discovering that everyone is actually working on the same problem is massive.

I’ve heard endless stories about organizations spending more time figuring out who should respond rather than fixing what’s actually wrong. That’s not incompetence. Some of the smartest people in the world work at AWS and other large complex companies. It’s just what happens when visibility is local and failure is global.

AWS actually fixed the race condition quickly, but the region didn’t return to steady state for hours.

In my humble opinion and experience that’s normal. Distributed systems don’t snap back, they tend to drift toward normal states.

If you’re ever part of an outage like this temper your expectations so they are not linear, your systems aren’t waiting on you;
they are re-learning what “healthy” means.

The question after every incident shouldn’t be “How did this happen?”, it should be “How do we recognize it faster next time?”

AWS’s transparency helps remind everyone that even at hyperscale, the fundamentals are the same: boundaries drift, context fragments, automation repeats mistakes perfectly.

Reliability isn’t about stopping that, it’s about building the reflexes to see it sooner, talk about it clearly, and learn from it completely.

https://redd.it/1oej8pw
@r_devops
Finding the Right Audience Without Feeling “Salesy” or Pushy

I’ve been thinking a lot lately about how to genuinely connect with the right audience — whether it’s for a creative project, small business, content channel, or personal brand. There’s so much advice out there about “target demographics” and “Individual DM's,” but sometimes it feels like that turns people into metrics instead of humans.

How do you find and attract the audience who actually resonates with what you do without coming across as pushy or overly promotional?

https://redd.it/1oeksjg
@r_devops
New to Devops - Why Is Everything Structured Differently?

I’m currently transitioning from IT to DevOps at my workplace. So far, it’s been going okay, but one thing that confuses me is encountering code that’s structured differently from other code. It’s hard to find consistency. I’m not sure if it’s because I work at a startup, but I constantly have to dig to figure out why one thing has a certain feature enabled while another doesn’t. There is a lot of these "context-specific decisions" on our code base and there are so many namespaces, so many models, it gets difficult to understand. Is this normal?

https://redd.it/1oejuje
@r_devops
Scheduling ML Workloads on Kubernetes

Hey guys. This article covers NVIDIA Kai-Scheduler, including gang scheduling, bin packing, consolidation, and queue features, etc:

https://martynassubonis.substack.com/p/scheduling-ml-workloads-on-kubernetes

https://redd.it/1oehdnd
@r_devops
Suggestions of tools to improve life quality of a devops engineer

I'm looking for suggestions that will improve my day to day operations as a devops engineer across the whole stack. For example a tool or ide that helps visualize and interact with the k8s cluster. I'm aware of something called lens ide but havent looked too much into it. Or autocompletion/suggestions for dockerfiles etc.. anything really. What is something you are using and would never go back to not using it again?

https://redd.it/1oebaei
@r_devops
Anyone else feel AI is making them a faster typist, but a dumber developer? 😩

I feel like I'm not programming anymore, I'm just auditing AI output.

Copilot/Cursor is great for boilerplate. It’ll crank out a CRUD endpoint in seconds. But then I spend 3x the time trying to spot the subtle, contextual bug it slipped in (e.g., a tiny thread-safety issue, or a totally wrong way to handle an old library).

It feels like my brain’s problem-solving pathways are atrophying. I trade the joy of solving a hard problem for the anxiety of verifying a complex, auto-generated one. This isn't higher velocity; it's just a different, more draining kind of work.

Am I alone in feeling this cognitive burnout?

https://redd.it/1oepjg3
@r_devops
Spent 40k on a monitoring solution we never used.

The purchase decision:
\- Sales demo looked amazing
\- Promised AI-powered anomaly detection
\- Would solve all our monitoring problems
\- Got VP approval for 40k annual contract

What happened:
\- Setup took 3 months
\- Required custom instrumentation
\- AI features needed 6 months of data
\- Dashboard was too complex
\- Team kept using Grafana instead

One year later:
\- Login count: 47 times
\- Alerts configured: 3
\- Useful insights: 0
\- Money spent: $40,000

Why it failed:
\- Didn't pilot with smaller team first
\- Bought for features, not current needs
\- No champions within the team
\- Too complex for our maturity level
\- Existing tools were good enough

Lesson: Enterprise sales demos show what's possible, not what you need. Start with free tools and upgrade when you feel the pain.

(https://x.com/brankopetric00/status/1981484857440993523)

https://redd.it/1oeqkvs
@r_devops
Auto scaling RabbitMq

I am busy working on a project to replace our AWS managed RabbitMQ service with a Rabbitmq hosted on an EC2 instance. We want to move away from the managed service due to the mandatory maintenance window imposed by AWS.

We are a startup so money is tight. So i am looking to do this in the most cost effective manner.

My current thinking is having one dedicate reserved instance that runs 24/7.
The having a ASG that is able to spin up a spot instance or two when we have a message storm.
We have an IOT company and when the APN blips all our devices reconnect at once causing our current RabbitMQ service's CPU to Spike.

So I would like an extra node to spin up, assist the master node with processing and then gracefully scale down again, leaving us with a single instance rabbit.

Is rabbit built to handle this type of thing? I am getting contrasting information and I am looking to hear from someone else who has gone down this route before.

Any advise, or experience welcome.


https://redd.it/1oeqo8r
@r_devops
A fast, private, secure, open-source S3 GUI

Since the web interfaces for Amazon S3 and Cloudflare R2 are a bit tedious, a friend of mine and I decided to build nicebucket, an open-source alternative using Tauri and React, released under the GPLv3 license.

I think it is useful for anyone who works with S3, R2, or any other S3 compatible service. We do not track any data and store all credentials safely via the native keychains.

We are still quite early so feedback is very much appreciated!

https://redd.it/1oeql17
@r_devops
Built a desktop app for unified K8s + GitOps visibility - looking for feedback

Hey everyone,

We just shipped something and would love honest feedback from the community.

What we built: Kunobi is a new platform that brings Kubernetes cluster management and GitOps workflows into a single, extensible system — so teams don’t have to juggle Lens, K9s, and GitOps CLIs to stay in control.

We make it easier to use Flux and Argo by enabling seamless interaction with GitOps tools. We’ve focused on addressing pain points we’ve faced ourselves — tools that are slow, memory-heavy, or just not built for scale.

Key features include:

Kubernetes resource discovery
Full RBAC compliance
Multi-cluster support
Fast keyboard navigation
Helm release history
Helm values and manifest diffing
Flux resource tree visualization

[Here’s a short demo video for clarity.](
https://youtu.be/y0m5L_XqGps?si=CSKS5Dqby-NqIixH)

Who we are: Kunobi is built by Zondax AG, a Swiss-based engineering team that’s been working in DevOps, blockchain, and infrastructure for years. We’ve built low-level, performance-critical tools for projects in the CNCF and Web3 ecosystems - Kunobi started as an internal tool to manage our own clusters, and evolved into something we wanted to share with others facing the same GitOps challenges.

Current state: It’s rough and in beta, but fully functional. We’ve been using it internally for a few months.

What we’re looking for:

Feedback on whether this actually solves a real problem for you
What features/integrations matter most
Any concerns or questions about the approach

Fair warning — we’re biased since we use this daily. But that’s also why we think it might be useful to others dealing with the same tool sprawl.

Happy to answer questions about how it works, architecture decisions, or anything else.

🔗 https://kunobi.ninja — download the beta here.

https://redd.it/1oetwyc
@r_devops
MongoDB Pod dont create User inside container

This is my mongodb manifest yaml file, when pod running success, i checked inside mongodb container dont create my user despite i add mono-init.js to folder: docker-entrypoint-initdb.d.
I do the same with docker-compose and everything will be ok!
How to fix this issue. Please help me

https://redd.it/1oeuvm5
@r_devops
Real world production on a cv for ansible

Hi all,

I have a network engineer background
I have done playbooks on network devices, mainly for f5
But I was contacted for an ansible job, so I need to put more "system" or DevOps kind of project
Can you give me ideas of what are you doing in production so I can do it myself and put it in my CV
Would an ansible certificate be useful, I have the basis



https://redd.it/1oetwcf
@r_devops
Only allow specific country IP range to SSH

Hi, May I know what is the simplest way to allow a specific country IP range to access my VPS SSH?

I prefer using UFW but not iptable coz I am a newbie and afraid drilling that down will mess things up

I am reading this post but not sure if it's valid to go with Ubunutu

https://blog.reverside.ch/UFW-GeoIP-and-how-to-get-there/

https://redd.it/1oexn4l
@r_devops
our postmortem from last week just identified the same root cause from june

had database connection pool exhaustion issue last tuesday. took three hours to fix. wrote the postmortem yesterday and vp pointed out we had the exact same issue in june.

pulled up that postmortem. action items were increase pool size and add better monitoring. neither happened because we needed to ship features to stay competitive.

so we shipped features for four months while the known prod issue sat unfixed. then it broke again and leadership acted shocked.

now they want to know why we keep having repeat incidents. maybe because postmortem action items go into backlog behind feature work and nobody looks at them until the same thing breaks again.

third time this year we've had a repeat incident where the fix was documented but never implemented. starting to wonder why we even write postmortems if nothing changes.

how do you actually get action items prioritized or is this just accepted everywhere?

https://redd.it/1oeyqqd
@r_devops
Database branches to simplify CI/CD

Careful some self-promo ahead (But I genuinely think this is an interesting topic to discuss).

In my experience failed migrations and database differences between environments are one of the most common causes of incidents. I have had failed deployments, half-applied migrations and even full-blown outages because someone didn't consider the legacy null values that were present in production but not on dev.

Many devs think "down migrations" are the answer to this. But they are hard to get right since a rollback of the code usually also removes the migration code from the container.

I work at Tiger Data (formerly Timescale) and we released a feature to fork an existing database this week. I wasn't involved in the development of the underlying tech, but it uses a copy on write mechanism that makes this process complete in under a minute. Imo these kind of features are a great way to simplify CI/CD and prevent issues such as the ones I mentioned above.

Modern infrastructure like this (e.g. Neon also has branches) actually offer a lot of options to simplify CI/CD. You can cheaply create a clone of your production database and use that for testing your migrations. You can even get a good idea of how long it will take to run your migrations by doing that.

Of course you'll also need to cleanup again and figure out if the additional cost of automatically running a db instance in your workflow is worth it. You could in theory even go further though and use the mechanism to spin up a complete test environment for each PR that a developer creates. Similar to how this is often done for frontend changes in my experience.

In practice a lot of the CI/CD setups I have worked with in other companies are really dusty and do not take advantage of the capabilities of the infrastructure that is available. It's also often hard to get buy in from decision makers to invest time in this kind of automation. But when it works it is down right beautiful.

https://redd.it/1of09uc
@r_devops