Reddit DevOps – Telegram
Question Version Bumping and Automating Releases

I work at a small company (2 person dev team) and there are no real protocols in place for version control or CI/CD. It's basically very smart scientists creating tools to aid R&D and QA on our product.

I don't want to re-invent the wheel, but I also want to take advantage of the freedom I have at work to learn how these processes and tools come about.

Our entire tech stack is basically python using PyQt to make windows desktop applications (yes i'm developing entirely on windows).

The workflow i've come up with is the following:
- Versions tracked in a .py file
- referenced by my pyinstaller .spec file, and my main.py to update noscript bar version, and file name version after compiling
- I have a noscript that bumps the version on dev when i'm ready to put out a new release
- allows inputs of major, minor, or patch to determine how the version is bumped.
- The noscript pushes the tag to main, which then triggers a GH actions
- the GH actions compiles and creates a release with a changelog generated from commits between version tags
- (eg summary of commits between v1.0.0..v1.1.0)

I'm trying to implement a git flow branching system, but have not incorporated release branches yet.

here's some ASCII art from claude (with a review and edits) attempting to demonstrate my release workflow from what i described (going bottom to top like git log):
*   Merge main back into dev - sync release v1.2.0                   (HEAD -> dev)
|\
| * v1.2.0 - release tagged on main (release created on GH here) (tag: v1.2.0, main)
| |\
| | * Merge dev into main for release v1.2.0
| |/
| * QA complete on dev (dev)
| * Merge feat/fix into dev
| |\
| | * Implement feature X (feat/fix)
| | * Branch feat/fix created from dev
| |/
* Dev baseline before feature work


I know the workflow is missing release branches, where i would ideally go like the following:
feat -> dev -> release -> dev             dev
` -> main | main -> release created from main
| | |
`-> hotfix (if needed)


My question is mostly about the automation of all the above workflows. How are people managing versions? Is a .py file given my stack reasonable/a professional approach?

Could I offload more of this process to GH actions for example? and have say a noscript that is just called release.py or .sh that triggers this entire process?

https://redd.it/1of1sa6
@r_devops
Outsider Curiosity - Outages

I sat through the Alaska Airlines “IT outage” yesterday and it got me very curious about how these situations get managed behind the scenes.

I’m very curious to know how many people are involved in troubleshooting/debugging something like that. Is there a solid staff that’s scheduled around the clock that can be trusted? Or does the company have to call in the savant no matter what time of day it is? Intuitively I feel like this could potentially be a “too many cooks in the kitchen” situation if the task isn’t handed over to a select group.

Are you clocking overtime during these situations or everyone’s salaried and just has to suck it up? Are the suits breathing down your neck during an outage or do they give you some space to work?

I feel like there must be some good insider stories here that I haven’t heard/read before. Feel free to link me any reading. Apologies if this is a common post in this sub, it’s just been on the front of my mind since last night.

https://redd.it/1of4qje
@r_devops
What Happened This Week at AWS – Full Technical Breakdown

Step 1: Routine Update and Rare Automation Failure

During the night between October 19 and 20, a routine update was performed on the core DynamoDB API system in the us-east-1 region.
The update triggered an internal automated process responsible for synchronizing DNS records between servers in different data centers.
During this synchronization, a rare race condition occurred, where two systems simultaneously wrote conflicting information to the internal DNS table.

Step 2: DNS Failure – Loss of Synchronization Between Servers

Because of this error, one of AWS’s internal DNS management systems lost synchronization with several regional DNS servers.
As a result, services trying to access DynamoDB using internal domain names were unable to resolve them to IP addresses, causing all calls to the DynamoDB API to fail.

Step 3: A Cross-Service Chain Reaction

DynamoDB is a foundational service relied upon by many other components, including Lambda, EC2, SQS, Step Functions, and Redshift.
When DynamoDB stopped responding, massive backlogs and timeouts started to build up in dependent services.
This created abnormal loads and partial outages across multiple cloud systems.

Step 4: Impact on the Internal Network Fabric

At the same time, the NLB Health Monitor, which is responsible for detecting healthy servers and routing internal traffic, received faulty data due to the DNS failure.
As a result, active servers were mistakenly marked as unavailable and removed from the network fabric, worsening the incident and increasing pressure on the remaining infrastructure.

Step 5: Detection and Containment

At 02:01 AM PDT (12:01 PM Saudi Arabia time), AWS teams identified that the issue originated from DNS resolution of the DynamoDB API.
By 02:24 AM, the chain reaction was contained, the DNS records were fixed, and the API returned to normal operation.
However, dependent services like Lambda and SQS required several more hours to rebuild queues and stabilize.

Step 6: Recovery and Lessons Learned

As a precautionary measure, AWS temporarily limited new EC2 launches to stabilize the system and prevent further cascading failures.
By the evening (US time), AWS reported that all services had fully recovered.
According to AWS, the root cause was faulty internal automation that led to DNS desynchronization, not a cyberattack.

Root Cause Summary

A faulty automation process and a race condition in the internal DNS synchronization system caused loss of name resolution for the DynamoDB API in us-east-1.
This triggered a widespread chain reaction, impacting many other services in the cloud infrastructure.
The incident lasted approximately nine hours, leading to outages or performance degradation in major applications like Snapchat, Reddit, and Fortnite, along with over a hundred additional services.
AWS announced plans to strengthen its monitoring and automation mechanisms to reduce the likelihood of similar incidents in the future.


https://redd.it/1of5763
@r_devops
I have an interview lined up for devops engineer 1 need guidance

Hey folks , I have an devops engineer interview lined up (Tech stack is GCP and GKS) .I have 1 yoe experience as a SRE and have no experience with cloud as my current org is on-prem.
I am not sure how to approach the preparation should I be honest and say I dont have hands on exp with cloud tools but am familiar with the concepts and revise my basics.
Or
Should I try some hands-on experiments with these tools ,I only have like 1 week to the interview.
anyone with similar experience of switching from on-prem to cloud please let me know how did you approach

Any relevant study material is highly appreciated

https://redd.it/1of4u8w
@r_devops
Adding my on-call shifts into my private calendar? Looking for best practices

Hey all,

are you pushing your on-call shifts from your Incident Response tool (e.g. PagerDuty/Opsgenie/FireHydrant) into your personal calendars or do you keep it 100% in your professional calendar?

Asking for best practices from the community. Adding it to my personal calendar feels like work will completely take over my private life. But I guess that's just the way it is?

https://redd.it/1of811q
@r_devops
Is RHCE enough for jr DevOps?

Sorry, I'm been depressed due to family circumstances. So just trying to find motivation to push forward since on November 15th my red hat would expires. I started as support at a MSP in 2020 then spent a year to earn CCNA, 2 years for RHCSA, and put in around 6 months for CCNP encore until I realized I was going into 2 different directions. I use gsn3 to lab everything to memory since covid allowed remote work.

but I didn't found alot of opportunities, which it seem Linux role became DevOps operations so I decided to go for RHCE. I feel I'm close though I've been on this certificates wheel for so long while my sister would be graduating bachelor registered nursing soon. I couldn't afford college since I had to support my family but Ioved learning, in fact my curiosity from my practice labs made me encounter linting (hence why CI/CD is needed) that Cisco encourage under devnet so that was something that was on the road map. Now it does feel like I just wasted my 20s, when so many HR filter you you for degrees anyway. Anyway besides that rant, it seem like it nevers enough at least to leave the proverbial helpdesk.

So I want to check would RHCE be the turning point to begin? I don't know how hard finding entry level roles for DevOps would be, but I don't know where I be in the next few months if I be living alone or under a bridge. I'm not asking for a 7 figure roles, but somewhere I could progress and feel their something to push toward.

https://redd.it/1ofalho
@r_devops
List of my job interview experiences

A while ago I found myself in the sudden predicament of finding a new role. I interviewed with multiple Platform Engineer roles in companies in London and wish to share my experiences. Feel free to add any of your anonymous experiences in the comments:

- Loadsure - recruiter call, ghosted, role was filled

- Checkatrade - final stage, senior engineer had attitude issues, feedback was word spaghetti.

- Lifi - ghosted

- GSS - nice call, comp too low

- Appvia - weird, recruiter call, rejected due to "not using AWS enough recently". Ive split the last decade on all 3 main providers... a good engineer can adapt?

- FDM - passed tech test, comp too low

- Mubi - more of an architectural tech test, felt good vibes, ghosted

- Zyte - ghosted

- NTT Data - comp too low

- Lightricks - 5 stages + take home, lowball comp, mega waste of time

- Citibank - surprisingly nice folk, 3 stages, ghosted, big fans of Golang

- WWT - good interview, job freeze

- anon trading fintech- 4 stages, offer, deep interview but fair

- brutal fintech - harsh grilling, immediate offer

- Trailmix games - comp too low

- Blackrock - offer, very deep interview

- Mastercard - offer, nice folk

- Balyasny - hedgefund lottery, talk to 5 people, ghosted

- JP Morgan - Senior VP with huge attitude problems. Staring at different screens and sighing. Worst of them all by far. Felt like a lecture, should we all just memorise ciphersuites and talk about multicasting? Ego trip

- Lloyds bank, fun but too long drawn out, comp lowball

- Synechron, good vibe, ghost

- Fasanara, hedgefund, brutal multiround in person interview, feedback: want CDK experience.. but tested me on Terraform? Circus

- Zencore, perfect match, comp too low

- Nucleus security, good vibe, ghosted

- MUFG, ghosted

- Palantir - auto rejection email

- US Bank - auto rejection email

- BCG - auto rejection email

- Vitol - auto rejection email

- DRW - hire freeze

- PA Consulting - hire freeze

- IG Group - auto rejection email

A couple I can't mention, but in the end the offer I accepted ended up being from the nicest interview process. Interviewing is exhausting, and frankly in 2020 I'd walk into a role. Stay strong to those on their search.

Advice to companies: you don't realise it, but you might be the candidates 7th interview of the week. Cut to the chase and make hiring processes short and to the point... and pay if you want talent.




























https://redd.it/1ofgpal
@r_devops
Which job should I take?

Long story short I was made redundant 3 months ago and finally got a job offer on Wednesday only to then get another offer yesterday.

Company A is a smaller startup who offered me the same salary I was on in my previous role. It’s the first job of its type in Europe and has a lot of potential to move into a team lead/management role which is something that would interest me. When I told them I had a second offer they didn’t increase theirs (yet). I got a phone call from the guy that would be my manager and he was totally understanding about the situation.

Company B offered me 20% more and is a huge global consultancy firm. The work would probably be easier and they would be sponsoring me to get security clearance. When I told them I already had another offer I was planning to take they wouldn’t take no as an answer and kept calling me constantly throughout the day to ask if I would accept, being really quite rude at times.

Am I stupid for thinking about taking the more difficult job which would pay me 20% less? I just feel like if I take the easy job I’ll likely still be doing the same thing if I was still there in 10 years whereas in the smaller company I’d have a lot more impact and ownership with more potential to grow in my career. Their responses to the opposite offers is pushing me towards company A as well.

But 20% is a lot of money, not life changing but when you’ve been out of the job for 3 months it makes it very tempting.


https://redd.it/1oflpyl
@r_devops
Istio external login

Hello, I have a Kubernetes cluster and I am using Istio. I have several UIs such as Prometheus, Jaeger, Longhorn UI, etc. I want these UIs to be accessible, but I want to use an external login via Keycloak.

When I try to access, for example, Prometheus UI, Istio should check the request, and if there is no token, it should redirect to Keycloak login. I want a global login mechanism for all UIs.

In this context, what is the best option? I have looked into oauth2-proxy. Are there any alternatives, or can Istio handle this entirely on its own? Based on your experience with similar systems, can you explain the best approach and the important considerations?

https://redd.it/1ofmicc
@r_devops
Tool for file syncing

I just joined a company and they have a NFS server that has been running for over 10 years. It contains files for thousands of sites they serve. Basically the docroot of NGINX (another server) uses this NFS to find the root of the sites.

The server also uses ZFS (but no mirror).

It gets restarted maybe 3-5 times a year and no apparent downtime.

Unfortunately the server is getting super full and it’s approaching 10% of free space. Deleting old snapshots no longer solves the problem as we need to keep 1 month worth of snapshots (used to be 12 months and gradually less because no one wanted to address this issue until now).

They need to keep using NFS. The Launch Template (used by AWS ASG) uses user data to bring ZFS back with existing EBS volume. If I try to manually add more volumes, that’ll be lost during next restart. The system is so old I can’t install the same versions of the tools to create a new golden image, not to mention the user data also uses aws to reuse the IP, etc.

So my question is: would it be a good idea to provision a new NFS, larger, but this time with 3 instances. I was thinking to use GlusterFS (it’s the only tool I know for this) to keep replicas of the files because I’m concerned of this being a single point of failure. ZFS snapshots would help with data recovery to some point but it won’t deal with NFS, route 53 etc, and not sure about using snapshots from very old ZFS with new versions works.

My idea is having 3 NFS instances, different AZs, equally provisioned (using ZFS too for snapshots), but 2 are in standby. If one fails I update the internal DNS to one of the standby ones. No more logic on user data.

To keep the files equal I’d use GlusterFS but with 1200GB of many small files in a ton of folders with deep tree I’m not sure there’s a better tool for replication or if I should try block replication.

I also used it long ago. I can’t remember if I can only replicate to one direction (server a to b, b to c) or if I can keep a to b and c, b to a and c and c to a and b?! That probably would help if I ever change the DNS for the NFS.

They prefer to avoid vendor locking by using EBS related solutions like multi-AZ too.

Am I too far from a good solution?

Thanks.

https://redd.it/1ofmib1
@r_devops
validate idea for portfolio project

For a portfolio of 4 yoe, a project like a bank seems too childish?

Like I am trying to build a bank simulator, where people can do dummy transactions but apart from money everything is real.

currently, I don't have any projects, that i can show case

DevOps things
- 5 microservices in different languages
- Harden it as much as possible with the APIM and app gateway and deploy it onto the AKS
- CICD pipelines, probably using templates and multi architecture builds
- proper monitoring

PS - trying to build everything from scratch

https://redd.it/1ofm3ql
@r_devops
Need a mentor or partner to learn devops

Hey i am looking for someone be my mentor or partner to learn devops I am beginner if anyone can dm me we can get connected

https://redd.it/1ofpqyz
@r_devops
Who are the most dependable enterprise software development companies in North America?

I’m doing some research to help a mid sized company find a partner for a custom enterprise build something beyond a basic web app.

The challenge is tons of agencies say they build enterprise systems, but when you dig in, most don’t actually have experience with complex integrations, scaling, or long-term maintenance.

If you’ve worked with a team that genuinely delivered on enterprise quality, solid architecture, documentation, and post launch support, who would you recommend?

Open to both US based and nearshore teams that have proven experience with enterprise scale work.

https://redd.it/1ofv2or
@r_devops
what is AWS amplify?

it seems like a very packaged service, and those i usually don't like, as they're good for the first 2 weeks but then when you need anything more custom it gets in the way of what you can build.

what is another option for deploying react/nextjs front ends?


edit: i am using AWS CDK - everything via IaC.

https://redd.it/1ofz51c
@r_devops
Spent too much time stripping down a base image to reduce CVEs and now it breaks on every update. How do you maintain custom containers long-term?

So I went down the rabbit hole of manually removing packages from ubuntu:latest to cut down our CVE count. Got it from 200+ vulns to like 30. Felt pretty good about myself.

Fast forward 2 weeks and every apt update breaks something different. Missing deps, broken symlinks, you name it. Now I'm spending more time babysitting this thing than I saved.

Anyone know a better way to do this? I see people talking about distroless but not sure if that fits our use case. What's your approach for keeping images lean without the maintenance nightmare?



https://redd.it/1og11pb
@r_devops
How do smaller teams manage observability costs without losing visibility?

I’m my very curious how small teams or those without enterprise budget handle monitoring and observability trade-offs.

Let's say for example tools like Datadog, New Relic, or CloudWatch can get pricey once you start tracking everything, but when I start trimming metrics it always feels risky.


For those of you running lean infra stacks:

• Do you actively drop/sample metrics, logs, or traces to save cost?

• Have you found any affordable stacks (e.g. Prometheus + Grafana + Loki/Tempo, or self-hosted OTel setups) that will still give you enough visibility?

• How do you decide what’s worth monitoring vs. what’s “nice to have”?

I'm not promoting anything. I'm just curious how different teams balance observability depth vs. cost in real-world setups.

https://redd.it/1og42rk
@r_devops