Reddit DevOps – Telegram
Which AWS "group buying" experience should I go with?

So last week I posted about looking at either signing a term to get locked in for a year or two to save 40% on AWS costs. We're running about $13k/month and client is breathing down my neck to figure out the best way to save on this cost.

At first I was like, awesome, volume discounts + guaranteed savings + hands off management = profit right.

They want to transfer ownership of our AWS account to them
We'd get invoices from TWO places (their company + AWS)
One Reddit literally said "it's like having an MSP ex-gf who won't ever let you go"
Stories of people losing their entire AWS account when the third-party stopped paying Amazon
Some poor soul had to spend 6 months recreating their account from scratch (my condolences)

So i pulled out all the conversations in the comments + my DMs, loaded it into Claude and got it to break it all down for me.

\
if I've made any factual mistakes in this post, please feel free to leave a comment and I'll make the adjustment.

First, Redditor recommended implementation strategy

1. Start with AWS native tools (Cost Explorer, Savings Plans)
2. Implement proper tagging and cost attribution
3. Avoid third-party account management

Ok #4 is heard loud and clear, but unfortunately that's against my client's directive, so I dug deeper.

The three leading solutions that address AWS commitment optimization without account transfer are:

Commitment Models Comparison (more detailed comparison below, compiled by Claude from website, call trannoscripts and DMs)

|Feature|MilkStraw AI|Archera|Opsima|
|:-|:-|:-|:-|
|Core Innovation|"Fluid savings" without commitments|Insurance-backed 30-day commitments|AI-powered with loss guarantee|
|Term Flexibility|No commitments required|30-day to 3-year terms|Flexible with guarantee protection|
|Risk Mitigation|Zero commitment risk|Insurance backing|Contractual loss guarantee|
|Multi-Cloud|AWS focused|AWS + Azure + GCP|Primarily AWS|
|Pricing Model|Not specified|Free platform + commitment fees|Simulation available|
|Enterprise Focus|Startups to enterprise|Enterprise-focused|Mid to large enterprise|
|Certifications|Not specified|ISO 27001, AWS Advanced Partner|AWS compliance mentioned|
|Platform Access|Read-only cross-account|Commitment management only|Cost reports + commitment rights|

Milkstraw and Opsima offers are very similar, both are almost no brainer offers. I think the tie breaker will come down to how easy the onboarding experience will be and so far from what I see, Milkstraw has a slightly easier onboarding set up. But please, correct me if I'm wrong here.

Archere's model is insurance/rebate, so it's financially different from the other two.

At our spend level, I'm starting to think this is more of a political/organizational problem than a technical one anyway. If I really just use first principle the whole reason I'm doing this is because devops director doesn't want the responsibility of handling the cost savings and want to offload it to a third party, and that third party would just deal with finance directly.

Either way, I will present all the options to my client as well as I could, and leave the choice to them.

ps. detailed comparison of all services, feel free to skip this part.

|Solution|Account Ownership|Billing Relationship|Exit Complexity|Savings Focus|Community Sentiment|
|:-|:-|:-|:-|:-|:-|
|MilkStraw AI| Keep full control| Direct AWS billing| Leave anytime|Commitment optimization|🟢 Positive|
|Opsima| Limited IAM role| Direct AWS billing| Contractual guarantee|Commitment management|🟢 Innovative approach|
|Archera| Keep full control| Direct AWS billing| 30-day terms|Insured commitments|🟢 Enterprise-focused|
|**Vantage.sh**| Keep full control| Direct AWS billing| Easy exit|Cost attribution|🟢 Highly recommended|
|Duckbill Group| Consulting only| Direct AWS billing| Consulting model|Architecture + negotiation|🟢 Trusted expert|
|**Spot.io**|⚠️ Instance management|
Which AWS "group buying" experience should I go with?

So last week I posted about looking at either signing a term to get locked in for a year or two to save 40% on AWS costs. We're running about $13k/month and client is breathing down my neck to figure out the best way to save on this cost.

At first I was like, awesome, volume discounts + guaranteed savings + hands off management = profit right.

* They want to **transfer ownership** of our AWS account to them
* We'd get invoices from TWO places (their company + AWS)
* One Reddit literally said "it's like having an MSP ex-gf who won't ever let you go"
* Stories of people **losing their entire AWS account** when the third-party stopped paying Amazon
* Some poor soul had to spend 6 months recreating their account from scratch (my condolences)

So i pulled out all the conversations in the comments + my DMs, loaded it into Claude and got it to break it all down for me.

\*if I've made any factual mistakes in this post, please feel free to leave a comment and I'll make the adjustment.

First, Redditor recommended implementation strategy

1. Start with AWS native tools (Cost Explorer, Savings Plans)
2. Implement proper tagging and cost attribution
3. Avoid third-party account management

Ok #4 is heard loud and clear, but unfortunately that's against my client's directive, so I dug deeper.

The three leading solutions that address AWS commitment optimization without account transfer are:

Commitment Models Comparison (more detailed comparison below, compiled by Claude from website, call trannoscripts and DMs)

|Feature|MilkStraw AI|Archera|Opsima|
|:-|:-|:-|:-|
|**Core Innovation**|"Fluid savings" without commitments|Insurance-backed 30-day commitments|AI-powered with loss guarantee|
|**Term Flexibility**|No commitments required|30-day to 3-year terms|Flexible with guarantee protection|
|**Risk Mitigation**|Zero commitment risk|Insurance backing|Contractual loss guarantee|
|**Multi-Cloud**|AWS focused|AWS + Azure + GCP|Primarily AWS|
|**Pricing Model**|Not specified|Free platform + commitment fees|Simulation available|
|**Enterprise Focus**|Startups to enterprise|Enterprise-focused|Mid to large enterprise|
|**Certifications**|Not specified|ISO 27001, AWS Advanced Partner|AWS compliance mentioned|
|**Platform Access**|Read-only cross-account|Commitment management only|Cost reports + commitment rights|

Milkstraw and Opsima offers are very similar, both are almost no brainer offers. I think the tie breaker will come down to how easy the onboarding experience will be and so far from what I see, Milkstraw has a slightly easier onboarding set up. But please, correct me if I'm wrong here.

Archere's model is insurance/rebate, so it's financially different from the other two.

At our spend level, I'm starting to think this is more of a political/organizational problem than a technical one anyway. If I really just use first principle the whole reason I'm doing this is because devops director doesn't want the responsibility of handling the cost savings and want to offload it to a third party, and that third party would just deal with finance directly.

Either way, I will present all the options to my client as well as I could, and leave the choice to them.

ps. detailed comparison of all services, feel free to skip this part.

|Solution|Account Ownership|Billing Relationship|Exit Complexity|Savings Focus|Community Sentiment|
|:-|:-|:-|:-|:-|:-|
|**MilkStraw AI**| Keep full control| Direct AWS billing| Leave anytime|Commitment optimization|🟢 Positive|
|**Opsima**| Limited IAM role| Direct AWS billing| Contractual guarantee|Commitment management|🟢 Innovative approach|
|**Archera**| Keep full control| Direct AWS billing| 30-day terms|Insured commitments|🟢 Enterprise-focused|
|[**Vantage.sh**](http://Vantage.sh)| Keep full control| Direct AWS billing| Easy exit|Cost attribution|🟢 Highly recommended|
|**Duckbill Group**| Consulting only| Direct AWS billing| Consulting model|Architecture + negotiation|🟢 Trusted expert|
|[**Spot.io**](http://Spot.io)|⚠️ Instance management|
Direct AWS billing|🟡 Medium complexity|Spot optimization|🟡 Use case specific|
|**Group Buy Services**| Account transfer| Dual billing| Very difficult|Volume discounts|🔴 Strongly avoid|
|**Resellers/MSPs**| Account transfer| Reseller billing| Very difficult|Various|🔴 Never recommended|

**MilkStraw AI** **Model:** Commitment optimization without actual commitments

* **Key Feature:** "Fluid savings" - get commitment pricing without commitment risk
* **Account Control:** Keep full AWS account ownership
* **Savings:** Up to 55% on EC2, 45% on Fargate, 35% on RDS
* **Access Required:** Read-only cross-account role, no billing migration
* **Risk:** Zero risk, leave anytime
* **Coverage:** EC2, Fargate, Lambda, SageMaker, RDS, OpenSearch, ElastiCache, RedShift
* **Billing:** Keep existing AWS billing relationship
* **Community Notes:** Sourced from incoming DM

**Opsima** **Model:** AI-powered commitment management with guarantees

* **Key Feature:** No money loss contractual guarantee
* **Account Control:** Manage commitments via IAM role, no infrastructure access
* **Savings:** Based on forecasting and optimization algorithms
* **Access Required:** Cost/usage reports + commitment management rights only
* **Risk:** Contractual guarantee against over-commitment
* **Prohibited:** Not a group buying service (complies with AWS June 2025 policy)
* **Community Notes:** Offers simulation without subnoscription

**Archera** **Model:** Insured Commitments with flexible terms

* **Key Feature:** Short-term (30-day) commitments with 1-3 year commitment pricing
* **Account Control:** No infrastructure access, commitment management only
* **Savings:** 1-3 year commitment discounts with 30-day flexibility
* **Access Required:** Commitment purchasing and management permissions
* **Risk:** Insurance-backed commitments reduce over-commitment risk
* **Multi-Cloud:** Supports AWS, Azure, and Google Cloud
* **Coverage:** All AWS reservable services, Savings Plans, Reserved Instances
* **Certifications:** ISO/IEC 27001:2022, AWS Advanced Partner, AWS Qualified Software
* **Platform:** Free multicloud commitment lifecycle management
* **Community Notes:** Sourced from incoming DM

https://redd.it/1nlgkxq
@r_devops
What's your deployment process like?

Hi everyone,.I've been tasked with proposing a redesign of our current deployment process/code promotion flow and am looking for some ideas.

Just for context:

Today we use argocd with Argo rollouts and GitHub actions. Our process today is as follows:

1.Developer opens PR
2. Github actions workflow triggers with build and allows them to deploy their changes to an Argocd emphemeral/PR app that spins up so they can test there
3. PR is merged
4. New GitHub workflow triggers from main branch with a new build from main, and then stages of deployment to QA (manual approvals) and then to prod (manual approval)

I've been asked to simplify this flow and also remove many of these manual deploy steps, but also focusing on fast feedback loops so a user knows the status of where there PR has been deployed to at all times...this is in an effort to encourage higher velocity and also ease of rollback.

Our qa and prod eks clusters are separate (along with the Argocd installations).

I've been looking at Kargo and the Argocd hydrator and promoter plugins as well, but still a little undecided on the approach to take here. Also, it would be nice to now have to build twice.

Curious on what everyone else is doing or if you have any suggestions.


Thanks.

https://redd.it/1nla325
@r_devops
Struggling with skills that don't pay off (Openstack, Istio,Crossplane,ClusterAPI now AI ? )

I've been doing devops and cloud stuff for over a decade. In one of my previous roles I got the chance to work with Istio, Crossplane and ClusterAPI. I really enjoyed those stacks so I kept learning and sharpening my skills in them. But now , although I am currently employed, I'm back on the market, most JD's only list those skills as 'nice to have' and here I am, the clown who spent nights and weekends mastering them like it was the Olympics. It hasn't helped me stand out from the marabunta of job seekers, I'm just another face in the kubernetes-flavored zombie horde.

This isn't the first time it's happened to me. Back when Openstack was heavily advertised and looked like 'the future' only to watch the demand fade away.

Now I feel the same urge with AI , yes I like learning but also want to see ROI, but another part of me worries it could be another OpenStack situation .

How do you all handle this urges to learn emerging technologies, especially when it's unclear they'll actually give you an advantage in the job market ? Do you just follow curiosity or do you strategically hold back ?

https://redd.it/1nl93ax
@r_devops
Ran 1,000 line noscript that destroyed all our test environments and was blamed for "not reading through it first"

Joined a new company that only had a single devops engineer who'd been working there for a while. I was asked to make some changes to our test environments using this noscript he'd written for bringing up all the AWS infra related to these environments (no Terraform).

The noscript accepted a few parameters like environment, AWS account, etc.. that you could provide. Nothing in the noscripts name indicated it would destroy anything, it was something like 'configure_test_environments.sh'

Long story short, I ran the noscript and it proceeded to terminate all our test environments which caused several engineers to ask in Slack why everything was down. Apparently there was a bug in the noscript which caused it to delete everything when you didn't provide a filter. Devops engineer blamed me and said I should have read through every line in the noscript before running it.

Was I in the wrong here?

https://redd.it/1nllqf4
@r_devops
AWS Cloud Associate (Solutions Architect Associate, Developer Associate, SysOps, Data Engineer Associate, Machine Learning Associate) Vouchers Available

Hi all,

I have AWS Associate vouchers available with me. If any one requires, dm me

https://redd.it/1nlpa8p
@r_devops
I almost lost my best employee to burnout - manager lessons which I learned from the Huberman Lab & APA


A few months ago, I noticed one of my top engineers start to drift. They stopped speaking up in standups. Their commits slowed. Their energy just felt… off. I thought maybe they were distracted or just bored. But then they told me: “I don’t think I can do this anymore.” That was the wake-up call. I realized I’d missed all the early signs of burnout. I felt like I failed as a lead. That moment pushed me into a deep dive—reading research papers, listening to podcasts, devouring books, to figure out how to actually spot and prevent burnout before it’s too late. Here’s what I wish every manager knew, backed by real research, not corporate fluff.

Burnout isn’t laziness or a vibe. It’s actually been classified by the World Health Organization as an occupational phenomenon with 3 clear signs: emotional exhaustion, depersonalization (a.k.a. cynicism), and reduced efficacy. Psychologist Christina Maslach developed the framework most HR teams use today (the Maslach Burnout Inventory), and it still holds up. You can spot it before it explodes, but only if you know where to look.

First, energy drops usually come first. According to ScienceDirect, sleep problems, midday crashes, and the “Sunday Scaries” creeping in earlier are huge flags. One TED Talk by Arianna Huffington even reframed sleep as a success tool, not a luxury. At Google, we now talk about sleep like we talk about uptime.

Then comes the shift in social tone. Cynicism sneaks in. People go camera-off. They stop joking. Stanford’s research on Zoom fatigueshows why this hits harder than you’d think, especially for women and junior folks. It’s not about introversion, it’s about depletion.

Quality drops next. Not always huge errors. Just more rework. More “oops” moments. Studies from Mayo Clinic and others found that chronic stress literally impairs prefrontal cortex function—so decision-making and focus tank. It’s not a motivation issue.

It’s brain function issue. One concept that really stuck with me is the Job Demands Control model. If someone has high demands and low control, burnout skyrockets. So I started asking in 1:1s, “Where do you wish
you had more say?” That small question flipped the power dynamic. Another one: the Effort Reward Imbalance theory. If people feel their effort isn’t matched by recognition or growth, they spiral. I now end the week asking, “What’s something you did this week that deserved more credit?”

After reading Burnout by the Nagoski sisters, I understood how important it is to close the stress cycle physically. It’s an insanely good read, half psychology, half survival guide. They break down how emotional stress builds up in the body and how most people never release it. I started applying their techniques like shaking off stress post-work (literally dance-breaks lol), and saw results fast. Their Brene‌ Brown interview on this still gives me chills. Also, One colleague put me onto BeFreed, an ai personalized learning app built by a team from Columbia University and Google that turns dense books and research into personalized podcast-style episodes. I was skeptical. But it blends ideas from books like Burnout by Emily and Amelia Nagoski, talks from Andrew Huberman, and Surgeon General frameworks into 10- to 40-minute
deep dives. I chose a smoky, sarcastic host voice (think Samantha from Her) and it literally felt like therapy meets Harvard MBA. One episode broke down burnout using Huberman Lab protocols, the Maslach inventory, and Gallup’s 5 burnout drivers, all personalized to me. Genuinely mind-blowing.

Another game-changer was the Huberman Lab episode on “How to Control Cortisol.” It gave me a practical protocol: morning sunlight, consistent wake time, caffeine after 90 minutes, NSDR every afternoon. Sounds basic, but it rebalanced my stress baseline. Now I share those tactics with my whole team.

I also started listening to Cal Newport’s Slow Productivity approach. He explains how our brains aren’t built for constant
sprints. One thing he said stuck: “Focus is a skill. Burnout is what happens when we treat it like a faucet.” This helped me rebuild our work cycles.

For deeper reflection, I read Dying for a Paycheck by Jeffrey Pfeffer. This book will make you question everything you think you know about work culture. Pfeffer is a Stanford professor and backs every chapter with research on how workplace stress is killing people, literally. It was hard to read but necessary. I cried during chapter 3. It’s the best book I’ve ever read about the silent cost of overwork.

Lastly, I check in with this podcast once a week: Modern Wisdom by Chris Williamson. His burnout episode with Johann Hari (author of Lost Connections) reminded me how isolation and meaninglessness are the roots of a lot of mental crashes. That made me rethink how I run team rituals—not just productivity, but belonging.

Reading changed how I lead. It gave me language, tools, and frameworks I didn’t get in any manager training. It made me realize how little we actually understand about the human brain, and how much potential we waste by pushing people past their limits.

So yeah. Read more. Listen more. Get smart about burnout before it costs you your best
people.

https://redd.it/1nlqgo8
@r_devops
What's the biggest pain point you're facing right now?

What's up, fellow students and DevOps pros!
​I'm a first-year MCA student, and I'm looking for a project idea for this semester. Instead of doing something boring, I really want to build a tool that solves a real problem in the DevOps world.
​I've been learning about the field, but I know there are a ton of issues that you only run into on the job. So, I need your help.
​What's the one thing that annoys you the most in your daily work? What's that one problem you wish there was a tool for?
​Could be something with:
​CI/CD pipelines being slow
​Managing configurations
​Dealing with security stuff
​Trying to figure out why something broke
​Cloud costs getting out of control
​Basically, what's a small-to-medium-sized pain point that a project could fix? I'm hoping to build something cool and maybe even open source it later.
​Thanks for any ideas you have!

https://redd.it/1nluhr9
@r_devops
Struggling to send logs from Alloy to Grafana Cloud Loki.. stdin gone, only file-based collection?

I’ve been trying to push logs to Loki in Grafana Cloud using Grafana Alloy and ran into some confusing limitations. Here’s what I tried:

* Installed the [latest Alloy](https://github.com/grafana/alloy/releases/tag/v1.10.2) (`v1.10.2`) locally on Windows. Works fine, but it doesn’t expose any `loki.source.stdin` or “console reader” component anymore, as when running `alloy tools` the only tool it has is:



Available Commands: prometheus.remote_write Tools for the prometheus.remote_write component

* Tried the `grafana/alloy` Docker container instead of local install, but same thing. No stdin log source. 3. Docs (like [Grafana’s tutorial](https://grafana.com/docs/grafana-cloud/send-data/logs/collect-logs-with-alloy/)) only show file-based log scraping:

* `local.file_match` \-> `loki.source.file` \-> `loki.process` \-> `loki.write`.
* No mention of console/stdout logs.

* `loki.source.stdin` is no longer supported. Example I'm currently testing:

​

loki.source.stdin "test" {
forward_to = [loki.write.default.receiver]
}

loki.write "default" {
endpoint {
url = env("GRAFANA_LOKI_URL")
tenant_id = env("GRAFANA_LOKI_USER")
password = env("GRAFANA_EDITOR_ROLE_TOKEN")
}
}

**What I learned / Best practices (please correct me if I’m wrong):**

* **Best practice today** is *not* to send logs directly from the app into Alloy with stdin (otherwise Alloy would have that command, right? RIGHT?). If I'm wrong, what's the best practice if I just need Collector/Alloy + Loki?
* So basically, Alloy right now **cannot read raw console logs directly**, only from files/API/etc. If you want console logs shipped to Loki Grafana Cloud, what’s the clean way to do this??

https://redd.it/1nloznd
@r_devops
Bytebase vs flyway & liquibase

I’m looking for a db versioning solution for a small team < 10 developers, however this solution will be multi-tenant where are expecting a number of databases (one per tenant) to grow, plus non-production databases for developers. The overall numbers of tenants would be small initially. Feature-wise I believe Liquibase is the more attractive product

Features needed.
- maintaining versions of a database.
- migrations.
- roll-back.
-drift detection.

Flyway:
- migration format: SQL/Java.
- most of the above in paid versions except drift detection.

Pricing: It looks like Flyway Teams isn’t available (not advertised) and with enterprise the price is “ask me”, though searching suggests $5k/10 databases.

Liquibase
- appears to have more database agnostic configuration vs SQL noscripts.
- migration format: XML/YAML/JSON.
- advanced features: Diff generation, preconditions, contexts.

Pricing: “ask sales”. $5k/10 databases?

Is anyone familiar with Bytebase?

Thank you.

https://redd.it/1nlw9ug
@r_devops
Flutter backend choice: Django or Supabase + FastAPI?

Hey folks,

I’m planning infra for a mobile app for the first time. My prior experience is Django + Postgres for web SaaS only, no Flutter/mobile before. This time I’m considering a more async-oriented setup:

Frontend: Flutter
Auth/DB: self-hosted Supabase (Postgres + RLS + Auth)
Custom endpoints / business logic: FastAPI
Infra: K8s

Questions for anyone who’s done this in production:

How stable is self-hosted Supabase (upgrades, backups, HA)?
Your experience with Flutter + supabase-dart for auth (email/password, magic links, OAuth) and token refresh?
If you ran FastAPI alongside Supabase, where did you draw the line between DB/RPC in Supabase vs custom FastAPI endpoints?
Any regrets vs Django (admin, validation, migrations, tooling)?

I’m fine moving some logic to the client if it reduces backend code. Looking for practical pros/cons before I commit.

Cheers.

https://redd.it/1nlx5jw
@r_devops
Terraform CI/CD for solo developer

Background

I am a software developer at my day job but not very experienced in infrastructure management. I have a side project at home using AWS and managing with Terraform. I’ve been doing research and slowly piecing together my IaC repository and its GitHub CI/CD.

For my three AWS workload accounts, I have a directory based approach in my terraform repo: environments/<env> where I add my resources.

I have a modules/bootstrap for managing my GitHub Actions OIDC, terraform state, the Terraform roles, etc.. If I make changes to bootstrap ahead of adding new resources in my environments, I will run terraform locally with IAM permissions to add new policy to my terraform roles. For example, if I am planning to deploy an ECR repository for the first time, I will need to bootstrap the GitHub Terraform role with the necessary ECR permissions. This is a pain for one person and multiple environments.

For PRs, a planning workflow is ran. Once a commit to main happens, dev deployment happens. Staging and production are manual deployments from GitHub.

My problems

I don’t like running Terraform locally when I make changes to bootstrap module. But I’m scared to give my GitHub actions terraform roles IAM permissions.

I’m not fully satisfied with my CI/CD. Should I do tag-based deployments to staging and production?

I also don’t like the directory based approach. Because there are differences in the directories, the successive deployment strategy does not fully vet the infrastructure changes for the next level environment.

How can I keep my terraform / infrastructure smart and professional but efficient and maintainable for one person?

https://redd.it/1nlzdrv
@r_devops
Reduced deployment failures from weekly to monthly with some targeted automation

We've been running a microservices platform (mostly Node.js/Python services) across about 20 production instances, and our deployment process was becoming a real bottleneck. We were seeing failures maybe 3-4 times per week, usually human error or inconsistent processes.

I spent some time over the past quarter building out better automation around our deployment pipeline. Nothing revolutionary, but it's made a significant difference in reliability.

**The main issues we were hitting:**

* Services getting deployed when system resources were already strained
* Inconsistent rollback procedures when things went sideways
* Poor visibility into deployment health until customers complained
* Manual verification steps that people would skip under pressure

**Approach:**

Built this into our existing CI/CD pipeline (we're using GitLab CI). The core improvement was making deployment verification automatic rather than manual.

Pre-deployment resource check:

#!/bin/bash

cpu_usage=$(ps -eo pcpu | awk 'NR>1 {sum+=$1} END {print sum}')
memory_usage=$(free | awk 'NR==2{printf "%.1f", $3*100/$2}')
disk_usage=$(df / | awk 'NR==2{print $5}' | sed 's/%//')

if (( $(echo "$cpu_usage > 75" | bc -l) )) || [ "$memory_usage" -gt 80 ] || [ "$disk_usage" -gt 85 ]; then
echo "System resources too high for safe deployment"
echo "CPU: ${cpu_usage}% | Memory: ${memory_usage}% | Disk: ${disk_usage}%"
exit 1
fi

The deployment noscript handles blue-green switching with automatic rollback on health check failure:

#!/bin/bash

SERVICE_NAME=$1
NEW_VERSION=$2
HEALTH_ENDPOINT="http://localhost:${SERVICE_PORT}/health"

# Start new version on alternate port
docker run -d --name ${SERVICE_NAME}_staging \
-p $((SERVICE_PORT + 1)):$SERVICE_PORT \
${SERVICE_NAME}:${NEW_VERSION}

# Wait for startup and run health checks
sleep 20
for i in {1..3}; do
if curl -sf http://localhost:$((SERVICE_PORT + 1))/health; then
echo "Health check passed"
break
fi
if [ $i -eq 3 ]; then
echo "Health check failed, cleaning up"
docker stop ${SERVICE_NAME}_staging
docker rm ${SERVICE_NAME}_staging
exit 1
fi
sleep 10
done

# Switch traffic (we're using nginx upstream)
sed -i "s/localhost:${SERVICE_PORT}/localhost:$((SERVICE_PORT + 1))/" /etc/nginx/conf.d/${SERVICE_NAME}.conf
nginx -s reload

# Final verification and cleanup
sleep 5
if curl -sf $HEALTH_ENDPOINT; then
docker stop ${SERVICE_NAME}_prod 2>/dev/null || true
docker rm ${SERVICE_NAME}_prod 2>/dev/null || true
docker rename ${SERVICE_NAME}_staging ${SERVICE_NAME}_prod
echo "Deployment completed successfully"
else

# Rollback
sed -i "s/localhost:$((SERVICE_PORT + 1))/localhost:${SERVICE_PORT}/" /etc/nginx/conf.d/${SERVICE_NAME}.conf
nginx -s reload
docker stop ${SERVICE_NAME}_staging
docker rm ${SERVICE_NAME}_staging
echo "Deployment failed, rolled back"
exit 1
fi

Post-deployment verification runs a few smoke tests against critical endpoints:

#!/bin/bash

SERVICE_URL=$1
CRITICAL_ENDPOINTS=("/api/status" "/api/users/health" "/api/orders/health")

echo "Running post-deployment verification..."

for endpoint in "${CRITICAL_ENDPOINTS[@]}"; do
response=$(curl -s -o /dev/null -w "%{http_code}" ${SERVICE_URL}${endpoint})
if [ "$response" != "200" ]; then
echo "Endpoint ${endpoint} returned ${response}"
exit 1
fi
done

# Check response times
response_time=$(curl -o /dev/null -s -w "%{time_total}" ${SERVICE_URL}/api/status)
if (( $(echo "$response_time > 2.0" | bc -l) )); then
echo "Response time too high: ${response_time}s"
exit 1
fi

echo "All verification checks
passed"

**Results:**

* Deployment failures down to maybe once a month, usually actual code issues rather than process problems
* Mean time to recovery improved significantly because rollbacks are automatic
* Team is much more confident about deploying, especially late in the day

The biggest win was making the health checks and rollback completely automatic. Before this, someone had to remember to check if the deployment actually worked, and rollbacks were manual.

We're still iterating on this - thinking about adding some basic load testing to the verification step, and better integration with our monitoring stack for deployment event correlation.

Anyone else working on similar deployment reliability improvements? Curious what approaches have worked for other teams.

https://redd.it/1nm0ue1
@r_devops
GO Feature Flag is now multi-tenant with flag sets

GO Feature Flag is a fully opensource feature flag solution written in GO and working really well with OpenFeature.

GOFF allows you to manage your feature flag directly in a file you put wherever you want (GitHub, S3, ConfigMaps …), no UI, it is a tool for developers close to your actual ecosystem.

Latest version of GOFF has introduced the concept of flag sets, where you can group feature flags by teams, it means that you can now be multi-tenant.

I’ll be happy to have feedbacks about flag sets or about GO Feature Flag in general.

https://github.com/thomaspoignant/go-feature-flag

https://redd.it/1nm3zyh
@r_devops
GCP Docs Misleading: AWS RDS Postgres → Cloud SQL Postgres migration doesn’t need Cloud SQL public IP (Configure connectivity using IP allowlists)

When migrating from **AWS RDS Postgres → GCP Cloud SQL Postgres** using **Database Migration Service (DMS)**, the official docs say you must:

* Enable the **Cloud SQL public IP**, and
* Add that Cloud SQL egress public IP to the **AWS RDS security group inbound rules**.

But in practice, this isn’t needed.

* You **don’t have to enable Cloud SQL public IP at all**.
* You only need to allow the **DMS service egress IP(s)** (for your region) in the AWS RDS security group inbound rules.
* With just that, the migration works fine.

This means the documentation is misleading and encourages users to unnecessarily expose Cloud SQL to the public internet, weakening security.

Docs reference: [Configure connectivity using IP allowlists](https://cloud.google.com/database-migration/docs/postgres/configure-connectivity-ip-allowlists)

Last week, I was working on this migration and lost nearly four hours due to misleading documentation, only to realize that enabling the Cloud SQL public IP wasn’t necessary.

I feel like I’m doing more service for Google than many of their customer engineers. I’m essentially providing free feedback to help improve their documentation. Maybe I should be charging for it, just kidding, I genuinely love Google Cloud.

I have write an [article](https://medium.com/@rasvihostings/simplifying-aws-rds-to-google-cloud-sql-enterprise-migrations-navigating-documentation-challenges-af5914b55570) about it check it out as well
[https://medium.com/@rasvihostings/simplifying-aws-rds-to-google-cloud-sql-enterprise-migrations-navigating-documentation-challenges-af5914b55570](https://medium.com/@rasvihostings/simplifying-aws-rds-to-google-cloud-sql-enterprise-migrations-navigating-documentation-challenges-af5914b55570)

https://redd.it/1nm75jh
@r_devops
Trunk Based

Does anyone else find that dev teams within their org constantly complain and want feature branches or GitFlow?

When what the real issue is, those teams are terrible at communicating and coordination..

https://redd.it/1nm84la
@r_devops
Practical Terminal Commands Every DevOps Should Know

I put together a list of 17 practical terminal commands that save me time every day — from reusing arguments with !$, fixing typos with ^old^new, to debugging ports with lsof.

These aren’t your usual ls and cd, but small tricks that make you feel much faster at the terminal.

Here is the Link

Curious to hear, what are your favorite hidden terminal commands?

https://redd.it/1nma0as
@r_devops
Anyone here trying to deploy resources to Azure using Bicep and running Gitlab pipelines?

Hi everyone!

I am a Fullstack developer trying to learn CICD and configure pipelines. My workplace uses Gitlab with Azure and thus I am trying to learn this. I hope this is the right sub to post this.

I have managed to do it through App Registration but that means I need to add AZURE_CLIENT_IDAZURE_TENANT_ID and AZURE_CLIENT_SECRET environment variables in Gitlab.

Is this the right approach or can I use managed identities for this?

The problem I encounter with managed identities is that I need to specify a branch. Sure I could configure it with my main branch but how can I test the pipeline in a merge requests? That means I would have many different branches and thus I would need to create a new managed identity for each? That sounds ridiculous and not logical.

Am I missing something?

I want to accomplish the following workflow

1. Develop and deploy a Fullstack App (Frontend React - Backend .NET)
2. Deploy Infrastructure as Code with Bicep. I want to deploy my application from a Dockerfile and using Azure Container Registry and Azure container Apps
3. Run Gitlab CICD Pipelines on merge request and check if the pipeline succeeds
4. On merge request approved, run the pipeline in main

I have been trying to find tutorials but most of them use Gitlab with AWS or Github. The articles I have tried to follow do not cover everything so clear.

The following pipeline worked but notice how I have the global before_noscript and image so it is available for other jobs. Is this okay?

stages:
- validate
- deploy

variables:
RESOURCEGROUP: my-group
LOCATION: my-location

image:
mcr.microsoft.com/azure-cli:latest
before
noscript:
- echo $AZURETENANTID
- echo $AZURECLIENTID
- echo $AZURECLIENTSECRET
- az login --service-principal -u $AZURECLIENTID -t $AZURETENANTID --password $AZURECLIENTSECRET
- az account show
- az bicep install

validateazure:
stage: validate
noscript:
- az bicep build --file main.bicep
- ls -la
- az deployment group validate --resource-group $RESOURCE
GROUP --template-file main.bicep --parameters u/parameters.dev.json
rules:
- if: $CIPIPELINESOURCE == "mergerequestevent"
- if: $CICOMMITBRANCH == "main"

deploytodev:
stage: deploy
noscript:
- az group create --name $RESOURCEGROUP --location $LOCATION --only-show-errors
- |
az deployment group create \
--resource-group $RESOURCE
GROUP \
--template-file main.bicep \
--parameters u/parameters.dev.json
environment:
name: development
rules:
- if: $CICOMMITBRANCH == "main"
when: manual

Would really appreciate feedback and thoughts about the code.

Thanks a lot!

https://redd.it/1nm7nuw
@r_devops