Reddit DevOps – Telegram
Building a complete Terraform CI/CD pipeline with automated validation and security scanning

We recently moved our infrastructure team off laptop-based Terraform workflow. The solution was layered validation in CI/CD. Terraform fmt and validate run in pre-commit hooks. tflint catches quality issues and deprecated patterns during PR checks. tfsec blocks security misconfigurations like unencrypted buckets or overly permissive IAM policies. Then Conftest with OPA enforces organizational policies that used to live in wikis.

One key decision was using OIDC authentication instead of long-lived access keys. GitHub Actions authenticates directly to AWS without storing credentials. Every infrastructure change requires PR review, shows the plan output as a comment, and needs manual approval before apply runs.

Drift detection runs on a schedule and creates issues when it finds manual changes. Infracost posts cost estimates in PRs so expensive mistakes get caught during review. The entire pipeline uses open-source tools and works without Terraform Cloud.

Starting advice: don't enable every security rule at once. You'll get 100+ warnings and your team will ignore it. Start with HIGH severity findings, fix those, then tighten gradually.

I documented the complete setup with working GitHub Actions workflows and policy examples: Production Ready Terraform with Testing, Validation and CI/CD

What's your approach to Terraform governance and automated validation?

https://redd.it/1pet61x
@r_devops
How did you reduce testing overhead at your startup without sacrificing quality?

Our engineering team is 8 people and we're drowning in testing overhead. Between unit tests, integration tests, and e2e tests we're spending almost 30% of sprint time on testing related work (writing, maintaining, fixing flaky tests).

Don't get me wrong, i know testing is important and we've caught a lot of bugs before production. But the overhead is getting ridiculous, we're moving slower than our competitors because we're spending so much time on test maintenance.

Curious how other startups have tackled this, especially teams that scaled testing without adding dedicated qa headcount. Did you find better tools? Change your testing strategy? Just accept the overhead as cost of quality?

We're using playwright right now which is better than selenium but still requires constant maintenance. Every UI change breaks tests even with data-testid attributes. CI times are also getting long which slows down deployment velocity.

Looking for practical advice from people who've actually solved this not theoretical best practices. What worked for you?

https://redd.it/1peu9wa
@r_devops
finally cut our CI/CD test time from 45min to 12min, here's how

We had 800 tests running in pipeline taking forever and failing randomly. Devs were ignoring test failures and just merging because nobody trusted the tests anymore

We tried a bunch of things that didn't work, parallelized more but hit resource limit, split tests into tiers but then we missed bugs, rewrote flaky tests but new ones kept appearing

What actually worked was rethinking our whole testing approach. Moved away from traditional selector-based testing for most functional tests because those were breaking constantly on ui changes and kept some integration tests for specific scenarios but changed the approach for the bulk of our test suite

We also implemented better test selection so we're not running everything on every pr. Risk based approach where we analyze what code changed and run relevant tests and full suite still runs on main branch and nightly

Pipeline now runs in about 12 min average and test failures actually mean something again. Devs trust them enough to investigate instead of just rerunning and it literally feel like we finally have a sustainable qa process in ci/cd

https://redd.it/1pev57u
@r_devops
Yea.. its DataDog again, how you cope with that?

So we got new bill, again over target. Ive seen this story over and over on this sub and each time it was:

- check what you dont need

- apply filters

- change retentions etc





Maybe, maybe this time someone will have some new ideas on how to tackle the issue on the broader range ?

https://redd.it/1peoe0x
@r_devops
Hosting 20M+ Requests Daily

I’ve been reading the HN comments on the battle between Kubernetes and tools like Uncloud. It reminded me of a real story from my own experience, how I hosted 20m+ daily requests and thousands of WebSocket connections.

Once, some friends reached out and asked me to build a crypto mining pool very quickly ("yesterday"). The idea was that if it took off, we would earn enough to buy a Porsche within a month or two. (We almost made it, but that’s a story for another time.)

I threw together a working prototype in a week and asked them to bring in about 5 miners for testing. About 30 minutes later, 5 miners arrived. An hour later, there were 50. Two hours later, 200. By the next day, we had over 2000, ...

The absolute numbers might not look stunning, but you have to understand the behavior: every client polled our server every few seconds to check if their current block had been solved or if there was new work. On top of that, a single client could represent dozens of GPUs (and there were no batching or anything). All of this generated significant load.

By the time we hit 200 users, I was urgently writing a cache and simultaneously spinning up socket servers to broadcast tasks. I had absolutely no time for Kubernetes or similar beauty. From the moment of launch, everything ran on tmux ;-)

At the peak, I had 7 servers running. Each hosted a tmux session with about 8-10 panes.

Tmux acted not just as a multiplexer, but as a dashboard where I could instantly see the logs for every app. In case a server crashed, I wrote a custom noscript to automatically restore the session exactly as it was.

This configuration survived perfectly until the project eventually died.

Are there any lessons or conclusions here? Nope ;-) The whole thing came together by chance, didn't last as long as we’d hoped.

But ever since then, whenever I see a discussion about these kinds of tools, an old grandpa deep inside me wakes up, smiles, and says: "If I may..."



https://redd.it/1peyjbq
@r_devops
Transition from backend to devops/infrastructure/platform

How did you transit from a backend to a platform/infra position?

I find myself really bored with developing backend business stuff. However I find myself really interested in the infrastructure side of things. K8s, containers, monitoring and observability. And each time I discover new tools, I feel really excited to try them out.

Also, it feels like the infra side of things have a lot of interesting problems and I gravitate towards these. How would I slowly transit towards these roles? I’m also thinking of studying and getting the CKA cert next year.

https://redd.it/1pf1g3d
@r_devops
Made a nifty helper noscript for acme.sh

I recently had trouble with user permissions while configuring slapd on alpine. So I made this little noscript called apit to "config"fy the installation of certs. It is just 100 lines of pure UNIX sh, and should work everywhere.

Sharing it here in the hopes it might be useful for someone.

https://redd.it/1pf3lab
@r_devops
The Missing Foundation of Non-Human Identity

I’ve been working on an identity/authorization system for machines and kept getting stuck on a basic question: what is machine identity, independent of any one stack (Kubernetes, cloud, OAuth, etc.)?

This post proposes a simple model based on where identity originates (self-proven / attested / asserted), what privileges it has at birth, and how it lives over time (disposable vs durable). I’ve also mapped common systems like SSH, SPIFFE/SPIRE, API keys, IoT, and AI agents into it.

I’d be very interested in counterexamples, ways this breaks down in real systems, or prior art I’ve missed.

Here's the post: https://www.hessra.net/blog/the-missing-foundation-of-non-human-identity

https://redd.it/1pf76t3
@r_devops
In AI/infra/devtools companies with usage-based pricing, who actually owns “adoption”?

In a lot of AI / infra / devtools products that charge by usage (requests, tokens, build minutes, cluster hours, etc.), there’s this blurry line after the deal is closed:

On paper, it looks like “someone on the post-sales side” owns adoption,
But in reality, I keep hearing about Solution Architects, Technical Account Managers, “technical success” folks, field engineers, SREs, and even core engineers getting dragged in when a key account’s usage isn’t where it’s supposed to be.

Sometimes usage is way below what was expected, sometimes it spikes in weird ways, sometimes it’s flat, but everyone feels something is off. And then suddenly there’s a Slack war room and a bunch of people with very different goals looking at the same graphs.

In your org (AI/infra/devtools, usage-based or pay-as-you-go):

When usage is clearly off for an important customer, who actually takes the lead on figuring out what’s going on and what to do about it, and what does that usually look like from your side?

Curious how this plays out in real life vs. how the org chart says it should.

https://redd.it/1pf8mi5
@r_devops
Job Switch

Currently working as a devops engineer and I like it a lot, been doing this for about 7-8 years. I want to switch into more backend/distributed systems but not sure what programming languages are best for this. I see it being split between Python & Go.

For anyone who has transitioned from Devops to BE/DSE or the other way around. What language would you say is best to learn ?

I’m trying to lock in for the next 12 months alongside grad school.

https://redd.it/1pf9974
@r_devops
Looking for developers

Hello Developers,

I’m a co-founder of Dayplay, an upcoming mobile app designed to help people quickly discover things to do—activities, local spots, events, hidden gems, and more. Our goal is to make finding something to do fast, easy, and fun.
We’re looking for a US-based full-stack developer with strong mobile app development skills to join our small founding team. We currently have two in-house devs, but one is going on leave due to personal reasons. Our MVP is 95% complete, and we’ll be launching on TestFlight for beta testers very soon. This role will have a big impact on the final stages of development and our early product growth.

About Dayplay
Dayplay is a mobile app built for quick decision-making. Users can instantly discover new places, activities, and experiences nearby through a clean, fast, and intuitive interface.

Who We’re Looking For
A well-rounded developer who can contribute across the stack and help push the mobile app to launch. Ideally someone with:
Full-stack experience (frontend + backend)
Strong mobile app development skills (React Native/Expo preferred)
Solid understanding of databases, APIs, and modern app architecture
Ability to move quickly, collaborate with a small team, and own tasks end-to-end
(If you want the full breakdown of the tech stack and responsibilities, feel free to DM me.)

Compensation
Compensation will be discussed directly and will be based on experience and expertise.



https://redd.it/1pf7402
@r_devops
I am so tired of debugging headless Chrome in Docker

I feel like I spend more time fixing my container setup than actually writing automation code. Getting a headless browser to run remotely without crashing from memory leaks is a huge pain. I just want to run my agent and have it work without spending a week on config files.

Has anyone found a way to just sandbox the whole thing? I am looking for something where I can just add a decorator or a simple command to handle the deployment side so I don't have to deal with the infrastructure mess.

https://redd.it/1pfcdam
@r_devops
Introducing localplane: an all-in-one local workspace on Kubernetes with ArgoCD, Ingress and local domain support

Hello everyone,

I was working on some helm charts and I needed to test them with an ArgoCD, ingress, locally and with a domain name.

So, I made localplane:

https://github.com/brandonguigo/localplane

Basically, with one command, it’ll :
- create a kind cluster
- launch the cloud-provider-kind command
- Configure dnsmasq so every ingress are reachable under *.localplane
- Deploy ArgoCD locally with a local git repo to work in (and that can be synced with a remote git repository to be shared)
- delivers you a ready to use workspace that you can destroy / recreate at will

This tool, ultimately, can be used for a lot of things :
- testing a helm chart
- testing load response of a kubernetes hpa config
- provide a universal local dev environment for your team
- many more cool stuff…

If you want to play locally with Kubernetes in a GitOps manner, give it a try ;)

Let me know what you think about it.

PS: it’s a very very wip project, done quickly, so there might be bugs. Any contributions are welcome!


https://redd.it/1pf2nr9
@r_devops
How do you manage multiple chats and focus on your work

Initially I was allocated to a single project and was working in that project. For that project also there were like 5 chats. Dev Chat, DevOps chat, Support chat ( with support team ), Product chat ( with customers ) which is fine. But the problem is they were expecting a reply within few minutes, and If I don't due to some reason, they gonna raise a complain, which is actually toxic.

Now the problem is, recently I'm responsible for reply to chats with few other projects as well. So there are like 20 teams chats, and messages are popping up like in every few mins. We have 4 team members. But everyone is expected to do the same.

I'm a person who don't like frequent context switching and like to focus on one task at a time.

But this new approach is driving me crazy. What should I do. This frequent messages are adding more stress.

https://redd.it/1pfijxk
@r_devops
Zerv – Dynamic versioning CLI that generates semantic versions from ANY git commit

TL;DR: Zerv automatically generates semantic version numbers from any git commit, handling pre-releases, dirty states, and multiple formats - perfect for CI/CD pipelines. Built in Rust, available on crates.io: `cargo install zerv`

Hey r/rust! I've been working on Zerv, a CLI tool written in Rust that automatically generates semantic versions from any git commit. It's designed to make version management in CI/CD pipelines effortless.



🚀 The Problem

Ever struggled with version numbers in your CI/CD pipeline? Zerv solves this by generating meaningful versions from **any git state** - clean releases, feature branches, dirty working directories, anything!



Key Features

\- `zerv flow`: Opinionated, automated pre-release management based on Git branches

\- `zerv version`: General-purpose version generation with complete manual control

\- Smart Schema System: Auto-detects clean releases, pre-releases, and build context

\- Multiple Formats: SemVer, PEP440 (Python), CalVer, with 20+ predefined schemas and custom schemas using Tera templates

\- Full Control: Override any component when needed

\- Built with Rust: Fast and reliable



🎯 Quick Examples

# Install
cargo install zerv


# Automated versioning based on branch context
zerv flow


# Examples of what you get:
# → 1.0.0 # On main branch with tag
# → 1.0.1-rc.1.post.3 # On release branch
# → 1.0.1-beta.1.post.5+develop.3.gf297dd0 # On develop branch
# → 1.0.1-alpha.59394.post.1+feature.new.auth.1.g4e9af24 # Feature branch
# → 1.0.1-alpha.17015.dev.1764382150+feature.dirty.work.1.g54c499a # Dirty working tree



🏗️ What makes Zerv different?

The most similar tool to Zerv is semantic-release, but Zerv isn't designed to replace it - it's designed to **complement** it. While semantic-release excels at managing base versions (major.minor.patch) on main branches, Zerv focuses on:

1. Pre-release versioning: Automatically generates meaningful pre-release versions (alpha, beta, rc) for feature and release branches - every commit or even in-between commit (dirty state) gets a version
2. Multi-format output: Works seamlessly with Python packages (PEP440), Docker images, SemVer, and any custom format
3. Works alongside semantic release: Use semantic release for main branch releases, Zerv for pre-releases



📊 Real-world Workflow Example

https://raw.githubusercontent.com/wislertt/zerv/main/assets/images/git-diagram-gitflow-development-flow.png

The image from the link demonstrates Zerv's `zerv flow` command generating versions at different Git states:

\- Main branch (v1.0.0): Clean release with just the base version

\- Feature branch: Automatically generates pre-release versions with alpha pre-release label, unique hash ID, and post count

\- After merge: Returns to clean semantic version on main branch

Notice how Zerv automatically:

\- Adds `alpha` pre-release label for feature branches

\- Includes unique hash IDs for branch identification

\- Tracks commit distance with `post.N` suffix (commit distance for normal branches, tag distance for release/* branches)

\- Provides full traceability back to exact Git states



🔗 Links

\- **GitHub**: https://github.com/wislertt/zerv

\- **Crates.io**: https://crates.io/crates/zerv

\- **Documentation**: https://github.com/wislertt/zerv/blob/main/README.md



🚧 Roadmap

This is still in active development. I'll be building a demo repository integrating Zerv with semantic-release using GitHub Actions as a PoC to validate and ensure production readiness.



🙏 Feedback welcome!

I'd love to hear
Sonarqube and other Code Qualify with mono repo support

So we have been using sonarqube for a while, but our dev team feels its a bit clunky - running the self hosted dev version, but the issue is the next jump to enterprise just to utilize the AI suggestions cost 25k USD a year, and way over my budget.

I have been looking around for alternatives, and some might have tested some. The two requirements we have is support for self hosted GitLab and support for monorepos, and some kind of AI suggestions (Not AI auto correct, but AI suggestions) - could be self hosted or managed.

The only tool I have ruled out if Qudona, because of Jetbrains non existing support

And yes, I have done google searches, but most of the tools pretty much say the same "im the best", but might be better options. I prefer a software that looks modern at least and a good UI/flow.

If it can integrate in Rider etc its a plus (yes I hate Jetbrains support, but he IDE is fine)

https://redd.it/1pfjcs9
@r_devops
can you actually automate end to end testing without coding or is that fantasy?

Non technical founder here trying to figure out testing for our saas product. We have 2 developers and they're focused on building features, don't have bandwidth to also become testing experts.

I keep seeing ads for tools that claim you can automate testing without writing code, just record what you're doing and it creates tests automatically. Sounds too good to be true but figured i'd ask if anyone has actually used these successfully.

Main concern is we keep shipping bugs to customers and it's embarrassing. Need some way to catch obvious issues before they go live but don't have budget to hire qa team yet.

Is no code test automation legit or am i gonna waste money on something that doesn't actually work? Would rather pay for a tool than have developers spend weeks learning selenium if there's a faster option.

https://redd.it/1pfksec
@r_devops
Help for Survey Needed😊

https://forms.office.com/r/E3RGz3Y0B3
Hi all, I’m working on my Final Year Project and I need your help! If you’re a Solution Architect, DevOps Engineer, Cloud Engineer, or anyone who wrangles cloud infrastructure for a living, I’d love to hear from you.



Cloud outages, failovers, DR drills that never happen—if these sound familiar, this survey is for you. I’m researching how teams actually handle cloud reliability and disaster recovery in the real world (not just what the documentation says), and your insights will help shape a practical automated multi-cloud DR/failover solution.



The survey only takes 5–7 minutes, everything is anonymous, and your experience could genuinely influence a tool designed for people like you.



If you have a moment, I’d really appreciate your input—thanks for helping make my FYP a little less painful and a lot more meaningful!

https://redd.it/1pfpnn9
@r_devops
Beginner in AWS: need mock papers resources and project recommendation

Asking again - I’ve been learning AWS for the past 2-3 months, along with Terraform, Gitlab, Kubernetes, and Docker through YouTube tutorials and hands-on practice. I’m now looking to work on more structured, real-world projects - possibly even contributing to public cloud related projects to build practical experience.

I’m also planning to take the AWS Cloud Practitioner exam. Could anyone suggest resources or websites that offer mock tests in an exam-like environment? Also, any recommendations for platforms where I can find beginner-friendly cloud projects to build my portfolio would be greatly appreciated.

https://redd.it/1pfpg1p
@r_devops