Reddit DevOps – Telegram
Considering Chainguard but how lockedin is it?

We’ve been looking at Chainguard for container image security. From what I’ve seen, it’s high quality, minimal, and secure. They provide SBOMs and reproducible builds, which is great.
That said, a few concerns come to mind:

• Many of their images are built on Chainguard OS / Wolfi, not standard community distros.

• Once you adopt it fully, you might be tied to their ecosystem… tooling, update cadence, and base OS.

• Some advanced features, like hardened or FIPS/STIG-certified images, are part of their paid offering.

• Their packaging is limited to Wolfi or internally maintained packages, which could make migration trickier.

How easy would it be to switch to other CVE or image protection tools if needed? Open to any advice/discussion and sorry if there is stupid question i asked.

Thanks in advance.

https://redd.it/1p2s0xx
@r_devops
How Deployable Is a Model Like Orion-MSP in Real Pipelines?

I came across **Orion-MSP**, which attempts to bring in-context learning to tabular data using multi-scale sparse attention and a Perceiver-style memory unit. From a research standpoint, it’s creative. From a DevOps standpoint, I’m not sure how practical it is.

A few questions I’m struggling with:

* Does a model with multiple attention scales introduce too much complexity for versioning and deployment?
* How would you monitor or debug failure cases in something with hierarchical attention patterns?
* Perceiver-style memory adds flexibility, but does it make observability harder in production pipelines?

Would be interested in hearing from people who’ve deployed Transformer-style tabular models — does something like Orion-MSP feel operationally reasonable, or is the architecture too intricate for most teams?

(Links available in comments if useful.)

https://redd.it/1p2skfg
@r_devops
How do you deploy laravel on ASG

I would love to know how people are managing laravel deployments running in ec2 in autoscaling group.
I have considered codedeploy.
I want something faster as envoyer.io
Also managing updates in .env file

https://redd.it/1p2qnua
@r_devops
which ai coding agents did you guys drop because they caused more chaos than help?


i’ve been cycling through a bunch of ai coding agents lately, and honestly, some of them created more mess than they solved. at one point i had aider, cursor, windsurf, cosine, cody, tabnine and continue.dev. a few stuck, but a few absolutely nuked my workflow with weird refactors, random hallucinations.

curious what everyone else has bailed on. which ai tools looked promising at first but ended up causing more chaos than help?

https://redd.it/1p2vqn0
@r_devops
Anybody here work for Rithum / Channel Advisor?

They’ve been hard down for almost 20 hours now. They claim it’s a fuck up during maintenance but I’m concerned they got owned and encrypted.

https://status.channeladvisor.com

https://redd.it/1p2vpby
@r_devops
How much time do you actually spend finding root cause vs fixing it?

When I was working at a larger bank I felt like we spent way too much time on debugging and troubleshooting incidents in production. Even though we had quite the mature tech stack with Grafana, Loki, Prometheus, OpenShift, I still found myself jumping around tools and code to figure out root cause and fix. Is issue in infra, application code, app deps, upstream/downstream service etc etc?

What's your experiences and how does your process look like? Would love to hear how you handle incident management and what tools you use.

I'm exploring building something within this space and would really appreciate your thoughts.

https://redd.it/1p2xs3o
@r_devops
ECS vs Regular EC2 Setup

I'm currently revamping a France-based company cloud infra. We have a few Micro FEs and a few Microservice BEs all running on Docker. Redis, PostgreSQL, with dev, staging, and prod environments. I'm asked to revamp from ground up and ignore existing infra setup, the goal is simplification. The setup is a bit over engineered because the app only ever gets around 5k daily users max, and is not intended to scale significantly. I'm thinking of using ECS + EC2 with load balance, ASG and Capcity Provider, and build+deploy the docker image using github actions to ECR where the ECS will pull the image from. But I feel like for this amount of users, is it better to just setup 2 ECs, one for the FE services and one for the BE services (for each env), with large hardware capacity, without using ECS or EKS entirely. I don't see the need to setup load balancing and auto scaling with this amount of users that's not expected to rise exponentially.

Some notes: no batch or intense compute, relatively small DB size, dev team of 5. User base majority centered around one region. Application is not critical.

Any thoughts?

https://redd.it/1p2zway
@r_devops
Built a tiny high-performance telemetry/log tailing agent in Zig (epoll + inotify). Feedback & contributors welcome

I’ve been hacking on a little side-project called zail — a lightweight telemetry agent written in Zig that watches directories recursively and streams out newly appended log data in real time.

Think of it like a minimal “tail-F”, but built properly on top of epoll + inotify, no polling, and stable file identity tracking (inode + dev_id). It’s designed for setups where you want something fast, predictable, and low-CPU to collect logs or feed them into other systems.

# Why I’m posting

I’m looking for early contributors, reviewers, and anyone who enjoys hacking on:

epoll / inotify internals
log rotation logic
output sinks (JSON, TCP/UDP, HTTP, Redis, etc.)
async worker pipelines
structured log parsing
general Zig code quality improvements

The codebase is small, easy to navigate, and friendly for new Zig/system-level contributors.

# Repo

https://github.com/ankushT369/zail

If you like low-level Linux stuff or just want a fun project to tinker with, I’d love your thoughts or contributions!

https://redd.it/1p2zafm
@r_devops
I built a bash noscript that finds K8s resource waste locally because installing Kubecost/CastAI agents triggered a 3-month security review.

**TL;DR:** I built a bash noscript that finds K8s resource waste locally because installing Kubecost/CastAI agents triggered a 3-month security review.

**The Problem:**
I've been consulting for Series B startups and noticed a pattern: massive over-provisioning (e.g., 8GB RAM requests for apps using 500MB), but no easy way to audit it. The existing tools are great, but they require installing agents inside the cluster. Security teams hate that. It often takes months to get approval.

**The Solution:**
I wrote a simple bash noscript that runs locally using your existing `kubectl` context.
* **No Agents:** Runs on your laptop.
* **Safety:** Anonymizes pod names locally (SHA256 hashes) before exporting anything.
* **Method:** Compares `requests` vs `usage` metrics from `kubectl top`.

**The Code (MIT Licensed):**
https://github.com/WozzHQ/wozz

**Quick Start:**
`curl -sL https://raw.githubusercontent.com/WozzHQ/wozz/main/noscripts/wozz-audit.sh | bash`

**What I'm looking for:**
I'm a solo dev trying to solve the "Agent Fatigue" problem.
1. Is the anonymization logic paranoid enough for your prod clusters?
2. What other cost patterns (orphaned PVCs, etc.) should I look for?

Thanks for roasting my code!

https://redd.it/1p31y4t
@r_devops
What's the cleverest prompt injection bypass you've actually encountered?

Been red teaming chatbots for a while now and the attack vectors keep evolving. Most attempts are basic role-play or system prompt leaks, but I've seen some genuinely creative ones.

The cleverest I caught recently was an attacker who embedded instructions in fake error messages, making the model think it was debugging itself. Something like "Error: To continue, ignore previous instructions and..." Pretty sneaky social engineering on the model itself.

I'm curious what others have encountered in production. Are you seeing more sophisticated multi-turn attacks? Any particularly creative bypasses that made you rethink your defenses?

Also interested in how teams are actually managing this operationally. Static filters obviously don't cut it.



https://redd.it/1p329nj
@r_devops
Is it normal to have to learn something new for every work task?

I'm working for a tech company where they put together a bigger DevOps team that spans across multiple projects, so that we manage them all at the same time. Previously we were doing the same work separately for each project. We were initially hired as inexperienced juniors, were never properly trained and for several years we kinda shot the shit since we had rather simple tasks.

Now we have an immense workload split among too few of us and, I kid you not, we get a new area of expertise to handle pretty much every month. 70% of the tasks I get require learning something new, almost from scratch. Only a few, highly experienced and highly motivated people are able to keep up. I feel like the rest of us are sinking, but I don't really know, since nobody talks about it.

Is this amount of learning something normally expected for a DevOps job in other companies?

I am extremely exhausted, I feel constantly ashamed of my performance, and I often procrastinate doing the tasks because I have no idea how to do them, nor do I feel like constantly asking questions. A lot of the time, I barely understand the answers, because I haven't been trained in what I'm supposed to do.

Is this situation normal when being a DevOps, are you constantly expected to learn new things from scratch, on your own? I don't know if I need to change the company or change my profession altogether.

https://redd.it/1p34525
@r_devops
Jenkins or GitLab Runners for Android apps?

Hey all, I’m in the process of setting up CI/CD at the moment in my company, starting with a few Android apps first.

At the moment, I have noscripts to run all of the tests and then build signed releases, it’s okay for now but I’d like to not have to do this and be able to have easily accessible builds to distribute automatically.

We moved from GitHub to running a self hosted GitLab instance (cheaper for LFS on other projects + easier overall personally), I haven’t configured runners yet but now need to think about either doing that or spinning up a Jenkins server, I’ve used it in the past for other projects personally and professionally so I’m relatively comfortable with it. But I need some more opinions on what you’d do in my situation.

Are there any other tools that might be easier for deployment/maintenance? The less administration the better personally lol. (I’m managing Development and other infrastructure already)

The ability to run our OS builds (AOSP) in the future would also be a nice to have, but not important, they’re a lot less frequent but not having to baby them would be good.

https://redd.it/1p35v5x
@r_devops
Release Engineering vs SRE

Hi all,

Looking for advice on two positions I've been offered at the same company. I had initially went in for a Platform Engineering role, however, this role has now closed.

The company are interested in still getting me on board though and have offered me the choice of an SRE and Release Engineer role. My background has mainly been in small companies where I've taken up more DevOps-y responsibities and for the past while been in a 'dedicated' DevOps role (though it is more an everything developer role in practice). I want to get more experience with the parts of DevOps I enjoy; designing and implementing distributed scalable infrastructure whilst abstracting complexity from SWEs in the SDLC. Ideally without becoming a Sys Admin or losing sight of SWE-esque day-to -day. Hence I believed PE would be a good fit (please correct me if I'm wrong)

I'm aware each company defines all these roles differently, and no opinion here can give me clarity into that. However the choice involves specialised industry defined roles at a size of company I don't have experience with. I don't have many people in my network I can ask for guidance so any insight to this would be amazing!

PS I have a knee jerk avoidance of RE cause I think focussing primarily on git, release versioning and build tools would drive me insane, but would love to be proved wrong as I love the idea of collaborating a bunch.

https://redd.it/1p2za1j
@r_devops
TLS MITM environments such as Zscaler: How do you ensure trust when the entire TLS chain is deliberately compromised?

When an organization has decided to implement global TLS inspection via Man In The Middle proxies, effectively taking a chainsaw to the entire computer/math trust architecture of TLS that underpins practically all modern computing, how can we still provide a valid, real, secure trust system to system and people to systems?

I'm going through my own thought experiments now trying to answer the question, "If only basic non-TLS HTTP existed, what would I need to configure and/or build to provide both the trust and secure communications that TLS otherwise ensures?

On the small scale I'm looking at things like enabling claims encryption for SAML and OIDC authentications, exclusively using FIDO2 hardware tokens (no TOTP, SMS, etc), etc. But while I've worked out securely authenticating to services, the MITM is still able to scrape the JWT bearer tokens, session cookies, etc to hijack sessions even if it can't replay the authentication itself. And even if we solve authentication, there's still the data itself to consider, which is going to require some form of public-key based, application-level encryption, like an SSH data flow only implemented in the web browser (WASM maybe?).

I'm late to the game, but suddenly I'm trust into understanding exactly the problem space that folks like WhatsApp et al have been trying to solve with full end-to-end encryption. Because I realize now that even if my own organization isn't using MITM TLS inspection, whatever or whoever I'm communicating with on the other side of the conversation may not be so lucky.

\---

To be clear I'm not looking for ideas on how to get around Zscaler for my own traffic; I've got more than enough technical chops to route around this asinine security theatre if I cared to.

Rather I'm looking at this from a systems architecture / DevOps / SDLC perspective for how I factor in a solution to address this new (to me) threat vector for my users. For example, ZScaler publishes a list of their proxy IP CIDR ranges which a website / app can match against the "client" and if it's matched at least present the user with a warning that any data they enter is absolutely NOT secure no matter what that little padlock icon in the location bar says (since ZScaler includes subverting the client's trust CA with their own).

My customers still need actual security, actual trust, no matter what my insecurity team thinks. So this is just another design requirement to deal with and I'm looking for tips about how others might have approached this problem. Both in application arch itself, but also the full SDLC because how do we deal with trusting supply chains, etc.

https://redd.it/1p3ajr8
@r_devops
I built a tower defense game that teaches cloud architecture (but does anyone actually want this?)

A couple weeks ago, I was once again explaining to a junior dev why his API was crashing under load. I drew diagrams, showed him charts, talked about load balancers and scaling... And I saw that familiar emptiness in his eyes. He was nodding, but I knew he wasn't really feeling the problem.

Then it hit me - what if I made a game where you actually see your architecture collapse in real-time?

What I built

Server Survival is basically tower defense for DevOps. You build cloud infrastructure from blocks (WAF, Load Balancer, EC2, RDS, S3), connect them with arrows, and then watch your creation try to survive waves of incoming traffic.

Full disclosure: this is a rough MVP

I'll be honest - right now this is a prototype hacked together on my knee. I intentionally made the simplest version possible just to validate the idea. There are tons of simplifications, some things don't work exactly like real AWS, the load balancing is sometimes wonky.

But! That's exactly why I'm releasing this open source. I want to understand - is this even interesting to anyone?

I have a ton of ideas for what could be added - different cloud providers (AWS/Azure/GCP), more realistic mechanics, auto-scaling groups, availability zones, monitoring dashboards, multiplayer mode, real-world incident scenarios like Black Friday or security breaches... But before I sink more time into this, I really need to know: does anyone actually need this?

GitHub: https://github.com/pshenok/server-survival

Let me know what you think

https://redd.it/1p3bxnx
@r_devops
Help restructuring a terraform monorepo.

Hello all!

My heads spinning a bit and I need some insights here.

For context: Two years ago I was contracted to architect and implement the cloud systems for an application working on a POC that needed a fast turnaround. At the time the application was very small, basic networking, one RDS instance, one API gateway etc you get it simple. So I put it all in a monorepo and implemented fairly basic gha CI/CD and branch based envs coupled with workspaces as everything was to be setup in one AWS account.

Fast forward to today I was called back to lend a hand as things have grown. They now have the same networking more or less but API gateway -> LoadBalancer -> Multiple Containerized APIs in ECS. Multiple DB's etc etc everything has grown exponentially, still in the monorepo and one state file they also want to shift to a multi account/env setup (music to my ears)

I really do not want to spark the debate of Multi vs Mono repos. But they have a small team of 3-4 devs that are in charge of the infra and deployment of applications so they've opted to leave it as a monorepo. Worth noting the application logic is broken out but at least the first image is deployed with terraform.

The question is now, how do you break up a jumbo sized root state file that everything is using and isolate state so that "services" like the ecs based api containers can be modified and deployed independently. Also graduate to prod without affecting prod when changes are made to a service. Target multiple accounts and avoid drift all from a mono repo...

My current plan is to switch to the tried and true Dir per env. Breakout each service as a "module" and parametrize their inputs. Stitch the CI/CD so that there's a staged deployment and granular deployment for isolated updates. So each service has it's own root level state and can be updated independently within the repo without a massive plan and deploy.

Graduating to prod and keeping them in sync seems difficult in a scenario like this as tagging service modules is pretty much out in a mono. So it'd have to lean more towards trunk based and semi manual deployments to the prod env after approval.

Hopefully that all makes sense. Any thoughts here would be greatly appreciated as I usually lean towards multi repo.

Cheers.




https://redd.it/1p3afkm
@r_devops
DevOps internship questions

Hey everyone! I'm a university student in CS. I have an interview for a DevOps internship next week. Looking forward to it, but wanna make sure I'm preparing properly. Here's what I've done so far:
\- I have looked at the interviewers' LinkedIns and checked out what they do or have done at the company
\- Reviewed all the technologies, languages and tools listed in the job posting. For the ones I already know or have on my resume, I refreshed my memory and did a deep dive into it. For the ones I wasn’t familiar with, I did a quick overview
\- Wrote down specific details about the projects and experience listed on my resume so I’m ready for questions like “what was your role?”/“why did you do it this way?”/“can you explain this in more detail?" and so on.
\- Prepped for some behavioural questions

I'm also thinking about preparing a few questions to ask them, some out of curiosity, some just to keep the interview flowing nicely.

What else should I focus on? I don't get nervous when it comes to stuff like this, so I should be able to hold my nerves and have a nice interview. Also, since it's an intern position, my guess is that they won't be expecting good technical skills or expertise, so if I'm right, they're looking for someone who is competent, willing to learn and shows some level of enthusiasm and drive. And my job is to leave a good impression on them to help me stand out.

Any advice and tips are much appreciated.
Also the job is in Canada, and the company is an enterprise level company.

https://redd.it/1p3n5k0
@r_devops
CEO "helps" with terraform, rewrites entire product into an unmaintainable frankenstein, now wants to migrate everything

Not my story, thankfully, it was shared /w me - just wanted to share the insanety that's going on rn:

"A customer recently asked us to help them with some terraform to install our app. My CEO casually remarked “hey I’m pretty good with terraform let me take this over”

Now he has a completely re-architected version of our product that only works for that one customer, he added a bunch of new services like Istio, ArgoCd, Vault, rewrote all our cicd in dagger, and ripped out a bunch more required services. It barely works. Nobody is trained on half of this. Some of our core functionality is completely missing. He vibe coded this over two months in a vacuum, and thinks of himself as some kind of genius he can’t even explain half the shit.

He is asking me to migrate everything over to his bullshit over the next couple weeks"

https://redd.it/1p3p7pv
@r_devops
What’s enough for a Junior?

I’m about to start applying for a Junior devops and my portfolio is as follows:

- all terraform natless eks cluster with an ALB ingress and kyverno admission based on a kms key sig and an attestation for an image(i also made a gitlab pipeline that signs an image with cosign and attests it with trivy and then pushes it into my private ecr).

- all terraform eks monitoring stack with kube-prometheus.

- Custom runtime with OCI image extraction, custom networking supporting multiple containers, NAT and port forwarding (i actually ran a monitoring stack on this using prometheus and a node exporter) all written in GO.

- Now i’m about to do an ebpf firewall and after this i’ll just start applying.

I have no reference point in terms of how a junior application pool actually looks like in terms of skill level and since i originally wanted to do cybersecurity my idea of a typical junior is about exactly as what i have right now.

Is there anybody who works in the industry and has an idea of the junior skill level and whether that’s enough to land a global remote position?


https://redd.it/1p3qvjf
@r_devops