Reddit DevOps – Telegram
Do you think that justfiles underdelivers everywhere except packing noscripts into single file?

I'm kinda disappointed in Justfiles. In documentation it looks nice, on practice it create whole another set of hustle.

I'm trying to automate and document few day to day tasks + deployment jobs. In my case it is quite simple env (dev, stage, prod) + target (app1, app2) combination.

I'd want to basically write something like just deploy dev app1, just tunnel dev app1-db.

Initially I've tried have some map like structure and variables, but Justfile doesn't support this. Fine, I've written all the constants manually by convention like, DEV_SOMETHING, PROD_SOMETHING.

Okay, then I figured I need a way to pick the value conditionally. So for the test I picked this pattern:

noscript
arg("env", pattern="dev|stage|prod")
arg("target", pattern="app1|app2")
deploy env target:
{{ if env == "dev" { "instanceid=" + DEVINSTANCEID } else { "" } }}
{{ if env == "prod" { "instance
id=" + PRODINSTANCEID } else { "" } }}
...

Which is already ugly enough, but what are my options?

But then I faced the need to pick values based on combination of env + target conditions, e.g. for port forwarding, where all the ports should be different. At this point I found out that justfile doesn't support AND or OR in if conditions. Parsing and evaluation of AND or OR operations isn't much harder then == and != itself.

Alright. Then I thought, maybe I'm approaching this wrong completely, maybe I need to generate all the tasks and treat justfile as a rendering engine for noscripts and task? I thought, maybe I need to use some for loop and basically try to generate deploy-{{env}}-{{target}}: root level tasks with fully instantiated noscript definition?

But I justfile doesn't support it as well.

I thought also about implementing some additional functions to simplify it, or like render time evaluation, but justfile doesn't support such functions as well.

So, at this point I'm quite disappointed in the value proposition of justfile, because honestly packing the noscripts into single file is quite the only value it brings. I know, maybe it's me, maybe I expected too much from it, but like what's the point of it then?

I've looked through github issues, there are things in dev, like custom functions and probably loops, but it's been about 3 or 4 years since I heard about it first time, and main limitations are still there. And the only thing I found regarding multiple conditions in if, is that instead of just implementing simplest operators evaluation, they thinking about integrating python as a noscripting language. Like, why? You already have additional tool to setup, "just" itself, bringing other runtime which actually gives programming features, out of which you need only the simplest operators and maps, is kinda defeats all the purpose. At this point it seems like reverting completely to just bash noscripts makes more sense than this.

What's your experience with just? All the threads I've seen about justfiles are already 1-3 years old, want to hear more fresh feedback about it.

https://redd.it/1qdnjhz
@r_devops
Research: how are teams controlling and auditing AI agents in production?

Hey folks,

We are researching how teams running AI agents in production deal with things like cost spikes, access control, and “what did this agent actually do?”

We put together a short anonymous survey (5–7 min) to understand current practices and gaps.

This is not a sales pitch. We are validating whether this is even a real problem worth solving.

Would appreciate honest, even skeptical feedback.

👉 https://forms.gle/yo7xwf6DrAnk2L5x7


https://redd.it/1qdoyc0
@r_devops
How big of a risk is prompt injection for client-facing chatbots or voice agents?

I’m trying to get a realistic read on prompt injection risk, not the “Twitter hot take” version When people talk about AI agents running shell commands, the obvious risks are clear. You give an agent too much power and it does something catastrophic like deleting files, messing up git state, or touching things it shouldn’t. But I’m more curious about client-facing systems. Things like customer support chatbots, internal assistants, or voice agents that don’t look dangerous at first glance. How serious is prompt injection in practice for those systems?

I get that models can be tricked into ignoring system instructions, leaking internal prompts, or behaving in unintended ways. But is this mostly theoretical, or are people actually seeing real incidents from it?

Also wondering about detection. Is there any reliable way to catch prompt injection after the fact, through logs or output analysis? Or does this basically force you to rethink the backend architecture so the model can’t do anything sensitive even if it’s manipulated?

I’m starting to think this is less about “better prompts” and more about isolation and execution boundaries.

Would love to hear how others are handling this in production.

https://redd.it/1qdr4hg
@r_devops
A Friday production deploy failed silently and went unnoticed until Monday

We have automated deployments that run Friday afternoons, and one of them silently failed last week. The pipeline reported green, monitoring did not flag anything unusual, and everyone went home assuming the deploy succeeded.

On Monday morning we discovered the new version never actually went out. A configuration issue prevented the deployment, but health checks still passed because the old version was continuing to run. Customers were still hitting bugs we believed had been fixed days earlier.

What makes this uncomfortable is realizing the failure could have gone unnoticed for much longer. Nothing in the process verified that the running build actually matched what we thought we deployed. The system was fully automated, but no one was explicitly confirming the outcome.

Automation removed friction, but it also removed curiosity. The pipeline succeeded, dashboards looked fine, and nobody thought to validate that the intended version was actually live. That is unsettling, especially since the entire system was designed to prevent exactly this kind of failure.

https://redd.it/1qdl5m8
@r_devops
What has been the most painful thing you have faced in recent time in Site Reliability/Devops

I have been working in the SRE/DevOps/Support-related field for almost 6 years
The most frustrating thing I face is whenever I try to troubleshoot anything, there's always some tracing gaps in the logs, from my gut feeling, know that the issue generates from a certain flow, but can never evidently prove that.

Is it just me, or has anyone else faced this in other companies as well? So far, I have worked with 3 different orgs, all Forbes top 10 kinda. Totally big players with no "Hiring or Talent Gap."

I also want to understand the perspective of someone working in a startup, how the logging and SRE roles work there in general, more painful as the product has not evolved, or if leadership cuts slack because the product has not evolved?

https://redd.it/1qdtskm
@r_devops
What's the canonical / industry standard way of collaborating on OpenTofu IaC?

I am a Typenoscript/Node backend developer and I am tasked with porting a mono repository to IaC.
- (1) When using OpenTofu for IaC, how do you canonically collaborate on an infrastructure change (when pushing code changes, validating plans, merging, applying)? I've read articles dealing with this topic, but it's not obvious what is a consensual option and what isn't. Workflows like Atlantis seem cool but I'm not sure what's are the caveats and downsides that come with its usage.
- (2) Why do people seem to need an external backend service? Do we really need to store a central state in a third party, considering OpenTofu can encrypt it? Or could we just track it in CI and devise a way to prevent merges on conflict? (secret vaults make sense though, since Github's secret management isn't suitable for the purpose of juggling the secrets of multiple apps and environments)
---

For more context:

The team I work for has a Github mono-repository for 4 standalone web applications, hosted on Vercel. We also use third party services like a NeonDB database, Digital Ocean storage bucket, OpenSearch, stuff like that.

Our team is still small at 8 developers, and it's not projected to grow significantly in size in the near future.
Vercel itself already offers a simplified CI/CD flow integration, but the reason we are going for IaC is mostly to help with our SOC2 compliance process. The idea is that we would be able to review configurations more easily, and not get bitten by un-auditable manual changes.

From that starting point, my understanding is that the industry standard for IaC is Terraform, and that the currently favored tool is its open source fork OpenTofu.

Then, I understand that in order to enable smooth collaboration and integration into GitHub's PR cycles, teams usually rely on a backend service that will lock/sync state files. Some commercial names that popped during my researches like Scalr, Env0, or Spacelift. These offer a lot of features which quite frankly I don't even understand. I also found tools like Atlantis and OpenTacos/Digger, but it's unclear whether or not these are niche or widely adopted.

If I had to pick up course of action right now, I would have gone for an Atlantis-like "GitOps" flow, using some sort of code hashing to detect conflicts on stale states when merging PRs. But I imagine that if it was that simple, this is what people would be doing.

https://redd.it/1qdsjmp
@r_devops
Resume Review Request 4 YOE, Jr. Security Engineer, US

Hello!

Resume here

Could I kindly request a quick glance over my resume? I transitioned into my position, and it's my first time in IT and cybersecurity. My first rotation threw me into the deep end with Linux engineering, automation, networking, and much more. However, I loved it and continued to pursue it.

Once I graduate from this program, I want to apply to software engineering roles that focus on security, devops, cloud, and the such.

I've mostly been chucking stuff that I've documented from my monthly reports into LLMs to try and come up with resume bullets, but would really appreciate human insight. Ideally, I'd like to shorten it back down to one page, and if there's any "fluff" please point it out. Would love constructive criticism.

Thanks in advance.

https://redd.it/1qe1je4
@r_devops
Looking for a "pro" perspective on my DevOps Capstone project

Hello everyone,

I’m currently building my portfolio to transition into Cloud/DevOps. My background is a bit non-traditional: I have a Bachelor's in Math, a Master’s in Theoretical CS, and I just finished a second Master’s in Cybersecurity.

My long-term goal is DevSecOps, but I think the best way to make my way on it is through a DevOps, Cloud, SRE, Platform Engineer, or any similar role for a couple of years first. 

I’ve just completed a PoC based on Rishab Kumar’s DevOps Capstone Project guidelines. Before I share this on LinkedIn, I was hoping to get some "brutally honest" feedback from this community.

The Tech Stack: Terraform, GitHub Actions, AWS, Docker

 Link: https://github.com/camillonunez1998/DevOps-project 

Specifically, I’m looking for feedback on:

1. Is my documentation clear enough for a recruiter?
2. Are there any "rookie" mistakes?
3. Does this project demonstrate the skills needed for a Junior Platform/DevOps role?

Thanks in advance!



https://redd.it/1qdokq6
@r_devops
What to focus on to get back into devops in 2026?

Some context: I worked in DevOps-related positions for the past decade but suffered some serious skill rot the past 4 years while working for the US government-- everything was out of date and I was kept away from most of the important pieces (No Kube exposure despite asking for experience with it, no major project deployments, mostly just small-time automation work.) However the job was *very* comfy and I allowed myself to settle into it -- a fatal error given that my entire team was laid off back in September during the government "cost saving" cuts.

Not taking the time after work to make sure I was current anyway and up to date was in part entirely my fault and in part severe burnout of the industry. (I have no passions for any work, really, so burnout is unavoidable for me.)

How do I course correct from here? I will likely need to work a much lower position in IT support (I'm completely out of money and lost my apartment already; Unemployment is not giving enough for cost of living here) and study evenings because I cannot pass an interview given the last several I've had going poorly; I simply do not have the necessary knowledge. I intend to re-certify as an AWS Solutions Architect Associate after letting it lapse, and may study for CKA as well.

I am admittedly pretty against AI and have that going against me right now, so I'm trying to focus on other avenues.

https://redd.it/1qe6z9g
@r_devops
Using Cloudflare Workers + WebSockets to replace a SaaS chat tool

I got tired of chat widgets destroying performance.

We were using Intercom and tried a couple of other popular tools too. Every one of them added a huge amount of JavaScript and dragged our Lighthouse score down. All we actually needed was a simple way for visitors to send a message and for us to reply quickly.

So I built a small custom chat widget myself. It is about 5KB, written in plain JavaScript, and runs on Cloudflare Workers using WebSockets. For the backend I used Discord, since our team already lives there. Each conversation becomes a thread and replies show up instantly for the visitor.

Once we switched, our performance score went back to 100 and the widget loads instantly. No third party noscripts, no tracking, no SaaS dashboard, and no recurring fees. Support replies are actually faster because they come straight from Discord.

I wrote a detailed breakdown of how it works and how I built it here if anyone is curious

https://tasrieit.com/blog/building-custom-chat-widget-discord-cloudflare-workers

Genuinely curious if others here have built their own replacements for common SaaS tools or if most people still prefer off the shelf solutions.

https://redd.it/1qedxc1
@r_devops
EMR Spark cost optimization advice

Our EMR Spark costs just crossed $100k per year.

We’re running fully on-demand m8g and m7g instances. Graviton has been solid, but staying 100% on-demand means we’re missing big savings on task nodes.

What’s blocking us from going Spot:

Fear of interruptions breaking long ETL and aggregation jobs
Unclear Spot instance mix on Graviton (m8g vs c8g vs r8g)

We know teams are cutting 60–80% with Spot, and Spark fault tolerance should make this viable. Our workloads are batch only (ETL, ad-hoc queries, long aggregations).

Before moving to Spot, we need better visibility into:

CPU-heavy stages
Memory spills
Shuffle and I/O hotspots
Actual dollar impact per stage

Spark UI helps for one-off debugging but not production cost ranking.

Questions:

Best Spot strategy on EMR (capacity-optimized vs price-capacity)?
Typical split: core on on-demand, task nodes mostly Spot?
Savings Plans vs RIs for baseline load?
Any EMR configs for clean Spot fallbacks?

Looking for real-world lessons from teams who optimized first, then added Spot.



https://redd.it/1qed9dy
@r_devops
Resume Review & Next Steps

This is a sanitized version of my resume:
[https://imgur.com/rjzJZvB](https://imgur.com/rjzJZvB)


General Overview:

* I have 7+ years of total experience in IT
* I have just a tad under 4 years of experience in my last role
* My last role is what I consider to be "DevOps in name-only" given that I didn't touch CICD or containers for the first 2-3 years. It was closer to generic Cloud or Infrastructure Engineer


I was recently and abruptly let go from my recent Remote job (no PIP, eligible for rehire, Org was restructuring right up to a new CEO). All I really want is 1) a Remote job and 2) a job where I can spend most of the day in a code editor.


The remote job isn't me being ennoscriptd, I moved to an area away from big cities when I held my last job for 2+ years so it's either 1) find a remote IT job, 2) bag groceries for a living, or 3) move again with 0 income).



I wanted to see if my resume looks generally okay, as general community sentiment seems to be that your resume shouldn't be longer than 1-page unless you have 10+ years of experience. I opted to omit bullet items for older roles as they are less relevant to roles I'm looking for (DevOps, Platform, Cloud, Infrastructure Engineer).


My resume draws from a Full CV where I have other experiences listed, such as setting up a fully 1-click deployment of a Splunk cluster (using Gitlab CI to orchestrate Terraform for Infra + Ansible for Splunk install/configure, with Splunk ingesting logs from AWS via Kinesis Firehose at the end of this).


There is one point of contention or lack in my experience I was hoping to get feedback on.


I listed "Python", but to be honest it was the lowest possible feasible usage of Python where I wrote a simple (less than 200 lines) noscript to automate Selenium web browser actions. Jira Server is known to have gaps in its API, so I can't fully automate the setup (inputting a license key) without using Selenium to interact with the web app. The noscript didn't really make use of functions or classes. As such, I can't honestly say I'd be able to write a Python noscript to do anything specific if asked during an interview.


Similarly, my only practical experience with Golang was when I "vibe-coded" alterations to a fork of Snyk/driftctl. I fundamentally don't understand the lower-level concepts of Golang, but as an engineer I was still able to decompose how the program worked (it reached out to 100+ separate AWS Service API endpoints to make a multitude of GET requests, leading to API rate limiting issues) enough to figure out a more practical workaround (e.g. replace all separate API calls with a single API call to AWS Config Configuration Recorder API instead).


Based on the [DevOps.sh](http://DevOps.sh) roadmap, I figured my major "lack" is knowing a programming language, so I figured a good "next step" is to learn Golang. I'm curious if I'm on-point about that. It's just that at this point, I'm not sure why you need to learn that and to what extent you need to know it. Is it mostly for noscripting or mini-tooling purposes, or do employers generally expect you to develop micro-services like an actual Software Developer?


I come more from the Ops side of IT.

https://redd.it/1qeqgfo
@r_devops
Cyberhaven's Unified DSPM & DLP Platform Launch - Webinar 2/3

Hey r/devops, wanted to share a webinar next week that's relevant if you're dealing with data security in your environments, especially with AI tools in the mix.

Cyberhaven is launching their unified platform that combines AI security, DLP, insider threat management, and NextGen DSPM. Their CEO and product team will be covering:

Getting visibility across cloud, on-prem, and endpoints in one place
Understanding what sensitive data you have, where it lives, and actual risk levels
How AI adoption is creating new data exposure challenges (shadow AI, ChatGPT usage, etc.)
Context-rich data visibility to reduce operational blind spots

If you're managing infrastructure where developers are spinning up cloud resources, using AI coding assistants, or moving data across environments - this covers the visibility and security posture challenges that come with it.

Pretty relevant given how many teams are now dealing with data sprawl from AI tools on top of the usual multi-cloud complexity.

Free registration: https://events.cyberhaven.com/winter-2026-launch/

Date: February 3rd, 2:00 PM — 3:00 PM EST

https://redd.it/1qeu7z3
@r_devops