Reddit DevOps – Telegram
EMR Spark cost optimization advice

Our EMR Spark costs just crossed $100k per year.

We’re running fully on-demand m8g and m7g instances. Graviton has been solid, but staying 100% on-demand means we’re missing big savings on task nodes.

What’s blocking us from going Spot:

Fear of interruptions breaking long ETL and aggregation jobs
Unclear Spot instance mix on Graviton (m8g vs c8g vs r8g)

We know teams are cutting 60–80% with Spot, and Spark fault tolerance should make this viable. Our workloads are batch only (ETL, ad-hoc queries, long aggregations).

Before moving to Spot, we need better visibility into:

CPU-heavy stages
Memory spills
Shuffle and I/O hotspots
Actual dollar impact per stage

Spark UI helps for one-off debugging but not production cost ranking.

Questions:

Best Spot strategy on EMR (capacity-optimized vs price-capacity)?
Typical split: core on on-demand, task nodes mostly Spot?
Savings Plans vs RIs for baseline load?
Any EMR configs for clean Spot fallbacks?

Looking for real-world lessons from teams who optimized first, then added Spot.



https://redd.it/1qed9dy
@r_devops
Resume Review & Next Steps

This is a sanitized version of my resume:
[https://imgur.com/rjzJZvB](https://imgur.com/rjzJZvB)


General Overview:

* I have 7+ years of total experience in IT
* I have just a tad under 4 years of experience in my last role
* My last role is what I consider to be "DevOps in name-only" given that I didn't touch CICD or containers for the first 2-3 years. It was closer to generic Cloud or Infrastructure Engineer


I was recently and abruptly let go from my recent Remote job (no PIP, eligible for rehire, Org was restructuring right up to a new CEO). All I really want is 1) a Remote job and 2) a job where I can spend most of the day in a code editor.


The remote job isn't me being ennoscriptd, I moved to an area away from big cities when I held my last job for 2+ years so it's either 1) find a remote IT job, 2) bag groceries for a living, or 3) move again with 0 income).



I wanted to see if my resume looks generally okay, as general community sentiment seems to be that your resume shouldn't be longer than 1-page unless you have 10+ years of experience. I opted to omit bullet items for older roles as they are less relevant to roles I'm looking for (DevOps, Platform, Cloud, Infrastructure Engineer).


My resume draws from a Full CV where I have other experiences listed, such as setting up a fully 1-click deployment of a Splunk cluster (using Gitlab CI to orchestrate Terraform for Infra + Ansible for Splunk install/configure, with Splunk ingesting logs from AWS via Kinesis Firehose at the end of this).


There is one point of contention or lack in my experience I was hoping to get feedback on.


I listed "Python", but to be honest it was the lowest possible feasible usage of Python where I wrote a simple (less than 200 lines) noscript to automate Selenium web browser actions. Jira Server is known to have gaps in its API, so I can't fully automate the setup (inputting a license key) without using Selenium to interact with the web app. The noscript didn't really make use of functions or classes. As such, I can't honestly say I'd be able to write a Python noscript to do anything specific if asked during an interview.


Similarly, my only practical experience with Golang was when I "vibe-coded" alterations to a fork of Snyk/driftctl. I fundamentally don't understand the lower-level concepts of Golang, but as an engineer I was still able to decompose how the program worked (it reached out to 100+ separate AWS Service API endpoints to make a multitude of GET requests, leading to API rate limiting issues) enough to figure out a more practical workaround (e.g. replace all separate API calls with a single API call to AWS Config Configuration Recorder API instead).


Based on the [DevOps.sh](http://DevOps.sh) roadmap, I figured my major "lack" is knowing a programming language, so I figured a good "next step" is to learn Golang. I'm curious if I'm on-point about that. It's just that at this point, I'm not sure why you need to learn that and to what extent you need to know it. Is it mostly for noscripting or mini-tooling purposes, or do employers generally expect you to develop micro-services like an actual Software Developer?


I come more from the Ops side of IT.

https://redd.it/1qeqgfo
@r_devops
Cyberhaven's Unified DSPM & DLP Platform Launch - Webinar 2/3

Hey r/devops, wanted to share a webinar next week that's relevant if you're dealing with data security in your environments, especially with AI tools in the mix.

Cyberhaven is launching their unified platform that combines AI security, DLP, insider threat management, and NextGen DSPM. Their CEO and product team will be covering:

Getting visibility across cloud, on-prem, and endpoints in one place
Understanding what sensitive data you have, where it lives, and actual risk levels
How AI adoption is creating new data exposure challenges (shadow AI, ChatGPT usage, etc.)
Context-rich data visibility to reduce operational blind spots

If you're managing infrastructure where developers are spinning up cloud resources, using AI coding assistants, or moving data across environments - this covers the visibility and security posture challenges that come with it.

Pretty relevant given how many teams are now dealing with data sprawl from AI tools on top of the usual multi-cloud complexity.

Free registration: https://events.cyberhaven.com/winter-2026-launch/

Date: February 3rd, 2:00 PM — 3:00 PM EST

https://redd.it/1qeu7z3
@r_devops
What happened to getport.io?

If I remember correctly, there was some open source internal developer platform project called Port and it was usually compared to Backstage.

Today I was looking for open-source internal developer platform projects and remembered Port.
But there's no trace of it and getport.io redirects to port.io which seems completely closed, SaaS platform?

Or am I misremembering things?

https://redd.it/1qjbjiz
@r_devops
Networking for DevOps?

Hi everyone,

I want to understand networking concepts properly, the ones that are essential and useful as a DevOps engineer. Couldn't find any suitable tutorials on YouTube. Would like your suggestions on resources/ books I can refer to to learn and implementation networking concepts on Cloud and become a good DevOps engineer.

Any suggestions would be appreciated!

Thanks in advance

https://redd.it/1qj01gb
@r_devops
3 hour+ AOSP builds killing dev velocity. Is a 7 month build system migration really the answer?

Our builds take forever. We're in the middle of an AOSP migration and wondering if anyone has migrated to Bazel successfully? We're talking about migrating tens of thousands of build rules, retooling our entire CI/CD pipeline, and retraining our devs to use Bazel. Our timeline keeps growing.

On a clear build, we're looking at 3+ hours for the full AOSP stack. Like I said, it's killing our dev velocity. How has the fix for slow builds become throwing out your entire build system to learn Bazel? It's genuinely useful, but I'm not sure the benefits are worth pulling our engineering resources for a 7 month long migration.

Are there any alternatives without the need for a complete system overhaul?

https://redd.it/1qj1uke
@r_devops
TFS / DevOps automation, to delete multiple sources, is this possible

Hi all,

I'm trying to create automation to do mass delete from TFS/Devops. Is this possible? I'm running TFS in VS2022 for SSRS project.

From what I learned, I need to :
1. Delete Source1,Source2,Source3...

2. Commit Delete for all objects from #1.

3. Commit project.

Is this possible with help of any noscripting, probably power Shell ?

Thanks



https://redd.it/1qjnj02
@r_devops
Made a simple file watcher for Python automation pipelines

Kept rewriting watchdog boilerplate for different projects — new file lands, process it, move it somewhere. Made a small library to skip that setup.

https://github.com/MichielMe/flowwatch

Just decorators:

@watcher.on_created("*.csv")
def process(event):
# handle event.path

Has process_existing=True which scans the folder on startup — useful when your service restarts and needs to catch up on files that landed while it was down.

Nothing fancy, just trying to save some boilerplate. Curious if anyone else deals with this pattern.

https://redd.it/1qjo6vg
@r_devops
How do you use language go as an SRE/devops at work?

I have heard much about go but never myself used it at work. Therefore I have an interest on how people working as devops/sre use it.

https://redd.it/1qjoz9e
@r_devops
MBA background matter when switching DevOps jobs?

Hi everyone,

I have an MBA background and have been working as a DevOps Engineer for the last 2.4 years. I’m currently planning to switch to another company.

Will my MBA (non-CS) background matter during interviews or shortlisting, or will companies mainly focus on my DevOps experience and skills?

Would love to hear from people who’ve faced something similar or are hiring managers.

Thanks!

https://redd.it/1qjr0jz
@r_devops
How are people persisting application or agent state across restarts locally?

I keep running into the same issue across different projects and I’m curious how others are handling it in practice.

When you’re building something stateful, whether that’s agents, long-running workflows, local services, or edge software, in-memory state disappears on restart. Cloud services solve some of this, but they introduce latency, cost, and dependencies that aren’t always acceptable, especially if the system needs to run locally or offline.

The patterns I’ve seen most often are things like Redis with persistence enabled, using a vector database as “memory”, storing state in Postgres or SQLite, writing ad-hoc files or checkpoints, or just rebuilding state on startup and hoping it’s fast enough.

All of these approaches work to a point, but they start to feel fragile once restarts are frequent, state grows large, latency needs to be predictable, or the system can’t afford a warmup or rebuild phase. At that stage it feels like we’re forcing tools to do jobs they weren’t really designed for.

I’m genuinely unsure whether there’s a clean, widely accepted way to handle this, or whether everyone just lives with the trade-offs and moves on.

How are people here persisting state or “memory” today? What breaks first in your setup? At what point does Redis, a database, or a DIY approach stop being worth it? Are there patterns that actually hold up long-term that I’m missing?

I’m asking because we’re spending time exploring this problem space and trying to understand whether this is a niche annoyance or a real recurring pain for others.

If this maps to something you’re building, let me know. We’ve built something locally that’s meant to address this, and I’m happy to let interested folks try it out or sanity-check whether it actually helps.

https://redd.it/1qjrowv
@r_devops
Someone built an entire AWS empire in the management account, send help!

I recently joined a company where everything runs in the AWS management account, prod, dev, stage, test, all mixed together. No member accounts. No guardrails. Some resources were clearly created for testing years ago and left running, and figuring out whether they were safe to delete was painful. To make things worse, developers have admin access to the management account. I know this is bad, and I plan to raise it with leadership.

My immediate challenge isn’t fixing the org structure overnight, but the fact that we don’t have any process to track:

* who owns a resource
* why it exists
* how long it should live (especially non-prod)

This leads to wasted spend, confusion during incidents, and risky cleanup decisions. SCPs aren’t an option since this is the management account, and pushing everything into member accounts right now feels unrealistic.

For folks who’ve inherited setups like this:

* What practical process did you put in place first?
* How did you enforce ownership and expiry without SCPs?
* What minimum requirements should DevOps insist on?
* Did you stabilise first, or push early for account separation?

Looking for battle-tested advice, not ideal-world answers 🙂

https://redd.it/1qjs2el
@r_devops
evaluating grafana vs signoz... how important is the UI workflow for incidents?

I am fairly new to observability tools and I am given the task of evaluating an OSS observability tool between grafana and signoz. We are a B2B company, just getting started (about 6 customers).

One consistent difference that I have come across is info in new tab vs in the same view and idk how important it is.

Say in log details, grafana opens a new tab if I want to see associated pod metrics but signoz opens a right panel. I see this in the Traces module too.

What difference does it make? Is it a make or break kinda ui feature? How does it help with incident resolving?

https://redd.it/1qjtoo6
@r_devops
CI CD pipeline from a platform perspective

Hi All,
I have a few queries about CI CD best practices when it comes to workflow ownership by platform team.
We are a newly build platform team and are using github actions, for our first task, we want to provide a basic workflow(test, lint, checks etc) to our different teams using python.

We want to ensure that its configurable and single source of truth should be pyproject.toml.
Questions:
1: How do we ensure that developers can run same checks in local as on CI without config drift between local and CI ?
2: Do we have any best practices when it comes to such offerings from a platform team ?
3: Any pitfalls to avoid or take care of ?

Thanks in advance

https://redd.it/1qjttqe
@r_devops
Needs genuine suggestions!!

I passed my AWS Solutions Architect Associate (SAA) exam last week after preparing for 2 months

A bit about me in here about what all I have been doing and have learnt while preparing AWS SAA

\- Do have working knowledge of Linux

\- Python: not a pro, but I understand the basics and can read/write noscripts

\- Built a small AWS cloud project focused on automation and have basic python projects too

\- Basics of Jenkins

\- Not currently working, but I do have 1+ year of experience as an L1 Compute Engineer at a well known company that works with Servers

Right now I’m a bit confused about the next steps.

\- What should I be focusing on next to break into a cloud role?

\- Should I go deeper into AWS (projects, services), improve Python, or start learning DevOps tools like Docker/Terraform? What should be my immediate next focus?

\- And most importantly should I start applying for cloud roles now, or wait until I skill up more? By the roles I mean cloud support and more

Any advice, roadmap suggestions, or personal experiences would really help.

https://redd.it/1qjw8vc
@r_devops
DevOps conference

Hello! Genuinely curious if you guys are tired of seeing Star Wars theme at industry conferences?

I work for a major tech software company specifically in the QA space and I am thinking of switching the theme of our swag and booth and was wondering if anyone might be able to suggest some themes that would actually draw interest and be a little bit more novel. What would you guys like to get when it comes to swag? What would you guys like to see when it comes to a theme that would stand out and catch your attention?

I’m pondering the idea of retro games or games as a whole things such as Nintendo or maybe even board games or some fair games..

Thank you in advance!

https://redd.it/1qjvjp9
@r_devops
Built a skill for Opsy that answers "WTF is costing me money on AWS?"

I've been running a few side projects on AWS and got tired of the monthly ritual of opening Cost Explorer, seeing random charges, and thinking "wtf is this?"

So I built aws-wtf \- a skill for Opsy (CLI DevOps agent) that:

1. Pulls your cost breakdown via Cost Explorer API
2. Maps charges to actual resources \- no more guessing what eipalloc-07fa453a5acbb5651 is
3. Exports everything to CSV with resource names, ARNs, regions, and human-readable explanations
4. Identifies cost offsets like credits and free tier

ex output:

|Resource|Category|Charge|Monthly Cost|
|:-|:-|:-|:-|
|my-app-backend|Container|ECS Fargate vCPU (0.5 vCPU)|$18.51|
|my-app-prod|Networking|Application Load Balancer hourly|$16.42|
|my-app-prod|Database|RDS db.t3.micro PostgreSQL|$12.82|



Run it monthly before your bill arrives, or when onboarding to a new account to understand what's running.

Link: https://github.com/opsyhq/opsy/tree/main/skills/aws-wtf

Would love feedback. What other AWS mysteries would be useful to decode?

https://redd.it/1qjzpbl
@r_devops
What we actually alert on vs what we just log after years of alert fatigue

Spent the last few weeks documenting our monitoring setup and realized the most important thing isn't the tools. It's knowing what deserves a page vs what should just be a Slack message vs what should just be logged.

Our rule is simple. Alert on symptoms, not causes. High CPU doesn't always mean a problem. Users getting 5xx errors is always a problem.

We break it into three tiers. Page someone when users are affected right now. Slack notification when something needs attention today like a cert expiring in 14 days. Just log it when it's interesting but not urgent.

The other thing that took us years to learn is that if an alert fires and we consistently do nothing about it, we delete the alert. Alert fatigue is real and it makes you ignore the alerts that actually matter.

Wrote up the full 10-layer framework we use covering everything from infrastructure to log patterns. Each layer exists because we missed it once and got burned.

https://tasrieit.com/blog/10-layer-monitoring-framework-production-kubernetes-2026

What's your approach to deciding what gets a page vs a notification?

https://redd.it/1qk1qsn
@r_devops
Questions when hiring Juniors

Hey guys,

I am going to hire 2 jrs to the team and I was wondering what kind of questions do you all ask? I am more into fetting their mindset as experience even tho preferred, is not required. I am more looking into getting someone that transitioned from development, especially backend, rather than sys admin. Not sure if I am fair or not but instead of supporters, I am more looking for engineers. How do you guys approach this?

Thanks

EDIT: Thanks a lot for the answers. I see that I am thinking the same way with most of you guys. The post may have been misleading but I am also more insterested in their mindset, curiosity, etc. I am not trying to be harsh towards jrs or anything, I am just a mid who is forced to be lead lol

https://redd.it/1qjz4t0
@r_devops
What’s the worst production outage you’ve seen caused by env/config issues?

I’ve seen multiple production issues caused by environment variables:

\- missing keys

\- wrong formats

\- prod using dev values

\- CI passing but prod breaking at runtime

In one case, everything looked green until deployment.

How do teams here actually prevent env/config-related failures?

Do you validate configs in CI, or rely on conventions and docs?



https://redd.it/1qk4zol
@r_devops
RESUME Review request (7+ YOE, staff Platform Engineering)

This is my current resume : https://imgur.com/a/H9ztGeD

I've recently been laid off due to company wide restructuring.

I took a break and have started rewriting my resume to target Platform Engineering / DevEx roles.

Is there anything that screams red flags on my resume? (I Deffo want to re-write the service discovery bulletpoint, it comes across as low impact BS compared to the actual work done, and i want to be concise to keep it to one page)

I have been getting interview calls and recruiters reaching out, but most of them tend to fall far below my comp range (Ideally 200k$+ and remote as a baseline, which as it stands is still a sizable paycut from my previous role). I've restarted the leetcode grind (Which hopefully I won't need to grind hards for serious Platform/DevEx roles) for some of the faang tier postings, but I don't think i'll apply to them for a few more weeks.

Edit: Definitely need to fix grammar in quite a few places

https://redd.it/1qk5b9i
@r_devops