Networking for DevOps?
Hi everyone,
I want to understand networking concepts properly, the ones that are essential and useful as a DevOps engineer. Couldn't find any suitable tutorials on YouTube. Would like your suggestions on resources/ books I can refer to to learn and implementation networking concepts on Cloud and become a good DevOps engineer.
Any suggestions would be appreciated!
Thanks in advance
https://redd.it/1qj01gb
@r_devops
Hi everyone,
I want to understand networking concepts properly, the ones that are essential and useful as a DevOps engineer. Couldn't find any suitable tutorials on YouTube. Would like your suggestions on resources/ books I can refer to to learn and implementation networking concepts on Cloud and become a good DevOps engineer.
Any suggestions would be appreciated!
Thanks in advance
https://redd.it/1qj01gb
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
3 hour+ AOSP builds killing dev velocity. Is a 7 month build system migration really the answer?
Our builds take forever. We're in the middle of an AOSP migration and wondering if anyone has migrated to Bazel successfully? We're talking about migrating tens of thousands of build rules, retooling our entire CI/CD pipeline, and retraining our devs to use Bazel. Our timeline keeps growing.
On a clear build, we're looking at 3+ hours for the full AOSP stack. Like I said, it's killing our dev velocity. How has the fix for slow builds become throwing out your entire build system to learn Bazel? It's genuinely useful, but I'm not sure the benefits are worth pulling our engineering resources for a 7 month long migration.
Are there any alternatives without the need for a complete system overhaul?
https://redd.it/1qj1uke
@r_devops
Our builds take forever. We're in the middle of an AOSP migration and wondering if anyone has migrated to Bazel successfully? We're talking about migrating tens of thousands of build rules, retooling our entire CI/CD pipeline, and retraining our devs to use Bazel. Our timeline keeps growing.
On a clear build, we're looking at 3+ hours for the full AOSP stack. Like I said, it's killing our dev velocity. How has the fix for slow builds become throwing out your entire build system to learn Bazel? It's genuinely useful, but I'm not sure the benefits are worth pulling our engineering resources for a 7 month long migration.
Are there any alternatives without the need for a complete system overhaul?
https://redd.it/1qj1uke
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
TFS / DevOps automation, to delete multiple sources, is this possible
Hi all,
I'm trying to create automation to do mass delete from TFS/Devops. Is this possible? I'm running TFS in VS2022 for SSRS project.
From what I learned, I need to :
1. Delete Source1,Source2,Source3...
2. Commit Delete for all objects from #1.
3. Commit project.
Is this possible with help of any noscripting, probably power Shell ?
Thanks
https://redd.it/1qjnj02
@r_devops
Hi all,
I'm trying to create automation to do mass delete from TFS/Devops. Is this possible? I'm running TFS in VS2022 for SSRS project.
From what I learned, I need to :
1. Delete Source1,Source2,Source3...
2. Commit Delete for all objects from #1.
3. Commit project.
Is this possible with help of any noscripting, probably power Shell ?
Thanks
https://redd.it/1qjnj02
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Made a simple file watcher for Python automation pipelines
Kept rewriting watchdog boilerplate for different projects — new file lands, process it, move it somewhere. Made a small library to skip that setup.
https://github.com/MichielMe/flowwatch
Just decorators:
@watcher.on_created("*.csv")
def process(event):
# handle event.path
Has process_existing=True which scans the folder on startup — useful when your service restarts and needs to catch up on files that landed while it was down.
Nothing fancy, just trying to save some boilerplate. Curious if anyone else deals with this pattern.
https://redd.it/1qjo6vg
@r_devops
Kept rewriting watchdog boilerplate for different projects — new file lands, process it, move it somewhere. Made a small library to skip that setup.
https://github.com/MichielMe/flowwatch
Just decorators:
@watcher.on_created("*.csv")
def process(event):
# handle event.path
Has process_existing=True which scans the folder on startup — useful when your service restarts and needs to catch up on files that landed while it was down.
Nothing fancy, just trying to save some boilerplate. Curious if anyone else deals with this pattern.
https://redd.it/1qjo6vg
@r_devops
GitHub
GitHub - MichielMe/flowwatch: FlowWatch is a tiny ergonomic layer on top of Watchfiles that makes it easy to build file-driven…
FlowWatch is a tiny ergonomic layer on top of Watchfiles that makes it easy to build file-driven workflows using simple decorators and a pretty Rich + Typer powered CLI. - MichielMe/flowwatch
How do you use language go as an SRE/devops at work?
I have heard much about go but never myself used it at work. Therefore I have an interest on how people working as devops/sre use it.
https://redd.it/1qjoz9e
@r_devops
I have heard much about go but never myself used it at work. Therefore I have an interest on how people working as devops/sre use it.
https://redd.it/1qjoz9e
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
MBA background matter when switching DevOps jobs?
Hi everyone,
I have an MBA background and have been working as a DevOps Engineer for the last 2.4 years. I’m currently planning to switch to another company.
Will my MBA (non-CS) background matter during interviews or shortlisting, or will companies mainly focus on my DevOps experience and skills?
Would love to hear from people who’ve faced something similar or are hiring managers.
Thanks!
https://redd.it/1qjr0jz
@r_devops
Hi everyone,
I have an MBA background and have been working as a DevOps Engineer for the last 2.4 years. I’m currently planning to switch to another company.
Will my MBA (non-CS) background matter during interviews or shortlisting, or will companies mainly focus on my DevOps experience and skills?
Would love to hear from people who’ve faced something similar or are hiring managers.
Thanks!
https://redd.it/1qjr0jz
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How are people persisting application or agent state across restarts locally?
I keep running into the same issue across different projects and I’m curious how others are handling it in practice.
When you’re building something stateful, whether that’s agents, long-running workflows, local services, or edge software, in-memory state disappears on restart. Cloud services solve some of this, but they introduce latency, cost, and dependencies that aren’t always acceptable, especially if the system needs to run locally or offline.
The patterns I’ve seen most often are things like Redis with persistence enabled, using a vector database as “memory”, storing state in Postgres or SQLite, writing ad-hoc files or checkpoints, or just rebuilding state on startup and hoping it’s fast enough.
All of these approaches work to a point, but they start to feel fragile once restarts are frequent, state grows large, latency needs to be predictable, or the system can’t afford a warmup or rebuild phase. At that stage it feels like we’re forcing tools to do jobs they weren’t really designed for.
I’m genuinely unsure whether there’s a clean, widely accepted way to handle this, or whether everyone just lives with the trade-offs and moves on.
How are people here persisting state or “memory” today? What breaks first in your setup? At what point does Redis, a database, or a DIY approach stop being worth it? Are there patterns that actually hold up long-term that I’m missing?
I’m asking because we’re spending time exploring this problem space and trying to understand whether this is a niche annoyance or a real recurring pain for others.
If this maps to something you’re building, let me know. We’ve built something locally that’s meant to address this, and I’m happy to let interested folks try it out or sanity-check whether it actually helps.
https://redd.it/1qjrowv
@r_devops
I keep running into the same issue across different projects and I’m curious how others are handling it in practice.
When you’re building something stateful, whether that’s agents, long-running workflows, local services, or edge software, in-memory state disappears on restart. Cloud services solve some of this, but they introduce latency, cost, and dependencies that aren’t always acceptable, especially if the system needs to run locally or offline.
The patterns I’ve seen most often are things like Redis with persistence enabled, using a vector database as “memory”, storing state in Postgres or SQLite, writing ad-hoc files or checkpoints, or just rebuilding state on startup and hoping it’s fast enough.
All of these approaches work to a point, but they start to feel fragile once restarts are frequent, state grows large, latency needs to be predictable, or the system can’t afford a warmup or rebuild phase. At that stage it feels like we’re forcing tools to do jobs they weren’t really designed for.
I’m genuinely unsure whether there’s a clean, widely accepted way to handle this, or whether everyone just lives with the trade-offs and moves on.
How are people here persisting state or “memory” today? What breaks first in your setup? At what point does Redis, a database, or a DIY approach stop being worth it? Are there patterns that actually hold up long-term that I’m missing?
I’m asking because we’re spending time exploring this problem space and trying to understand whether this is a niche annoyance or a real recurring pain for others.
If this maps to something you’re building, let me know. We’ve built something locally that’s meant to address this, and I’m happy to let interested folks try it out or sanity-check whether it actually helps.
https://redd.it/1qjrowv
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Someone built an entire AWS empire in the management account, send help!
I recently joined a company where everything runs in the AWS management account, prod, dev, stage, test, all mixed together. No member accounts. No guardrails. Some resources were clearly created for testing years ago and left running, and figuring out whether they were safe to delete was painful. To make things worse, developers have admin access to the management account. I know this is bad, and I plan to raise it with leadership.
My immediate challenge isn’t fixing the org structure overnight, but the fact that we don’t have any process to track:
* who owns a resource
* why it exists
* how long it should live (especially non-prod)
This leads to wasted spend, confusion during incidents, and risky cleanup decisions. SCPs aren’t an option since this is the management account, and pushing everything into member accounts right now feels unrealistic.
For folks who’ve inherited setups like this:
* What practical process did you put in place first?
* How did you enforce ownership and expiry without SCPs?
* What minimum requirements should DevOps insist on?
* Did you stabilise first, or push early for account separation?
Looking for battle-tested advice, not ideal-world answers 🙂
https://redd.it/1qjs2el
@r_devops
I recently joined a company where everything runs in the AWS management account, prod, dev, stage, test, all mixed together. No member accounts. No guardrails. Some resources were clearly created for testing years ago and left running, and figuring out whether they were safe to delete was painful. To make things worse, developers have admin access to the management account. I know this is bad, and I plan to raise it with leadership.
My immediate challenge isn’t fixing the org structure overnight, but the fact that we don’t have any process to track:
* who owns a resource
* why it exists
* how long it should live (especially non-prod)
This leads to wasted spend, confusion during incidents, and risky cleanup decisions. SCPs aren’t an option since this is the management account, and pushing everything into member accounts right now feels unrealistic.
For folks who’ve inherited setups like this:
* What practical process did you put in place first?
* How did you enforce ownership and expiry without SCPs?
* What minimum requirements should DevOps insist on?
* Did you stabilise first, or push early for account separation?
Looking for battle-tested advice, not ideal-world answers 🙂
https://redd.it/1qjs2el
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
evaluating grafana vs signoz... how important is the UI workflow for incidents?
I am fairly new to observability tools and I am given the task of evaluating an OSS observability tool between grafana and signoz. We are a B2B company, just getting started (about 6 customers).
One consistent difference that I have come across is info in new tab vs in the same view and idk how important it is.
Say in log details, grafana opens a new tab if I want to see associated pod metrics but signoz opens a right panel. I see this in the Traces module too.
What difference does it make? Is it a make or break kinda ui feature? How does it help with incident resolving?
https://redd.it/1qjtoo6
@r_devops
I am fairly new to observability tools and I am given the task of evaluating an OSS observability tool between grafana and signoz. We are a B2B company, just getting started (about 6 customers).
One consistent difference that I have come across is info in new tab vs in the same view and idk how important it is.
Say in log details, grafana opens a new tab if I want to see associated pod metrics but signoz opens a right panel. I see this in the Traces module too.
What difference does it make? Is it a make or break kinda ui feature? How does it help with incident resolving?
https://redd.it/1qjtoo6
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
CI CD pipeline from a platform perspective
Hi All,
I have a few queries about CI CD best practices when it comes to workflow ownership by platform team.
We are a newly build platform team and are using github actions, for our first task, we want to provide a basic workflow(test, lint, checks etc) to our different teams using python.
We want to ensure that its configurable and single source of truth should be pyproject.toml.
Questions:
1: How do we ensure that developers can run same checks in local as on CI without config drift between local and CI ?
2: Do we have any best practices when it comes to such offerings from a platform team ?
3: Any pitfalls to avoid or take care of ?
Thanks in advance
https://redd.it/1qjttqe
@r_devops
Hi All,
I have a few queries about CI CD best practices when it comes to workflow ownership by platform team.
We are a newly build platform team and are using github actions, for our first task, we want to provide a basic workflow(test, lint, checks etc) to our different teams using python.
We want to ensure that its configurable and single source of truth should be pyproject.toml.
Questions:
1: How do we ensure that developers can run same checks in local as on CI without config drift between local and CI ?
2: Do we have any best practices when it comes to such offerings from a platform team ?
3: Any pitfalls to avoid or take care of ?
Thanks in advance
https://redd.it/1qjttqe
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Needs genuine suggestions!!
I passed my AWS Solutions Architect Associate (SAA) exam last week after preparing for 2 months
A bit about me in here about what all I have been doing and have learnt while preparing AWS SAA
\- Do have working knowledge of Linux
\- Python: not a pro, but I understand the basics and can read/write noscripts
\- Built a small AWS cloud project focused on automation and have basic python projects too
\- Basics of Jenkins
\- Not currently working, but I do have 1+ year of experience as an L1 Compute Engineer at a well known company that works with Servers
Right now I’m a bit confused about the next steps.
\- What should I be focusing on next to break into a cloud role?
\- Should I go deeper into AWS (projects, services), improve Python, or start learning DevOps tools like Docker/Terraform? What should be my immediate next focus?
\- And most importantly should I start applying for cloud roles now, or wait until I skill up more? By the roles I mean cloud support and more
Any advice, roadmap suggestions, or personal experiences would really help.
https://redd.it/1qjw8vc
@r_devops
I passed my AWS Solutions Architect Associate (SAA) exam last week after preparing for 2 months
A bit about me in here about what all I have been doing and have learnt while preparing AWS SAA
\- Do have working knowledge of Linux
\- Python: not a pro, but I understand the basics and can read/write noscripts
\- Built a small AWS cloud project focused on automation and have basic python projects too
\- Basics of Jenkins
\- Not currently working, but I do have 1+ year of experience as an L1 Compute Engineer at a well known company that works with Servers
Right now I’m a bit confused about the next steps.
\- What should I be focusing on next to break into a cloud role?
\- Should I go deeper into AWS (projects, services), improve Python, or start learning DevOps tools like Docker/Terraform? What should be my immediate next focus?
\- And most importantly should I start applying for cloud roles now, or wait until I skill up more? By the roles I mean cloud support and more
Any advice, roadmap suggestions, or personal experiences would really help.
https://redd.it/1qjw8vc
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
DevOps conference
Hello! Genuinely curious if you guys are tired of seeing Star Wars theme at industry conferences?
I work for a major tech software company specifically in the QA space and I am thinking of switching the theme of our swag and booth and was wondering if anyone might be able to suggest some themes that would actually draw interest and be a little bit more novel. What would you guys like to get when it comes to swag? What would you guys like to see when it comes to a theme that would stand out and catch your attention?
I’m pondering the idea of retro games or games as a whole things such as Nintendo or maybe even board games or some fair games..
Thank you in advance!
https://redd.it/1qjvjp9
@r_devops
Hello! Genuinely curious if you guys are tired of seeing Star Wars theme at industry conferences?
I work for a major tech software company specifically in the QA space and I am thinking of switching the theme of our swag and booth and was wondering if anyone might be able to suggest some themes that would actually draw interest and be a little bit more novel. What would you guys like to get when it comes to swag? What would you guys like to see when it comes to a theme that would stand out and catch your attention?
I’m pondering the idea of retro games or games as a whole things such as Nintendo or maybe even board games or some fair games..
Thank you in advance!
https://redd.it/1qjvjp9
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Built a skill for Opsy that answers "WTF is costing me money on AWS?"
I've been running a few side projects on AWS and got tired of the monthly ritual of opening Cost Explorer, seeing random charges, and thinking "wtf is this?"
So I built aws-wtf \- a skill for Opsy (CLI DevOps agent) that:
1. Pulls your cost breakdown via Cost Explorer API
2. Maps charges to actual resources \- no more guessing what
3. Exports everything to CSV with resource names, ARNs, regions, and human-readable explanations
4. Identifies cost offsets like credits and free tier
ex output:
|Resource|Category|Charge|Monthly Cost|
|:-|:-|:-|:-|
|my-app-backend|Container|ECS Fargate vCPU (0.5 vCPU)|$18.51|
|my-app-prod|Networking|Application Load Balancer hourly|$16.42|
|my-app-prod|Database|RDS db.t3.micro PostgreSQL|$12.82|
Run it monthly before your bill arrives, or when onboarding to a new account to understand what's running.
Link: https://github.com/opsyhq/opsy/tree/main/skills/aws-wtf
Would love feedback. What other AWS mysteries would be useful to decode?
https://redd.it/1qjzpbl
@r_devops
I've been running a few side projects on AWS and got tired of the monthly ritual of opening Cost Explorer, seeing random charges, and thinking "wtf is this?"
So I built aws-wtf \- a skill for Opsy (CLI DevOps agent) that:
1. Pulls your cost breakdown via Cost Explorer API
2. Maps charges to actual resources \- no more guessing what
eipalloc-07fa453a5acbb5651 is3. Exports everything to CSV with resource names, ARNs, regions, and human-readable explanations
4. Identifies cost offsets like credits and free tier
ex output:
|Resource|Category|Charge|Monthly Cost|
|:-|:-|:-|:-|
|my-app-backend|Container|ECS Fargate vCPU (0.5 vCPU)|$18.51|
|my-app-prod|Networking|Application Load Balancer hourly|$16.42|
|my-app-prod|Database|RDS db.t3.micro PostgreSQL|$12.82|
Run it monthly before your bill arrives, or when onboarding to a new account to understand what's running.
Link: https://github.com/opsyhq/opsy/tree/main/skills/aws-wtf
Would love feedback. What other AWS mysteries would be useful to decode?
https://redd.it/1qjzpbl
@r_devops
What we actually alert on vs what we just log after years of alert fatigue
Spent the last few weeks documenting our monitoring setup and realized the most important thing isn't the tools. It's knowing what deserves a page vs what should just be a Slack message vs what should just be logged.
Our rule is simple. Alert on symptoms, not causes. High CPU doesn't always mean a problem. Users getting 5xx errors is always a problem.
We break it into three tiers. Page someone when users are affected right now. Slack notification when something needs attention today like a cert expiring in 14 days. Just log it when it's interesting but not urgent.
The other thing that took us years to learn is that if an alert fires and we consistently do nothing about it, we delete the alert. Alert fatigue is real and it makes you ignore the alerts that actually matter.
Wrote up the full 10-layer framework we use covering everything from infrastructure to log patterns. Each layer exists because we missed it once and got burned.
https://tasrieit.com/blog/10-layer-monitoring-framework-production-kubernetes-2026
What's your approach to deciding what gets a page vs a notification?
https://redd.it/1qk1qsn
@r_devops
Spent the last few weeks documenting our monitoring setup and realized the most important thing isn't the tools. It's knowing what deserves a page vs what should just be a Slack message vs what should just be logged.
Our rule is simple. Alert on symptoms, not causes. High CPU doesn't always mean a problem. Users getting 5xx errors is always a problem.
We break it into three tiers. Page someone when users are affected right now. Slack notification when something needs attention today like a cert expiring in 14 days. Just log it when it's interesting but not urgent.
The other thing that took us years to learn is that if an alert fires and we consistently do nothing about it, we delete the alert. Alert fatigue is real and it makes you ignore the alerts that actually matter.
Wrote up the full 10-layer framework we use covering everything from infrastructure to log patterns. Each layer exists because we missed it once and got burned.
https://tasrieit.com/blog/10-layer-monitoring-framework-production-kubernetes-2026
What's your approach to deciding what gets a page vs a notification?
https://redd.it/1qk1qsn
@r_devops
Tasrie IT Services
The 10-Layer Monitoring Framework That Saved Our Clients From 3am Pages | Tasrie IT Services
After a decade of production incidents and implementing monitoring for 400+ server environments, this is the exact framework we use. No theory - just what actually works.
Questions when hiring Juniors
Hey guys,
I am going to hire 2 jrs to the team and I was wondering what kind of questions do you all ask? I am more into fetting their mindset as experience even tho preferred, is not required. I am more looking into getting someone that transitioned from development, especially backend, rather than sys admin. Not sure if I am fair or not but instead of supporters, I am more looking for engineers. How do you guys approach this?
Thanks
EDIT: Thanks a lot for the answers. I see that I am thinking the same way with most of you guys. The post may have been misleading but I am also more insterested in their mindset, curiosity, etc. I am not trying to be harsh towards jrs or anything, I am just a mid who is forced to be lead lol
https://redd.it/1qjz4t0
@r_devops
Hey guys,
I am going to hire 2 jrs to the team and I was wondering what kind of questions do you all ask? I am more into fetting their mindset as experience even tho preferred, is not required. I am more looking into getting someone that transitioned from development, especially backend, rather than sys admin. Not sure if I am fair or not but instead of supporters, I am more looking for engineers. How do you guys approach this?
Thanks
EDIT: Thanks a lot for the answers. I see that I am thinking the same way with most of you guys. The post may have been misleading but I am also more insterested in their mindset, curiosity, etc. I am not trying to be harsh towards jrs or anything, I am just a mid who is forced to be lead lol
https://redd.it/1qjz4t0
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
What’s the worst production outage you’ve seen caused by env/config issues?
I’ve seen multiple production issues caused by environment variables:
\- missing keys
\- wrong formats
\- prod using dev values
\- CI passing but prod breaking at runtime
In one case, everything looked green until deployment.
How do teams here actually prevent env/config-related failures?
Do you validate configs in CI, or rely on conventions and docs?
https://redd.it/1qk4zol
@r_devops
I’ve seen multiple production issues caused by environment variables:
\- missing keys
\- wrong formats
\- prod using dev values
\- CI passing but prod breaking at runtime
In one case, everything looked green until deployment.
How do teams here actually prevent env/config-related failures?
Do you validate configs in CI, or rely on conventions and docs?
https://redd.it/1qk4zol
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
RESUME Review request (7+ YOE, staff Platform Engineering)
This is my current resume : https://imgur.com/a/H9ztGeD
I've recently been laid off due to company wide restructuring.
I took a break and have started rewriting my resume to target Platform Engineering / DevEx roles.
Is there anything that screams red flags on my resume? (I Deffo want to re-write the service discovery bulletpoint, it comes across as low impact BS compared to the actual work done, and i want to be concise to keep it to one page)
I have been getting interview calls and recruiters reaching out, but most of them tend to fall far below my comp range (Ideally 200k$+ and remote as a baseline, which as it stands is still a sizable paycut from my previous role). I've restarted the leetcode grind (Which hopefully I won't need to grind hards for serious Platform/DevEx roles) for some of the faang tier postings, but I don't think i'll apply to them for a few more weeks.
Edit: Definitely need to fix grammar in quite a few places
https://redd.it/1qk5b9i
@r_devops
This is my current resume : https://imgur.com/a/H9ztGeD
I've recently been laid off due to company wide restructuring.
I took a break and have started rewriting my resume to target Platform Engineering / DevEx roles.
Is there anything that screams red flags on my resume? (I Deffo want to re-write the service discovery bulletpoint, it comes across as low impact BS compared to the actual work done, and i want to be concise to keep it to one page)
I have been getting interview calls and recruiters reaching out, but most of them tend to fall far below my comp range (Ideally 200k$+ and remote as a baseline, which as it stands is still a sizable paycut from my previous role). I've restarted the leetcode grind (Which hopefully I won't need to grind hards for serious Platform/DevEx roles) for some of the faang tier postings, but I don't think i'll apply to them for a few more weeks.
Edit: Definitely need to fix grammar in quite a few places
https://redd.it/1qk5b9i
@r_devops
Server setup - suggestions
We have a beefy server with 2x64 AMD EPYC cores, 1+ TB RAM, multiple Nvidia data center GPUs, etc.
The plan is to use it to train AI models with images, videos, lidar data, etc, and of course maybe host some LLMs as well, and later with more GPUs and/or servers possible.
Currently, I started the setup with a Proxmox, configuring and setting up everything with Ansible to have everything in a git repo, and the plan is to have Kubernetes running on Talos VMs to be able to use Kubeflow Pipelines to be able to efficiently schedule GPUs where required, with possible Nvidia MIG if needed, and to be able to run ML pipelines easily.
Is this a bad way of doing this?
Any recommendations for these kind of use cases?
https://redd.it/1qka15z
@r_devops
We have a beefy server with 2x64 AMD EPYC cores, 1+ TB RAM, multiple Nvidia data center GPUs, etc.
The plan is to use it to train AI models with images, videos, lidar data, etc, and of course maybe host some LLMs as well, and later with more GPUs and/or servers possible.
Currently, I started the setup with a Proxmox, configuring and setting up everything with Ansible to have everything in a git repo, and the plan is to have Kubernetes running on Talos VMs to be able to use Kubeflow Pipelines to be able to efficiently schedule GPUs where required, with possible Nvidia MIG if needed, and to be able to run ML pipelines easily.
Is this a bad way of doing this?
Any recommendations for these kind of use cases?
https://redd.it/1qka15z
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Stop trusting your Terraform State file. It’s lying to you.
I've been in a bit of a debate with my platform team this week and wanted to sanity check this with you guys.
We’re doing a massive migration for a Sovereign Cloud environment, so compliance is tight. During the audit, I realized something that scared the hell out of me: we treat our Terraform State file like it's the gospel truth. But it's not. It's just a cached memory of what infrastructure used to look like.
The moment a Junior Admin hotfixes a Security Group in the AWS Console at 2 AM because "prod is down," that State file is technically corrupt. It doesn't match Reality (the Cloud API) anymore.
Most of our pipelines were just running terraform plan, which compares Git vs State. It assumes State is accurate. It completely ignores the fact that someone might have clicked around in the console three days ago.
So, I forced a change: The Hard Drift Gate.
We added
The pushback has been real. Half my team hates it. They say it kills velocity because they can't just "blast out a fix" if there's existing drift from a previous hotfix. They have to clean up the mess first.
My argument: Deploying on top of unknown manual changes isn't "velocity," it's negligence. Especially when a manual change might have exposed a private bucket to the public internet, and your standard apply might just silently overwrite it (or worse, ignore it).
I wrote up the exact bash logic we used to trap the exit codes and how we filter out "noise" vs "actual risks" (like data residency violations). I pinned the full write-up to my profile if anyone wants to steal the noscript, to avoid spamming the sub.
Am I being too strict here? How do you guys handle the "ClickOps" gap? Do you block the pipeline, or just let Terraform bulldoze over the manual changes and hope for the best?
https://redd.it/1qk60ll
@r_devops
I've been in a bit of a debate with my platform team this week and wanted to sanity check this with you guys.
We’re doing a massive migration for a Sovereign Cloud environment, so compliance is tight. During the audit, I realized something that scared the hell out of me: we treat our Terraform State file like it's the gospel truth. But it's not. It's just a cached memory of what infrastructure used to look like.
The moment a Junior Admin hotfixes a Security Group in the AWS Console at 2 AM because "prod is down," that State file is technically corrupt. It doesn't match Reality (the Cloud API) anymore.
Most of our pipelines were just running terraform plan, which compares Git vs State. It assumes State is accurate. It completely ignores the fact that someone might have clicked around in the console three days ago.
So, I forced a change: The Hard Drift Gate.
We added
terraform plan -refresh-only -detailed-exitcode before the regular plan. If it returns Exit Code 2 (Drift Detected), the pipeline dies. Hard stop. No deploying new code until you acknowledge or import the manual changes.The pushback has been real. Half my team hates it. They say it kills velocity because they can't just "blast out a fix" if there's existing drift from a previous hotfix. They have to clean up the mess first.
My argument: Deploying on top of unknown manual changes isn't "velocity," it's negligence. Especially when a manual change might have exposed a private bucket to the public internet, and your standard apply might just silently overwrite it (or worse, ignore it).
I wrote up the exact bash logic we used to trap the exit codes and how we filter out "noise" vs "actual risks" (like data residency violations). I pinned the full write-up to my profile if anyone wants to steal the noscript, to avoid spamming the sub.
Am I being too strict here? How do you guys handle the "ClickOps" gap? Do you block the pipeline, or just let Terraform bulldoze over the manual changes and hope for the best?
https://redd.it/1qk60ll
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
PM question: what to do when automation become just another project?
I sit between product and QA, and lately automation is feeling like a whole project all on its own.
manual regression is slow and frustrating but every time we try to automate more it seems to come with a load of headaches: months of setup, new tools to learn, not to mention only one or two people on the team actually know how it works.
it’s making automation hard to justify when timelines are already tight.
for teams that actually made the transition to automated testing what made it click?
trying to figure it out before we invest more time into this.
https://redd.it/1qk2h48
@r_devops
I sit between product and QA, and lately automation is feeling like a whole project all on its own.
manual regression is slow and frustrating but every time we try to automate more it seems to come with a load of headaches: months of setup, new tools to learn, not to mention only one or two people on the team actually know how it works.
it’s making automation hard to justify when timelines are already tight.
for teams that actually made the transition to automated testing what made it click?
trying to figure it out before we invest more time into this.
https://redd.it/1qk2h48
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Have you used adviser.sh - if so, what were your experiences?
Someone told me about this at work today. Looking at the blurb for https://github.com/adviserlabs/docs/tree/main, it seems to promise a way to run large-scale compute and data workflows without having to know how infrastructure, cloud configuration, or orchestration details work...
I’m generally skeptical of “magic” abstractions, but I’ve spent a fair amount of time dealing with HPC clusters, cloud schedulers, and workflow tooling, so I can see how something like this could be useful
What’s it actually like in practice?
https://redd.it/1qke9uz
@r_devops
Someone told me about this at work today. Looking at the blurb for https://github.com/adviserlabs/docs/tree/main, it seems to promise a way to run large-scale compute and data workflows without having to know how infrastructure, cloud configuration, or orchestration details work...
I’m generally skeptical of “magic” abstractions, but I’ve spent a fair amount of time dealing with HPC clusters, cloud schedulers, and workflow tooling, so I can see how something like this could be useful
What’s it actually like in practice?
https://redd.it/1qke9uz
@r_devops
GitHub
GitHub - adviserlabs/docs: Adviser documentation
Adviser documentation. Contribute to adviserlabs/docs development by creating an account on GitHub.