Launch container on first connection
I'm trying to imagine how I could implement Cloud Run scale to zero feature.
Let's say I'm running either containers with CRIU or KVM images, the scenario would be:
- A client start a request (the protocol might be HTTP, TCP, UDP, ...)
- The node receives the request
- If a container is ready to serve, forward the connection as normal
- If no container is available, first starts it, then forward the connection
I can imagine implementing this via a load balancer (eBPF? Custom app?), who would be in charge of terminating connections, anyhow I'm fuzzy on the details.
- Wouldn't the connection possibly timeout while the container is starting? I can ameliorate this using CRIU for fast boots
- Is there some projects already covering this?
https://redd.it/1pccrxa
@r_devops
I'm trying to imagine how I could implement Cloud Run scale to zero feature.
Let's say I'm running either containers with CRIU or KVM images, the scenario would be:
- A client start a request (the protocol might be HTTP, TCP, UDP, ...)
- The node receives the request
- If a container is ready to serve, forward the connection as normal
- If no container is available, first starts it, then forward the connection
I can imagine implementing this via a load balancer (eBPF? Custom app?), who would be in charge of terminating connections, anyhow I'm fuzzy on the details.
- Wouldn't the connection possibly timeout while the container is starting? I can ameliorate this using CRIU for fast boots
- Is there some projects already covering this?
https://redd.it/1pccrxa
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Company is starting on-call soon. What should I request from management? Money? Separate phone? etc...
For reference, I'm in the US, a salaried employee, and 100% remote.
We've been working on a greenfield project for a few years that is now just going live and it's going to require our entire department (not just my team) to be on-call.
I guess this is my chance to make some requests from management. Should I ask to be compensated for on-call? Should I ask for a separate phone? Should I pull out my contract and refuse to do on-call if it's not in writing?
https://redd.it/1pcdh3r
@r_devops
For reference, I'm in the US, a salaried employee, and 100% remote.
We've been working on a greenfield project for a few years that is now just going live and it's going to require our entire department (not just my team) to be on-call.
I guess this is my chance to make some requests from management. Should I ask to be compensated for on-call? Should I ask for a separate phone? Should I pull out my contract and refuse to do on-call if it's not in writing?
https://redd.it/1pcdh3r
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Should I give my fiver dev my login to my hosting account?
So I am asking because I don’t know if it is safe to give the developer I hired my login with my personal information on the account.
He said “The work related to Dokan requires noscripting in the backend inside the custom files. This cannot be done from the WordPress dashboard. And for noscripting, the database also needs to be configured so without access to the database, how will the work be done?
And if I’m editing but I don’t have hosting permissions, then how will I insert the noscript?”
So I made him a database dev account on phpmyadmin, a cPanel ftp account, and an admin account on my Wordpress site for him but he said that he still needs my login. Is it safe/should I give him my login? He has 5 stars and 178 reviews on fiver and from Bangladesh.
https://redd.it/1pcfx6d
@r_devops
So I am asking because I don’t know if it is safe to give the developer I hired my login with my personal information on the account.
He said “The work related to Dokan requires noscripting in the backend inside the custom files. This cannot be done from the WordPress dashboard. And for noscripting, the database also needs to be configured so without access to the database, how will the work be done?
And if I’m editing but I don’t have hosting permissions, then how will I insert the noscript?”
So I made him a database dev account on phpmyadmin, a cPanel ftp account, and an admin account on my Wordpress site for him but he said that he still needs my login. Is it safe/should I give him my login? He has 5 stars and 178 reviews on fiver and from Bangladesh.
https://redd.it/1pcfx6d
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
From PSI to kill signal: logic I used so auto-remediation doesn’t kill good workloads
Last week I posted here about using Linux PSI instead of just CPU% for “is this box in trouble?” checks.
This is a follow-up: if you trust PSI as a signal, how do you actually act on it without killing normal spikes?
The naive thing I’ve seen (and done myself before) is:
if CPU > 90% for N seconds -> kill / restart
That works until it doesn’t. Enough times I’ve seen:
JVM starting
image builds
some heavy batch job CPU goes high for a bit, everything is actually fine, but the “helper” noscript freaks out and kills it. So now I use 2 signals plus a grace period.
Rough rule:
1. CPU is high (example: > 90%)
2. AND CPU PSI is high (example: cpu some avg10 > 40)
3. AND both stay high for N seconds (I use 15s) If any of these drops during the grace period, I reset state and do nothing.
Only if all three stay true for 15 seconds, then I:
look at per-process stats (CPU, fork rate, short jobs / crash loops)
pick the top offender
send kill to that one process
This avoids the classic “kill → restart → kill → restart” loop from pure CPU-based rules. Short normal spikes don’t keep PSI high for 15 seconds. Real runaway jobs usually do.
I wrote up the Rust code that does this:
read /proc/pressure/cpu (PSI)
combine with eBPF events (fork/exec/exit)
apply the rule above and choose a victim
Write-up + code is here: [https://getlinnix.substack.com/p/from-psi-to-kill-signal-the-rust](https://getlinnix.substack.com/p/from-psi-to-kill-signal-the-rust)
This runs inside Linnix (small eBPF-based tool I’m hacking on: [github.com/linnix-os/linnix](https://github.com/linnix-os/linnix)), but the idea is generic. Even a bash noscript checking /proc/pressure/cpu in a loop with a grace period would be safer than plain CPU > 90% -> kill.
Curious how people here handle this:
Do you use a grace period for auto-remediation (k8s, systemd, custom noscripts)?
Do you gate actions on more than one signal (CPU + PSI / latency / queue depth / error rate)?
Any stories where “CPU threshold → restart/kill” caused more damage than it fixed?
https://redd.it/1pcgk8r
@r_devops
Last week I posted here about using Linux PSI instead of just CPU% for “is this box in trouble?” checks.
This is a follow-up: if you trust PSI as a signal, how do you actually act on it without killing normal spikes?
The naive thing I’ve seen (and done myself before) is:
if CPU > 90% for N seconds -> kill / restart
That works until it doesn’t. Enough times I’ve seen:
JVM starting
image builds
some heavy batch job CPU goes high for a bit, everything is actually fine, but the “helper” noscript freaks out and kills it. So now I use 2 signals plus a grace period.
Rough rule:
1. CPU is high (example: > 90%)
2. AND CPU PSI is high (example: cpu some avg10 > 40)
3. AND both stay high for N seconds (I use 15s) If any of these drops during the grace period, I reset state and do nothing.
Only if all three stay true for 15 seconds, then I:
look at per-process stats (CPU, fork rate, short jobs / crash loops)
pick the top offender
send kill to that one process
This avoids the classic “kill → restart → kill → restart” loop from pure CPU-based rules. Short normal spikes don’t keep PSI high for 15 seconds. Real runaway jobs usually do.
I wrote up the Rust code that does this:
read /proc/pressure/cpu (PSI)
combine with eBPF events (fork/exec/exit)
apply the rule above and choose a victim
Write-up + code is here: [https://getlinnix.substack.com/p/from-psi-to-kill-signal-the-rust](https://getlinnix.substack.com/p/from-psi-to-kill-signal-the-rust)
This runs inside Linnix (small eBPF-based tool I’m hacking on: [github.com/linnix-os/linnix](https://github.com/linnix-os/linnix)), but the idea is generic. Even a bash noscript checking /proc/pressure/cpu in a loop with a grace period would be safer than plain CPU > 90% -> kill.
Curious how people here handle this:
Do you use a grace period for auto-remediation (k8s, systemd, custom noscripts)?
Do you gate actions on more than one signal (CPU + PSI / latency / queue depth / error rate)?
Any stories where “CPU threshold → restart/kill” caused more damage than it fixed?
https://redd.it/1pcgk8r
@r_devops
Substack
From PSI to Kill Signal: The Rust Circuit Breaker Inside Linnix
Last week I wrote about why CPU% is not enough, and why I started using PSI (/proc/pressure/*) to judge if a box is really in trouble.
Beginner in AWS: Need Mock Tests and Project Recommendations
I’ve been learning AWS for the past 2-3 months, along with Terraform, Gitlab, Kubernetes, and Docker through YouTube tutorials and hands-on practice. I’m now looking to work on more structured, real-world projects - possibly even contributing to public cloud related projects to build practical experience.
I’m also planning to take the AWS Cloud Practitioner exam. Could anyone suggest resources or websites that offer mock tests in an exam-like environment? Also, any recommendations for platforms where I can find beginner-friendly cloud projects to build my portfolio would be greatly appreciated.
https://redd.it/1pcghje
@r_devops
I’ve been learning AWS for the past 2-3 months, along with Terraform, Gitlab, Kubernetes, and Docker through YouTube tutorials and hands-on practice. I’m now looking to work on more structured, real-world projects - possibly even contributing to public cloud related projects to build practical experience.
I’m also planning to take the AWS Cloud Practitioner exam. Could anyone suggest resources or websites that offer mock tests in an exam-like environment? Also, any recommendations for platforms where I can find beginner-friendly cloud projects to build my portfolio would be greatly appreciated.
https://redd.it/1pcghje
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Any good DevOps podcasts?
Joining a company operating in the DevOps space and I want to keep up to date with current trends.
Any good podcasts you recommend?
https://redd.it/1pcjqlk
@r_devops
Joining a company operating in the DevOps space and I want to keep up to date with current trends.
Any good podcasts you recommend?
https://redd.it/1pcjqlk
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
RSS Feeds - DevOps, Cloud, etc
Hey everyone!
Kind of a fun post here but wanted to refresh my Reeder app and do some spring cleaning. With the advent of so much this year between AI and all sorts of barely-working-great-ideas from Harvard MBAs, thought it would be nice to see what kinds of RSS feeds people use/watch regularly across a few topics:
* DevOps
* Cloud (AWS, Azure, GCP, whatever)
* Infrastructure tech like CDNs, WAFs, ALBs, etc
* Platform Eng like K8s, docker, etv.
* Programming, maybe fixated on IaC as a topic
Anyway, my feeds have been a mix of broken or just not great lately and though I'd ask the community.
https://redd.it/1pcio1c
@r_devops
Hey everyone!
Kind of a fun post here but wanted to refresh my Reeder app and do some spring cleaning. With the advent of so much this year between AI and all sorts of barely-working-great-ideas from Harvard MBAs, thought it would be nice to see what kinds of RSS feeds people use/watch regularly across a few topics:
* DevOps
* Cloud (AWS, Azure, GCP, whatever)
* Infrastructure tech like CDNs, WAFs, ALBs, etc
* Platform Eng like K8s, docker, etv.
* Programming, maybe fixated on IaC as a topic
Anyway, my feeds have been a mix of broken or just not great lately and though I'd ask the community.
https://redd.it/1pcio1c
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Backstage plugin to update an entity
i have created a backstage plugin to allow updating a catalog entity from the same scaffolder template it was created with, this allows updating an entity as a self service from the same entity page, the values are pre populated, with conditional steps if needed.
you can check it out here
Entity scaffolder plugin
https://redd.it/1pcqi8q
@r_devops
i have created a backstage plugin to allow updating a catalog entity from the same scaffolder template it was created with, this allows updating an entity as a self service from the same entity page, the values are pre populated, with conditional steps if needed.
you can check it out here
Entity scaffolder plugin
https://redd.it/1pcqi8q
@r_devops
GitHub
backstage-plugins/plugins/entity-scaffolder at main · TheCodingSheikh/backstage-plugins
Contribute to TheCodingSheikh/backstage-plugins development by creating an account on GitHub.
Azure Engineer or SRE more future?
I am a fresh grad with 1 year working experience (including internship) as a backend developer. I am really interested in cloud and DevOps. I recently received 2 interviews: Azure Engineer and SRE. I wonder which path has more future ?
The Azure Engineer position basically focus on IaaS , deployment, write Teraform. They said might have chance to do cicd pipeline in future... I am wondering is it a good path to go for Cloud engineer/ DevOps engineer? Because they also mentioned that it is bery easy to pick up... I am afraid if this is just a simple deployment job. But they do mentioned that I will do the design infrastructure etc. and a lot of things to learn.
Or SRE better? Which path has more future?
Hope to seek opinions from you all ..🙏🏻
https://redd.it/1pcrhhd
@r_devops
I am a fresh grad with 1 year working experience (including internship) as a backend developer. I am really interested in cloud and DevOps. I recently received 2 interviews: Azure Engineer and SRE. I wonder which path has more future ?
The Azure Engineer position basically focus on IaaS , deployment, write Teraform. They said might have chance to do cicd pipeline in future... I am wondering is it a good path to go for Cloud engineer/ DevOps engineer? Because they also mentioned that it is bery easy to pick up... I am afraid if this is just a simple deployment job. But they do mentioned that I will do the design infrastructure etc. and a lot of things to learn.
Or SRE better? Which path has more future?
Hope to seek opinions from you all ..🙏🏻
https://redd.it/1pcrhhd
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
List of 50 top companies in 2025 that hire DevOps engineers!
https://devopsprojectshq.com/role/top-devops-companies-2025/
https://redd.it/1pcwt6t
@r_devops
https://devopsprojectshq.com/role/top-devops-companies-2025/
https://redd.it/1pcwt6t
@r_devops
DevOps Projects
Top 50 DevOps Companies Hiring DevOps Engineers in 2025
Data-driven ranking of companies hiring DevOps engineers based on active job openings. Find your next DevOps career opportunity.
Is Continuous Exposure Management the true SecDevOps endgame?
We talk a lot about "Shift Left," but the reality is security findings often hit the CI/CD pipeline late, or they are generated by a vulnerability scanner that doesn't understand the context of the running application.
I'm looking at this idea of Exposure Management, which seems like the natural evolution of SecDevOps/SRE practices. It forces security to be integrated and continuous, covering the entire lifecycle: code repos, cloud configurations, deployed application, and user identity. The goal is to continuously assess risk, not just find flaws.
If you are running a mature SecDevOps pipeline, how are you ensuring that security findings from different tools (SAST, DAST, CSPM, etc.) are unified and prioritized to show a single, clear measure of risk, rather than just raw vulnerability counts?
https://redd.it/1pcxb2x
@r_devops
We talk a lot about "Shift Left," but the reality is security findings often hit the CI/CD pipeline late, or they are generated by a vulnerability scanner that doesn't understand the context of the running application.
I'm looking at this idea of Exposure Management, which seems like the natural evolution of SecDevOps/SRE practices. It forces security to be integrated and continuous, covering the entire lifecycle: code repos, cloud configurations, deployed application, and user identity. The goal is to continuously assess risk, not just find flaws.
If you are running a mature SecDevOps pipeline, how are you ensuring that security findings from different tools (SAST, DAST, CSPM, etc.) are unified and prioritized to show a single, clear measure of risk, rather than just raw vulnerability counts?
https://redd.it/1pcxb2x
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Devops tool builder
Hi. I am 7+ year devops experience have been building some SaaS products for a while. I want to contribute for devops community. Is there any tool that would help devops. I thought about incident management, auto resolution but some companies have been doing them. And AWS also announced AWS Devops Agent today. Is there any part of the daily worklife of devops, sre and sys admin thats often overlooked by the devops tool companies with or without AI.
https://redd.it/1pcxab8
@r_devops
Hi. I am 7+ year devops experience have been building some SaaS products for a while. I want to contribute for devops community. Is there any tool that would help devops. I thought about incident management, auto resolution but some companies have been doing them. And AWS also announced AWS Devops Agent today. Is there any part of the daily worklife of devops, sre and sys admin thats often overlooked by the devops tool companies with or without AI.
https://redd.it/1pcxab8
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Switching to product based company
Question on programming languages and switching to developer role
Just a general question. In the product based companies, does programming language based on oops matters or even go lang should be fine. Consider both interview and regular day to day work.
The thing is i have almost 15 yrs experience, never coded in my life and I recently picked up go language. I know it will take lot of time to develop skillset considering i will not have practical exposure. But still a few questions if anyone can help.
1) I know I can never match or atleast get an entry to maang faang or whatever. But will there a chance for other product companies.
I don't know how tougher will be struggle in their day to day works.
2) If in interviews, if I choose go language with no idea around classes or oops will that be a reject.
3) I know at this age, system design etc..is expected but again i dont think I can understand them unless I have practical exposure. But if I am ready to lower my designation will that be ok.
https://redd.it/1pd033f
@r_devops
Question on programming languages and switching to developer role
Just a general question. In the product based companies, does programming language based on oops matters or even go lang should be fine. Consider both interview and regular day to day work.
The thing is i have almost 15 yrs experience, never coded in my life and I recently picked up go language. I know it will take lot of time to develop skillset considering i will not have practical exposure. But still a few questions if anyone can help.
1) I know I can never match or atleast get an entry to maang faang or whatever. But will there a chance for other product companies.
I don't know how tougher will be struggle in their day to day works.
2) If in interviews, if I choose go language with no idea around classes or oops will that be a reject.
3) I know at this age, system design etc..is expected but again i dont think I can understand them unless I have practical exposure. But if I am ready to lower my designation will that be ok.
https://redd.it/1pd033f
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Transparently and efficiently forward connection to container/VM via load balancer
**TLDR**: How can my load balancer efficiently and transparently forward an incoming connection to a container/VM in Linux?
**Problem**: For educational purposes, and maybe to write a patch for liburing in case some APIs are missing, I would like to learn how to implement a load balancer capable of scaling a target service from zero to hero. LB and target services are on the same physical node.
I would like for this approach to be:
* **Efficient**: as little memory copying as possible, as little CPU utilization as possible
* **Transparent**: the target service should not understand what's happening
I saw systemd socket activation, but it seems it can scale from 0 to 1, while it does not handle further scaling. Also the socket hands off code felt a bit hard to follow, but maybe I'm just a noob.
**Current status**: After playing a bit I managed to do this either efficiently or transparently, but not both. I would like to do both.
The load balancer process is written in Rust and uses `io_uring`.
**Efficient approach**:
* LB binds to a socket and fires a multishot accept
* On client connection the LB perform some business logic to decide which container should handle the incoming request
* If the service is scaled to zero fires up the first container
* If the service is overloaded fires up more instances
* Pass the socket file denoscriptor to the container via `sendmsg`
* The container receives the FD and fires a multishot receive to handle incoming data
This approach is VERY efficient (no memory copying, very little CPU usage) but the receiving process need to be aware of what's happening to receive and correctly handle the socket FD.
Let's say I want to run an arbitrary node.js container, then this approach won't work.
**Transparent approach**:
* LB binds to a socket and fires a multishot accept
* On client connection the LB perform some business logic to decide which container should handle the incoming request
* If the service is scaled to zero fires up the first container
* If the service is overloaded fires up more instances
* LB connect to the container, fires a multishot receive
* Incoming data get sent to the container via zerocopy send
This approach is less efficient because:
* The incoming container copies the data once (but this happens also in the efficient case)
* We double the number of active connections, for each connection between client and LB we have a connection between LB and service
The advantage of this approach is that the incoming service is not aware of what's happening
**Questions**:
* What can I use to efficiently forward the connection from the LB to the container? Some kind of pipe?
* Is there a way to make the container think there is a new accept event even though the connection was already accepted and without opening a new connection between the LB and the container?
* If the connection is TCP, can I use the fact that both the LB and the container are on the same phyisical node and use some kind of lightweight protocol? For example I could use Unix Domain Sockets but then the target app should be aware of this, breaking transparency
https://redd.it/1pd14dz
@r_devops
**TLDR**: How can my load balancer efficiently and transparently forward an incoming connection to a container/VM in Linux?
**Problem**: For educational purposes, and maybe to write a patch for liburing in case some APIs are missing, I would like to learn how to implement a load balancer capable of scaling a target service from zero to hero. LB and target services are on the same physical node.
I would like for this approach to be:
* **Efficient**: as little memory copying as possible, as little CPU utilization as possible
* **Transparent**: the target service should not understand what's happening
I saw systemd socket activation, but it seems it can scale from 0 to 1, while it does not handle further scaling. Also the socket hands off code felt a bit hard to follow, but maybe I'm just a noob.
**Current status**: After playing a bit I managed to do this either efficiently or transparently, but not both. I would like to do both.
The load balancer process is written in Rust and uses `io_uring`.
**Efficient approach**:
* LB binds to a socket and fires a multishot accept
* On client connection the LB perform some business logic to decide which container should handle the incoming request
* If the service is scaled to zero fires up the first container
* If the service is overloaded fires up more instances
* Pass the socket file denoscriptor to the container via `sendmsg`
* The container receives the FD and fires a multishot receive to handle incoming data
This approach is VERY efficient (no memory copying, very little CPU usage) but the receiving process need to be aware of what's happening to receive and correctly handle the socket FD.
Let's say I want to run an arbitrary node.js container, then this approach won't work.
**Transparent approach**:
* LB binds to a socket and fires a multishot accept
* On client connection the LB perform some business logic to decide which container should handle the incoming request
* If the service is scaled to zero fires up the first container
* If the service is overloaded fires up more instances
* LB connect to the container, fires a multishot receive
* Incoming data get sent to the container via zerocopy send
This approach is less efficient because:
* The incoming container copies the data once (but this happens also in the efficient case)
* We double the number of active connections, for each connection between client and LB we have a connection between LB and service
The advantage of this approach is that the incoming service is not aware of what's happening
**Questions**:
* What can I use to efficiently forward the connection from the LB to the container? Some kind of pipe?
* Is there a way to make the container think there is a new accept event even though the connection was already accepted and without opening a new connection between the LB and the container?
* If the connection is TCP, can I use the fact that both the LB and the container are on the same phyisical node and use some kind of lightweight protocol? For example I could use Unix Domain Sockets but then the target app should be aware of this, breaking transparency
https://redd.it/1pd14dz
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Transitioning from Software Engineer to DevOps
Hello everyone.
In recent years I have been working as a software engineer with a specialization in backend and now I want to make a transition to the field of DevOps.
As a developer I use a lot of common tools such as CI/CD, Docker, Python but unfortunately as part of my work day I don't really cover all the tools (I don't have any work with the cloud at all) and therefore I have to learn everything myself through independent projects that I check.
Moreover, there are more jobs in the field of DevOps than in software development and you can be more compensated in them and this is one of the reasons I want to make the transition.
I use AI a lot in terms of topics and terms that I need to know and of course learn how things work
Has anyone made this transition before?
What jobs should I aim for? I was thinking about the MID LEVEL level
Tips that can help?
Thank you.
https://redd.it/1pd1yi2
@r_devops
Hello everyone.
In recent years I have been working as a software engineer with a specialization in backend and now I want to make a transition to the field of DevOps.
As a developer I use a lot of common tools such as CI/CD, Docker, Python but unfortunately as part of my work day I don't really cover all the tools (I don't have any work with the cloud at all) and therefore I have to learn everything myself through independent projects that I check.
Moreover, there are more jobs in the field of DevOps than in software development and you can be more compensated in them and this is one of the reasons I want to make the transition.
I use AI a lot in terms of topics and terms that I need to know and of course learn how things work
Has anyone made this transition before?
What jobs should I aim for? I was thinking about the MID LEVEL level
Tips that can help?
Thank you.
https://redd.it/1pd1yi2
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Salt Typhoon: When State-Sponsored Hackers Infiltrate Telecom Infrastructure 📡
https://instatunnel.my/blog/salt-typhoon-when-state-sponsored-hackers-infiltrate-telecom-infrastructure
https://redd.it/1pd3z0z
@r_devops
https://instatunnel.my/blog/salt-typhoon-when-state-sponsored-hackers-infiltrate-telecom-infrastructure
https://redd.it/1pd3z0z
@r_devops
InstaTunnel
Salt Typhoon:How State-Sponsored Hackers Breached US Telecom
Explore how the Salt Typhoon cyber-espionage campaign infiltrated major U.S. telecom providers to geolocate users and intercept communications
eBPF for the Infrastructure Platform:
How Modern Applications Leverage Kernel-Level Programmability
New white paper from the eBPF Foundation
https://ebpf.foundation/new-state-of-ebpf-report-explores-how-modern-infrastructure-teams-are-building-on-kernel-level-programmability/
https://redd.it/1pd516j
@r_devops
How Modern Applications Leverage Kernel-Level Programmability
New white paper from the eBPF Foundation
https://ebpf.foundation/new-state-of-ebpf-report-explores-how-modern-infrastructure-teams-are-building-on-kernel-level-programmability/
https://redd.it/1pd516j
@r_devops
So, what do you guys think of the new AWS DevOps Agents?
According to AWS, the agent can identify, investigate, and even “resolve” incidents based on monitoring alerts, significantly reducing the number of incident responses required by an actual DevOps person.
I personally think it’s still a long shot to fully resolve incidents for larger organizations because they have resources spread across multiple clouds, on‑prem servers, and all the complexity that involves. These kinds of agents might be useful as an additional layer of monitoring by acting as a third eye on all the monitoring and observability tools an organization has.
https://aws.amazon.com/devops-agent/
Full article about the Frontier agents which includer a Developer Agent(Kiro), Security Agent and DevOps Agent : https://www.aboutamazon.com/news/aws/amazon-ai-frontier-agents-autonomous-kiro?utm\_source=ecsocial&utm\_medium=linkedin&utm\_term=36
https://redd.it/1pd481u
@r_devops
According to AWS, the agent can identify, investigate, and even “resolve” incidents based on monitoring alerts, significantly reducing the number of incident responses required by an actual DevOps person.
I personally think it’s still a long shot to fully resolve incidents for larger organizations because they have resources spread across multiple clouds, on‑prem servers, and all the complexity that involves. These kinds of agents might be useful as an additional layer of monitoring by acting as a third eye on all the monitoring and observability tools an organization has.
https://aws.amazon.com/devops-agent/
Full article about the Frontier agents which includer a Developer Agent(Kiro), Security Agent and DevOps Agent : https://www.aboutamazon.com/news/aws/amazon-ai-frontier-agents-autonomous-kiro?utm\_source=ecsocial&utm\_medium=linkedin&utm\_term=36
https://redd.it/1pd481u
@r_devops
Amazon
Frontier agent – AWS DevOps Agent – AWS
AWS DevOps Agent is a frontier agent that resolves and proactively prevents incidents, continuously improving reliability and performance.
A Technical Look at Why Moving Back to a Monolith Saved $1,200/Month: Real Benchmarks
A client came to us with a backend that appeared modern but operated poorly. The system relied on 12 microservices, 12 deployments, and 12 separate repositories, which collectively created multiple points of failure. Despite the complexity, the platform was slow, expensive, and increasingly difficult to maintain. Their monthly infrastructure cost had reached $1,900, and each CI/CD cycle required 27 minutes for a single release.
After reviewing logs, traffic patterns, and operational behavior, it became clear that the architecture itself was creating the instability. We consolidated the entire system into a single, modular Node.js and Express monolith running on PM2 and Docker.
The results were immediate:
\- The infrastructure cost has been reduced from $1,900 to $700 per month
\- Latency (P95) improved from 240ms to 38ms
\- CI/CD time decreased from 27 minutes to 8 minutes
\- Deployment failures reduced from 6 per month to 0 - 1 per month
\- Debugging time dropped from hours to minutes
The experience highlighted a recurring pattern. Microservices often address organizational scale rather than early product requirements. In this case, a well-structured monolith delivered significantly better performance, lower overhead, and greater operational stability.
This outcome further reinforced a broader operational reality that complexity tends to accumulate faster than value when microservices are introduced before the system or the team genuinely requires them.
Much of the instability observed in this project stemmed not from technical limitations but from architectural decisions made prematurely.
This serves as a reflection on how architectural choices, when made too early, can introduce operational burdens that ultimately hinder system resilience and efficiency.
https://redd.it/1pd70n9
@r_devops
A client came to us with a backend that appeared modern but operated poorly. The system relied on 12 microservices, 12 deployments, and 12 separate repositories, which collectively created multiple points of failure. Despite the complexity, the platform was slow, expensive, and increasingly difficult to maintain. Their monthly infrastructure cost had reached $1,900, and each CI/CD cycle required 27 minutes for a single release.
After reviewing logs, traffic patterns, and operational behavior, it became clear that the architecture itself was creating the instability. We consolidated the entire system into a single, modular Node.js and Express monolith running on PM2 and Docker.
The results were immediate:
\- The infrastructure cost has been reduced from $1,900 to $700 per month
\- Latency (P95) improved from 240ms to 38ms
\- CI/CD time decreased from 27 minutes to 8 minutes
\- Deployment failures reduced from 6 per month to 0 - 1 per month
\- Debugging time dropped from hours to minutes
The experience highlighted a recurring pattern. Microservices often address organizational scale rather than early product requirements. In this case, a well-structured monolith delivered significantly better performance, lower overhead, and greater operational stability.
This outcome further reinforced a broader operational reality that complexity tends to accumulate faster than value when microservices are introduced before the system or the team genuinely requires them.
Much of the instability observed in this project stemmed not from technical limitations but from architectural decisions made prematurely.
This serves as a reflection on how architectural choices, when made too early, can introduce operational burdens that ultimately hinder system resilience and efficiency.
https://redd.it/1pd70n9
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Remote team laptop setup automation - we automate everything except new hire laptops
DevOps team that prides itself on automation. Everything is infrastructure as code:
- Kubernetes clusters: Terraform
- Database migrations: Automated
- CI/CD pipelines: GitHub Actions
- Monitoring: Automated alerting
- Scaling: Auto-scaling groups
- Deployments: Fully automated
New hire laptop setup: "Here's a list of 63 things to install manually, good luck!"
New DevOps engineer started Monday. Friday afternoon and they're still configuring local environment:
- Docker (with all the WSL complications)
- kubectl with multiple cluster configs
- terraform with authentication
- AWS CLI with MFA setup
- Multiple VPN clients for different environments
- IDE with company plugins
- SSH key management across services
- Local databases for development
- Language version managers
- Company security tools
We can provision entire production environments in 12 minutes but can't ship a laptop ready to work immediately?
This feels like the most obvious automation opportunity in our entire tech stack. Why are we treating developer laptop configuration like it's 2010 while everything else is cutting-edge automated infrastructure?
https://redd.it/1pda7cv
@r_devops
DevOps team that prides itself on automation. Everything is infrastructure as code:
- Kubernetes clusters: Terraform
- Database migrations: Automated
- CI/CD pipelines: GitHub Actions
- Monitoring: Automated alerting
- Scaling: Auto-scaling groups
- Deployments: Fully automated
New hire laptop setup: "Here's a list of 63 things to install manually, good luck!"
New DevOps engineer started Monday. Friday afternoon and they're still configuring local environment:
- Docker (with all the WSL complications)
- kubectl with multiple cluster configs
- terraform with authentication
- AWS CLI with MFA setup
- Multiple VPN clients for different environments
- IDE with company plugins
- SSH key management across services
- Local databases for development
- Language version managers
- Company security tools
We can provision entire production environments in 12 minutes but can't ship a laptop ready to work immediately?
This feels like the most obvious automation opportunity in our entire tech stack. Why are we treating developer laptop configuration like it's 2010 while everything else is cutting-edge automated infrastructure?
https://redd.it/1pda7cv
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community