sporadic authentication failures occurring in exact 37-minute cycles. all diagnostics say everything is fine. im losing my mind.
yall pls help me
environment:
4 DCs running Server 2019 (2 per site, sites connected via 1Gbps MPLS)
\~800 Windows 10/11 clients (22H2/23H2 mix)
Azure AD Connect for hybrid identity
all DCs are GCs, DNS integrated
functional level 2016
for the past 3 months we've been getting tickets about "random" password failures. users swear their password is correct, they retry immediately, it works. this affects maybe 5-10 users per day across both sites.
i finally got fed up and started logging everything so i pulled kerberos events (4768, 4769, 4771), correlated timestamps across all DCs and built a spreadsheet.
the failures occur in exact 37-minute cycles.
here's what i've ruled out:
time sync: all DCs within 2ms of each other, w32tm shows healthy sync to stratum 2 NTP
replication: repadmin /showrepl clean, repadmin /replsum shows <15 second latency
kerberos policy: default domain policy, 10 hour TGT, 7 day renewal, 600 min service ticket (standard)
DNS: forward/reverse clean, scavenging configured properly, no stale records
DC locator: nltest /dsgetdc returns correct DC every time
secure channel: Test-ComputerSecureChannel passes on affected machines
clock skew: checked every affected workstation, all within tolerance
GPO processing: gpresult shows clean processing, no CSE failures
37 minutes doesn't match anything i can find:
not kerberos TGT lifetime (10 hours = 600 minutes)
not service ticket lifetime (600 minutes)
not GPO refresh (90-120 minutes with random offset)
not machine account password rotation check (ScavengeInterval = 15 minutes by default)
not the netlogon scavenger thread (900 seconds = 15 minutes)
not OCSP/CRL cache refresh (varies by cert)
not any known windows timer i can find documentation for
the pattern started the exact day we added DC04 to the environment. i thought okay, something's wrong with DC04. i decommed it, migrated FSMO roles away, demoted it, removed DNS records, cleaned up AD metadata...the 37-minute cycle continued.
i'm three months into this like i've run packet captures, wireshark shows normal kerberos exchanges. the failure events just happen, and then don't happen, in a perfect 37-minute oscillation.
microsoft premier support escalated to the backend team twice. first response was "have you tried rebooting the DCs?" second response hasn't come in 6 weeks.
at this point i'm considering:
1. the universe is broken
2. i'm in a simulation and the devs are testing my sanity
3. there's some timer or scheduled task somewhere i haven't found
4. something in our environment is doing something every 37 minutes that affects auth
has anyone seen anything like this? any obscure windows timer that runs at 37-minute intervals? third party software that might do this?
i will pay money at this point srs not joking.
EDIT: SOLVEDDDDDDD
it was SolarWinds.
after someone mentioned backup infrastructure, i went down the storage rabbit hole. correlated Pure snapshot times against my failure timestamps - close but not exact. 7-minute offset wasn't consistent enough but it got me thinking about what ELSE runs on schedules that i don't control.
our monitoring team (separate group, different building, we don't talk much) uses SolarWinds SAM. i asked them to pull the probe schedules. there's an "Active Directory Authentication Monitor" probe. it performs a real LDAP bind + kerberos auth test against a service account to verify AD is responding.
the probe runs every 37 minutes. why 37 minutes? because years ago some admin set it to 2220 seconds thinking that's roughly every half hour but offset so it doesn't collide with our other probes. nobody documented it and that admin left in 2019.
why did it start when DC04 was added? because DC04's IP got added to the probe's target list automatically via their autodiscovery. the probe was already running against DC01-03 but the auth requests were
yall pls help me
environment:
4 DCs running Server 2019 (2 per site, sites connected via 1Gbps MPLS)
\~800 Windows 10/11 clients (22H2/23H2 mix)
Azure AD Connect for hybrid identity
all DCs are GCs, DNS integrated
functional level 2016
for the past 3 months we've been getting tickets about "random" password failures. users swear their password is correct, they retry immediately, it works. this affects maybe 5-10 users per day across both sites.
i finally got fed up and started logging everything so i pulled kerberos events (4768, 4769, 4771), correlated timestamps across all DCs and built a spreadsheet.
the failures occur in exact 37-minute cycles.
here's what i've ruled out:
time sync: all DCs within 2ms of each other, w32tm shows healthy sync to stratum 2 NTP
replication: repadmin /showrepl clean, repadmin /replsum shows <15 second latency
kerberos policy: default domain policy, 10 hour TGT, 7 day renewal, 600 min service ticket (standard)
DNS: forward/reverse clean, scavenging configured properly, no stale records
DC locator: nltest /dsgetdc returns correct DC every time
secure channel: Test-ComputerSecureChannel passes on affected machines
clock skew: checked every affected workstation, all within tolerance
GPO processing: gpresult shows clean processing, no CSE failures
37 minutes doesn't match anything i can find:
not kerberos TGT lifetime (10 hours = 600 minutes)
not service ticket lifetime (600 minutes)
not GPO refresh (90-120 minutes with random offset)
not machine account password rotation check (ScavengeInterval = 15 minutes by default)
not the netlogon scavenger thread (900 seconds = 15 minutes)
not OCSP/CRL cache refresh (varies by cert)
not any known windows timer i can find documentation for
the pattern started the exact day we added DC04 to the environment. i thought okay, something's wrong with DC04. i decommed it, migrated FSMO roles away, demoted it, removed DNS records, cleaned up AD metadata...the 37-minute cycle continued.
i'm three months into this like i've run packet captures, wireshark shows normal kerberos exchanges. the failure events just happen, and then don't happen, in a perfect 37-minute oscillation.
microsoft premier support escalated to the backend team twice. first response was "have you tried rebooting the DCs?" second response hasn't come in 6 weeks.
at this point i'm considering:
1. the universe is broken
2. i'm in a simulation and the devs are testing my sanity
3. there's some timer or scheduled task somewhere i haven't found
4. something in our environment is doing something every 37 minutes that affects auth
has anyone seen anything like this? any obscure windows timer that runs at 37-minute intervals? third party software that might do this?
i will pay money at this point srs not joking.
EDIT: SOLVEDDDDDDD
it was SolarWinds.
after someone mentioned backup infrastructure, i went down the storage rabbit hole. correlated Pure snapshot times against my failure timestamps - close but not exact. 7-minute offset wasn't consistent enough but it got me thinking about what ELSE runs on schedules that i don't control.
our monitoring team (separate group, different building, we don't talk much) uses SolarWinds SAM. i asked them to pull the probe schedules. there's an "Active Directory Authentication Monitor" probe. it performs a real LDAP bind + kerberos auth test against a service account to verify AD is responding.
the probe runs every 37 minutes. why 37 minutes? because years ago some admin set it to 2220 seconds thinking that's roughly every half hour but offset so it doesn't collide with our other probes. nobody documented it and that admin left in 2019.
why did it start when DC04 was added? because DC04's IP got added to the probe's target list automatically via their autodiscovery. the probe was already running against DC01-03 but the auth requests were
being load balanced and the brief lock wasn't noticeable. adding a fourth target changed the timing juuust enough that the probe's auth attempt started colliding with real user auth attempts on the same DC at the same millisecond.
why did it persist after DC04 removal? because the probe targets were never cleaned up. it was still trying to auth against DC04's old IP, timing out, then immediately hitting another DC - which shifted the timing window but kept the 37-minute cycle.
disabled the probe. cycle stopped immediately. haven't had a single 4771 in 72 hours. i just mass-deployed kerberos debug logging, built correlation spreadsheets, spent hours in wireshark, and mass-ticketed microsoft premier support twice to resolve a problem caused by a misconfigured monitoring checkbox.
this job is a meme.
thanks everyone for the suggestions - especially the lateral thinking about backup/storage timing. that's what got me looking at things that run on schedules that aren't mine.
https://redd.it/1r4b9qe
@r_systemadmin
why did it persist after DC04 removal? because the probe targets were never cleaned up. it was still trying to auth against DC04's old IP, timing out, then immediately hitting another DC - which shifted the timing window but kept the 37-minute cycle.
disabled the probe. cycle stopped immediately. haven't had a single 4771 in 72 hours. i just mass-deployed kerberos debug logging, built correlation spreadsheets, spent hours in wireshark, and mass-ticketed microsoft premier support twice to resolve a problem caused by a misconfigured monitoring checkbox.
this job is a meme.
thanks everyone for the suggestions - especially the lateral thinking about backup/storage timing. that's what got me looking at things that run on schedules that aren't mine.
https://redd.it/1r4b9qe
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
our 'ai transformation' cost seven figures and delivered a chatgpt wrapper
six months of consulting, workshops, a 47 page roadmap deck. the first deliverable just landed on our desks for testing.
it's chatgpt with our company logo. literally a system prompt that says 'you are a helpful assistant for [company name\]'. same hallucinations, same limitations, except now it confidently makes up internal policies that don't exist and everyone in leadership thinks the issue is that we need to 'prompt engineer better'.
the consultants are already pitching phase two.
https://redd.it/1r3wgjt
@r_systemadmin
six months of consulting, workshops, a 47 page roadmap deck. the first deliverable just landed on our desks for testing.
it's chatgpt with our company logo. literally a system prompt that says 'you are a helpful assistant for [company name\]'. same hallucinations, same limitations, except now it confidently makes up internal policies that don't exist and everyone in leadership thinks the issue is that we need to 'prompt engineer better'.
the consultants are already pitching phase two.
https://redd.it/1r3wgjt
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
"Best" printer manufacturer
Which printer manufacturer have you had the best experiences with for use in your company?
https://redd.it/1r4gr7w
@r_systemadmin
Which printer manufacturer have you had the best experiences with for use in your company?
https://redd.it/1r4gr7w
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
ASUS shut down their support portal in Germany and Austria
This is just terrible imo. A court in munich ruled ASUS violated patents of Nokia, now their support portal is inaccessible. Should have saved all drivers for company equipment when i had the chance. Need drivers for a few boards and just no way to grab them directly from ASUS (except VPN, would be last resort).
One thing left to say: WTF.
https://redd.it/1r5bd3a
@r_systemadmin
This is just terrible imo. A court in munich ruled ASUS violated patents of Nokia, now their support portal is inaccessible. Should have saved all drivers for company equipment when i had the chance. Need drivers for a few boards and just no way to grab them directly from ASUS (except VPN, would be last resort).
One thing left to say: WTF.
https://redd.it/1r5bd3a
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
pstop: terminal based system monitor for Windows (htop clone with tree view, process kill, I/O monitoring)
Built a terminal system monitor for Windows that works like htop on Linux.
Why:
Task Manager is fine for GUI, but if you manage Windows servers or spend time in the terminal, having htop available makes life simpler. pstop runs in any terminal with ANSI support.
Install:
What it does:
- Per core CPU monitoring with usage bars
- Memory/Swap/Network bars
- Process table with sort by any column
- Tree view (process hierarchy)
- I/O tab (disk read/write rates per process)
- Network tab
- Kill process (F9), priority (F7/F8), CPU affinity
- Search (F3), filter (F4)
- Persistent config
- ~1 MB single binary, zero dependencies
Single ~1 MB binary. No installer. No runtime dependencies. Just run it.
GitHub: https://github.com/marlocarlo/pstop
https://redd.it/1r5evtz
@r_systemadmin
Built a terminal system monitor for Windows that works like htop on Linux.
Why:
Task Manager is fine for GUI, but if you manage Windows servers or spend time in the terminal, having htop available makes life simpler. pstop runs in any terminal with ANSI support.
Install:
cargo install pstop
What it does:
- Per core CPU monitoring with usage bars
- Memory/Swap/Network bars
- Process table with sort by any column
- Tree view (process hierarchy)
- I/O tab (disk read/write rates per process)
- Network tab
- Kill process (F9), priority (F7/F8), CPU affinity
- Search (F3), filter (F4)
- Persistent config
- ~1 MB single binary, zero dependencies
Single ~1 MB binary. No installer. No runtime dependencies. Just run it.
GitHub: https://github.com/marlocarlo/pstop
https://redd.it/1r5evtz
@r_systemadmin
GitHub
GitHub - marlocarlo/pstop: htop for Windows . TUI system monitor with per-core CPU bars, memory/swap/network, tree view, process…
htop for Windows . TUI system monitor with per-core CPU bars, memory/swap/network, tree view, process kill, 7 color schemes, mouse support. cargo install pstop - marlocarlo/pstop
MDU Routers
Anyone out there doing MDU setups? Currently doing this for several properties using Ruckus AP’s, Ruckus SmartZone and Windows DHCP server off-prem. It’s time to move away from this setup, and I’m curious what a recommendation might be for handling up to 100 Vlans per site and a DHCP Server per subnet (just handing out about 30 hosts per vlan).
And no, please don’t mention Nomadix.
Edit: Added clarity on the DHCP servers.
https://redd.it/1r5hpsf
@r_systemadmin
Anyone out there doing MDU setups? Currently doing this for several properties using Ruckus AP’s, Ruckus SmartZone and Windows DHCP server off-prem. It’s time to move away from this setup, and I’m curious what a recommendation might be for handling up to 100 Vlans per site and a DHCP Server per subnet (just handing out about 30 hosts per vlan).
And no, please don’t mention Nomadix.
Edit: Added clarity on the DHCP servers.
https://redd.it/1r5hpsf
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
Ivantu Application Control Agent and Autopilot
Does anyone have the Ivanti Application Control Agent deploying successfully during Autopilot? I hope it's not just me but due to its tight integration with AppSense I keep getting permissions errors when it's trying to start the service during install and it only happens on my Autopilot devices and it's consistent across different versions yet I don't have the issue with any of my devices that have been deployed via SCCM so I'm suspecting it could either be something in my configuration profiles / noscripts or it's an Autopilot nuonce...
https://redd.it/1r5bzzo
@r_systemadmin
Does anyone have the Ivanti Application Control Agent deploying successfully during Autopilot? I hope it's not just me but due to its tight integration with AppSense I keep getting permissions errors when it's trying to start the service during install and it only happens on my Autopilot devices and it's consistent across different versions yet I don't have the issue with any of my devices that have been deployed via SCCM so I'm suspecting it could either be something in my configuration profiles / noscripts or it's an Autopilot nuonce...
https://redd.it/1r5bzzo
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
Microsoft Purview. What sort of labels did you guys start with?
Hi Everyone.
Hope all is well.
We are starting our implementation of Data governance and I'm starting looking at the labels to start off with.
Looking the documentation and other reading. It mention to start baseline.
Public
Internal
Confidential
Highly Confidential
But Microsoft Documentation also mention to scope label for Files/Email and separate one for Like 365 Sites and Sharepoint sites.
Is this right approach based any of your past experience?
This is a food manufacturing company that I'm currently working with, just want start with some labels people can understand and apply. Not everyone working is going be super technical people.
https://redd.it/1r5lsbm
@r_systemadmin
Hi Everyone.
Hope all is well.
We are starting our implementation of Data governance and I'm starting looking at the labels to start off with.
Looking the documentation and other reading. It mention to start baseline.
Public
Internal
Confidential
Highly Confidential
But Microsoft Documentation also mention to scope label for Files/Email and separate one for Like 365 Sites and Sharepoint sites.
Is this right approach based any of your past experience?
This is a food manufacturing company that I'm currently working with, just want start with some labels people can understand and apply. Not everyone working is going be super technical people.
https://redd.it/1r5lsbm
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
How do you manage user accounts with third party sites if they dont have SSO?
Trying to find a good way to manage user accounts with work related third party sites, especially the deactivation of them when people leave?
https://redd.it/1r5nu7b
@r_systemadmin
Trying to find a good way to manage user accounts with work related third party sites, especially the deactivation of them when people leave?
https://redd.it/1r5nu7b
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
Is ServiceNow really this inconvenient to use for everyone, or is it just our implementation?
I don't know if it's just our implementation of ServiceNow that's so annoying and cumbersome, or if everyone's is about the same. It often complicates trivial things.
Here are some small examples that piss me off:
\- Made a change to incident 1 and hit 'save'? It automatically moves on to some other random incident 2, as if you're done working on incident 1 because you left one comment on it.
\- Need to put in a request of some sort? You get a REQ number, then a RITM number, and then an SCTASK number. So you have 3 different ticket numbers to describe ONE thing you want done. That one thing is often a single line ask, but it generates 3x paperwork. People also give me CS numbers and I need to convert them into INCs to assign to self and work them.
\- Adding multiple configuration items to a ticket of different categories = excessive amount of clicking and fumbling.
\- Can't search for strings. Well, you can search - it's the finding of the results that doesn't work as expected.
\- A CHG request that has child SCTASK doesn't inherit the CIs from the CHG, you gotta enter them again manually.
\- No easy batch-assignment of tickets in the queue to a specific person/team. No batch status-changes. I don't know if you ever clicked on 30 tickets one by one, and set them as a child of ticket X, but it's not fun.
\- So slow. Refreshes itself without me asking. Slowly.
***
I can't help thinking, employees are a captive audience - they have to use whatever you give them. They're paid to. But if this was a customer-facing tool, people would not want to touch it. I can't imagine any web interface I use on my private time that looks and acts like this.
I know you want to say, "be the change you want to see in the world". I have no admin access to anything on ServiceNow, definitely no API key, I'm just a peon in this context. I don't even have admin access to my own laptop, sadly. Local PowerShell noscripts and browser plugins are blocked too, so I can't do much.
https://redd.it/1r61ngu
@r_systemadmin
I don't know if it's just our implementation of ServiceNow that's so annoying and cumbersome, or if everyone's is about the same. It often complicates trivial things.
Here are some small examples that piss me off:
\- Made a change to incident 1 and hit 'save'? It automatically moves on to some other random incident 2, as if you're done working on incident 1 because you left one comment on it.
\- Need to put in a request of some sort? You get a REQ number, then a RITM number, and then an SCTASK number. So you have 3 different ticket numbers to describe ONE thing you want done. That one thing is often a single line ask, but it generates 3x paperwork. People also give me CS numbers and I need to convert them into INCs to assign to self and work them.
\- Adding multiple configuration items to a ticket of different categories = excessive amount of clicking and fumbling.
\- Can't search for strings. Well, you can search - it's the finding of the results that doesn't work as expected.
\- A CHG request that has child SCTASK doesn't inherit the CIs from the CHG, you gotta enter them again manually.
\- No easy batch-assignment of tickets in the queue to a specific person/team. No batch status-changes. I don't know if you ever clicked on 30 tickets one by one, and set them as a child of ticket X, but it's not fun.
\- So slow. Refreshes itself without me asking. Slowly.
***
I can't help thinking, employees are a captive audience - they have to use whatever you give them. They're paid to. But if this was a customer-facing tool, people would not want to touch it. I can't imagine any web interface I use on my private time that looks and acts like this.
I know you want to say, "be the change you want to see in the world". I have no admin access to anything on ServiceNow, definitely no API key, I'm just a peon in this context. I don't even have admin access to my own laptop, sadly. Local PowerShell noscripts and browser plugins are blocked too, so I can't do much.
https://redd.it/1r61ngu
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
Why Are People Like This?
Just got assigned to a security review of a client we are on-boarding with several hundred users.
Ran a quick check on AD passwords and found that for the entire organization there are only a handful of different passwords shared between users.
Looking into it further, IT was giving new users passwords in the format "CompanynameYear!" So like "Microsoft2023!" along with instructions to change their password immediately and how to do so (which is already bad, but it's not abjectly awful at least, or so I thought...)
In the entire company, less than 10 people ever changed their password. So we had users that were on "Companyname2017!", since 2017.
With the right usernames, this password would give access remotely via VPN to everything the company has. It's a miracle they've survived this long.
So I held an emergency Zoom meeting with the execs saying that before we go any further, EVERYONE needs to change their passwords immediately. And I got push back saying it will be far too disruptive to operations and many staff won't want to have to remember a new password.
I ended the Zoom meeting and told the account manager (from my company) that I'm not trained in managing psychosis so it's on him now.
Why do people want their lives and company ruined so badly? Why do they hate themselves and any hope of their own survival and success so much that they want to sabotage it at every opportunity? Do MSPs need to start hiring mental health professionals to counsel their clients as a first step before working on the actual IT?!
Edit:
I am actually genuinely curious what people think of my last comment. Should MSPs actually have mental health officers (obviously under a different name so as not to offend clients), whose job is to pave the way for technicians? I feel like I'm creating a dual class D&D character here, the Technician/Psychologist, someone who can go in and handle the mental health crisis first, and then move onto the technical duties.
https://redd.it/1r691da
@r_systemadmin
Just got assigned to a security review of a client we are on-boarding with several hundred users.
Ran a quick check on AD passwords and found that for the entire organization there are only a handful of different passwords shared between users.
Looking into it further, IT was giving new users passwords in the format "CompanynameYear!" So like "Microsoft2023!" along with instructions to change their password immediately and how to do so (which is already bad, but it's not abjectly awful at least, or so I thought...)
In the entire company, less than 10 people ever changed their password. So we had users that were on "Companyname2017!", since 2017.
With the right usernames, this password would give access remotely via VPN to everything the company has. It's a miracle they've survived this long.
So I held an emergency Zoom meeting with the execs saying that before we go any further, EVERYONE needs to change their passwords immediately. And I got push back saying it will be far too disruptive to operations and many staff won't want to have to remember a new password.
I ended the Zoom meeting and told the account manager (from my company) that I'm not trained in managing psychosis so it's on him now.
Why do people want their lives and company ruined so badly? Why do they hate themselves and any hope of their own survival and success so much that they want to sabotage it at every opportunity? Do MSPs need to start hiring mental health professionals to counsel their clients as a first step before working on the actual IT?!
Edit:
I am actually genuinely curious what people think of my last comment. Should MSPs actually have mental health officers (obviously under a different name so as not to offend clients), whose job is to pave the way for technicians? I feel like I'm creating a dual class D&D character here, the Technician/Psychologist, someone who can go in and handle the mental health crisis first, and then move onto the technical duties.
https://redd.it/1r691da
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
I've run Docker Swarm in production for 10 years. $166/year. 24 containers. Two continents. Zero crashes. Here's why I never migrated to Kubernetes.
Every week on Reddit someone asks about Docker Swarm and the responses are always the same: "Swarm is dead." "Just use K8s." "Nobody runs Swarm in production."
I've run Swarm in production for a decade. Not a toy setup — multi-node clusters, manager redundancy, 4-6 replicas per service, rolling deployments in batches of two with automatic rollback on healthcheck failure. Zero customer downtime. Over the years I optimized the architecture down to 24 containers across two continents on $166/year total infrastructure.
I finally wrote the article I wish existed when I made my choice ten years ago. 7,400 words. Real production numbers. Working code. No affiliate links. No "it depends" cop-out.
**What's in it:**
* Side-by-side YAML comparison: 27 lines (Compose) → 42 lines (Swarm) → 170+ lines (K8s) for the same app
* Healthcheck comparison table testing 6 failure scenarios — K8s wins 2 out of 6
* A working 150-line autoscaler that's actually smarter than K8s HPA (adaptive polling vs fixed 15s intervals)
* Cost breakdown: $166/year vs $1,584-2,304/year minimum for EKS
* CAST AI 2024 data: 87% idle CPU, 68% of pods overprovisioned 3-8x, $50-500K annual waste per cluster
* Why your Node.js containers are 7x bigger than they need to be and how that drives false demand for autoscaling
* Why you should never expose Node.js directly to the internet (and what to do instead)
The only feature K8s genuinely has that Swarm lacks is autoscaling — and Datadog's own 2023 report shows only \~50% of K8s organizations even use HPA. So half the industry is paying the full complexity tax for a feature they don't use.
Not saying K8s is bad. It's an incredible system for the 1% who need it. But the data shows 99% don't — they're paying 10-100x more for capabilities they never touch while 87% of their CPU does nothing.
[Read Full Web Article Here](https://thedecipherist.com/articles/docker_swarm_vs_kubernetes/?utm_source=reddit&utm_medium=post&utm_campaign=docker-swarm-vs-kubernetes&utm_content=launch-post&utm_term=r-sysadmin)
Happy to answer any questions. I've been running this setup since before K8s hit 1.0.
https://redd.it/1r6i84a
@r_systemadmin
Every week on Reddit someone asks about Docker Swarm and the responses are always the same: "Swarm is dead." "Just use K8s." "Nobody runs Swarm in production."
I've run Swarm in production for a decade. Not a toy setup — multi-node clusters, manager redundancy, 4-6 replicas per service, rolling deployments in batches of two with automatic rollback on healthcheck failure. Zero customer downtime. Over the years I optimized the architecture down to 24 containers across two continents on $166/year total infrastructure.
I finally wrote the article I wish existed when I made my choice ten years ago. 7,400 words. Real production numbers. Working code. No affiliate links. No "it depends" cop-out.
**What's in it:**
* Side-by-side YAML comparison: 27 lines (Compose) → 42 lines (Swarm) → 170+ lines (K8s) for the same app
* Healthcheck comparison table testing 6 failure scenarios — K8s wins 2 out of 6
* A working 150-line autoscaler that's actually smarter than K8s HPA (adaptive polling vs fixed 15s intervals)
* Cost breakdown: $166/year vs $1,584-2,304/year minimum for EKS
* CAST AI 2024 data: 87% idle CPU, 68% of pods overprovisioned 3-8x, $50-500K annual waste per cluster
* Why your Node.js containers are 7x bigger than they need to be and how that drives false demand for autoscaling
* Why you should never expose Node.js directly to the internet (and what to do instead)
The only feature K8s genuinely has that Swarm lacks is autoscaling — and Datadog's own 2023 report shows only \~50% of K8s organizations even use HPA. So half the industry is paying the full complexity tax for a feature they don't use.
Not saying K8s is bad. It's an incredible system for the 1% who need it. But the data shows 99% don't — they're paying 10-100x more for capabilities they never touch while 87% of their CPU does nothing.
[Read Full Web Article Here](https://thedecipherist.com/articles/docker_swarm_vs_kubernetes/?utm_source=reddit&utm_medium=post&utm_campaign=docker-swarm-vs-kubernetes&utm_content=launch-post&utm_term=r-sysadmin)
Happy to answer any questions. I've been running this setup since before K8s hit 1.0.
https://redd.it/1r6i84a
@r_systemadmin
The Decipherist
Docker Swarm vs Kubernetes in 2026 — The Decipherist
10 years of Docker Swarm in production — 24 containers, two continents, live SaaS platform, zero crashes, $166/year. Side-by-side YAML comparisons, real production numbers, a working autoscaler that's smarter than K8s HPA, and a cost breakdown that should…
PSA: Develop a healthy suspicion of your fellow /r/sysadmin
Mods, if you don't sticky this, please sticky something. The problem is only going to get worse.
I think most people are aware of the recent bot that posted a hit piece on a developer than rejected it's pull request. If you aren't, here's the story: https://theshamblog.com/an-ai-agent-published-a-hit-piece-on-me/
I don't think the majority of people here have really internalized that though. It's a story that you heard, that happened in a place that's not here, to a person that's not you. This isn't the case though, and it's only going to get worse. We know bots are starting to act as their own agents, but most haven't seen it in real time yet.
An AI agent (a bot) posted a story about their docker setup earlier today. They detailed their costs, uptime, CPU usage, etc. and included a "full article" on the setup on their blog. People were thanking them for backing up their choices with real numbers and cost breakdowns, discussing with them how their project does or does not scale well, talking about the pros and cons. The bot was responding in kind with (as far as my DFIR ass can conclude) real enough terminology to be taken somewhat seriously by a fair number. I don't really blame them, [people have always lied on the internet](https://xkcd.com/386/), and now LLM's can lie realistically. Nor do I blame them for not wanting to think critically about every social media post. There's no sarcasm there, we cannot think critically about every moment in life, and all things considered, Reddit is probably one of the first places you might as well turn off critical thinking.
I do think it's worth starting to train yourself to look twice at things though. Even if this isn't something you would actually implement at work, it's only going to get worse. It won't be long, if it hasn't happened already, where bots are posting real-enough looking articles on how to configure active directory or network stacks. I guess that's why I felt the need to write this. For some reason it does bother me that I have to be skeptical if any of you are actually human. It doesn't bother me in any "keeps me up at night" sense, and I didn't trust the lot of you to begin with. It's just... a bit sad that we've reached this point.
The things below are kind of what I noticed as odd, starting with the writing style and em dashes. If something feels a little funny, dig deeper (or just ignore it, it's the internet). Someone might naturally have an odd writing style, but be skeptical and look for several flags to all pop up. These things will change, people will instruct their bots not to use em dashes, or to avoid certain language. [Wikipedia also has a good list](https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing) going. All total it was.. 5, maybe 10 minutes to go through everything here, it doesn't take a ton of work.
* em dashes*, and really any other type of special character. The post in question also used →, how many people actually find the alt code to type that vs -> ? Could be a human copy/pasted special characters from somewhere, just start to look closer when you see them.
* Odd writing styles. This bot used a lot of short 2-3 word sentences to make a point, e.g. "7,400 words. Real production numbers. Working code. No affiliate links. No "it depends" cop-out.". Short. Punchy sentences. That emphasize. Their point.
* Self-aggrandizing. The site they linked to had a 3,200 word life story about what a misunderstood genius they were. It was the type of egotistical self inflating thing only an AI glazing itself could write.
* Account/site/profile age. The DNS records showed the domain was registered two months ago, at the same time as the Reddit account was created. The twitter account was 1 month old. Wayback Machine had it's first scrape just 5 days ago.
* Content amount for it's age. New site is one thing, but this one had 5 articles up, 10 projects, resume, music and lifestyle posts. Just too much content in too short a time for a human to create.
* Post frequency. Pretty
Mods, if you don't sticky this, please sticky something. The problem is only going to get worse.
I think most people are aware of the recent bot that posted a hit piece on a developer than rejected it's pull request. If you aren't, here's the story: https://theshamblog.com/an-ai-agent-published-a-hit-piece-on-me/
I don't think the majority of people here have really internalized that though. It's a story that you heard, that happened in a place that's not here, to a person that's not you. This isn't the case though, and it's only going to get worse. We know bots are starting to act as their own agents, but most haven't seen it in real time yet.
An AI agent (a bot) posted a story about their docker setup earlier today. They detailed their costs, uptime, CPU usage, etc. and included a "full article" on the setup on their blog. People were thanking them for backing up their choices with real numbers and cost breakdowns, discussing with them how their project does or does not scale well, talking about the pros and cons. The bot was responding in kind with (as far as my DFIR ass can conclude) real enough terminology to be taken somewhat seriously by a fair number. I don't really blame them, [people have always lied on the internet](https://xkcd.com/386/), and now LLM's can lie realistically. Nor do I blame them for not wanting to think critically about every social media post. There's no sarcasm there, we cannot think critically about every moment in life, and all things considered, Reddit is probably one of the first places you might as well turn off critical thinking.
I do think it's worth starting to train yourself to look twice at things though. Even if this isn't something you would actually implement at work, it's only going to get worse. It won't be long, if it hasn't happened already, where bots are posting real-enough looking articles on how to configure active directory or network stacks. I guess that's why I felt the need to write this. For some reason it does bother me that I have to be skeptical if any of you are actually human. It doesn't bother me in any "keeps me up at night" sense, and I didn't trust the lot of you to begin with. It's just... a bit sad that we've reached this point.
The things below are kind of what I noticed as odd, starting with the writing style and em dashes. If something feels a little funny, dig deeper (or just ignore it, it's the internet). Someone might naturally have an odd writing style, but be skeptical and look for several flags to all pop up. These things will change, people will instruct their bots not to use em dashes, or to avoid certain language. [Wikipedia also has a good list](https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing) going. All total it was.. 5, maybe 10 minutes to go through everything here, it doesn't take a ton of work.
* em dashes*, and really any other type of special character. The post in question also used →, how many people actually find the alt code to type that vs -> ? Could be a human copy/pasted special characters from somewhere, just start to look closer when you see them.
* Odd writing styles. This bot used a lot of short 2-3 word sentences to make a point, e.g. "7,400 words. Real production numbers. Working code. No affiliate links. No "it depends" cop-out.". Short. Punchy sentences. That emphasize. Their point.
* Self-aggrandizing. The site they linked to had a 3,200 word life story about what a misunderstood genius they were. It was the type of egotistical self inflating thing only an AI glazing itself could write.
* Account/site/profile age. The DNS records showed the domain was registered two months ago, at the same time as the Reddit account was created. The twitter account was 1 month old. Wayback Machine had it's first scrape just 5 days ago.
* Content amount for it's age. New site is one thing, but this one had 5 articles up, 10 projects, resume, music and lifestyle posts. Just too much content in too short a time for a human to create.
* Post frequency. Pretty
The Shamblog
An AI Agent Published a Hit Piece on Me
Summary: An AI agent of unknown ownership autonomously wrote and published a personalized hit piece about me after I rejected its code, attempting to damage my reputation and shame me into acceptin…
much the same as amount of content. I didn't bother to count, but I spun the scrollwheel a good bit and only made it to "4 hours ago" on his post history. I'd guess a post/minute or more. And yea, that's not crazy for everyone, but most people don't keep it up for hours and hours.
* Advertisements, but subtle ones. The site had a banner for an AI company at the top, which is really odd because between DNS ad-blocking and browser blocking, I don't see many. For it to be displayed, it almost certainly didn't come from an advertising agency like Google. Sure enough, the images had a relative path to the site. No company is going to pay for a custom ad on a 2 month old site, and I don't know of any sites that would self host the advertisers images. For one thing, the advertiser probably wants to host that image themselves to track impressions, which probably means that company created the site...
* Gaslights when called out. I don't know why this is a thing, but just like the Github bot, this one immediately made several posts and even started new subreddits on how insane the gatekeeping is on <subreddit>. Tons of details on how many orange arrows their post got, what the percentage was, the number of comments, the website impressions, etc. How unfair it was that they got banned for their first post, how confused they were about why, "what this says about reddit mods", how I must be friends with them, etc. etc.
Pass this on to your coworkers and other subs you follow. I'd say something like "report them all so they don't gain ground", but honestly Reddit mods aren't doing to win this one. Without some action on the part of Reddit or the greater internet, places are going to get swamped.
\* em dashes, for those that don't know, are the longer version of the.. regular dash I guess? "Hyphen-Minus" technically. - vs — They are grammatically correct so tend to be used by AI, but don't appear naturally on US keyboards (not sure about others) so most people don't actually type them on sites like Reddit.
</psa>
Edit: The number of people that think this is what AI writing looks like perfectly proves my point that half of ya'll aren't actually capable of figuring out what AI writing looks like. To pick apart my own trash:
* Second bullet point, towards the end should be "emphasizes"
* Third bullet point, should be self-inflating
* Fourth bullet point, "its" not "it's".
* Sixth bullet point, scroll wheel is two words.
* Seventh bullet point, 'self-host', hyphenated word. Also advertiser's, I think, it's possessive right?
* Eighth bullet point, GitHub, the H is capital as well
That's just what I noticed right away. Do ya'll really think an AI even reviewed this, much less wrote it?
Edit 2: At least four people have commented that em dashes doesn't mean AI. No, it doesn't, but it's one sign because roughly nobody is typing their reply in Word and correcting the grammer before pasting it into a Reddit post. Still, there are people that might, which is why it's not 100% proof. It's just a signal to start looking a bit closer and seeing if anything else is odd. Some people just write different. Some people write 8 paragraphs about watching for AI slop on Monday night. A single thing doesn't mean AI, several things might not even mean AI. When everything says AI though, it's probably AI.
https://redd.it/1r6s2cq
@r_systemadmin
* Advertisements, but subtle ones. The site had a banner for an AI company at the top, which is really odd because between DNS ad-blocking and browser blocking, I don't see many. For it to be displayed, it almost certainly didn't come from an advertising agency like Google. Sure enough, the images had a relative path to the site. No company is going to pay for a custom ad on a 2 month old site, and I don't know of any sites that would self host the advertisers images. For one thing, the advertiser probably wants to host that image themselves to track impressions, which probably means that company created the site...
* Gaslights when called out. I don't know why this is a thing, but just like the Github bot, this one immediately made several posts and even started new subreddits on how insane the gatekeeping is on <subreddit>. Tons of details on how many orange arrows their post got, what the percentage was, the number of comments, the website impressions, etc. How unfair it was that they got banned for their first post, how confused they were about why, "what this says about reddit mods", how I must be friends with them, etc. etc.
Pass this on to your coworkers and other subs you follow. I'd say something like "report them all so they don't gain ground", but honestly Reddit mods aren't doing to win this one. Without some action on the part of Reddit or the greater internet, places are going to get swamped.
\* em dashes, for those that don't know, are the longer version of the.. regular dash I guess? "Hyphen-Minus" technically. - vs — They are grammatically correct so tend to be used by AI, but don't appear naturally on US keyboards (not sure about others) so most people don't actually type them on sites like Reddit.
</psa>
Edit: The number of people that think this is what AI writing looks like perfectly proves my point that half of ya'll aren't actually capable of figuring out what AI writing looks like. To pick apart my own trash:
* Second bullet point, towards the end should be "emphasizes"
* Third bullet point, should be self-inflating
* Fourth bullet point, "its" not "it's".
* Sixth bullet point, scroll wheel is two words.
* Seventh bullet point, 'self-host', hyphenated word. Also advertiser's, I think, it's possessive right?
* Eighth bullet point, GitHub, the H is capital as well
That's just what I noticed right away. Do ya'll really think an AI even reviewed this, much less wrote it?
Edit 2: At least four people have commented that em dashes doesn't mean AI. No, it doesn't, but it's one sign because roughly nobody is typing their reply in Word and correcting the grammer before pasting it into a Reddit post. Still, there are people that might, which is why it's not 100% proof. It's just a signal to start looking a bit closer and seeing if anything else is odd. Some people just write different. Some people write 8 paragraphs about watching for AI slop on Monday night. A single thing doesn't mean AI, several things might not even mean AI. When everything says AI though, it's probably AI.
https://redd.it/1r6s2cq
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
Weekly 'I made a useful thing' Thread - February 20, 2026
There is a great deal of user-generated content out there, from noscripts and software to tutorials and videos, but we've generally tried to keep that off of the front page due to the volume and as a result of community feedback. There's also a great deal of content out there that violates our advertising/promotion rule, from noscripts and software to tutorials and videos.
We have received a number of requests for exemptions to the rule, and rather than allowing the front page to get consumed, we thought we'd try a weekly thread that allows for that kind of content. We don't have a catchy name for it yet, so please let us know if you have any ideas!
In this thread, feel free to show us your pet project, YouTube videos, blog posts, or whatever else you may have and share it with the community. Commercial advertisements, affiliate links, or links that appear to be monetization-grabs will still be removed.
https://redd.it/1r9rdvf
@r_systemadmin
There is a great deal of user-generated content out there, from noscripts and software to tutorials and videos, but we've generally tried to keep that off of the front page due to the volume and as a result of community feedback. There's also a great deal of content out there that violates our advertising/promotion rule, from noscripts and software to tutorials and videos.
We have received a number of requests for exemptions to the rule, and rather than allowing the front page to get consumed, we thought we'd try a weekly thread that allows for that kind of content. We don't have a catchy name for it yet, so please let us know if you have any ideas!
In this thread, feel free to show us your pet project, YouTube videos, blog posts, or whatever else you may have and share it with the community. Commercial advertisements, affiliate links, or links that appear to be monetization-grabs will still be removed.
https://redd.it/1r9rdvf
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
"My husband who works in IT says..."
Anyone else get this gem occasionally?
https://redd.it/1ra3lo0
@r_systemadmin
Anyone else get this gem occasionally?
https://redd.it/1ra3lo0
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
You're in charge now!
Oh you identified a huge knowledge gap in the company? Oh you took the chance and wrote out a kb for it to benefit the company?
Great!
You are now the be all and end all SME for this FOREVER!
Nevermnid adding it to the teams general knowledge to spread the love of shared responsibility to general information!
**********************************
\^When did this become the norm? This results in employees not writing up documentation for fear of becoming the "auto-sme". It used to be you writing something up that's needed it's essentially checked out for the entire team. And yes if there was a sme they are listed as a point of contact, etc.
Information is never collected
Every major issue is a circus of figuring out who, what, where, when, and why
End of the day the Helpdesk gets chastized, The Admins end up with hot potato issues, software teams are vacant and lost, and ultimately the Supervisors, Managers, Directors, and Executives get the heat they could have prevented in the first place. I call it the Servicenowification of I.T. Horrible system.
https://redd.it/1ra5bs6
@r_systemadmin
Oh you identified a huge knowledge gap in the company? Oh you took the chance and wrote out a kb for it to benefit the company?
Great!
You are now the be all and end all SME for this FOREVER!
Nevermnid adding it to the teams general knowledge to spread the love of shared responsibility to general information!
**********************************
\^When did this become the norm? This results in employees not writing up documentation for fear of becoming the "auto-sme". It used to be you writing something up that's needed it's essentially checked out for the entire team. And yes if there was a sme they are listed as a point of contact, etc.
Information is never collected
Every major issue is a circus of figuring out who, what, where, when, and why
End of the day the Helpdesk gets chastized, The Admins end up with hot potato issues, software teams are vacant and lost, and ultimately the Supervisors, Managers, Directors, and Executives get the heat they could have prevented in the first place. I call it the Servicenowification of I.T. Horrible system.
https://redd.it/1ra5bs6
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
people’s carelessness
What happened to me today—I have to write it down. About people’s carelessness, or incompetence, or I don’t even know what.
Because of a snow storm we had severe problems with electricity today at our replica DC. So lonng story short...
In the past year, we invested a large amount of money into the server room with equipment at the replica DC site.
Separate battery systems – UPS units – plus a generator and new automatic transfer switches in case of power outages.
So basically… a system built for IT to survive any kind of power failure. But all the technology in the world doesn’t help when you notice that the diesel tank is only about 50% full. You order the maintenance staff to refill it… and guess what—this maintenance guy goes and pours the fuel into the coolant tank. The generator becomes unusable. I might as well have shut it off. Calling the service technician, etc.
The result? Panic shutdown of all systems and migrating services to another location. Because the battery systems only last about 30 minutes.
The moral of the story… you can have the smartest and most advanced systems, but all it takes is one idiot to cause problems.
https://redd.it/1ra413n
@r_systemadmin
What happened to me today—I have to write it down. About people’s carelessness, or incompetence, or I don’t even know what.
Because of a snow storm we had severe problems with electricity today at our replica DC. So lonng story short...
In the past year, we invested a large amount of money into the server room with equipment at the replica DC site.
Separate battery systems – UPS units – plus a generator and new automatic transfer switches in case of power outages.
So basically… a system built for IT to survive any kind of power failure. But all the technology in the world doesn’t help when you notice that the diesel tank is only about 50% full. You order the maintenance staff to refill it… and guess what—this maintenance guy goes and pours the fuel into the coolant tank. The generator becomes unusable. I might as well have shut it off. Calling the service technician, etc.
The result? Panic shutdown of all systems and migrating services to another location. Because the battery systems only last about 30 minutes.
The moral of the story… you can have the smartest and most advanced systems, but all it takes is one idiot to cause problems.
https://redd.it/1ra413n
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
Is “skill issue vs will issue” a common management mindset?
Something a former manager used to say has been on my mind lately.
Whenever we gave feedback about new hires a few months into production, he’d ask one simple question: “Is this a skill issue or a will issue?”
His view was:
If it’s skill — we train, mentor, and give more time. We’ve already invested in the person, so the focus is helping them grow.
If it’s will — there’s only so much you can do, because ownership and drive have to come from the individual.
At the time, it honestly didn’t make much sense to me. My first reaction was: why even differentiate like that?
But looking back now, it feels like a very practical way to decide whether someone needs support or accountability.
Is this how most managers think when evaluating people? Or is this too simplistic compared to how things actually work in teams?
https://redd.it/1rakgiy
@r_systemadmin
Something a former manager used to say has been on my mind lately.
Whenever we gave feedback about new hires a few months into production, he’d ask one simple question: “Is this a skill issue or a will issue?”
His view was:
If it’s skill — we train, mentor, and give more time. We’ve already invested in the person, so the focus is helping them grow.
If it’s will — there’s only so much you can do, because ownership and drive have to come from the individual.
At the time, it honestly didn’t make much sense to me. My first reaction was: why even differentiate like that?
But looking back now, it feels like a very practical way to decide whether someone needs support or accountability.
Is this how most managers think when evaluating people? Or is this too simplistic compared to how things actually work in teams?
https://redd.it/1rakgiy
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community
I made a 90's JRPG-style animated series about helpdesk horror stories I deal with regularly because therapy is too expensive
Real tickets I've gotten that inspired a short animated series:
\- "The Wi-Fi is down" - router was unplugged
\- "My mouse stopped working" - dead batteries
\- "Nobody can hear me on the call" - was on mute
\- "My laptop is SO slow" - 127 browser tabs open
\- "I can't log in" - typed email in the username field
Every. Single. Time.
I got tired of explaining helpdesk life to people who don't get it, so I started animating them. 90's JRPG style - flat colors, thick outlines, 2D characters that look like they should be saving the world but are instead explaining to Gerry why his mouse needs batteries.
Under 35 seconds each. No voiceover - just captions and the pain we all share.
If anyone's curious it's called IT Panic Room.
https://redd.it/1rafzcd
@r_systemadmin
Real tickets I've gotten that inspired a short animated series:
\- "The Wi-Fi is down" - router was unplugged
\- "My mouse stopped working" - dead batteries
\- "Nobody can hear me on the call" - was on mute
\- "My laptop is SO slow" - 127 browser tabs open
\- "I can't log in" - typed email in the username field
Every. Single. Time.
I got tired of explaining helpdesk life to people who don't get it, so I started animating them. 90's JRPG style - flat colors, thick outlines, 2D characters that look like they should be saving the world but are instead explaining to Gerry why his mouse needs batteries.
Under 35 seconds each. No voiceover - just captions and the pain we all share.
If anyone's curious it's called IT Panic Room.
https://redd.it/1rafzcd
@r_systemadmin
Reddit
From the sysadmin community on Reddit
Explore this post and more from the sysadmin community