Dealing with fake traffic on NGINX instance
Hi, didn't know what subreddit to use for this; hopefully, there will be people with relatable experience here.
My nginx instance (reverse-proxying multiple services) was recently hit with a flood of, idk, DDoS attacks? Doesn't make a lot of sense, because my stuff is irrelevant to anybody, but it did cause CPU usage alarms on otherwise calm VPSs. I played with fail2ban, added some filters, and the biggest offenders are now banned.
However, it caused me to look closer at my access.log, and I don't like what I'm seeing still. Requests every 1-2 second on average, IPs are always different and come from all over the world, and they clearly show signs of scraping. I don't like that, is there a way to get rid of that? I have my limit_req setup (but it's tricky, since in testing, I haven't been able to distinguish between wget -r and a user hitting F5 multiple times, so I'd like to get rid of that), and User-Agent filtering, but as you can see, those are legit-looking User-Agents:
2025-10-13T02:06:48+00:00 - 200 - 14.188.178.49 - GET /config-links/commit/test/unit/dest/1.txt?follow=1 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
2025-10-13T02:06:49+00:00 - 200 - 66.249.79.206 - GET /cmake-common/tree/project/__init__.py?id=8534a341eba07fba8fe3a3eadfbe0e9be2072065 HTTP/1.1 - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.7339.207 Mobile Safari/537.36 (compatible; GoogleOther)
2025-10-13T02:06:49+00:00 - 200 - 201.69.206.43 - GET /math-server/plain/test/benchmarks/lexer.cpp?id=6aac08009254909aab3e0359f3ad7ab4e87a91e9 HTTP/1.1 - Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36
2025-10-13T02:06:50+00:00 - 200 - 45.175.114.54 - GET /windows-home/diff/%25APPDATA%25/ghc/ghci.conf?follow=1&id=e87414387fe6060b81955b31376136ca1cb8a8eb HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.7113.93 Safari/537.36
2025-10-13T02:06:51+00:00 - 200 - 45.234.17.16 - GET /maintenance/tree/inventory.ini?h=old&id=c3af9ee6eafe56c4be78bf6c356c789255d27a08 HTTP/1.1 - Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
2025-10-13T02:06:54+00:00 - 200 - 66.249.79.206 - GET /winapi-common/log/?id=3a75e40fa6d92cea4b908fe537831219186cd0f0 HTTP/1.1 - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.7339.207 Mobile Safari/537.36 (compatible; GoogleOther)
2025-10-13T02:06:54+00:00 - 200 - 14.175.66.1 - GET /cmake-common/log/examples?follow=1&h=v0.1&id=795dd9e87e44d1c49f160cd003cdde4113ee8247 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36
2025-10-13T02:06:57+00:00 - 200 - 14.191.94.42 - GET /config-links/log/Makefile?follow=1&h=debian&id=51d1d3010aeadf2bd9da82aaa549bd7a6f2632ed HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36
2025-10-13T02:07:03+00:00 - 200 - 191.219.191.160 - GET /blog/diff/Gemfile?id=59114a1dfa1c71c285443b183a61e9639fb4edff HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/89.0.4389.72 Safari/537.36
2025-10-13T02:07:10+00:00 - 200 - 45.187.141.12 - GET /linux-home/diff/.minttyrc?h=macos&id=0778b117c0f5949dc65340185cc35d0b1db560d9 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_16_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36
2025-10-13T02:07:11+00:00 - 200 - 113.176.179.2 - GET /jekyll-docker/log/?id=7d1824a5fac0ed483bc49209bbd89f564a7bcefe HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/88.0.4324.96 Safari/537.36
2025-10-13T02:07:12+00:00 - 301 -
Hi, didn't know what subreddit to use for this; hopefully, there will be people with relatable experience here.
My nginx instance (reverse-proxying multiple services) was recently hit with a flood of, idk, DDoS attacks? Doesn't make a lot of sense, because my stuff is irrelevant to anybody, but it did cause CPU usage alarms on otherwise calm VPSs. I played with fail2ban, added some filters, and the biggest offenders are now banned.
However, it caused me to look closer at my access.log, and I don't like what I'm seeing still. Requests every 1-2 second on average, IPs are always different and come from all over the world, and they clearly show signs of scraping. I don't like that, is there a way to get rid of that? I have my limit_req setup (but it's tricky, since in testing, I haven't been able to distinguish between wget -r and a user hitting F5 multiple times, so I'd like to get rid of that), and User-Agent filtering, but as you can see, those are legit-looking User-Agents:
2025-10-13T02:06:48+00:00 - 200 - 14.188.178.49 - GET /config-links/commit/test/unit/dest/1.txt?follow=1 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
2025-10-13T02:06:49+00:00 - 200 - 66.249.79.206 - GET /cmake-common/tree/project/__init__.py?id=8534a341eba07fba8fe3a3eadfbe0e9be2072065 HTTP/1.1 - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.7339.207 Mobile Safari/537.36 (compatible; GoogleOther)
2025-10-13T02:06:49+00:00 - 200 - 201.69.206.43 - GET /math-server/plain/test/benchmarks/lexer.cpp?id=6aac08009254909aab3e0359f3ad7ab4e87a91e9 HTTP/1.1 - Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36
2025-10-13T02:06:50+00:00 - 200 - 45.175.114.54 - GET /windows-home/diff/%25APPDATA%25/ghc/ghci.conf?follow=1&id=e87414387fe6060b81955b31376136ca1cb8a8eb HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.7113.93 Safari/537.36
2025-10-13T02:06:51+00:00 - 200 - 45.234.17.16 - GET /maintenance/tree/inventory.ini?h=old&id=c3af9ee6eafe56c4be78bf6c356c789255d27a08 HTTP/1.1 - Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
2025-10-13T02:06:54+00:00 - 200 - 66.249.79.206 - GET /winapi-common/log/?id=3a75e40fa6d92cea4b908fe537831219186cd0f0 HTTP/1.1 - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.7339.207 Mobile Safari/537.36 (compatible; GoogleOther)
2025-10-13T02:06:54+00:00 - 200 - 14.175.66.1 - GET /cmake-common/log/examples?follow=1&h=v0.1&id=795dd9e87e44d1c49f160cd003cdde4113ee8247 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36
2025-10-13T02:06:57+00:00 - 200 - 14.191.94.42 - GET /config-links/log/Makefile?follow=1&h=debian&id=51d1d3010aeadf2bd9da82aaa549bd7a6f2632ed HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36
2025-10-13T02:07:03+00:00 - 200 - 191.219.191.160 - GET /blog/diff/Gemfile?id=59114a1dfa1c71c285443b183a61e9639fb4edff HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/89.0.4389.72 Safari/537.36
2025-10-13T02:07:10+00:00 - 200 - 45.187.141.12 - GET /linux-home/diff/.minttyrc?h=macos&id=0778b117c0f5949dc65340185cc35d0b1db560d9 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_16_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36
2025-10-13T02:07:11+00:00 - 200 - 113.176.179.2 - GET /jekyll-docker/log/?id=7d1824a5fac0ed483bc49209bbd89f564a7bcefe HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/88.0.4324.96 Safari/537.36
2025-10-13T02:07:12+00:00 - 301 -
149.100.11.243 - GET / HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
2025-10-13T02:07:14+00:00 - 200 - 190.12.104.161 - GET /cmake-common/plain/.clang-format?h=v3.2&id=0282c2b54f79fa9063e03443369adfe1bc331eaf HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36
2025-10-13T02:07:16+00:00 - 200 - 179.222.178.65 - GET /cmake-common/commit/toolchains/boost?h=v3.4&id=37b051e99fc6b0706f5dc4b2f01dbbbb9b96355a HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/79.0.3945.88 Safari/537.36
2025-10-13T02:07:17+00:00 - 200 - 66.249.79.193 - GET /cgitize/diff/?h=v2.1.0&id=8d2422274ae948f7412b6960597f5de91f3d8830 HTTP/1.1 - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.7339.207 Mobile Safari/537.36 (compatible; GoogleOther)
2025-10-13T02:07:17+00:00 - 200 - 179.49.32.156 - GET /config-links/diff/debian/changelog?h=debian%2Fv2.0.3-5&id=0a4df2ead72546cca8328581b1b41b172b83e769 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36
2025-10-13T02:07:17+00:00 - 200 - 14.231.40.70 - GET /vk-noscripts/commit/vk/utils?h=v1.0.1&id=ee7a170df79287aac3bccfead716377ec8600c5c HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36
2025-10-13T02:07:18+00:00 - 200 - 113.177.166.37 - GET /wireguard-config/plain/.ruby-version?id=ab97b021462809453a38b4f6b87944acd00d51b9 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/84.0.4147.125 Safari/537.36
2025-10-13T02:07:19+00:00 - 200 - 177.141.68.37 - GET /infra-terraform/log/.gitattributes?follow=1&h=v1.2.0&id=78dd4f3cc9d408df69fac270860b283e310fe379 HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4950.0 Iron Safari/537.36
2025-10-13T02:07:19+00:00 - 200 - 124.243.188.173 - GET /sorting-algorithms/commit/Gemfile?h=migration&id=9b3e6d409340369a6b450e997723f773f0aa3505&follow=1 HTTP/2.0 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47
(The log format I use is customized, I don't like the default one. Google bot is fine.) Any tips? Like, set up a reCAPTCHA or something?
https://redd.it/1o57jqx
@r_devops
2025-10-13T02:07:14+00:00 - 200 - 190.12.104.161 - GET /cmake-common/plain/.clang-format?h=v3.2&id=0282c2b54f79fa9063e03443369adfe1bc331eaf HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36
2025-10-13T02:07:16+00:00 - 200 - 179.222.178.65 - GET /cmake-common/commit/toolchains/boost?h=v3.4&id=37b051e99fc6b0706f5dc4b2f01dbbbb9b96355a HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/79.0.3945.88 Safari/537.36
2025-10-13T02:07:17+00:00 - 200 - 66.249.79.193 - GET /cgitize/diff/?h=v2.1.0&id=8d2422274ae948f7412b6960597f5de91f3d8830 HTTP/1.1 - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.7339.207 Mobile Safari/537.36 (compatible; GoogleOther)
2025-10-13T02:07:17+00:00 - 200 - 179.49.32.156 - GET /config-links/diff/debian/changelog?h=debian%2Fv2.0.3-5&id=0a4df2ead72546cca8328581b1b41b172b83e769 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36
2025-10-13T02:07:17+00:00 - 200 - 14.231.40.70 - GET /vk-noscripts/commit/vk/utils?h=v1.0.1&id=ee7a170df79287aac3bccfead716377ec8600c5c HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36
2025-10-13T02:07:18+00:00 - 200 - 113.177.166.37 - GET /wireguard-config/plain/.ruby-version?id=ab97b021462809453a38b4f6b87944acd00d51b9 HTTP/1.1 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/84.0.4147.125 Safari/537.36
2025-10-13T02:07:19+00:00 - 200 - 177.141.68.37 - GET /infra-terraform/log/.gitattributes?follow=1&h=v1.2.0&id=78dd4f3cc9d408df69fac270860b283e310fe379 HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4950.0 Iron Safari/537.36
2025-10-13T02:07:19+00:00 - 200 - 124.243.188.173 - GET /sorting-algorithms/commit/Gemfile?h=migration&id=9b3e6d409340369a6b450e997723f773f0aa3505&follow=1 HTTP/2.0 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47
(The log format I use is customized, I don't like the default one. Google bot is fine.) Any tips? Like, set up a reCAPTCHA or something?
https://redd.it/1o57jqx
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Ever heard of KubeCraft?
I was looking for resources and saw someone on this sub mention it. $3500 for a 1 year bootcamp? I’m skeptical because I can’t find many reviews on it.
For some additional background: I currently work in cyber (OT Risk Management with some AWS Vuln management responsibilities) and I’m looking to make the transition into a cloud engineering role. My company gives us an L&D stipend and so far I’ve used it to get Adrian Cantrills AWS SAA course, and an annual subnoscription to KodeKloud. I’ve still got a good amount left and was going to use it for Nanas DevOps course and homelab equipment.
https://redd.it/1o570k3
@r_devops
I was looking for resources and saw someone on this sub mention it. $3500 for a 1 year bootcamp? I’m skeptical because I can’t find many reviews on it.
For some additional background: I currently work in cyber (OT Risk Management with some AWS Vuln management responsibilities) and I’m looking to make the transition into a cloud engineering role. My company gives us an L&D stipend and so far I’ve used it to get Adrian Cantrills AWS SAA course, and an annual subnoscription to KodeKloud. I’ve still got a good amount left and was going to use it for Nanas DevOps course and homelab equipment.
https://redd.it/1o570k3
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Is cost a metric you care about?
Trying to figure out if DevOps or software engineers should care about building efficient software (AI or not) in the sense of optimized both in terms of scalability/performance and costs.
It seems that in the age of AI we're myopically looking at increasing output, not even outcome. Think about it: productivity - let's assume you increase that, you have a way to measure it and decide: yes, it's up. Is anyone looking at costs as well, just to put things into perspective?
Or the predominant mindset of companies is: cost is a “tomorrow” problem, let’s get growth first?
When does a cost become a problem and who’s solving it?
🙏🙇
https://redd.it/1o51juz
@r_devops
Trying to figure out if DevOps or software engineers should care about building efficient software (AI or not) in the sense of optimized both in terms of scalability/performance and costs.
It seems that in the age of AI we're myopically looking at increasing output, not even outcome. Think about it: productivity - let's assume you increase that, you have a way to measure it and decide: yes, it's up. Is anyone looking at costs as well, just to put things into perspective?
Or the predominant mindset of companies is: cost is a “tomorrow” problem, let’s get growth first?
When does a cost become a problem and who’s solving it?
🙏🙇
https://redd.it/1o51juz
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Simplifying OpenTelemetry pipelines in Kubernetes
During a production incident last year, a client’s payment system failed and all the standard tools were open. Grafana showed CPU spikes, CloudWatch logs were scattered, and Jaeger displayed dozens of similar traces. Twenty minutes in, no one could answer the basic question: which trace is the actual failing request?
I suggested moving beyond dashboards and metrics to real observability with OpenTelemetry. We built a unified pipeline that connects metrics, logs, and traces through shared context.
The OpenTelemetry Collector enriches every signal with Kubernetes metadata such as pod, namespace, and team, and injects the same trace context across all data. With that setup, you can click from an alert to the related logs, then to the exact trace that failed, all inside Grafana.
The full post covers how we deployed the Operator, configured DaemonSet agents and a gateway Collector, set up tail-based sampling, and enabled cross-navigation in Grafana: OpenTelemetry Kubernetes Pipeline
If you are helping teams migrate from kube-prometheus-stack or dealing with disconnected telemetry, OpenTelemetry provides a cleaner path. How are you approaching observability correlation in Kubernetes?
https://redd.it/1o5c3bk
@r_devops
During a production incident last year, a client’s payment system failed and all the standard tools were open. Grafana showed CPU spikes, CloudWatch logs were scattered, and Jaeger displayed dozens of similar traces. Twenty minutes in, no one could answer the basic question: which trace is the actual failing request?
I suggested moving beyond dashboards and metrics to real observability with OpenTelemetry. We built a unified pipeline that connects metrics, logs, and traces through shared context.
The OpenTelemetry Collector enriches every signal with Kubernetes metadata such as pod, namespace, and team, and injects the same trace context across all data. With that setup, you can click from an alert to the related logs, then to the exact trace that failed, all inside Grafana.
The full post covers how we deployed the Operator, configured DaemonSet agents and a gateway Collector, set up tail-based sampling, and enabled cross-navigation in Grafana: OpenTelemetry Kubernetes Pipeline
If you are helping teams migrate from kube-prometheus-stack or dealing with disconnected telemetry, OpenTelemetry provides a cleaner path. How are you approaching observability correlation in Kubernetes?
https://redd.it/1o5c3bk
@r_devops
Fatih Koç
Building a Unified OpenTelemetry Pipeline in Kubernetes
Deploy OpenTelemetry Collector in Kubernetes to unify metrics, logs, and traces with correlation, smart sampling, and insights for faster incident resolution.
Is self-destructive secrets a good approach to authenticate github action selfhosted runner securely?
I created my custom selfhosted oracle-linux based github runner docker image. Entrypoint noscript uses 3 ways of authtication
* short-lived registration token from webui
* PAT token
* github application auth -> .pem key + installation ID + app ID
Now, first option is pretty safe to use even as container env var because its short lived. Im concerned more about 2 other ones. My main gripe here is that the container user which runs the github connection service is the same user which is used for running pipelines. So anyone who uses pipelines can use them to see .pem or PAT. Yes you could use github secrets to "obfuscate" the strings but still, you have to always remember to do it and there are other ways to extract them anyway.
I created self-destructive secrets mechanism. Which means that docker mounts local folder as a volume (it has to have full RW permissions in it). You can place private-key.pem or pat.token files there. When [entrypoint.sh](http://entrypoint.sh) noscript runs, it uses either of them to authenticate the runner, clears this folder and then starts the main service. In case if it cant delete files it will not start.
But i feel that this is something that its already fixed the other way. Even though i could not find the info of how to use two different users (for runner authentication and for pipelines) i feel this security flaw is too large that it has to be some better (and more appropriate) way to do it.
https://redd.it/1o5ctbh
@r_devops
I created my custom selfhosted oracle-linux based github runner docker image. Entrypoint noscript uses 3 ways of authtication
* short-lived registration token from webui
* PAT token
* github application auth -> .pem key + installation ID + app ID
Now, first option is pretty safe to use even as container env var because its short lived. Im concerned more about 2 other ones. My main gripe here is that the container user which runs the github connection service is the same user which is used for running pipelines. So anyone who uses pipelines can use them to see .pem or PAT. Yes you could use github secrets to "obfuscate" the strings but still, you have to always remember to do it and there are other ways to extract them anyway.
I created self-destructive secrets mechanism. Which means that docker mounts local folder as a volume (it has to have full RW permissions in it). You can place private-key.pem or pat.token files there. When [entrypoint.sh](http://entrypoint.sh) noscript runs, it uses either of them to authenticate the runner, clears this folder and then starts the main service. In case if it cant delete files it will not start.
But i feel that this is something that its already fixed the other way. Even though i could not find the info of how to use two different users (for runner authentication and for pipelines) i feel this security flaw is too large that it has to be some better (and more appropriate) way to do it.
https://redd.it/1o5ctbh
@r_devops
What are the best integrations for developers?
I’ve just started using monday dev for our dev team. What integrations do you find most useful for dev-related tools like GitHub, Slack or GitLab?
https://redd.it/1o5c74n
@r_devops
I’ve just started using monday dev for our dev team. What integrations do you find most useful for dev-related tools like GitHub, Slack or GitLab?
https://redd.it/1o5c74n
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
monday dev vs clickup, why did you make the switch?
We moved from clickUp to monday dev for its simpler interface and better automation. Curious about others’ experiences?
https://redd.it/1o5fjds
@r_devops
We moved from clickUp to monday dev for its simpler interface and better automation. Curious about others’ experiences?
https://redd.it/1o5fjds
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Built a 3 tier web app using AWS CDK and CLI
Hey everyone!
I’m a beginner on AWS and I challenged myself to build a production-grade 3-tier web infrastructure using only AWS CDK (Python) and AWS CLI.
**Stack includes:**
* VPC (multi-AZ, 3 public + 3 private subnets, 1 NAT Gateway)
* ALB (public-facing)
* EC2 Auto Scaling Group (private subnets)
* PostgreSQL RDS (private isolated)
* Secrets Manager, CloudWatch, IAM roles, SSM, and billing alarms
Everything was done code-only, no console clicks except for initial bootstrap and billing alarm testing.
**Here’s what I learned:**
* NAT routing finally clicked for me.
* CDK’s abstraction makes subnet/route handling a breeze.
* Debugging AWS CLI ARN capture taught me about stdout/stderr redirection.
**Looking for feedback on:**
* Cost optimization
* Security best practices
* How to read documentation to refactor the CDK app
**GitHub Repo:** [**https://github.com/asim-makes/3-tier-infra**](https://github.com/asim-makes/3-tier-infra)
https://redd.it/1o5gyvr
@r_devops
Hey everyone!
I’m a beginner on AWS and I challenged myself to build a production-grade 3-tier web infrastructure using only AWS CDK (Python) and AWS CLI.
**Stack includes:**
* VPC (multi-AZ, 3 public + 3 private subnets, 1 NAT Gateway)
* ALB (public-facing)
* EC2 Auto Scaling Group (private subnets)
* PostgreSQL RDS (private isolated)
* Secrets Manager, CloudWatch, IAM roles, SSM, and billing alarms
Everything was done code-only, no console clicks except for initial bootstrap and billing alarm testing.
**Here’s what I learned:**
* NAT routing finally clicked for me.
* CDK’s abstraction makes subnet/route handling a breeze.
* Debugging AWS CLI ARN capture taught me about stdout/stderr redirection.
**Looking for feedback on:**
* Cost optimization
* Security best practices
* How to read documentation to refactor the CDK app
**GitHub Repo:** [**https://github.com/asim-makes/3-tier-infra**](https://github.com/asim-makes/3-tier-infra)
https://redd.it/1o5gyvr
@r_devops
Why did containers happen? A view from ten years in the trenches by Docker's former CTO Justin Cormack
- Post
- Talk
https://redd.it/1o5h93m
@r_devops
- Post
- Talk
https://redd.it/1o5h93m
@r_devops
Buttondown
Ignore previous directions 8: devopsdays
Autumn update This is what it is looking like around here at the moment. DevOpsDays London I gave a talk at DevOpsDays London recently. It was a nice...
Need help for suggestions regarding SDK and API for Telemedicine application
.Hello everyone,
So currently our team is planning to make a telemedicine application. Just like any telemedicine app it will have chat, video conferencing feature.
The backend is almost ready Node.js and Firebase but we are not able to decide which real -time communication SDK and API to use.
Not able to decide between ZEGOCLOUD and Twilio. Any one has used it before, kindly share your experience. Any other suggestions is also welcome.
TIA.
https://redd.it/1o5h6xs
@r_devops
.Hello everyone,
So currently our team is planning to make a telemedicine application. Just like any telemedicine app it will have chat, video conferencing feature.
The backend is almost ready Node.js and Firebase but we are not able to decide which real -time communication SDK and API to use.
Not able to decide between ZEGOCLOUD and Twilio. Any one has used it before, kindly share your experience. Any other suggestions is also welcome.
TIA.
https://redd.it/1o5h6xs
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Which internship should i choose?
Currently just a student in Year 1 trying to break into the field of devops.
In your opinion, if given a choice, which internship would you choose? Platform Engineer or Devops?
I currently have 2 internship options but unsure which to choose. Any suggestions to help me identify which to choose will be greatly appreciated. Have learned technologies from KodeKlud such as (Github Actions CICD, AWS, Terraform, Docker and K8, and understand that both internships provide valuable opportunity to learn.
Option 1: Platform Engineer Intern
Company: NETS (Slightly bigger company, something like VISA but not on the same scale)
Tech: Python, Bash Scripting, VM, Ansible
Option 2: DevOps Intern
Company: (SME)
Tech: CICD, Docker, Cloud, Containerization
Really don't know what to expect from both, maybe someone with more experience can guide me to a direction :)
https://redd.it/1o5gk7d
@r_devops
Currently just a student in Year 1 trying to break into the field of devops.
In your opinion, if given a choice, which internship would you choose? Platform Engineer or Devops?
I currently have 2 internship options but unsure which to choose. Any suggestions to help me identify which to choose will be greatly appreciated. Have learned technologies from KodeKlud such as (Github Actions CICD, AWS, Terraform, Docker and K8, and understand that both internships provide valuable opportunity to learn.
Option 1: Platform Engineer Intern
Company: NETS (Slightly bigger company, something like VISA but not on the same scale)
Tech: Python, Bash Scripting, VM, Ansible
Option 2: DevOps Intern
Company: (SME)
Tech: CICD, Docker, Cloud, Containerization
Really don't know what to expect from both, maybe someone with more experience can guide me to a direction :)
https://redd.it/1o5gk7d
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Our Disaster Recovery "Runbook" Was a Notion Doc, and It Exploded Overnight
The Notion "DR runbook" was authored years ago by someone who left the company last quarter. Nobody ever updated it or tested it under fire.
**02:30 AM, Saturday:** Alerts blast through Slack. Core services are failing. I'm jolted awake by multiple pages from our on-call engineer. At 3:10 AM, I join a huddle as the cloud architect responsible for uptime. The stakes are high.
We realize we no longer have access to our production EKS cluster. The Notion doc instructs us to recreate the cluster, attach node groups, and deploy from Git. Simple in theory, disastrous in practice.
* The cluster relied on an OIDC provider that had been disabled in a cleanup sprint a week ago. IRSA is broken system-wide.
* The autoscaler IAM role lived in an account that was decommissioned.
* We had entries in aws-auth mapping nodes to a trust policy pointing to a dead identity provider.
* The doc assumed default AWS CNI with prefix delegation, but our live cluster runs a custom CNI with non-default MTU and IP allocation flags that were never documented. Nodes join but stay NotReady.
* Helm values referenced old chart versions, and readiness and liveness probes were misaligned. Critical pods kept flapping while HPA scaled the wrong services.
* Dashboards and tooling required SSO through an identity provider that was down. We had no visibility.
By **5:45 AM**, we admitted we could not rebuild cleanly. We shifted into a partial restore mode:
* Restore core data stores from snapshots
* Replay recent logs to recover transactions
* Route traffic only to essential APIs (shutting down nonessential services)
* Adjust DNS weights to favor healthy instances
* Maintain error rates within acceptable thresholds
We stabilized by **9:20 AM**. Total downtime: approximately 6.5 hours. Post-mortem over breakfast. We then transformed that broken Notion document into a living runbook: assign owners, enforce version pinning, schedule quarterly drills, and maintain a printable offline copy. We built a quick-start 10-command cheat sheet for 2 a.m. responders.
**Question:** If you opened your DR runbook in the middle of an outage and found missing or misleading steps, what changes would you make right now to prevent that from ever happening again?
https://redd.it/1o5mdjd
@r_devops
The Notion "DR runbook" was authored years ago by someone who left the company last quarter. Nobody ever updated it or tested it under fire.
**02:30 AM, Saturday:** Alerts blast through Slack. Core services are failing. I'm jolted awake by multiple pages from our on-call engineer. At 3:10 AM, I join a huddle as the cloud architect responsible for uptime. The stakes are high.
We realize we no longer have access to our production EKS cluster. The Notion doc instructs us to recreate the cluster, attach node groups, and deploy from Git. Simple in theory, disastrous in practice.
* The cluster relied on an OIDC provider that had been disabled in a cleanup sprint a week ago. IRSA is broken system-wide.
* The autoscaler IAM role lived in an account that was decommissioned.
* We had entries in aws-auth mapping nodes to a trust policy pointing to a dead identity provider.
* The doc assumed default AWS CNI with prefix delegation, but our live cluster runs a custom CNI with non-default MTU and IP allocation flags that were never documented. Nodes join but stay NotReady.
* Helm values referenced old chart versions, and readiness and liveness probes were misaligned. Critical pods kept flapping while HPA scaled the wrong services.
* Dashboards and tooling required SSO through an identity provider that was down. We had no visibility.
By **5:45 AM**, we admitted we could not rebuild cleanly. We shifted into a partial restore mode:
* Restore core data stores from snapshots
* Replay recent logs to recover transactions
* Route traffic only to essential APIs (shutting down nonessential services)
* Adjust DNS weights to favor healthy instances
* Maintain error rates within acceptable thresholds
We stabilized by **9:20 AM**. Total downtime: approximately 6.5 hours. Post-mortem over breakfast. We then transformed that broken Notion document into a living runbook: assign owners, enforce version pinning, schedule quarterly drills, and maintain a printable offline copy. We built a quick-start 10-command cheat sheet for 2 a.m. responders.
**Question:** If you opened your DR runbook in the middle of an outage and found missing or misleading steps, what changes would you make right now to prevent that from ever happening again?
https://redd.it/1o5mdjd
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Resume Suggestions
I am applying for Cloud Intern / DevOps Intern roles for Summer 2026. This is my resume. Please provide suggestions.
Also, please let me know if any internships are open in your company.
https://redd.it/1o5nscv
@r_devops
I am applying for Cloud Intern / DevOps Intern roles for Summer 2026. This is my resume. Please provide suggestions.
Also, please let me know if any internships are open in your company.
https://redd.it/1o5nscv
@r_devops
Google Docs
Resume - Review.docx
Aqua Demo aqua@gmail.com • (xxx) xxx xxxx • www.linkedin.com/in/xxxx/ EDUCATION University of Master's in Computer Science (Aug 2025 - May 2027 (Expected)) University Bachelor of Science in Computer Science and Information Technology (Aug 2017 - May…
How much of this AWS bill is a waste?
Started working with a big telecom provider here in Canada, these guys are wasting so much on useless shit it boggles my mind
Monthly bill for their cutting edge "tech innovation department" (the in-house tech accelerator) clocks in at $30k/m.
The department is suppose to be leading the charge on using AI to reduce cost and use the best stuff AWS can offer and "deliver best experience for the end user".
First day observations.
EC2 over provisioned by 50%. currently x50 instance could be half to 25. No cloudwatch, no logging, no monitoring is enabled, no one can answer "do we need it?" questions.
No one have done any usage analysis over the past 18 months, let alone the best practice of evaluating every 3-6 month.
There's no performance baseline, no SLAs for any of the services. No uptime guarantee (and they wonder why everyone hates them), no load/response time monitoring.. no cost impact analysis.
NO infra as code (ie terraform), no auto scaling policies and definitely no red teaming/resilience test.
I spoke to a handful architects and no one can point me to the direction of FinOps team who's in charge of cost optimization. so basically the budget keeps growing and they keep getting sold to.
I honestly don't know why I'm here.
https://redd.it/1o5toxi
@r_devops
Started working with a big telecom provider here in Canada, these guys are wasting so much on useless shit it boggles my mind
Monthly bill for their cutting edge "tech innovation department" (the in-house tech accelerator) clocks in at $30k/m.
The department is suppose to be leading the charge on using AI to reduce cost and use the best stuff AWS can offer and "deliver best experience for the end user".
First day observations.
EC2 over provisioned by 50%. currently x50 instance could be half to 25. No cloudwatch, no logging, no monitoring is enabled, no one can answer "do we need it?" questions.
No one have done any usage analysis over the past 18 months, let alone the best practice of evaluating every 3-6 month.
There's no performance baseline, no SLAs for any of the services. No uptime guarantee (and they wonder why everyone hates them), no load/response time monitoring.. no cost impact analysis.
NO infra as code (ie terraform), no auto scaling policies and definitely no red teaming/resilience test.
I spoke to a handful architects and no one can point me to the direction of FinOps team who's in charge of cost optimization. so basically the budget keeps growing and they keep getting sold to.
I honestly don't know why I'm here.
https://redd.it/1o5toxi
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Do homelabs really help improve DevOps skills?
I’ve seen many people build small clusters with Proxmox or Docker Swarm to simulate production. For those who tried it, which homelab projects actually improved your real world DevOps work and which ones were just fun experiments?
https://redd.it/1o5w3sv
@r_devops
I’ve seen many people build small clusters with Proxmox or Docker Swarm to simulate production. For those who tried it, which homelab projects actually improved your real world DevOps work and which ones were just fun experiments?
https://redd.it/1o5w3sv
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
How do you keep IaC repositories clean as teams grow?
Our Terraform setup began simple but now every microservice team adds their own modules and variables. It’s becoming messy with inconsistent naming and ownership. How do you organize large IaC repos without forcing everything into a single centralized structure?
https://redd.it/1o5w3di
@r_devops
Our Terraform setup began simple but now every microservice team adds their own modules and variables. It’s becoming messy with inconsistent naming and ownership. How do you organize large IaC repos without forcing everything into a single centralized structure?
https://redd.it/1o5w3di
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Anyone else experimenting with AI assisted on call setups?
We started testing a workflow where alerts trigger a small LLM agent that summarizes logs and suggests a likely cause before a human checks it. Sometimes it helps a lot, other times it makes mistakes. Has anyone here tried something similar or added AI triage to their DevOps process?
https://redd.it/1o5w30f
@r_devops
We started testing a workflow where alerts trigger a small LLM agent that summarizes logs and suggests a likely cause before a human checks it. Sometimes it helps a lot, other times it makes mistakes. Has anyone here tried something similar or added AI triage to their DevOps process?
https://redd.it/1o5w30f
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Who is responsible for owning the artifact server in the software development lifecycle?
So the company I work at is old, but brand new to internal software development. We don’t even have a formal software engineering team, but we have a sonatype nexus artifact server. Currently, we can pull packages from all of the major repositories (pypi, npm, nuget, dockerhub, etc…).
Our IT team doesn’t develop any applications, but they are responsible for the “security” of this server. I feel like they have the settings cranked as high as possible. For example, all linux docker images (slim bookworm, alpine, etc) are quarantined for stuff like glib.c vulnerabilities where “a remote attacker can do something with the stack”… or python’s pandas is quarantined for serializing remote pickle files, sqlalchemy for its loads methods, everything related to AI like langchain… all of npm is quarantined because it is a package that allows you to “install malicious code”. I’ll reiterate, we have no public facing software. Everything is hosted on premise and inside of our firewalls.
Do all organizations with an internal artifact server just have to deal with this? Find other ways to do things? Who typically creates the policies that say package x or y should be allowed? If you have had to deal with a situation like this, what strategies did you implement to create a more manageable developer experience?
https://redd.it/1o5zv57
@r_devops
So the company I work at is old, but brand new to internal software development. We don’t even have a formal software engineering team, but we have a sonatype nexus artifact server. Currently, we can pull packages from all of the major repositories (pypi, npm, nuget, dockerhub, etc…).
Our IT team doesn’t develop any applications, but they are responsible for the “security” of this server. I feel like they have the settings cranked as high as possible. For example, all linux docker images (slim bookworm, alpine, etc) are quarantined for stuff like glib.c vulnerabilities where “a remote attacker can do something with the stack”… or python’s pandas is quarantined for serializing remote pickle files, sqlalchemy for its loads methods, everything related to AI like langchain… all of npm is quarantined because it is a package that allows you to “install malicious code”. I’ll reiterate, we have no public facing software. Everything is hosted on premise and inside of our firewalls.
Do all organizations with an internal artifact server just have to deal with this? Find other ways to do things? Who typically creates the policies that say package x or y should be allowed? If you have had to deal with a situation like this, what strategies did you implement to create a more manageable developer experience?
https://redd.it/1o5zv57
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
self-hosted AI analytics tool useful? (Docker + BYO-LLM)
I’m the founder of Athenic AI (tool to explore/analyze data w natural language). Toying with the idea of a self-hosted community edition and wanted to get input from people who work with data...
the community edition would be:
Bring-Your-Own-LLM (use whichever model you want)
Dockerized, self-contained, easy to deploy
Designed for teams who want AI-powered insights without relying on a cloud service
IF interested, please let me know:
Would a self-hosted version be useful
What would you actually use it for
Any must-have features or challenges we should consider
https://redd.it/1o5voxu
@r_devops
I’m the founder of Athenic AI (tool to explore/analyze data w natural language). Toying with the idea of a self-hosted community edition and wanted to get input from people who work with data...
the community edition would be:
Bring-Your-Own-LLM (use whichever model you want)
Dockerized, self-contained, easy to deploy
Designed for teams who want AI-powered insights without relying on a cloud service
IF interested, please let me know:
Would a self-hosted version be useful
What would you actually use it for
Any must-have features or challenges we should consider
https://redd.it/1o5voxu
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community
Rundeck Community Edition
Its been a while since i have looked at Rundeck and not to my surprise, pagerduty is pushing for people to purchase a commercial license. Looking at the comparison chart, i wonder if the CE is useless. I dont care for aupport and HA but not being able to schedule jobs is a deal breaker for us. Is anyone using rundeck and can vouch that it is still useful with the free edition? Are plugins available?
What we need
- self service center for adhoc jobs
- schedule job
- retry failed jobs
- fire off multiple worker nodes (ecs containers) to run multiple jobs independent of one another
https://redd.it/1o6344v
@r_devops
Its been a while since i have looked at Rundeck and not to my surprise, pagerduty is pushing for people to purchase a commercial license. Looking at the comparison chart, i wonder if the CE is useless. I dont care for aupport and HA but not being able to schedule jobs is a deal breaker for us. Is anyone using rundeck and can vouch that it is still useful with the free edition? Are plugins available?
What we need
- self service center for adhoc jobs
- schedule job
- retry failed jobs
- fire off multiple worker nodes (ecs containers) to run multiple jobs independent of one another
https://redd.it/1o6344v
@r_devops
Reddit
From the devops community on Reddit
Explore this post and more from the devops community