Reddit DevOps – Telegram
How do you guys handle very high traffic?

I have came across a case where there are usually 10-15k requests per min, on normal spikes it goes up to 40k req/min. But today for some reason i encountered hugh spikes 90k req/min multiple times. Now servers that handle requests are in auto scaling and it scaled up to 40 servers to match the traffic but it also resulted in lots of 5XX and 4xx errors while it scaled up.
Architecture is as below

Aws WAF —> ALB—-> AutoScaling EC2

Part of requests are not that much important to us, meaning can be processed later(slowly)

Need senior level architecture suggestions to better handle this.

We considered contanerization but at the moment App is tightly coupled with local redis server. Each server needs to have redis server and php horizon

https://redd.it/1pddcp8
@r_devops
Building A Platform for Provisioning infrastructure on different clouds - Side Project

Hello, I hope everyone is good. Now a days i have free time because my job is very relax. So i decided to build a platform similar to internet developer tool. Its just my side project polishing my skills bcz i want to learn platform engineering. I am DevOps engineer.i have questions from all platform engineers if you would like to build the platform how you make the architecture. My current stack is:
Casbin - for RBAC
Pulumi - for infrastructure Provisioning
Fastapi - backend api
React - frontend
Calery redis - multiple jobs handling
PostgreSQL for Database

For cloud provide authentication i am using identify provide setup to automatically exchange tokens so no need for service accounts to store.

Need suggestions like what are the mistakes people do when building platform and how to avoid them. Are my current stack is good or need to change?
Thanks everyone.

https://redd.it/1pddqo0
@r_devops
1
My first website

It's basically what the noscript says, I created my first website (with help from AI) and wanted to receive feedback. If you find the idea interesting, feel free to make a donation at the bottom of the page. Note the site is not completely finished yet, there are still some bugs to fix.

Link: jgsp.me

https://redd.it/1pdi42e
@r_devops
Setup to deploy small one-off internal tools without DevOps input?

So,

Out DevOps guy is flooded and so is the bottle neck on deploying anything new. My team would like to be able to deploy one-ff web apps to AWS without his input as they are not mission critical i.e. prototypes, ideas, internal tools, but it takes weeks to get it to happen atm.

I'm thinking, if we had a EKS cluster for handling these little web apps, is there a setup in which, along with the web-app code, we could include the k8s config YAML for the app and have a CI/CD noscript (we're using Bitbucket) that could pick up this ks config and deploy to EKS?

Hopefully not involving the poor DevOps guy and making my team more independent while remaining secure in our VPC.

We had a third party vibe code a quick app and deployed to Vercel, which breaks company data privacy for our clients not to mention security concerns. But its a use case we've been told we need to cater to...

Has anyone done something like this?

https://redd.it/1pdio9p
@r_devops
Using ClickHouse for Real-Time L7 DDoS & Bot Traffic Analytics with Tempesta FW

Most open-source L7 DDoS mitigation and bot-protection approaches rely on challenges (e.g., CAPTCHA or JavaScript proof-of-work) or static rules based on the User-Agent, Referer, or client geolocation. These techniques are increasingly ineffective, as they are easily bypassed by modern open-source impersonation libraries and paid cloud proxy networks.

We explore a different approach: classifying HTTP client requests in near real time using ClickHouse as the primary analytics backend.

We collect access logs directly from [Tempesta FW](https://github.com/tempesta-tech/tempesta), a high-performance open-source hybrid of an HTTP reverse proxy and a firewall. Tempesta FW implements zero-copy per-CPU log shipping into ClickHouse, so the dataset growth rate is limited only by ClickHouse bulk ingestion performance - which is very high.

[WebShield](https://github.com/tempesta-tech/webshield/), a small open-source Python daemon:

* periodically executes analytic queries to detect spikes in traffic (requests or bytes per second), response delays, surges in HTTP error codes, and other anomalies;

* upon detecting a spike, classifies the clients and validates the current model;

* if the model is validated, automatically blocks malicious clients by IP, TLS fingerprints, or HTTP fingerprints.

To simplify and accelerate classification — whether automatic or manual — we introduced a new TLS fingerprinting method.

WebShield is a small and simple daemon, yet it is effective against multi-thousand-IP botnets.

The [full article](https://tempesta-tech.com/blog/defending-against-l7-ddos-and-web-bots-with-tempesta-fw/) with configuration examples, ClickHouse schemas, and queries.


https://redd.it/1pdd2lm
@r_devops
Yaml pipeline builder

Is there such a thing as a gui to at least scaffold multi stage pipelines? I'm building some relatively simple ones and it seems to me a gui would have been able to do what I need


The azure devops classic builder does a pretty good job but only works within a single job

https://redd.it/1pdluxy
@r_devops
Do you require your team to refactor code or follow design pattern - AI/MLOps ?

Hii, It's me again

Trying to change to be better in long run is so painful. Basically I joined this startup company and everything is still messy as everyone lacks of production experience (including me).

I realize if we want to make the development process correctly and efficient, we need to change, refactor everything. For example we're developing AI core features, but we don't have CI/CD pipeline, nearly have no design pattern applied to the code, hard code, prompt put directly into the code.


So recently, I come up with the idea build CI/CD, use MLFlow for tracking everything, for transparent. I know that, evaluation, benchmark, tracking version are extremely important in AI development. For example someone in the team changes the prompt and do some shallow testing (int a few samples) and pick good sample result and said okay It's better now. Noooo, we need a comprehensive testing again, log and show the result into the dashboard, make sure it's truly better.

As someone who lacks experience in MLOps, but I do know (a little) what should do to make it more reliable in the development process. But I also know that changing this might be painful for other devs in my team. Maybe I have to propose a design pattern so everyone else need to refactor and follow? For example, to standardize instruction prompt, we definitely put it somewhere else and have prompt management mechanism...


But also I don't know if this really worth to try or change. Or if we're lucky we get to make it work 100% and put it in the production?


Please share your thought. :(



https://redd.it/1pdo9j2
@r_devops
What even am I?

I have the opportunity to “choose” my “role” at my current company. This won’t overly affect my pay unless I can reason for it. With all the terms and different companies naming the same roles differently, I’m really just clueless.

Here’s what I do currently at my company: CI/CD + multi-cloud IaC and k8s to infra design and cost optimization. I’m on the ISO committee writing company policies/processes for compliance (ISO/GDPR/SOC2), help QA with tests & load testing, manage access + IT inventory, and lately run AI ops (designing the flow, vector DBs, agents, repo modules)


https://redd.it/1pdsrrc
@r_devops
Enabling Google Consent Mode with OneTrust for Germany

Hello folks, I need your help in setting up Google Consent Mode. We have OneTrust as CMP on our websites. OneTrust has an option to enable Google Consent Mode, and when it’s enabled there are default choices for each storage type. Can someone advise which option to select for each category to set up Google Consent Mode correctly? In-case website address is needed, it's: mitdiabetes.de

https://redd.it/1pdt13e
@r_devops
Kubernetes Secrets/ENV automation

Hey Guys! I recently came across one use-case where secrets need to be autogenerated and pushed to a secret management tool ( Vault for me).
context:
1) Everytime if we are creating a new cluster for a new client, we create the secrets mannualy api-keys and some random generated strings.( including mongo or postgress connection string). which takes a lot of time and effort.

2) On the release day, comparing the lower environment and upper environment mannually to findout the newly created secrets.

Now we have created a Golang application which will automatically generate the secrets based upon the context provided to it. But still some user intervention is required via cli to confirm secret type ( if its api-key it can't be generated randomly so user needs to pass it via cli).

Does anyone know, how we can more effortlessly manage it ? like one-click solution?
Can someone please let me know how you guys are handling it in your organization?

Thank you!

https://redd.it/1pdupoa
@r_devops
Switching to devops from frontend/fullstack dev

I have 2 YOE and planning to switch to devops from frontend heavy full stack development and banking/fintech domain . Currently my package is 6.2 lpa in mumbai, india. I am targeting for minimum 25 lpa inr for my next switch. I just wanted ur advise on what should I focus more on to get the desired hike and an entry in devops role like getting hands on devops tools and anything else maybe soft skills and also become the best in devops field, currently i am following roadmap from roadmap site.
Thanks🙌🏻

https://redd.it/1pdv72p
@r_devops
Looking for guidance wiz vs orca vs upwind

im trying to pick cloud security platform for one of our client and im kinda stuck. they’re growing fast, and we’re trying to keep things safe while the security team is still taking shape. Right now our DevOps and SRE handle most of it, and they’re stretched enough as it is.

We run fully on AWS and use the native tools, but the alerts stack up. We need clearer signals. Whats exposed. Whats exploitable. What needs attention now, not next month.

We looked at wiz, orca, and upwind. They look similar from the outside. Same claims. Same style. One talks about runtime data through ebpf, one pushes posture, one pushes simplicity. Hard to tell what changes the day to day work. 
Price matters. Ease matters and something that helps a small group keep things under control.

Please tell me about your experience with them. Not the demo version please 🙏.

TIA



https://redd.it/1pdw8bw
@r_devops
Version/Patch Monitoring Service on AWS/GCP/Azure

Hi,

Ya'll know how you have hundreds of services deployed on cloud, each requiring their own upgrade and patch management protocol?

Would there be interest in a small web service that monitors your clusters, dbs, elasticache etc. (just read perms on the versions), shows current version and eol / upcoming patchings, AWS release notes + auto alerts your team and syncs with your calendar?

This is geared for the smb rather than the enterprise that has entire teams devoted to it.

https://redd.it/1pduucq
@r_devops
External Service Certification

Something that I have observed working at different companies (working closely with the dev teams) is what happens when developers want/need to work with third-party services:

I saw this a few times: The team found an external service that seemed to work for a project, but then the questions came from devops:

\-Where is the data stored?

\-How long will this API keep my (and our customers) data?

\-Who else is processing or accessing it behind the scenes?

And does the API even have the certifications needed to keep everything secure and compliant? ( folks working with EU companies will know what I mean here, with GDPR etc).

In smaller companies and startups, this is often not a big problem: things move fast, and the stakes might feel lower. But in bigger companies, with security, compliance teams and standards, this is not the case (You can’t just plug in any API and hope all works out)

Main scenario I have seen: The Security/devops teams need some answers and send a (long) questionnaire. If the service provider cant show/demonstrate where data lives or how data protected, chances are the service does not get approved at all.

Sometimes, that process can drag on which delays things and can even force the team to build something new (from scratch).

So I was wondering how we can kind of put all this in practice: Its not the final result yet but I think its in the right direction.

So, we put together a certification scheme to be able to capture (and show) upfront, structured human AND machine-readable information about how APIs handle data:

\- Location/region that data is stored

\- Retention period (inout and output, logs, metadata)

\- Third parties that might be involved

\- Any Standards and if are actually met (and not just implied) - this could be GDPR, SOC 2 etc.

I think that having this information can help teams move faster, and build features that users (and compliance folks) can trust (or at least not have big objections against lol).

Would like to get your take : What do you think about this idea? What extra information would you find useful to know/see before deciding to move ahead with using n external service?

This is currently how our certificates look like (for the APIs we have certified): https://apyhub.com/catalog (you can check the shield icon next an API).

Nikolas



https://redd.it/1pdvugg
@r_devops
Maintainer Feedback Needed: Why do you run Harbor on Azure instead of using ACR?

Hey all, I am one of the maintainers of CNCF Harbor. I know we have quite a few users who are running Harbor on Azure although there is ACR.

I recently had a discussion with a group of Azure experts, who claimed there is no reason why Harbor would ever be a better fit than ACR.

I was really surprised because that's not the reality we see. I mean, If ACR fits your needs, go with it. Good for you; I am in total agreement with such a decision. ACR is simpler to set up and maintain, and it also integrates nicely into the Azure ecosystem.

From some Harbor users who run on Azure, I know a few arguments why they favor Harbor over ACR.

Replication capabilities from/to other registries
IAM outside Azure, some see that as a benefit.
Works better as an organization-wide registry
Better fitted for cross-cloud and on-prem


Somehow those arguments didn't resonate at all.

So my question is, are there any other deciding factors you decided on for Harbor instead of ACR?
thx.

https://redd.it/1pe0269
@r_devops
Need help to improve my skill in GitHub CI/CD

Hi guys, for past few days I have learnt Linux and git. by using chatgpt I practiced some basic things, i want to push my level from basic to medium level.
My goal is to be understand better and improve skill in cloud and devops world!
Guidance and helps are welcome


https://redd.it/1pe2dkn
@r_devops
Is Golden Kubestronaut actually worth it? Looking for honest opinions from the community

Hey everyone,

I'm a Senior Cloud Architect (10+ years experience) currently holding Kubestronaut certification along with Azure Solutions Architect and a bunch of other certs. I've been seriously considering going for Golden Kubestronaut but the more I think about it, the more I'm second-guessing myself.

Here's my dilemma:

The Cost Reality:
- 5-6 additional certs to maintain = ₹75,000-1,50,000 just for exams
- Renewal costs every 2-3 years = another ₹50,000+
- Realistically 200-300 hours of study time
- That's time away from actual hands-on work
- had to pay from own pocket as employer is not covering the cost

Pros I can see:
- Ultimate flex in the K8s community - only ~200 people worldwide have it
- Opens doors for conference speaking and community leadership
- Shows insane dedication and commitment
- Might help with consulting opportunities
- Resume definitely stands out in the pile

Cons I'm worried about:
- The certs I'd need to add (11+) seem less valuable than what I already have (CKA/CKS/CKAD)
- Most hiring managers don't even know the difference between Kubestronaut and Golden Kubestronaut
- Knowledge retention is already a problem - I don't use half the stuff I learned for exams daily
- That ₹1,50,000 could build a sick home lab where I'd actually learn practical skills
- My current Kubestronaut already proves I know K8s deeply
- Salary bump seems minimal - maybe 5-10% at most?

Alternative I'm considering:
Taking that same money and time to either:
1. Build a proper home lab (3-node K8s cluster + NAS) for hands-on practice
2. Get GCP or AWS certification to become multi-cloud
3. Learn Platform Engineering (Backstage, ArgoCD, Crossplane)
4. Focus on FinOps certification (seems to have better ROI)

My real question:
For those who've achieved Golden Kubestronaut - was it actually worth it career-wise? Did it open doors that regular Kubestronaut didn't? Or is it more of a personal achievement thing?

And for hiring managers - does Golden Kubestronaut actually make a candidate significantly more attractive, or is regular Kubestronaut + solid project experience better?

I'm leaning towards skipping it and focusing on practical skills + multi-cloud, but I'd love to hear from people who've been in this position. Especially interested in hearing from people who chose NOT to pursue it after getting Kubestronaut.

Thanks for any insights!

https://redd.it/1pe3ure
@r_devops
failed KCNA three times

I'm kind of at a loss here...I've gotten a 71/75 exactly ... three times. I know kubernetes relatively well... have gone through the kodekloud course multiple times, practice exams from udemy, have done all the HOL on kodekloud multiple times.

all 3 times I am finding questions that were not covered in any of those resources... multiple specifics on networking, security, 3rd party players like ArgoCD etc. Just feel like each time I prepare I am not properly prepped for these off the wall questions.

any tips? I have one more retake left...

https://redd.it/1pe5130
@r_devops
IS AI the future or is a big scam?

I am really confused, I am a unity developer and I am seeing that nowdays 90% of jobs is around AI and agentic AI

But at the same time every time I ask to any AI a coding task
For example how to implement this:
https://github.com/CyberAgentGameEntertainment/InstantReplay?tab=readme-ov-file

I get a lot of NONSENSE, lies, false claiming, code that not even compile etc.

And from what I hear from collegues they have the same feelings.

And at the same time I not see in real world a real application of AI other then "casual chatting" or coding no more complex than "how is 2+2?"

Can someone clarify this to me? there are real good use of ai?

https://redd.it/1pe6ho7
@r_devops
Bitbucket bait-and-switched, now charging $15/month per self-hosted runner

I saw this morning that Bitbucket has announced self-hosted runner v5 which comes with some interesting new features, but they also changed their pricing from no charge for self-hosted runners to $15/month per concurrent build slot. So now if you're trying to run multiple builds at once or parallelizing releases on your own hardware they want you to pay for the privilege.

This seems crazy to me as we are using self-hosted runners to save money by using our own hardware for builds. We just spent months moving a bunch of our pipelines over to BB and it just seems so wrong that after all that, they can just threaten to make our releases (which rely on parallelizing pipelines) take over 10x as long unless we want to pony up a monthly fee that we really can't afford on top of what we're already paying for users and hardware or instances to actually run the builds.

Github doesn't charge for self-hosted runners. Gitlab doesn't either. It looks like CircleCI does but included concurrency is higher, or unlimited if you have an enterprise plan. So this feels like a total ripoff and a bait-and-switch because they know moving to another CI platform is a massive undertaking.

https://www.atlassian.com/blog/bitbucket/announcing-v5-self-hosted-runners

https://redd.it/1pe8wzd
@r_devops