DevOps & SRE notes – Telegram
DevOps & SRE notes
12K subscribers
42 photos
19 files
2.5K links
Helpful articles and tools for DevOps&SRE

WhatsApp: https://whatsapp.com/channel/0029Vb79nmmHVvTUnc4tfp2F

For paid consultation (RU/EN), contact: @tutunak


All ways to support https://telegra.ph/How-support-the-channel-02-19
Download Telegram
A new terraform version has been released. Import already existed infrastructure to the terraform state become easier.
https://www.hashicorp.com/blog/terraform-1-5-brings-config-driven-import-and-checks
Streaming alert evaluation offers better scalability than traditional polling time-series databases, overcoming high dimensionality/cardinality limitations. This enables engineers to have more reliable and real-time alerting systems. The transition to the streaming path has opened doors for supporting more exciting use-cases and has allowed multiple platform teams at Netflix to generate and maintain alerts programmatically without affecting other users. The streaming paradigm may help tackle correlation problems in observability and offer new opportunities for metrics and events verticals, such as logs and traces.

https://netflixtechblog.com/improved-alerting-with-atlas-streaming-eval-e691c60dc61e
👍1
In this post, the author discusses potential PostgreSQL pitfalls that may not affect small databases, but can cause issues when databases grow.
https://philbooth.me/blog/nine-ways-to-shoot-yourself-in-the-foot-with-postgresql
Pipedrive Infra manages numerous Kubernetes clusters across different clouds, including AWS and on-premise OpenStack. They had been experiencing intermittent failing pod health checks, which became more frequent over time. After an extensive investigation, the team discovered that Kubelet was initiating TCP sessions to pods using random source ports within the same range reserved by Kubernetes nodeports. This caused the TCP SYN-ACK to be redirected to other pods, leading to failed health checks. The solution was to disallow the use of the nodeport range as the source port for TCP sessions with a single line of code, effectively resolving the issue.

https://medium.com/pipedrive-engineering/solving-the-mystery-of-pods-health-checks-failures-in-kubernetes-55b375493d03
Efficient GPU utilization is crucial for minimizing infrastructure expenses, especially in large Kubernetes clusters running AI and HPC workloads. NVIDIA MIG enables partitioning GPUs into smaller slices, but using MIG in Kubernetes through the NVIDIA GPU Operator alone has limitations due to static configurations. Dynamic MIG Partitioning addresses these limitations by automating the creation and deletion of MIG profiles based on real-time workload requirements, ensuring optimal GPU utilization. The nos module works alongside the NVIDIA GPU Operator to implement dynamic MIG partitioning, simplifying the management of MIG configurations and reducing operational costs.

https://towardsdatascience.com/dynamic-mig-partitioning-in-kubernetes-89db6cdde7a3
Have you ever heard that company migrate from microservice architecture to monolith?
Moving our service to a monolith reduced our infrastructure cost by over 90%. It also increased our scaling capabilities. Today, we’re able to handle thousands of streams and we still have capacity to scale the service even further. Moving the solution to Amazon EC2 and Amazon ECS also allowed us to use the Amazon EC2 compute saving plans that will help drive costs down even further.
https://www.primevideotech.com/video-streaming/scaling-up-the-prime-video-audio-video-monitoring-service-and-reducing-costs-by-90
In this post, the author explores various load balancing algorithms, including round robin, weighted round robin, dynamic weighted round robin, and least connections. The simulations demonstrate how these algorithms perform in different scenarios, highlighting their strengths and weaknesses. Round robin performs well in terms of median latency but struggles with higher percentiles. Least connections offer a good balance between simplicity and performance but may not be optimal in terms of latency. The PEWMA algorithm, which combines techniques from dynamic weighted round robin and least connections, shows significant improvements across all latency percentiles but has additional complexity and may not handle dropped requests as well as least connections. Ultimately, the choice of load balancing algorithm depends on the specific requirements of a workload and the performance characteristics that need to be optimized.

https://samwho.dev/load-balancing/
👍1
Adrien "ZeratoR" Nougaret's annual charity event, Zevent, returned this year with a new addition called Zevent Place. Inspired by Reddit's r/place, Zevent Place is a collaborative canvas where donors can draw pixels based on the amount they donate. Developers William Traoré and Alexandre Moghrabi created the platform with several features, such as Pixel Upgrade system and real-time updates, to protect community creations and enhance user experience.

The team utilized various technologies like GraphQL, NestJS, Redis, and MinIO, and managed to handle massive amounts of updates while maintaining a low CPU and bandwidth footprint. Although there were challenges, such as unexpected rate limit errors with Cloudflare, the event achieved 98.4% uptime, with the downtime being addressed and resolved promptly.

Overall, Zevent Place was a successful project, and valuable lessons were learned throughout its development and implementation.

https://medium.com/@alexmogfr/zevent-place-how-we-handled-100k-ccu-on-a-real-time-collective-canvas-71d3d346e0ab