DevOps & SRE notes – Telegram
DevOps & SRE notes
12K subscribers
38 photos
19 files
2.5K links
Helpfull articles and tools for DevOps&SRE

WhatsApp: https://whatsapp.com/channel/0029Vb79nmmHVvTUnc4tfp2F

For paid consultation (RU/EN), contact: @tutunak


All ways to support https://telegra.ph/How-support-the-channel-02-19
Download Telegram
To bridge critical gaps in production issue resolution, this blogpost explores how observability was built using Google Cloud services. By leveraging tools like OpenTelemetry and Google Cloud Trace Explorer, the author demonstrates how traceability and logging enhancements can optimize debugging and system performance monitoring.

https://punits.dev/blog/building-observability-with-google-cloud-services/
👍3
Yan Cui dives into the complexities of end-to-end testing for microservices spanning bounded contexts, offering practical strategies for balancing team responsibilities and testing approaches. This article highlights tools like Pact, WireMock Cloud, and Cypress to streamline testing while fostering collaboration across diverse organizational structures.

https://theburningmonk.com/2024/12/how-to-e2e-test-microservices-across-bounded-contexts/
Mercari's Microservices Platform Network team outlines the company's successful migration from Fastly to Cloudflare, emphasizing strategies to ensure minimal disruptions. This blogpost also introduces future initiatives like "CDN as a Service," aimed at empowering developers through tools such as CDN Kit and automated permission systems.

https://engineering.mercari.com/en/blog/entry/20241223-a-smooth-cdn-provider-migration-and-future-initiatives/
This post critiques the casual metaphors used to describe Facebook's 2021 outage, arguing that such language trivializes preventable engineering failures and undermines accountability. By examining historical disasters like the Grover Shoe Factory explosion, the article emphasizes the need for rigorous engineering standards to ensure safety and reliability in critical infrastructure.

https://www.flyingpenguin.com/?p=64164
👍2
This article introduces a cost-effective strategy for combining AWS WAF with reactive infrastructure to block attackers without exceeding budget constraints. By leveraging the WAF-Ja3FingerPrint-Blacklist Terraform module, it dynamically reconfigures WAF rules based on traffic analysis, reducing the expense of advanced rules like Account Theft Protection while maintaining robust security.

https://dev.to/aws-builders/combine-aws-waf-with-reactive-infrastructure-to-block-attackers-and-dont-go-broke-in-the-process-2jpb
👍1
This article delves into the "hot shard problem," a common challenge in distributed systems where uneven data distribution leads to resource saturation on specific shards. It outlines practical solutions, including vertical scaling, caching, load balancing, and selecting optimal sharding keys, to ensure system performance and reliability.

https://newsletter.scalablethread.com/p/how-to-handle-hot-shard-problem
👍5
In this post, Rachel explores the frustrations of debugging lag issues in distributed systems, highlighting how subtle timing problems can cascade into larger failures. Through real-world anecdotes, the article underscores the importance of understanding system behavior and addressing latency with precision.

https://rachelbythebay.com/w/2025/01/09/lag/
1👍1
The article "Load Testing Kubernetes Clients Without Breaking the Bank" on *itnext.io* likely discusses strategies for conducting load testing on Kubernetes clients in a cost-effective manner. Load testing is crucial for ensuring that applications can handle increased traffic and usage without performance degradation, and doing so efficiently is important for businesses with limited budgets.

https://itnext.io/load-testing-kubernetes-clients-without-breaking-the-bank-f43332faa6ce
👍4👏21
This blogpost explores the deployment of large language models (LLMs) using a combination of Google Kubernetes Engine (GKE), Google Gemma, and the Ollama framework, highlighting the benefits of customization, flexibility, and cost-effectiveness. By leveraging these tools, users can achieve seamless and efficient LLM deployment while maintaining control over their data and environment.

https://medium.com/google-cloud/gke-gemma-ollama-the-power-trio-for-flexible-llm-deployment-5f1fa9223477
👍4
This tutorial explores the integration of Kluctl with Cluster API, showcasing how Kluctl can efficiently manage Kubernetes clusters by leveraging its templating and deployment capabilities. By using Kluctl, users can manage multiple workload clusters with a unified CLI, benefiting from features like templating, which simplifies the management of complex deployments without requiring extensive copy-pasting or patching. The tutorial demonstrates setting up a local environment using Kind and deploying a workload cluster with Kluctl.

https://kluctl.io/blog/2024/03/13/cluster-api-kluctl/