DevOps & SRE notes – Telegram
DevOps & SRE notes
12K subscribers
38 photos
19 files
2.5K links
Helpfull articles and tools for DevOps&SRE

WhatsApp: https://whatsapp.com/channel/0029Vb79nmmHVvTUnc4tfp2F

For paid consultation (RU/EN), contact: @tutunak


All ways to support https://telegra.ph/How-support-the-channel-02-19
Download Telegram
Understanding the principles of OCC and isolation is essential for maintaining consistency in databases. This blogpost dives into these concepts and their real-world applications.

https://brooker.co.za/blog/2024/12/17/occ-and-isolation.html
Reflecting on the challenges of facilitating post-incident reviews, Will Gallego shares insights into what went wrong during a recent meeting and how over-reliance on familiar methods can hinder progress. This blogpost offers a candid look at the nuances of facilitation and the lessons learned from navigating complex team dynamics.

https://willgallego.com/2025/01/11/an-incident-review-of-an-incident-review/?utm_source=chatgpt.com
👍3🔥1
To bridge critical gaps in production issue resolution, this blogpost explores how observability was built using Google Cloud services. By leveraging tools like OpenTelemetry and Google Cloud Trace Explorer, the author demonstrates how traceability and logging enhancements can optimize debugging and system performance monitoring.

https://punits.dev/blog/building-observability-with-google-cloud-services/
👍3
Yan Cui dives into the complexities of end-to-end testing for microservices spanning bounded contexts, offering practical strategies for balancing team responsibilities and testing approaches. This article highlights tools like Pact, WireMock Cloud, and Cypress to streamline testing while fostering collaboration across diverse organizational structures.

https://theburningmonk.com/2024/12/how-to-e2e-test-microservices-across-bounded-contexts/
Mercari's Microservices Platform Network team outlines the company's successful migration from Fastly to Cloudflare, emphasizing strategies to ensure minimal disruptions. This blogpost also introduces future initiatives like "CDN as a Service," aimed at empowering developers through tools such as CDN Kit and automated permission systems.

https://engineering.mercari.com/en/blog/entry/20241223-a-smooth-cdn-provider-migration-and-future-initiatives/
This post critiques the casual metaphors used to describe Facebook's 2021 outage, arguing that such language trivializes preventable engineering failures and undermines accountability. By examining historical disasters like the Grover Shoe Factory explosion, the article emphasizes the need for rigorous engineering standards to ensure safety and reliability in critical infrastructure.

https://www.flyingpenguin.com/?p=64164
👍2
This article introduces a cost-effective strategy for combining AWS WAF with reactive infrastructure to block attackers without exceeding budget constraints. By leveraging the WAF-Ja3FingerPrint-Blacklist Terraform module, it dynamically reconfigures WAF rules based on traffic analysis, reducing the expense of advanced rules like Account Theft Protection while maintaining robust security.

https://dev.to/aws-builders/combine-aws-waf-with-reactive-infrastructure-to-block-attackers-and-dont-go-broke-in-the-process-2jpb
👍1
This article delves into the "hot shard problem," a common challenge in distributed systems where uneven data distribution leads to resource saturation on specific shards. It outlines practical solutions, including vertical scaling, caching, load balancing, and selecting optimal sharding keys, to ensure system performance and reliability.

https://newsletter.scalablethread.com/p/how-to-handle-hot-shard-problem
👍5
In this post, Rachel explores the frustrations of debugging lag issues in distributed systems, highlighting how subtle timing problems can cascade into larger failures. Through real-world anecdotes, the article underscores the importance of understanding system behavior and addressing latency with precision.

https://rachelbythebay.com/w/2025/01/09/lag/
1👍1