TechLead Bits – Telegram
TechLead Bits
424 subscribers
62 photos
1 file
156 links
About software development with common sense.
Thoughts, tips and useful resources on technical leadership, architecture and engineering practices.

Author: @nelia_loginova
Download Telegram
Channel created
Channel photo updated
Welcome to the TechLead Bits!

Navigation:

#architecture - everything about software architecture, #systemdesign, #patterns, cross-cutting concerns and trade-offs

#engineering - engineering practices in software development: #codereview, #ci, #refactoring, #testing, #documentation

#usecase - implementation references from real companies

#softskills - soft skills recommendations and topics: #management, #leadership, #communications, #productivity, #creativity

#booknook - books overview

#news - notable news from the industry

#offtop - some other materials and news that may not directly relate to the technical leadership

👋 Hi, I’m Nelia!

🔹 15+ years in IT
🔹 10+ years in team management
🔹 Now head of cloud platform division: we cook and deliver cloud platform services like Kubernetes, SQL/NoSQL databases, Kafka, data streaming, service mesh, backup/restore procedures and more.

My focus is on non-functional requirements — things like consistency, availability, latency, security, observability, scalability, and more.

💡 Here I'm writing about:
- Technical leadership
- Architecture and engineering practices
- How to scale systems, processes — and yourself 😉
Please open Telegram to view this post
VIEW IN TELEGRAM
TechLead Bits pinned «Welcome to the TechLead Bits! Navigation: #architecture - everything about software architecture, #systemdesign, #patterns, cross-cutting concerns and trade-offs #engineering - engineering practices in software development: #codereview, #ci, #refactoring…»
Code Reviews Like Humans

There's already a ton of stuff out there about code reviews, covering what they're for and how to make them better. But I want to recommend one more Better Code Reviews FTW!

One of the points that I really like here is that a code review is basically just giving feedback. And we're all pretty good at giving feedback in other areas. So why not apply that same constructive approach to code reviews? Just imagine that there is something like `Fix typo in successful`. It can be read as `Hey, you made a mistake, but I still think you’re smart! or `You make a stupid mistake, dumbass`. Impression is quite different, right? It's all about being supportive and positive with each other.

Oh, and there's a cool article that dives into the same topic if you're interested: https://mtlynch.io/human-code-reviews-1/ Check it out!

#engineering #codereview
Scale it automagically!

Currently, there's considerable interest in automatic autoscaling, largely driven by the cost of resource usage. Let's verify how it works with dynamic load inside Kubernetes. There exists a naive belief that the HPA will deliver equivalent performance to pre-configured replicas when demand arises.

Consider a scenario where we set CPU utilization target at 80%. During peak events such as Black Friday or intense invoicing periods, service demands surge, causing CPU utilization growth to 100%.
The HPA operates on a logic where desired replicas are calculated using the formula:

desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]


For instance, if we have 5 replicas and the CPU utilization grows to 100% out of the desired 80%, the calculation will result in 6 replicas as a target. Consequently, the deployment scales up to 6 replicas. However, this scaling process involves a delay as the HPA controller waits for all pods to be ready, which can be time-consuming depending on the size of the service. Additionally, there's a stabilization window with a default of 5 minutes.
After the initial scaling, it's highly likely that the adjustment won't be sufficient (let’s say you need around 50 replicas to handle the load). Therefore, subsequent iterations of scaling will occur at intervals of at least 5 minutes until the desired 80% utilization is achieved or until the maximum allowed number of replicas is reached. This delay might not meet the business requirements.

In summary, autoscaling requires time to adapt to the load. While it functions effectively in scenarios where the load changes gradually, it's less suitable for handling scheduled, heavy loads. In such cases, it's advisable to explore alternative options and proactively prepare the environment for the anticipated load.

#architecture #performance #scalability
Death By Meeting

I finally read "Death by Meeting: A Leadership Fable...About Solving the Most Painful Problem in Business" by P. Lencioni after it had been on my reading list for quite some time. And let me tell you my impressions.

Meetings – the bane of many of our professional lives. Who enjoys them, really? I certainly don't. So, naturally, I expect to find some insights on how to get fewer meetings. Surprisingly, the book advocates for the opposite approach - meetings are mandatory!
But not the soul-crushing, unproductive gatherings we've all come to dread. No, the book suggests a complete rebuild of our approach to meetings, transforming them into dynamic, engaging, and yes, even enjoyable experiences.

Think of it this way: movies captivate us for hours on end, holding our attention without fail. You know what makes movies so exciting? Conflict! So why not add some of that spice into our meetings? Inject some healthy debates, encourage interaction, and yeah, even a bit of arguing. Let's make decisions right then and there, turning our meetings into the place where things actually get done.

The examples and stories in the book are built around general business management, but some ideas can be adapted to the context of IT.

#booknook #softskills #management #meetings
11
Reliable Stateful Systems at Netflix

There is quite an interesting long-read article explaining how Netflix implements the reliability of its stateful services.

To understand what reliability actually means the author proposes answering three fundamental questions:
- How often does the system fail?
- When it fails, how large is the blast radius?
- How long does it take to recover from an outage?
Ideally, systems don’t fail, have minimal impact on failure and recover very quickly😀. That is the simple part.

So how is that achieved? That is actually the hard part:
📍Single tenancy. There are no multi-tenant data stores. It minimizes blast impact.
📍Capacity management. Netflix programs special workload capacity models that generate specifications for a cluster according to the system requirements and SLOs.
📍Data replication. Data is replicated in 12 availability zones across 4 regions.
📍Overprovisioning. In case of one region degradation, the traffic is spread across other regions so each region must keep an extra 33% capacity reserved for failover.
📍Snapshot restoration. To replace one instance with another - load snapshot from s3 and then apply delta.
📍Performance monitoring. Continuous monitoring is essential to detect failures quickly, remediate them and recover.
📍Cache in front of services. The idea is to use caching for complex business logic rather than underlying data. Service with business logic is quite expensive to operate, so approach shifts load to the cache.
📍Reliable clients. Quite complex approach to inform clients what timeouts and level of service that they can rely on. Better to read the original.
📍Load Balancing. Netflix uses improved choice-of-2 algorithms with weight requests. It takes into account availability zones, replicas, health state.
📍Stateful APIs. All requests are idempotent with built-in work pagination. That approach requires additional idempotent tokens or global transaction ids.

#architecture #reliability #usecase
2
Load Balancing

📍Round Robin. Simple, stable, well-suited for servers with identical replicas.
📍Weighted Round Robin. Adds a `weight for each replica, weight` usually correlates with the server capacity. Improves resource utilization in heterogeneous infrastructure.
📍Least connections\Load. Directs network traffic to the server with the fewest active connections\load. It can be really effective with long-lived sessions or tasks.
📍Weighted Least Connection. It’s the previous one enriched with capacity weights.
📍Hash Ring. Each host is mapped into a circle using its hashed address, each request is routed to a host by hashing some property of the request. So the balancer finds the nearest host clockwise to match requests with the server. It can work well if there is a good request attribute to hash.
📍Random Subsetting. Each client randomly shuffles the list of hosts and fills its subset by selecting available backends from the list. In real cases the load may be distributed unevenly.
📍Deterministic Subsetting. Google improvement for Random Subsetting: adds client assignment rounds, server random shuffles and allows servers to adjust weights back to clients. It has greater stability and evenness in distribution.
📍Random Choice of 2. The algorithms suggest picking up 2 servers randomly and selecting one with the least load. Simple, quite effective in many cases.

Let’s go back to that last post and figure out why Netflix wasn't satisfied and had to cook up another load balancing algorithm.
The main issue is that stateful services are asymmetric (postgres leader and followers, zookeeper leader and followers, cassandra topology with quorums, etc.), and it does matter which host to connect to. Plus, the traffic between availability zones was substantial, leading to additional latency.
So what actually was done (on Cassandra client balancing sample):
- Random Choice of 2 is extended to Random Choice of 8
- Weight for nodes is added and based on rack topology, replicas info and health state
- Selected nodes are sorted according to the weight

The improvement reduces latency by up to 40% for Netflix scenarios, which I believe is a significant achievement. Unfortunately I couldn't find the original video detailing the implementation, but there's a presentation with measurements available.

#architecture #reliability #network
👍3👀3
The Cafe on the Edge of the World

I stumbled upon 'The Cafe on the Edge of the World: A Story About the Meaning of Life' by John Strelecky while browsing the marketplace for more books to read. Although it wasn't originally on my reading list, I decided to give it a try after it was suggested along with other books I ordered.

The main character finds himself lost on his journey and stumbles upon a café in the middle of nowhere with an unusual name. What's even more intriguing is that the menu includes thought-provoking questions:
- Why are you here? (sometimes it’s “Why am I here?”)
- Do you fear death?
- Are you fulfilled?
Throughout the book, he spends time discussing these questions with the café's inhabitants, uncovering new insights about his own life.
While you shouldn't expect revolutionary ideas and simple answers, the book serves as a great opportunity to pause and reflect on our own lives—what we're doing, why we're doing it, and what we truly want to do, what makes us happy.

It's also a time for reflection for me. Consequently, I've enrolled in a sketchnoting course😲. So, hopefully, in a little while, my summaries will be illustrated😀.

#booknook #productivity #offtop
👏21
The Price of the Decision

Let’s talk about the financial implications of our technical choices. Understanding this aspect is a key; the less we spend on our solution, the more funds we retain for the business.

Consider a simple example to illustrate the significant impact of a single decision:

Imagine we're integrating with a Data Lake through Kafka, leveraging Debezium to stream data from the database. Our project is relatively small, with approximately 100 tables for streaming. Opting for AWS to meet our infrastructure requirements, we'll utilize MSK (Kafka) service and leverage its well-defined pricing model in our calculations. Also we need AWS recommendations for broker sizes.

Alright, we'll intentionally skip all other requirements and opt for two diametrically opposed implementation options:
1️⃣Stream data from all tables to a single topic
2️⃣Stream each table to a separate topic

Option 1
Let’s take standard replication factor 3 and 10 partitions for the topic.
1 env cost:
- 1 topic * 10 partitions * 3 replication factor = 30 partitions
- Take the smallest possible configuration: 3 kafka.t3.small brokers
Price: $4.1 per month
Nice!
Usually we have more than 1 environment (dev, QA, SIT, UAT and other environment types), let’s say 10 envs in total.
10 env cost:
- 10 topics * 10 partitions * 3 replication factor = 300 partitions
- Take the smallest possible configuration: 3 kafka.t3.small broker
Price: $4.1 per month

Option 2
Let’s take standard replication factor 3 and 5 partitions for the topic as we have less data per topic.
1 env cost:
- 100 topics * 5 partitions * 3 replication factor = 1500 partitions
- Take the smallest possible configuration: 6 kafka.t3.small brokers
Price: $8.21 per months
10 env cost:
- 1000 topics * 5 partitions * 3 replication factor = 15 000 partitions
- Take the smallest possible configuration: 6 kafka.m5.4xlarge brokers
Price: $1209.6 per month

$4.1 vs $1209.6
And this price doesn't even account for any load; it's just for having all those partitions.

As techleads, it's our responsibility to make cost-effective decisions.

#architecture #tradeoffs
1