TechLead Bits pinned «Welcome to the TechLead Bits! Navigation: #architecture - everything about software architecture, #systemdesign, #patterns, cross-cutting concerns and trade-offs #engineering - engineering practices in software development: #codereview, #ci, #refactoring…»
Code Reviews Like Humans
There's already a ton of stuff out there about code reviews, covering what they're for and how to make them better. But I want to recommend one more Better Code Reviews FTW!
One of the points that I really like here is that a code review is basically just giving feedback. And we're all pretty good at giving feedback in other areas. So why not apply that same constructive approach to code reviews? Just imagine that there is something like `
Oh, and there's a cool article that dives into the same topic if you're interested: https://mtlynch.io/human-code-reviews-1/ Check it out!
#engineering #codereview
There's already a ton of stuff out there about code reviews, covering what they're for and how to make them better. But I want to recommend one more Better Code Reviews FTW!
One of the points that I really like here is that a code review is basically just giving feedback. And we're all pretty good at giving feedback in other areas. So why not apply that same constructive approach to code reviews? Just imagine that there is something like `
Fix typo in successful`. It can be read as `Hey, you made a mistake, but I still think you’re smart! or `You make a stupid mistake, dumbass`. Impression is quite different, right? It's all about being supportive and positive with each other.Oh, and there's a cool article that dives into the same topic if you're interested: https://mtlynch.io/human-code-reviews-1/ Check it out!
#engineering #codereview
YouTube
Better Code Reviews FTW! - Tess Ferrandez-Norlander - NDC London 2024
This talk was recorded at NDC London in London, England. #ndclondon #ndcconferences #developer #softwaredeveloper
Attend the next NDC conference near you:
https://ndcconferences.com
https://ndclondon.com/
Subscribe to our YouTube channel and learn…
Attend the next NDC conference near you:
https://ndcconferences.com
https://ndclondon.com/
Subscribe to our YouTube channel and learn…
Scale it automagically!
Currently, there's considerable interest in automatic autoscaling, largely driven by the cost of resource usage. Let's verify how it works with dynamic load inside Kubernetes. There exists a naive belief that the HPA will deliver equivalent performance to pre-configured replicas when demand arises.
Consider a scenario where we set CPU utilization target at 80%. During peak events such as Black Friday or intense invoicing periods, service demands surge, causing CPU utilization growth to 100%.
The HPA operates on a logic where desired replicas are calculated using the formula:
For instance, if we have 5 replicas and the CPU utilization grows to 100% out of the desired 80%, the calculation will result in 6 replicas as a target. Consequently, the deployment scales up to 6 replicas. However, this scaling process involves a delay as the HPA controller waits for all pods to be ready, which can be time-consuming depending on the size of the service. Additionally, there's a stabilization window with a default of 5 minutes.
After the initial scaling, it's highly likely that the adjustment won't be sufficient (let’s say you need around 50 replicas to handle the load). Therefore, subsequent iterations of scaling will occur at intervals of at least 5 minutes until the desired 80% utilization is achieved or until the maximum allowed number of replicas is reached. This delay might not meet the business requirements.
In summary, autoscaling requires time to adapt to the load. While it functions effectively in scenarios where the load changes gradually, it's less suitable for handling scheduled, heavy loads. In such cases, it's advisable to explore alternative options and proactively prepare the environment for the anticipated load.
#architecture #performance #scalability
Currently, there's considerable interest in automatic autoscaling, largely driven by the cost of resource usage. Let's verify how it works with dynamic load inside Kubernetes. There exists a naive belief that the HPA will deliver equivalent performance to pre-configured replicas when demand arises.
Consider a scenario where we set CPU utilization target at 80%. During peak events such as Black Friday or intense invoicing periods, service demands surge, causing CPU utilization growth to 100%.
The HPA operates on a logic where desired replicas are calculated using the formula:
desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]
For instance, if we have 5 replicas and the CPU utilization grows to 100% out of the desired 80%, the calculation will result in 6 replicas as a target. Consequently, the deployment scales up to 6 replicas. However, this scaling process involves a delay as the HPA controller waits for all pods to be ready, which can be time-consuming depending on the size of the service. Additionally, there's a stabilization window with a default of 5 minutes.
After the initial scaling, it's highly likely that the adjustment won't be sufficient (let’s say you need around 50 replicas to handle the load). Therefore, subsequent iterations of scaling will occur at intervals of at least 5 minutes until the desired 80% utilization is achieved or until the maximum allowed number of replicas is reached. This delay might not meet the business requirements.
In summary, autoscaling requires time to adapt to the load. While it functions effectively in scenarios where the load changes gradually, it's less suitable for handling scheduled, heavy loads. In such cases, it's advisable to explore alternative options and proactively prepare the environment for the anticipated load.
#architecture #performance #scalability
Kubernetes
Horizontal Pod Autoscaling
In Kubernetes, a HorizontalPodAutoscaler automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload to match demand.
Horizontal scaling means that the response to increased load is to…
Horizontal scaling means that the response to increased load is to…
Death By Meeting
I finally read "Death by Meeting: A Leadership Fable...About Solving the Most Painful Problem in Business" by P. Lencioni after it had been on my reading list for quite some time. And let me tell you my impressions.
Meetings – the bane of many of our professional lives. Who enjoys them, really? I certainly don't. So, naturally, I expect to find some insights on how to get fewer meetings. Surprisingly, the book advocates for the opposite approach - meetings are mandatory!
But not the soul-crushing, unproductive gatherings we've all come to dread. No, the book suggests a complete rebuild of our approach to meetings, transforming them into dynamic, engaging, and yes, even enjoyable experiences.
Think of it this way: movies captivate us for hours on end, holding our attention without fail. You know what makes movies so exciting? Conflict! So why not add some of that spice into our meetings? Inject some healthy debates, encourage interaction, and yeah, even a bit of arguing. Let's make decisions right then and there, turning our meetings into the place where things actually get done.
The examples and stories in the book are built around general business management, but some ideas can be adapted to the context of IT.
#booknook #softskills #management #meetings
I finally read "Death by Meeting: A Leadership Fable...About Solving the Most Painful Problem in Business" by P. Lencioni after it had been on my reading list for quite some time. And let me tell you my impressions.
Meetings – the bane of many of our professional lives. Who enjoys them, really? I certainly don't. So, naturally, I expect to find some insights on how to get fewer meetings. Surprisingly, the book advocates for the opposite approach - meetings are mandatory!
But not the soul-crushing, unproductive gatherings we've all come to dread. No, the book suggests a complete rebuild of our approach to meetings, transforming them into dynamic, engaging, and yes, even enjoyable experiences.
Think of it this way: movies captivate us for hours on end, holding our attention without fail. You know what makes movies so exciting? Conflict! So why not add some of that spice into our meetings? Inject some healthy debates, encourage interaction, and yeah, even a bit of arguing. Let's make decisions right then and there, turning our meetings into the place where things actually get done.
The examples and stories in the book are built around general business management, but some ideas can be adapted to the context of IT.
#booknook #softskills #management #meetings
❤1✍1
Reliable Stateful Systems at Netflix
There is quite an interesting long-read article explaining how Netflix implements the reliability of its stateful services.
To understand what reliability actually means the author proposes answering three fundamental questions:
- How often does the system fail?
- When it fails, how large is the blast radius?
- How long does it take to recover from an outage?
Ideally, systems don’t fail, have minimal impact on failure and recover very quickly😀. That is the simple part.
So how is that achieved? That is actually the hard part:
📍Single tenancy. There are no multi-tenant data stores. It minimizes blast impact.
📍Capacity management. Netflix programs special workload capacity models that generate specifications for a cluster according to the system requirements and SLOs.
📍Data replication. Data is replicated in 12 availability zones across 4 regions.
📍Overprovisioning. In case of one region degradation, the traffic is spread across other regions so each region must keep an extra 33% capacity reserved for failover.
📍Snapshot restoration. To replace one instance with another - load snapshot from s3 and then apply delta.
📍Performance monitoring. Continuous monitoring is essential to detect failures quickly, remediate them and recover.
📍Cache in front of services. The idea is to use caching for complex business logic rather than underlying data. Service with business logic is quite expensive to operate, so approach shifts load to the cache.
📍Reliable clients. Quite complex approach to inform clients what timeouts and level of service that they can rely on. Better to read the original.
📍Load Balancing. Netflix uses improved choice-of-2 algorithms with weight requests. It takes into account availability zones, replicas, health state.
📍Stateful APIs. All requests are idempotent with built-in work pagination. That approach requires additional idempotent tokens or global transaction ids.
#architecture #reliability #usecase
There is quite an interesting long-read article explaining how Netflix implements the reliability of its stateful services.
To understand what reliability actually means the author proposes answering three fundamental questions:
- How often does the system fail?
- When it fails, how large is the blast radius?
- How long does it take to recover from an outage?
Ideally, systems don’t fail, have minimal impact on failure and recover very quickly😀. That is the simple part.
So how is that achieved? That is actually the hard part:
📍Single tenancy. There are no multi-tenant data stores. It minimizes blast impact.
📍Capacity management. Netflix programs special workload capacity models that generate specifications for a cluster according to the system requirements and SLOs.
📍Data replication. Data is replicated in 12 availability zones across 4 regions.
📍Overprovisioning. In case of one region degradation, the traffic is spread across other regions so each region must keep an extra 33% capacity reserved for failover.
📍Snapshot restoration. To replace one instance with another - load snapshot from s3 and then apply delta.
📍Performance monitoring. Continuous monitoring is essential to detect failures quickly, remediate them and recover.
📍Cache in front of services. The idea is to use caching for complex business logic rather than underlying data. Service with business logic is quite expensive to operate, so approach shifts load to the cache.
📍Reliable clients. Quite complex approach to inform clients what timeouts and level of service that they can rely on. Better to read the original.
📍Load Balancing. Netflix uses improved choice-of-2 algorithms with weight requests. It takes into account availability zones, replicas, health state.
📍Stateful APIs. All requests are idempotent with built-in work pagination. That approach requires additional idempotent tokens or global transaction ids.
#architecture #reliability #usecase
InfoQ
How Netflix Ensures Highly-Reliable Online Stateful Systems
Building reliable stateful services at scale isn’t a matter of building reliability into the servers, the clients, or the APIs in isolation. By combining smart and meaningful choices for each of these three components, we can build massively scalable, SLO…
❤2
Load Balancing
📍Round Robin. Simple, stable, well-suited for servers with identical replicas.
📍Weighted Round Robin. Adds a `weight
📍Least connections\Load. Directs network traffic to the server with the fewest active connections\load. It can be really effective with long-lived sessions or tasks.
📍Weighted Least Connection. It’s the previous one enriched with capacity weights.
📍Hash Ring. Each host is mapped into a circle using its hashed address, each request is routed to a host by hashing some property of the request. So the balancer finds the nearest host clockwise to match requests with the server. It can work well if there is a good request attribute to hash.
📍Random Subsetting. Each client randomly shuffles the list of hosts and fills its subset by selecting available backends from the list. In real cases the load may be distributed unevenly.
📍Deterministic Subsetting. Google improvement for Random Subsetting: adds client assignment rounds, server random shuffles and allows servers to adjust weights back to clients. It has greater stability and evenness in distribution.
📍Random Choice of 2. The algorithms suggest picking up 2 servers randomly and selecting one with the least load. Simple, quite effective in many cases.
Let’s go back to that last post and figure out why Netflix wasn't satisfied and had to cook up another load balancing algorithm.
The main issue is that stateful services are asymmetric (postgres leader and followers, zookeeper leader and followers, cassandra topology with quorums, etc.), and it does matter which host to connect to. Plus, the traffic between availability zones was substantial, leading to additional latency.
So what actually was done (on Cassandra client balancing sample):
- Random Choice of 2 is extended to Random Choice of 8
- Weight for nodes is added and based on rack topology, replicas info and health state
- Selected nodes are sorted according to the weight
The improvement reduces latency by up to 40% for Netflix scenarios, which I believe is a significant achievement. Unfortunately I couldn't find the original video detailing the implementation, but there's a presentation with measurements available.
#architecture #reliability #network
📍Round Robin. Simple, stable, well-suited for servers with identical replicas.
📍Weighted Round Robin. Adds a `weight
for each replica, weight` usually correlates with the server capacity. Improves resource utilization in heterogeneous infrastructure.📍Least connections\Load. Directs network traffic to the server with the fewest active connections\load. It can be really effective with long-lived sessions or tasks.
📍Weighted Least Connection. It’s the previous one enriched with capacity weights.
📍Hash Ring. Each host is mapped into a circle using its hashed address, each request is routed to a host by hashing some property of the request. So the balancer finds the nearest host clockwise to match requests with the server. It can work well if there is a good request attribute to hash.
📍Random Subsetting. Each client randomly shuffles the list of hosts and fills its subset by selecting available backends from the list. In real cases the load may be distributed unevenly.
📍Deterministic Subsetting. Google improvement for Random Subsetting: adds client assignment rounds, server random shuffles and allows servers to adjust weights back to clients. It has greater stability and evenness in distribution.
📍Random Choice of 2. The algorithms suggest picking up 2 servers randomly and selecting one with the least load. Simple, quite effective in many cases.
Let’s go back to that last post and figure out why Netflix wasn't satisfied and had to cook up another load balancing algorithm.
The main issue is that stateful services are asymmetric (postgres leader and followers, zookeeper leader and followers, cassandra topology with quorums, etc.), and it does matter which host to connect to. Plus, the traffic between availability zones was substantial, leading to additional latency.
So what actually was done (on Cassandra client balancing sample):
- Random Choice of 2 is extended to Random Choice of 8
- Weight for nodes is added and based on rack topology, replicas info and health state
- Selected nodes are sorted according to the weight
The improvement reduces latency by up to 40% for Netflix scenarios, which I believe is a significant achievement. Unfortunately I couldn't find the original video detailing the implementation, but there's a presentation with measurements available.
#architecture #reliability #network
👍3👀3
The Cafe on the Edge of the World
I stumbled upon 'The Cafe on the Edge of the World: A Story About the Meaning of Life' by John Strelecky while browsing the marketplace for more books to read. Although it wasn't originally on my reading list, I decided to give it a try after it was suggested along with other books I ordered.
The main character finds himself lost on his journey and stumbles upon a café in the middle of nowhere with an unusual name. What's even more intriguing is that the menu includes thought-provoking questions:
- Why are you here? (sometimes it’s “Why am I here?”)
- Do you fear death?
- Are you fulfilled?
Throughout the book, he spends time discussing these questions with the café's inhabitants, uncovering new insights about his own life.
While you shouldn't expect revolutionary ideas and simple answers, the book serves as a great opportunity to pause and reflect on our own lives—what we're doing, why we're doing it, and what we truly want to do, what makes us happy.
It's also a time for reflection for me. Consequently, I've enrolled in a sketchnoting course😲. So, hopefully, in a little while, my summaries will be illustrated😀.
#booknook #productivity #offtop
I stumbled upon 'The Cafe on the Edge of the World: A Story About the Meaning of Life' by John Strelecky while browsing the marketplace for more books to read. Although it wasn't originally on my reading list, I decided to give it a try after it was suggested along with other books I ordered.
The main character finds himself lost on his journey and stumbles upon a café in the middle of nowhere with an unusual name. What's even more intriguing is that the menu includes thought-provoking questions:
- Why are you here? (sometimes it’s “Why am I here?”)
- Do you fear death?
- Are you fulfilled?
Throughout the book, he spends time discussing these questions with the café's inhabitants, uncovering new insights about his own life.
While you shouldn't expect revolutionary ideas and simple answers, the book serves as a great opportunity to pause and reflect on our own lives—what we're doing, why we're doing it, and what we truly want to do, what makes us happy.
It's also a time for reflection for me. Consequently, I've enrolled in a sketchnoting course😲. So, hopefully, in a little while, my summaries will be illustrated😀.
#booknook #productivity #offtop
👏2❤1
The Price of the Decision
Let’s talk about the financial implications of our technical choices. Understanding this aspect is a key; the less we spend on our solution, the more funds we retain for the business.
Consider a simple example to illustrate the significant impact of a single decision:
Imagine we're integrating with a Data Lake through Kafka, leveraging Debezium to stream data from the database. Our project is relatively small, with approximately 100 tables for streaming. Opting for AWS to meet our infrastructure requirements, we'll utilize MSK (Kafka) service and leverage its well-defined pricing model in our calculations. Also we need AWS recommendations for broker sizes.
Alright, we'll intentionally skip all other requirements and opt for two diametrically opposed implementation options:
1️⃣Stream data from all tables to a single topic
2️⃣Stream each table to a separate topic
Option 1
Let’s take standard replication factor 3 and 10 partitions for the topic.
1 env cost:
- 1 topic * 10 partitions * 3 replication factor = 30 partitions
- Take the smallest possible configuration: 3 kafka.t3.small brokers
Price: $4.1 per month
Nice!
Usually we have more than 1 environment (dev, QA, SIT, UAT and other environment types), let’s say 10 envs in total.
10 env cost:
- 10 topics * 10 partitions * 3 replication factor = 300 partitions
- Take the smallest possible configuration: 3 kafka.t3.small broker
Price: $4.1 per month
Option 2
Let’s take standard replication factor 3 and 5 partitions for the topic as we have less data per topic.
1 env cost:
- 100 topics * 5 partitions * 3 replication factor = 1500 partitions
- Take the smallest possible configuration: 6 kafka.t3.small brokers
Price: $8.21 per months
10 env cost:
- 1000 topics * 5 partitions * 3 replication factor = 15 000 partitions
- Take the smallest possible configuration: 6 kafka.m5.4xlarge brokers
Price: $1209.6 per month
$4.1 vs $1209.6
And this price doesn't even account for any load; it's just for having all those partitions.
As techleads, it's our responsibility to make cost-effective decisions.
#architecture #tradeoffs
Let’s talk about the financial implications of our technical choices. Understanding this aspect is a key; the less we spend on our solution, the more funds we retain for the business.
Consider a simple example to illustrate the significant impact of a single decision:
Imagine we're integrating with a Data Lake through Kafka, leveraging Debezium to stream data from the database. Our project is relatively small, with approximately 100 tables for streaming. Opting for AWS to meet our infrastructure requirements, we'll utilize MSK (Kafka) service and leverage its well-defined pricing model in our calculations. Also we need AWS recommendations for broker sizes.
Alright, we'll intentionally skip all other requirements and opt for two diametrically opposed implementation options:
1️⃣Stream data from all tables to a single topic
2️⃣Stream each table to a separate topic
Option 1
Let’s take standard replication factor 3 and 10 partitions for the topic.
1 env cost:
- 1 topic * 10 partitions * 3 replication factor = 30 partitions
- Take the smallest possible configuration: 3 kafka.t3.small brokers
Price: $4.1 per month
Nice!
Usually we have more than 1 environment (dev, QA, SIT, UAT and other environment types), let’s say 10 envs in total.
10 env cost:
- 10 topics * 10 partitions * 3 replication factor = 300 partitions
- Take the smallest possible configuration: 3 kafka.t3.small broker
Price: $4.1 per month
Option 2
Let’s take standard replication factor 3 and 5 partitions for the topic as we have less data per topic.
1 env cost:
- 100 topics * 5 partitions * 3 replication factor = 1500 partitions
- Take the smallest possible configuration: 6 kafka.t3.small brokers
Price: $8.21 per months
10 env cost:
- 1000 topics * 5 partitions * 3 replication factor = 15 000 partitions
- Take the smallest possible configuration: 6 kafka.m5.4xlarge brokers
Price: $1209.6 per month
$4.1 vs $1209.6
And this price doesn't even account for any load; it's just for having all those partitions.
As techleads, it's our responsibility to make cost-effective decisions.
#architecture #tradeoffs
❤1
Queues for Kafka
`Kafka Queues: Now and in the Future` breaks down how consumer groups work today and what updates are coming in future Kafka versions. Future is the most interesting part actually.
So, Kafka is exceptionally effective for streaming large volumes of data through topics and partitions. Consumer groups manage this data by dividing the workload based on the number of partitions. While this method offers high performance, it also presents notable limitations:
- Consumers are exclusively assigned specific partitions.
- Parallel data processing is restricted by the number of partitions in a topic
- There's no option for individual acknowledgment or message redelivery
- Messages cannot be marked as rejected
- Message processing times can vary unevenly
So, KIP-932 has been introduced to enhance Kafka's queue capabilities. This proposal suggests a 'share group' with a `share-partition` abstractions.
Key changes:
- A share-partition can be assigned to any number of consumers, removing the scaling limitation by the number of partitions.
- Each message has a state: available for processing, acquired, acknowledged, or rejected.
- Consumers receive available messages from shared partitions, transitioning the message state to acquired
- Messages not acknowledged within a specific timeframe are returned to the available state.
- If the number of delivery attempts exceeds a defined threshold, a message is marked as rejected, halting further delivery attempts
- Consumed messages are not guaranteed to be ordered
Currently, the KIP is in the Accepted state and is planned for release in Kafka 4.0.
#news #architecture #kafka
`Kafka Queues: Now and in the Future` breaks down how consumer groups work today and what updates are coming in future Kafka versions. Future is the most interesting part actually.
So, Kafka is exceptionally effective for streaming large volumes of data through topics and partitions. Consumer groups manage this data by dividing the workload based on the number of partitions. While this method offers high performance, it also presents notable limitations:
- Consumers are exclusively assigned specific partitions.
- Parallel data processing is restricted by the number of partitions in a topic
- There's no option for individual acknowledgment or message redelivery
- Messages cannot be marked as rejected
- Message processing times can vary unevenly
So, KIP-932 has been introduced to enhance Kafka's queue capabilities. This proposal suggests a 'share group' with a `share-partition` abstractions.
Key changes:
- A share-partition can be assigned to any number of consumers, removing the scaling limitation by the number of partitions.
- Each message has a state: available for processing, acquired, acknowledged, or rejected.
- Consumers receive available messages from shared partitions, transitioning the message state to acquired
- Messages not acknowledged within a specific timeframe are returned to the available state.
- If the number of delivery attempts exceeds a defined threshold, a message is marked as rejected, halting further delivery attempts
- Consumed messages are not guaranteed to be ordered
Currently, the KIP is in the Accepted state and is planned for release in Kafka 4.0.
#news #architecture #kafka
SoftwareMill
Kafka queues: now and in the future | SoftwareMill
Find out how some of the queueing features can be implemented in Kafka today using the KMQ pattern.
❤4👍3
Leaders Eat Last
`Leaders Eat Last` by S. Sinek is highly recommended by various management sources, so let’s check what it has to offer.
In my experience as a leader, I've often relied on the gut feeling for certain practices. For instance, I've long believed in the importance of offline communication—the handshakes, conversations over coffee (or something stronger), and discussions in informal settings like smoking rooms. These interactions significantly enhance collaboration in ways that virtual meetings, even with cameras, cannot replicate.
How does this relate to the book? It proves my observations
Key points:
📌Safety: One of the fundamental human needs is the need for safety. When we feel secure within a team or company, we can work effectively towards achieving its goals. Otherwise, resources are diverted towards personal protection. The author refers to this concept as "The Circle of Safety."
📌Endorphins, dopamine, serotonin, and oxytocin: These chemical elements continue to influence our behavior, just as they did thousands of years ago. Understanding their principles enables us to stimulate the desired behaviors.
📌Leadership: Every team, whether formal or informal, has a leader. This is inherent in our nature and evolutionary processes. While self-organizing teams are often discussed, there is inevitably a leader within each.
📌Abstraction: We tend to emotionally react more strongly to a specific story about an individual death or disaster than to a report about hundreds of people dying in a news headline. This is due to abstraction, which can greatly skew our perception of the world. This section of the book presents shocking psychological research that even scientifically explain one of the most horrific events in modern history—the Holocaust.
📌Based on information above the author outlines the following lessons for leaders:
✔️ A good corporate culture can cultivate good leaders, while bad leaders can produce a bad culture
✔️ The values of a leader strongly influence the culture of the team
✔️ Integrity is essential for building trust
✔️ Personal relationships matter
✔️ Lead people, not numbers
✔️ Being a leader is not a reward; it's a responsibility
The book presents much more interesting examples and psychological research on the principles of human behavior. Sometimes it delves into philosophical reflections on the state of modern leadership and its future. Even with that, I definitely recommend the book.
#booknook #management #leadership
`Leaders Eat Last` by S. Sinek is highly recommended by various management sources, so let’s check what it has to offer.
In my experience as a leader, I've often relied on the gut feeling for certain practices. For instance, I've long believed in the importance of offline communication—the handshakes, conversations over coffee (or something stronger), and discussions in informal settings like smoking rooms. These interactions significantly enhance collaboration in ways that virtual meetings, even with cameras, cannot replicate.
How does this relate to the book? It proves my observations
Key points:
📌Safety: One of the fundamental human needs is the need for safety. When we feel secure within a team or company, we can work effectively towards achieving its goals. Otherwise, resources are diverted towards personal protection. The author refers to this concept as "The Circle of Safety."
📌Endorphins, dopamine, serotonin, and oxytocin: These chemical elements continue to influence our behavior, just as they did thousands of years ago. Understanding their principles enables us to stimulate the desired behaviors.
📌Leadership: Every team, whether formal or informal, has a leader. This is inherent in our nature and evolutionary processes. While self-organizing teams are often discussed, there is inevitably a leader within each.
📌Abstraction: We tend to emotionally react more strongly to a specific story about an individual death or disaster than to a report about hundreds of people dying in a news headline. This is due to abstraction, which can greatly skew our perception of the world. This section of the book presents shocking psychological research that even scientifically explain one of the most horrific events in modern history—the Holocaust.
📌Based on information above the author outlines the following lessons for leaders:
✔️ A good corporate culture can cultivate good leaders, while bad leaders can produce a bad culture
✔️ The values of a leader strongly influence the culture of the team
✔️ Integrity is essential for building trust
✔️ Personal relationships matter
✔️ Lead people, not numbers
✔️ Being a leader is not a reward; it's a responsibility
The book presents much more interesting examples and psychological research on the principles of human behavior. Sometimes it delves into philosophical reflections on the state of modern leadership and its future. Even with that, I definitely recommend the book.
#booknook #management #leadership
👍3❤2🫡1
Multi-Arch Images
Public cloud providers actively promote migrating to ARM instances, offering potential cost savings of up to ~20%. For instance, AWS suggests their m7g ARM instances priced at $8.617 per hour, compared to $10.644 per hour for equivalent AMD instances. While the performance of ARM instances may vary depending on the application's loading profile, many benchmarks indicate quite positive outcomes:
- ARM vs Intel on Amazon’s cloud
- Faster and Cheaper: ARM Graviton vs Intel and AMD x86 AWS EC2
- Performance Analysis for Arm vs x86 CPUs
As pragmatic engineers who love cost-effective solutions, we see the value in exploring instances with different architectures to optimize both cost and performance.
The concept sounds promising, but what technical preparations are needed for that?
Let's delve into some theory.
By default, when building containers, they are targeted for the same architecture as the local CPU. As a result, running AMD-built containers on ARM nodes, and vice versa, causes errors. This is where multi-arch images come into play.
A multi-arch image looks like a single image with a single tag, but it has a collection of images designed for multiple architectures. How does this function? Each Docker image is defined by a manifest—a JSON file containing all the information about the image. It includes references to each of its layers, their corresponding sizes, the hash of the image, its size and also the platform it's intended for.
For multi-arch images the manifest contains a list of manifest, enforcing the container runtime to select the appropriate image based on the underlying platform architecture.
Presently, two primary manifest standards exist: the Docker manifest version 2 and the OCI Image Index Specification.
Personally, I prefer the OCI standard as it presents a more generic approach to work with multi-arch images.
Many articles explain how to create these images, so I won't cover that here. Just remember, making multi-arch images requires adoption from the build process and a bit more storage in the registry.
In conclusion, using multi-arch images can lower operational costs in public clouds, but it requires making extra changes to integrate them into CI/CD processes. Do you need that in your project? It depends 😉
#engineering
Public cloud providers actively promote migrating to ARM instances, offering potential cost savings of up to ~20%. For instance, AWS suggests their m7g ARM instances priced at $8.617 per hour, compared to $10.644 per hour for equivalent AMD instances. While the performance of ARM instances may vary depending on the application's loading profile, many benchmarks indicate quite positive outcomes:
- ARM vs Intel on Amazon’s cloud
- Faster and Cheaper: ARM Graviton vs Intel and AMD x86 AWS EC2
- Performance Analysis for Arm vs x86 CPUs
As pragmatic engineers who love cost-effective solutions, we see the value in exploring instances with different architectures to optimize both cost and performance.
The concept sounds promising, but what technical preparations are needed for that?
Let's delve into some theory.
By default, when building containers, they are targeted for the same architecture as the local CPU. As a result, running AMD-built containers on ARM nodes, and vice versa, causes errors. This is where multi-arch images come into play.
A multi-arch image looks like a single image with a single tag, but it has a collection of images designed for multiple architectures. How does this function? Each Docker image is defined by a manifest—a JSON file containing all the information about the image. It includes references to each of its layers, their corresponding sizes, the hash of the image, its size and also the platform it's intended for.
For multi-arch images the manifest contains a list of manifest, enforcing the container runtime to select the appropriate image based on the underlying platform architecture.
Presently, two primary manifest standards exist: the Docker manifest version 2 and the OCI Image Index Specification.
Personally, I prefer the OCI standard as it presents a more generic approach to work with multi-arch images.
Many articles explain how to create these images, so I won't cover that here. Just remember, making multi-arch images requires adoption from the build process and a bit more storage in the registry.
In conclusion, using multi-arch images can lower operational costs in public clouds, but it requires making extra changes to integrate them into CI/CD processes. Do you need that in your project? It depends 😉
#engineering
Caching CheatSheet
In the past weeks, there have been two big articles discussing caching strategies:
- Mastering Caching in Distributed Applications
- 9 Caching Strategies for System Design Interviews
So it's a good time to review the patterns we use.
Let’s start with the definition of what caching is. I really like the following one:
Caches can be local or distributed.
5 implementation strategies:
✔️Cache-Aside (Lazy-Loading): The app controls reading and writing to the cache. If data's in the cache, it's used; if not, it's fetched from storage and added to the cache.
✔️Write-Through. Both the cache and the database get updated together within the same transaction when you write data. Reading happens only from the cache.
✔️Write-Around. Data is written directly to storage. Usually, it's combined with Cache-Aside for reading.
✔️Write-behind (write-back). Data is first written to the cache and then sent to the datastore asynchronously. The cache product usually handles syncing with the datastore.
✔️Read-Through. The cache is the main place to get data from. If it's not there, it's fetched from storage. Unlike Cache-Aside, the cache product, not the app, decides when and how to fetch data.
Cache invalidation can be performed time-based or event-based.
Cache-eviction strategies:
- Least Recently Used (LRU)
- First In First Out (FIFO)
- Least Frequently Used (LFU)
- Time To Live (TTL)
- Random Replacement
Promising paths for further development in that field:
- AI and Machine Learning-Driven Caching. It can enhance caching mechanisms by predicting data usage patterns and preemptively caching data based on the needs.
- In-Memory Data Grids. They not only cache data but also provide a range of data processing capabilities, real-time analytics, and decision-making directly within the cache layer.
#architecture #patterns
In the past weeks, there have been two big articles discussing caching strategies:
- Mastering Caching in Distributed Applications
- 9 Caching Strategies for System Design Interviews
So it's a good time to review the patterns we use.
Let’s start with the definition of what caching is. I really like the following one:
Caching is the action of storing data in a temporary medium where the data is either cheaper, faster, or more optimal to retrieve rather than retrieving it from its original storage
Caches can be local or distributed.
5 implementation strategies:
✔️Cache-Aside (Lazy-Loading): The app controls reading and writing to the cache. If data's in the cache, it's used; if not, it's fetched from storage and added to the cache.
✔️Write-Through. Both the cache and the database get updated together within the same transaction when you write data. Reading happens only from the cache.
✔️Write-Around. Data is written directly to storage. Usually, it's combined with Cache-Aside for reading.
✔️Write-behind (write-back). Data is first written to the cache and then sent to the datastore asynchronously. The cache product usually handles syncing with the datastore.
✔️Read-Through. The cache is the main place to get data from. If it's not there, it's fetched from storage. Unlike Cache-Aside, the cache product, not the app, decides when and how to fetch data.
Cache invalidation can be performed time-based or event-based.
Cache-eviction strategies:
- Least Recently Used (LRU)
- First In First Out (FIFO)
- Least Frequently Used (LFU)
- Time To Live (TTL)
- Random Replacement
Promising paths for further development in that field:
- AI and Machine Learning-Driven Caching. It can enhance caching mechanisms by predicting data usage patterns and preemptively caching data based on the needs.
- In-Memory Data Grids. They not only cache data but also provide a range of data processing capabilities, real-time analytics, and decision-making directly within the cache layer.
#architecture #patterns
Medium
Mastering Caching in Distributed Applications
If I had a dollar for every time that I came across a bug with an implementation of caching in a software system… I would probably have…
👍3
What We Can Learn from UK E-Gate Failure
In the beginning of May, the UK E-Gate system experienced a 4-hour outage. This system handles automatic passport control, with over 270 automated gates spread across 15 airports and rail ports in the UK. These gates use biometric data and advanced facial recognition technology to allow entry into the country. The failure caused major airports to cease operations, leading to delays for passengers attempting to cross the UK border.
The outage was not the result of a cyber attack but rather a network issue. The Home Office was conducting a software update that exceeded the data limits outlined in the contract, prompting the network provider to shut down the service. This network was also used for a connection between the gates and the database used for passport verifications.
Dave Farley addressed potential architectural flaws in the E-Gate system in a recent video and offered recommendations for designing distributed systems with failure in mind.
So how E-Gate architecture can be improved to prevent such failures:
- Add redundancy (reserved network channel)
- Cache copies of the data
- Limit the ways things can fail (e.g., use local network instead)
If something can go wrong, it will go wrong. And we as engineers must be prepared for potential failures, designing systems that can handle problems instead of trying to make them perfect.
Here are some general recommendations to consider when building distributed systems:
✔️Assume things can go wrong
✔️ Limit the blast radius of failure
✔️ Use Chaos Engineering techniques
Just a few days ago on May 30th, the E-Gate system encountered another failure🤦♂️, resulting in disruptions at railway stations and leaving passengers queued for approximately 4 hours. It appears that the fundamental issues with the system remain unresolved.
#architecture #reliability #resilience #usecase
In the beginning of May, the UK E-Gate system experienced a 4-hour outage. This system handles automatic passport control, with over 270 automated gates spread across 15 airports and rail ports in the UK. These gates use biometric data and advanced facial recognition technology to allow entry into the country. The failure caused major airports to cease operations, leading to delays for passengers attempting to cross the UK border.
The outage was not the result of a cyber attack but rather a network issue. The Home Office was conducting a software update that exceeded the data limits outlined in the contract, prompting the network provider to shut down the service. This network was also used for a connection between the gates and the database used for passport verifications.
Dave Farley addressed potential architectural flaws in the E-Gate system in a recent video and offered recommendations for designing distributed systems with failure in mind.
So how E-Gate architecture can be improved to prevent such failures:
- Add redundancy (reserved network channel)
- Cache copies of the data
- Limit the ways things can fail (e.g., use local network instead)
If something can go wrong, it will go wrong. And we as engineers must be prepared for potential failures, designing systems that can handle problems instead of trying to make them perfect.
Here are some general recommendations to consider when building distributed systems:
✔️Assume things can go wrong
✔️ Limit the blast radius of failure
✔️ Use Chaos Engineering techniques
Just a few days ago on May 30th, the E-Gate system encountered another failure🤦♂️, resulting in disruptions at railway stations and leaving passengers queued for approximately 4 hours. It appears that the fundamental issues with the system remain unresolved.
#architecture #reliability #resilience #usecase
YouTube
Software Engineering F&*K Up Behind The Passport E-gate Failure
The UK passport e-gate failure was a big news story at the beginning of May, a significant software failure that caused disruption and significant delays in travellers entering the UK.
In this episode, Dave Farley talks about the issue, how poor software…
In this episode, Dave Farley talks about the issue, how poor software…
🔥2❤1👍1
How Instagram Scaled to 14 Million Users
I enjoy studying various real-life scenarios where technologies and patterns are applied to solve practical problems. These examples often inspire me with new ideas for my day-to-day work.
So today we’ll review a really nice video that offers valuable insights into Instagram's growth journey and techniques they use to serve 14 million users.
Key principles of Instagram architecture:
✔️ Keep things simple
✔️ Don’t reinvent the wheel
✔️ Go with already proven and solid technologies
Instagram initially used several Django instances, a PostgreSQL database on EC nodes, and Nginx load balancers. When the database grew too large, they split the data into multiple shards.
One of the toughest challenges was generating IDs and determining the correct shard to handle each ID. The Instagram team decided to use the Snowflake ID approach with some modifications:
- 41 bits for unix time in ms
- 13 bits for shard id
- 10 bits for the auto-increment sequence
Other interesting aspects of Instagram's architecture include:
- Their notification system (e.g., for likes, comments, new posts) is powered by a Gearman job server with around 200 workers to handle these tasks.
- Files are stored on S3 servers located worldwide to keep data close to users
- The technology stack also includes Apache Solr, Memcache, Pingdom, PagerDuty, and Sentry.
To me, the system seems well thought-out and carefully built, focusing on simplicity and scalability. Keeping it simple is a great architectural principle—it makes maintenance easier and the system more resilient.
#architecture #scaling #usecase
I enjoy studying various real-life scenarios where technologies and patterns are applied to solve practical problems. These examples often inspire me with new ideas for my day-to-day work.
So today we’ll review a really nice video that offers valuable insights into Instagram's growth journey and techniques they use to serve 14 million users.
Key principles of Instagram architecture:
✔️ Keep things simple
✔️ Don’t reinvent the wheel
✔️ Go with already proven and solid technologies
Instagram initially used several Django instances, a PostgreSQL database on EC nodes, and Nginx load balancers. When the database grew too large, they split the data into multiple shards.
One of the toughest challenges was generating IDs and determining the correct shard to handle each ID. The Instagram team decided to use the Snowflake ID approach with some modifications:
- 41 bits for unix time in ms
- 13 bits for shard id
- 10 bits for the auto-increment sequence
Other interesting aspects of Instagram's architecture include:
- Their notification system (e.g., for likes, comments, new posts) is powered by a Gearman job server with around 200 workers to handle these tasks.
- Files are stored on S3 servers located worldwide to keep data close to users
- The technology stack also includes Apache Solr, Memcache, Pingdom, PagerDuty, and Sentry.
To me, the system seems well thought-out and carefully built, focusing on simplicity and scalability. Keeping it simple is a great architectural principle—it makes maintenance easier and the system more resilient.
#architecture #scaling #usecase
YouTube
How Instagram Scaled to 14 Million Users With Only 3 Engineers
In this video, we will explore how Instagram managed to scale so well with only 3 engineers in their early days.
Corrections:
- https://www.youtube.com/watch?v=TdhXPsDXdAI&t=332s , I meant to say "not to be confused sql schema of a table"
As requested…
Corrections:
- https://www.youtube.com/watch?v=TdhXPsDXdAI&t=332s , I meant to say "not to be confused sql schema of a table"
As requested…
❤🔥3👍2
Why Don’t People Do What You Expect
As leaders, at any level, we delegate tasks to our subordinates and control their execution. During these times, there's an expectation that everything will be executed exactly as envisioned. Surprisingly, this isn't always the case. In fact, there are cases where some team members consistently fail to complete assigned tasks. This can be frustrating, leading to increased tension between team members and their leader. In the worst-case scenario, it may result in a team member being released from the project or even fired.
But what causes these issues? And is there a way to address them?
Let's consider one of the fundamental assumptions of management: that people always act in the best possible manner (excluding deliberate sabotage, of course). This "best possible manner" depends on numerous factors, ranging from the team atmosphere to personal health and family relationships.
So we must try to understand the reasons behind the unexpected behavior. Generally, these reasons can be grouped into four categories:
1️⃣Don’t understand. The task or its purpose isn't clear, leading to a lack of motivation to do it.
2️⃣Not able to do. There's a lack of skills or competencies necessary to perform the task.
3️⃣Cannot do it. There's a lack of resources, both physical and mental, to carry out the task. Personal issues may decrease the ability to focus on work tasks effectively.
4️⃣Don’t want. The task may not align with the individual's competencies or goals, leading to a lack of motivation to do it.
As leaders, our first step is to identify the correct reason. Often, a one-on-one conversation with appropriate questions can shed light on the situation. In most cases, we have the power to address the problem, help our team members and make the team stronger.
#management #leadership
As leaders, at any level, we delegate tasks to our subordinates and control their execution. During these times, there's an expectation that everything will be executed exactly as envisioned. Surprisingly, this isn't always the case. In fact, there are cases where some team members consistently fail to complete assigned tasks. This can be frustrating, leading to increased tension between team members and their leader. In the worst-case scenario, it may result in a team member being released from the project or even fired.
But what causes these issues? And is there a way to address them?
Let's consider one of the fundamental assumptions of management: that people always act in the best possible manner (excluding deliberate sabotage, of course). This "best possible manner" depends on numerous factors, ranging from the team atmosphere to personal health and family relationships.
So we must try to understand the reasons behind the unexpected behavior. Generally, these reasons can be grouped into four categories:
1️⃣Don’t understand. The task or its purpose isn't clear, leading to a lack of motivation to do it.
2️⃣Not able to do. There's a lack of skills or competencies necessary to perform the task.
3️⃣Cannot do it. There's a lack of resources, both physical and mental, to carry out the task. Personal issues may decrease the ability to focus on work tasks effectively.
4️⃣Don’t want. The task may not align with the individual's competencies or goals, leading to a lack of motivation to do it.
As leaders, our first step is to identify the correct reason. Often, a one-on-one conversation with appropriate questions can shed light on the situation. In most cases, we have the power to address the problem, help our team members and make the team stronger.
#management #leadership
❤🔥4👍1💯1
Coupling or Uncoupling?
We all start our careers knowing that our code should be loosely coupled and have high cohesion. But can we ever achieve fully uncoupled code in a system? Not really. Coupling determines how freely we can make changes within a system, and different types of coupling may produce totally different effects.
Michael Nygard dives deep into that topic in his Uncoupling talk where he explains what coupling is, how to analyze it and what we as engineers can do to make the systems more resilient to change.
So, let's break it down. The author talks about a few types of coupling:
📍Operational. Consumer cannot run without a provider. For example, a service might fail if it can't access the database.
📍Development. Changes in the producer and consumer need to be synchronized and delivered together
📍Semantic. Changed together because of shared concepts.If a concept changes due to new business requirements, those changes need to be reflected in all downstream systems.
📍Functional. Changed together because of shared responsibility.
📍Incidental. Change together for no good reason. The E-Gate system failure serves as a prime example of incidental coupling, where an update in one system triggered a failure in another due to a shared network.
By strategically adjusting the levels of different types of coupling, we can effectively manage the impact of our changes.
#architecture #systemdesign
We all start our careers knowing that our code should be loosely coupled and have high cohesion. But can we ever achieve fully uncoupled code in a system? Not really. Coupling determines how freely we can make changes within a system, and different types of coupling may produce totally different effects.
Michael Nygard dives deep into that topic in his Uncoupling talk where he explains what coupling is, how to analyze it and what we as engineers can do to make the systems more resilient to change.
So, let's break it down. The author talks about a few types of coupling:
📍Operational. Consumer cannot run without a provider. For example, a service might fail if it can't access the database.
📍Development. Changes in the producer and consumer need to be synchronized and delivered together
📍Semantic. Changed together because of shared concepts.If a concept changes due to new business requirements, those changes need to be reflected in all downstream systems.
📍Functional. Changed together because of shared responsibility.
📍Incidental. Change together for no good reason. The E-Gate system failure serves as a prime example of incidental coupling, where an update in one system triggered a failure in another due to a shared network.
By strategically adjusting the levels of different types of coupling, we can effectively manage the impact of our changes.
#architecture #systemdesign
YouTube
Uncoupling • Michael Nygard • GOTO 2018
This presentation was recorded at GOTO Amsterdam 2018. #gotocon #gotoams
http://gotoams.nl
Michael Nygard - Innovative Technology Leader at Cognitect @cmdrmikolazic
ABSTRACT
We overload our terms a lot in this industry. "Coupling" is one such. That word…
http://gotoams.nl
Michael Nygard - Innovative Technology Leader at Cognitect @cmdrmikolazic
ABSTRACT
We overload our terms a lot in this industry. "Coupling" is one such. That word…
❤2👀1
The Motive
`Please, don’t be a leader, unless you’re doing it for the right reason, and you probably aren’t!` It may sound provocative, but Patrick Lencioni starts his book 'The Motive: Why So Many Leaders Abdicate Their Most Important Responsibilities' with this very idea. Released in 2020, this is one of his latest works where he challenges traditional views on leadership, exploring why individuals should choose to take on leadership roles.
The author defines two primary motives that drive leaders:
1️⃣ Reward: Leaders driven by rewards primarily seek personal gratification and enjoyment in their position. They avoid mundane tasks, focusing solely on activities that interest them. They actually dislike managerial tasks and would likely find greater satisfaction and effectiveness in other areas.
2️⃣ Responsibility: Leaders with a sense of responsibility prioritize the well-being and growth of their team. They take unpleasant tasks, recognizing their duty to their employees and organization.
While most individuals possess elements of both motives, Lencioni argues that one typically dominates, significantly influencing leadership effectiveness. According to him, responsibility-oriented leaders tend to achieve greater success.
Lencioni identifies five crucial activities that reward-oriented leaders tend to avoid:
1️⃣ Developing the leadership team: Taking personal accountability for the growth and development of team members.
2️⃣ Managing subordinates (and making them manage theirs): Providing guidance, sharing knowledge, and offering mentorship.
3️⃣ Having difficult and uncomfortable conversations: Addressing complex and uncomfortable issues promptly and effectively.
4️⃣ Running great team meetings: Facilitating productive discussions and decision-making.
5️⃣ Communicating constantly and repetitively to employees: Keeping the team informed about organizational vision, goals, decisions, and challenges.
For me, it was a valuable review that brought particular aspects into focus. After looking closely at how I lead, I realized that handling tough conversations (point 3) isn't my strongest skill. So, I marked it as an area to improve upon. But it's nice to know that this challenge isn't unique to me, as the book mentions 😉.
So, revise those unpleasant tasks, clarify your motives and be a good leader.
#management #leadership #booknook
`Please, don’t be a leader, unless you’re doing it for the right reason, and you probably aren’t!` It may sound provocative, but Patrick Lencioni starts his book 'The Motive: Why So Many Leaders Abdicate Their Most Important Responsibilities' with this very idea. Released in 2020, this is one of his latest works where he challenges traditional views on leadership, exploring why individuals should choose to take on leadership roles.
The author defines two primary motives that drive leaders:
1️⃣ Reward: Leaders driven by rewards primarily seek personal gratification and enjoyment in their position. They avoid mundane tasks, focusing solely on activities that interest them. They actually dislike managerial tasks and would likely find greater satisfaction and effectiveness in other areas.
2️⃣ Responsibility: Leaders with a sense of responsibility prioritize the well-being and growth of their team. They take unpleasant tasks, recognizing their duty to their employees and organization.
While most individuals possess elements of both motives, Lencioni argues that one typically dominates, significantly influencing leadership effectiveness. According to him, responsibility-oriented leaders tend to achieve greater success.
Lencioni identifies five crucial activities that reward-oriented leaders tend to avoid:
1️⃣ Developing the leadership team: Taking personal accountability for the growth and development of team members.
2️⃣ Managing subordinates (and making them manage theirs): Providing guidance, sharing knowledge, and offering mentorship.
3️⃣ Having difficult and uncomfortable conversations: Addressing complex and uncomfortable issues promptly and effectively.
4️⃣ Running great team meetings: Facilitating productive discussions and decision-making.
5️⃣ Communicating constantly and repetitively to employees: Keeping the team informed about organizational vision, goals, decisions, and challenges.
For me, it was a valuable review that brought particular aspects into focus. After looking closely at how I lead, I realized that handling tough conversations (point 3) isn't my strongest skill. So, I marked it as an area to improve upon. But it's nice to know that this challenge isn't unique to me, as the book mentions 😉.
So, revise those unpleasant tasks, clarify your motives and be a good leader.
#management #leadership #booknook
🔥2
Dual-Write Problem
In the distributed world, it's often the case that two external systems need to be synchronized and updated simultaneously to maintain consistency. It’s called the dual-write problem. A classic example of this is when data needs to be stored both in a database and in Kafka.
So what's the best solution for this problem? What are common pitfalls and how to avoid them? These questions are addressed in the article 'Solving the Dual-Write Problem` by W.Waldron.
Identified antipatterns:
📍Relying on operation order
📍Wrapping dual writes into a database transaction
📍Retrying failed operations
Possible solutions:
📍Transactional outbox pattern. In this approach, an outbox table is set up in the database. Changes are made to the target tables and the outbox table within the same transaction. A separate process then reads the outbox table rows and sends them to Kafka, retrying if there's a failure.
📍Change Data Capture (CDC) if supported by the database. Modification of p.1
📍Event sourcing. Event sourcing: Every change is recorded in the database as an event. Since each event is written to a single row in a single table, transactions are unnecessary. A separate process can then read these events and send them to Kafka.
📍The listen-to-yourself pattern. Any change is directly sent to Kafka. A separate process can listen to these events and use them to update the database, retrying as needed. Database will be eventually consistent.
The core concept behind these solutions is to divide writes into two separate processes and establish a dependency between them. While this isn't an exhaustive list of all potential solutions, it provides a solid set of practices to begin with.
#architecture #systemdesign #patterns
In the distributed world, it's often the case that two external systems need to be synchronized and updated simultaneously to maintain consistency. It’s called the dual-write problem. A classic example of this is when data needs to be stored both in a database and in Kafka.
So what's the best solution for this problem? What are common pitfalls and how to avoid them? These questions are addressed in the article 'Solving the Dual-Write Problem` by W.Waldron.
Identified antipatterns:
📍Relying on operation order
📍Wrapping dual writes into a database transaction
📍Retrying failed operations
Possible solutions:
📍Transactional outbox pattern. In this approach, an outbox table is set up in the database. Changes are made to the target tables and the outbox table within the same transaction. A separate process then reads the outbox table rows and sends them to Kafka, retrying if there's a failure.
📍Change Data Capture (CDC) if supported by the database. Modification of p.1
📍Event sourcing. Event sourcing: Every change is recorded in the database as an event. Since each event is written to a single row in a single table, transactions are unnecessary. A separate process can then read these events and send them to Kafka.
📍The listen-to-yourself pattern. Any change is directly sent to Kafka. A separate process can listen to these events and use them to update the database, retrying as needed. Database will be eventually consistent.
The core concept behind these solutions is to divide writes into two separate processes and establish a dependency between them. While this isn't an exhaustive list of all potential solutions, it provides a solid set of practices to begin with.
#architecture #systemdesign #patterns
Confluent
Understanding the Dual-Write Problem and Its Solutions
The dual-write problem can arise in any distributed system. Fortunately, it has solutions in event sourcing & the transactional outbox & listen-to-yourself patterns.
❤🔥2
Flaky Tests Overhaul
If you've got a substantial set of tests, chances are very high that you've encountered situations where some test outcomes fluctuate between runs, even when there are no changes in the code. This inconsistency is called "flakiness".
Teams often may find themselves repeatedly retrying pipelines containing flaky tests (timeouts are also considered a form of flakiness), trying to make the pipeline green. However, this process often results in wasted engineering hours and CI resources. As the code base and number of teams grow, managing flaky tests becomes increasingly challenging, leading to more potential issues and accumulating technical debt.
The Uber engineering team recently published an article detailing their approach to improve CI stability and address the issue of test flakiness.
Key points:
📍Separate service Testopedia is introduced to visualize history of test execution and test performance characteristics
📍Testopedia is language/repo-agnostic. It operates with the term ‘test entity’. Each test entity is uniquely identified by a “fully qualified name” (FQN) that usually includes a full test address in the repo.
📍Tests can be grouped into realms, each realm is owned by some responsible team.
📍Testopedia analyzes test execution stats (including flakiness, reliability, staleness, execution time) , groups problem tests and triggers a JIRA ticket with the deadline to fix.
📍GenAI integration is a future step to auto-generate fixes for flaky tests. It’s under research now.
As a result, the authors noted that implementing the Testopedia approach significantly improved the reliability of CI and reduced the number of retries. If this tool were available as an open-source project, I would certainly give it a try, but unfortunately, it's not.
However, in the absence of such a tool, what steps can we take on our own to address this issue? Here are some suggestions:
📍Visualize pipeline health by implementing simple monitoring of CI statistics, including the number of retries, execution time, and other relevant metrics. To improve something it must be measurable.
📍Treat problematic tests as work items with clear deadlines for resolution.
📍Prioritize CI issues, recognizing them as critical technical debt that will require attention anyway
📍Implement measures to make retries more difficult or even impossible (validations, webhooks, etc.)
📍Clearly define roles and responsibilities for maintaining CI stability; otherwise there is a risk of collective irresponsibility.
#engineering #ci
If you've got a substantial set of tests, chances are very high that you've encountered situations where some test outcomes fluctuate between runs, even when there are no changes in the code. This inconsistency is called "flakiness".
Teams often may find themselves repeatedly retrying pipelines containing flaky tests (timeouts are also considered a form of flakiness), trying to make the pipeline green. However, this process often results in wasted engineering hours and CI resources. As the code base and number of teams grow, managing flaky tests becomes increasingly challenging, leading to more potential issues and accumulating technical debt.
The Uber engineering team recently published an article detailing their approach to improve CI stability and address the issue of test flakiness.
Key points:
📍Separate service Testopedia is introduced to visualize history of test execution and test performance characteristics
📍Testopedia is language/repo-agnostic. It operates with the term ‘test entity’. Each test entity is uniquely identified by a “fully qualified name” (FQN) that usually includes a full test address in the repo.
📍Tests can be grouped into realms, each realm is owned by some responsible team.
📍Testopedia analyzes test execution stats (including flakiness, reliability, staleness, execution time) , groups problem tests and triggers a JIRA ticket with the deadline to fix.
📍GenAI integration is a future step to auto-generate fixes for flaky tests. It’s under research now.
As a result, the authors noted that implementing the Testopedia approach significantly improved the reliability of CI and reduced the number of retries. If this tool were available as an open-source project, I would certainly give it a try, but unfortunately, it's not.
However, in the absence of such a tool, what steps can we take on our own to address this issue? Here are some suggestions:
📍Visualize pipeline health by implementing simple monitoring of CI statistics, including the number of retries, execution time, and other relevant metrics. To improve something it must be measurable.
📍Treat problematic tests as work items with clear deadlines for resolution.
📍Prioritize CI issues, recognizing them as critical technical debt that will require attention anyway
📍Implement measures to make retries more difficult or even impossible (validations, webhooks, etc.)
📍Clearly define roles and responsibilities for maintaining CI stability; otherwise there is a risk of collective irresponsibility.
#engineering #ci
👍3🔥3
Architectural Principles
Whether you set them or not, you and your team already use architectural principles. These principles may not be obvious, but they influence technical decisions, from writing any piece of code to preparing complex designs. The challenge is that each developer may have their own set of principles based on their previous experiences, best practices, or books they have read. This can lead to inconsistent system behavior across different modules, debates during code reviews, and the need for additional approvals to make decisions.
Let’s define what an architectural principle is. Eoin Woods, author of Continuous Architecture in Practice, offers a definition I like: an architectural principle is
In essence, architectural principles are guidelines that establish a common framework for the team, enabling them to make informed decisions independently of any central authority. These principles not only enhance team responsibility for their actions but also streamline the achievement of business and technical goals. For architectural principles to be effective, it is crucial that the team participates in their development and accepts them.
Relationship Between Goals, Principles, and Decisions:
Recommendations for defining Architectural Principles:
📍Simplicity: Ensure that principles are straightforward and do not require additional context for understanding.
📍SMART Criteria: Define principles that are Specific, Measurable, Achievable, Realistic, and Testable.
📍Practical Guidance: Principles should be practical and usable as implementation guidance.
📍Relevance: Focus on the most significant principles that, if not properly defined, could lead to poor decisions.
📍Conciseness: Keep the list short (5-7 items) to reduce cognitive load.
Some examples of good principles:
✔️Degrade Gracefully: Design the system to continue functioning even when some components fail, providing a lower-quality service instead of a total failure.
✔️Self-Healing: Ensure that failed actions are continuously retried until they succeed
✔️Design to Be Monitored: Build systems that are self-diagnosing and can be easily monitored.
Architectural principles set boundaries and guidelines for teams, allowing the freedom to make independent decisions. This helps maintain consistency, improve decision-making, boost autonomy, and achieve technical and business goals.
#architecture #documentation
Whether you set them or not, you and your team already use architectural principles. These principles may not be obvious, but they influence technical decisions, from writing any piece of code to preparing complex designs. The challenge is that each developer may have their own set of principles based on their previous experiences, best practices, or books they have read. This can lead to inconsistent system behavior across different modules, debates during code reviews, and the need for additional approvals to make decisions.
Let’s define what an architectural principle is. Eoin Woods, author of Continuous Architecture in Practice, offers a definition I like: an architectural principle is
a declarative statement made with the intention of guiding architectural design decisions to achieve one or more qualities of a system.
In essence, architectural principles are guidelines that establish a common framework for the team, enabling them to make informed decisions independently of any central authority. These principles not only enhance team responsibility for their actions but also streamline the achievement of business and technical goals. For architectural principles to be effective, it is crucial that the team participates in their development and accepts them.
Relationship Between Goals, Principles, and Decisions:
Goal -> Requirements -> Principles -> Decision
Recommendations for defining Architectural Principles:
📍Simplicity: Ensure that principles are straightforward and do not require additional context for understanding.
📍SMART Criteria: Define principles that are Specific, Measurable, Achievable, Realistic, and Testable.
📍Practical Guidance: Principles should be practical and usable as implementation guidance.
📍Relevance: Focus on the most significant principles that, if not properly defined, could lead to poor decisions.
📍Conciseness: Keep the list short (5-7 items) to reduce cognitive load.
Some examples of good principles:
✔️Degrade Gracefully: Design the system to continue functioning even when some components fail, providing a lower-quality service instead of a total failure.
✔️Self-Healing: Ensure that failed actions are continuously retried until they succeed
✔️Design to Be Monitored: Build systems that are self-diagnosing and can be easily monitored.
Architectural principles set boundaries and guidelines for teams, allowing the freedom to make independent decisions. This helps maintain consistency, improve decision-making, boost autonomy, and achieve technical and business goals.
#architecture #documentation
👍2🔥2
