GreenOps: Renewable Energy Trend
In the recent news it's reported that Amazon fully switched on renewable energy usage:
Amazon have invested billions of dollars in more than 500 solar and wind projects globally, which together are capable of generating enough energy to power the equivalent of 7.6 million U.S. homes.
Any data center infrastructure comes with an environmental cost; the IT sector alone is responsible for 1.4% of carbon emissions worldwide. Carbon-aware approach became a trend not only for Amazon but also for all major cloud providers, such as Google and Microsoft. Of course, this shift is primarily driven by government regulations, social responsibility and company reputation.
A new concept, GreenOps, has been introduced to make applications
References:
- Using GreenOps to Improve Your Operational Efficiency and Save the Planet
- Amazon Renewable Energy Goal
- What Is GreenOps? Putting a Sustainable Focus on FinOps
#news #engineering
In the recent news it's reported that Amazon fully switched on renewable energy usage:
All of the electricity consumed by Amazon’s operations, including its data centers, was matched with 100% renewable energy in 2023.
Amazon have invested billions of dollars in more than 500 solar and wind projects globally, which together are capable of generating enough energy to power the equivalent of 7.6 million U.S. homes.
Any data center infrastructure comes with an environmental cost; the IT sector alone is responsible for 1.4% of carbon emissions worldwide. Carbon-aware approach became a trend not only for Amazon but also for all major cloud providers, such as Google and Microsoft. Of course, this shift is primarily driven by government regulations, social responsibility and company reputation.
A new concept, GreenOps, has been introduced to make applications
greener. It's mostly an evolution of FinOps but with the focus on environmentally friendly optimizations : using less energy, being aware of carbon emissions, and utilizing more efficient hardware.References:
- Using GreenOps to Improve Your Operational Efficiency and Save the Planet
- Amazon Renewable Energy Goal
- What Is GreenOps? Putting a Sustainable Focus on FinOps
#news #engineering
👍2🔥1
Make Architecture Reliable
Reliability is the top priority feature for modern systems. Your customers always expect service reliability, even if they don't realize it. Nobody interests in the super cool feature that cannot be used (because of service unavailability or bad performance for example).
Reliability can be defined as the ability of a system to carry out its intended function without interruption. Good definition, but not really actionable. I prefer Google definition
The logic is simple: if system is not reliable, users will not use it. If users don't use it, it's worth nothing. So reliability matters.
Let's check what reliable architecture usually includes:
📍Measurable Reliability Targets: SLO, SLI and error budget
📍High-Availability:
- Redundancy: multiple replicas for the same service
- Self-Healing: the ability to remediate issues without manual interventions
- Graceful Degradation: degrade service levels gracefully when overloaded
- Fail Safe: be ready for unexpected failure, no data or system corruption
- Retriable APIs: make your operations idempotent, allow retries
- Critical Dependencies Minimization: the reliability level of a service is defined by the reliability of its least reliable component or dependency.
- Multiple Availability Zones (AZ): spread instances across multiple AZ, ability to survive in case of AZ outage
📍Disaster Recovery:
- Multiple Regions: spread instances across multiple regions (each region has multiple AZ), ability to survive in case of region failure
- Data Replication Across Regions
📍Scalability: ability to scale for increased workload
📍Observability: code instrumentation, tools for data collection and analysis, fast failure detection
📍Recovery Procedures: rollback strategies, recovery from outages
📍Chaos Engineering: practices to test failures internally
📍Operational Excellence: a fully automated operational experience with minimal manual steps and low cognitive complexity
References:
- Google Cloud Architecture Framework: Reliability
- AWS Well-Architected Framework
- AWS Well-Architected Framework: Reliability Pillar
- Azure Well-Architected Framework: Reliability
#architecture #systemdesign #reliability
Reliability is the top priority feature for modern systems. Your customers always expect service reliability, even if they don't realize it. Nobody interests in the super cool feature that cannot be used (because of service unavailability or bad performance for example).
Reliability can be defined as the ability of a system to carry out its intended function without interruption. Good definition, but not really actionable. I prefer Google definition
Your service is reliable when your customers are happy.
The logic is simple: if system is not reliable, users will not use it. If users don't use it, it's worth nothing. So reliability matters.
Let's check what reliable architecture usually includes:
📍Measurable Reliability Targets: SLO, SLI and error budget
📍High-Availability:
- Redundancy: multiple replicas for the same service
- Self-Healing: the ability to remediate issues without manual interventions
- Graceful Degradation: degrade service levels gracefully when overloaded
- Fail Safe: be ready for unexpected failure, no data or system corruption
- Retriable APIs: make your operations idempotent, allow retries
- Critical Dependencies Minimization: the reliability level of a service is defined by the reliability of its least reliable component or dependency.
- Multiple Availability Zones (AZ): spread instances across multiple AZ, ability to survive in case of AZ outage
📍Disaster Recovery:
- Multiple Regions: spread instances across multiple regions (each region has multiple AZ), ability to survive in case of region failure
- Data Replication Across Regions
📍Scalability: ability to scale for increased workload
📍Observability: code instrumentation, tools for data collection and analysis, fast failure detection
📍Recovery Procedures: rollback strategies, recovery from outages
📍Chaos Engineering: practices to test failures internally
📍Operational Excellence: a fully automated operational experience with minimal manual steps and low cognitive complexity
References:
- Google Cloud Architecture Framework: Reliability
- AWS Well-Architected Framework
- AWS Well-Architected Framework: Reliability Pillar
- Azure Well-Architected Framework: Reliability
#architecture #systemdesign #reliability
🔥3
S3 for Kafka Storage
In version 3.6.0 Kafka introduced early access to the Tiered Storage feature (KIP-405) that significantly improves operation experience and decreases cluster costs.
Existing problems with scalability and efficiency that this change is intended to solve:
📍Huge disk capacity that is required to keep data for long period of time (retention policy for days, weeks or even months).
📍Processing speed may be impacted by big amount of data kept in the cluster
📍Home-grown implementations to copy old data to external storages like HDFS
📍Expensive scaling approach. Kafka is scaled by adding new brokers that also require RAM and CPU, not possible to scale disks only
📍Copying a lot of data in case node failure as new node must copy all the data that was on the failed broker from other replicas
📍High recovery time. The time for recovery and rebalancing is proportional to the amount of data stored locally on a Kafka broker.
Suggested solution:
✏️ Use Tiered Storage pattern. Split data management on separate tires based on performance and access requirements, cost considerations. Most commonly used tiers:
-"Hot": Local storage to keeps the most critical and frequently accessed data
- "Warm": Remote lower-cost storage that keeps less critical or infrequently accessed data
-"Cold": Low-cost storage to keep periodic backup data
✏️ Kafka storage is split on local and remote storages. Local storage is the same as it's in Kafka now. Remote storage is a pluggable storage that can be HDFS, S3, Azure blob, etc.
✏️ Inactive segments are copied to the remote storage according to the configured retention policy
✏️ Remote and local storages have their own retention policies. So local storage can be very short like few hours.
✏️ Any data that exceeds the local retention threshold will not be removed until successfully uploaded to the remote storage
✏️ Clients can still get older data, it will be read from the remote storage
✏️ Feature is enabled by
✏️ New metrics are introduced to monitor integration performance with the remote storage
Current limitations:
* Compacted topics are not supported
* To disable remote storage, the topic must be recreated
To sum up, Tiered Storage allows scale storage independently from cluster size that reduces overall usage costs. It is still in an early access state (3.8.0 version) and is not recommended for production use. However, there is significant interest in it. AWS has announced S3 support in their MSK service, and Uber has reported successfully running the feature in production.
#news #architecture #technologies #kafka
In version 3.6.0 Kafka introduced early access to the Tiered Storage feature (KIP-405) that significantly improves operation experience and decreases cluster costs.
Existing problems with scalability and efficiency that this change is intended to solve:
📍Huge disk capacity that is required to keep data for long period of time (retention policy for days, weeks or even months).
📍Processing speed may be impacted by big amount of data kept in the cluster
📍Home-grown implementations to copy old data to external storages like HDFS
📍Expensive scaling approach. Kafka is scaled by adding new brokers that also require RAM and CPU, not possible to scale disks only
📍Copying a lot of data in case node failure as new node must copy all the data that was on the failed broker from other replicas
📍High recovery time. The time for recovery and rebalancing is proportional to the amount of data stored locally on a Kafka broker.
Suggested solution:
✏️ Use Tiered Storage pattern. Split data management on separate tires based on performance and access requirements, cost considerations. Most commonly used tiers:
-"Hot": Local storage to keeps the most critical and frequently accessed data
- "Warm": Remote lower-cost storage that keeps less critical or infrequently accessed data
-"Cold": Low-cost storage to keep periodic backup data
✏️ Kafka storage is split on local and remote storages. Local storage is the same as it's in Kafka now. Remote storage is a pluggable storage that can be HDFS, S3, Azure blob, etc.
✏️ Inactive segments are copied to the remote storage according to the configured retention policy
✏️ Remote and local storages have their own retention policies. So local storage can be very short like few hours.
✏️ Any data that exceeds the local retention threshold will not be removed until successfully uploaded to the remote storage
✏️ Clients can still get older data, it will be read from the remote storage
✏️ Feature is enabled by
remote.log.storage.system.enable on the cluster and remote.storage.enable on the topic✏️ New metrics are introduced to monitor integration performance with the remote storage
Current limitations:
* Compacted topics are not supported
* To disable remote storage, the topic must be recreated
To sum up, Tiered Storage allows scale storage independently from cluster size that reduces overall usage costs. It is still in an early access state (3.8.0 version) and is not recommended for production use. However, there is significant interest in it. AWS has announced S3 support in their MSK service, and Uber has reported successfully running the feature in production.
#news #architecture #technologies #kafka
👍4
Architecture Decision Records
One of the most popular tools for documenting architectural decisions is an Architectural Decision Record (ADR). An ADR is a document that describes a choice made by the team regarding a significant aspect of the software architecture they’re planning to build. "Significant" means that the decision has a measurable impact on the architecture and quality of a software or hardware system.
The collection of ADRs created and maintained for a project is referred to as the project decision log.
A basic ADR typically includes the following parts:
- Title: The name of the change.
- Status: The current status from the ADR lifecycle, such as draft, accepted, rejected, deprecated, etc.
- Context: The purpose of the ADR, the issue it aims to solve, business priorities, team skills, and limitations.
- Decision: A denoscription of the proposed change.
- Consequences: The effects of the change, including what becomes easier or more difficult, as well as any outputs and after-review actions
It is a good practice to review ADR after implementation to compare the documented information with what was actually implemented.
Company-specific ADR process examples:
- Github ADR
- AWS ADR Process
- Google Cloud ADR Recommendations
ADR is a good tool to document architectural decisions, improve internal communications and facilitate knowledge sharing within the team or across the organization.
#architecture #documentation
One of the most popular tools for documenting architectural decisions is an Architectural Decision Record (ADR). An ADR is a document that describes a choice made by the team regarding a significant aspect of the software architecture they’re planning to build. "Significant" means that the decision has a measurable impact on the architecture and quality of a software or hardware system.
The collection of ADRs created and maintained for a project is referred to as the project decision log.
A basic ADR typically includes the following parts:
- Title: The name of the change.
- Status: The current status from the ADR lifecycle, such as draft, accepted, rejected, deprecated, etc.
- Context: The purpose of the ADR, the issue it aims to solve, business priorities, team skills, and limitations.
- Decision: A denoscription of the proposed change.
- Consequences: The effects of the change, including what becomes easier or more difficult, as well as any outputs and after-review actions
It is a good practice to review ADR after implementation to compare the documented information with what was actually implemented.
Company-specific ADR process examples:
- Github ADR
- AWS ADR Process
- Google Cloud ADR Recommendations
ADR is a good tool to document architectural decisions, improve internal communications and facilitate knowledge sharing within the team or across the organization.
#architecture #documentation
👍3❤1
Mostly nobody likes to write docs. But AI can significantly simplify that experience.
I experimented with ChatGPT to generate ADR according to the template link and decision text like "We decided to use PostgreSQL as a main database storage as it has high expertise in our company, team to support it, it's easy and has drivers for mostly all languages. Other options that we verify were Oracle, MySQL"
Result is a well-structured document. Of course, It needs to be carefully reviewed as AI tends to make up some details 😃.
#architecture #documentation
I experimented with ChatGPT to generate ADR according to the template link and decision text like "We decided to use PostgreSQL as a main database storage as it has high expertise in our company, team to support it, it's easy and has drivers for mostly all languages. Other options that we verify were Oracle, MySQL"
Result is a well-structured document. Of course, It needs to be carefully reviewed as AI tends to make up some details 😃.
#architecture #documentation
👍2❤1🔥1
Distributed Leadership
Distributed Leadership is a video from Phil Haack about practices for effective work with distributed teams. The author worked in Github for a long time, and as you may know Github is fully distributed from the very beginning.
Let's check what practices are recommended:
✏️ Provide context to the messages: never send a message like 'Hey, what's around?'. It does not explain the reason why you reach that person. Make message clear so the person can respond without additional clarifications.
✏️ Write things down: document decisions, meeting notes, guidelines, etc. Make these information discoverable for everyone (e.g., as md docs in git)
✏️ Use video calls: it help to feel each other as people not just resources
✏️ Use chats, small talks, emojis, gifs, jokes: creating an environment of camaraderie. The author refers to this as a "distributed water cooler," replicating the experience of casual conversations near the water cooler in a physical office.
✏️ Organize in-person meetings periodically: spend time physically together. Github organizes a full company summit for all employees once a year, team summits can be more frequent.
✏️ Avoid drive-by comments: don't comment if there are nothing valuable to add
✏️ ChatOps: automate as much as you can
✏️ Avoid synchronization points: avoid meetings, prioritize tasks to unblock other teams
✏️ Use decision-making frameworks: RACI (Responsible, Accountable, Consulted, Informed) or DACI (Driver, Approver, Contributor, Informed), clearly assign responsibilities, make priorities alignment inside the team
✏️ Support work-life balance
The video doesn't provide any revolutionary ideas on managing distributed teams but it does provide a good summary of working practices. I use most of these practices in my daily work and can confirm that they are really helpful, especially the communication part.
#leadership #management
Distributed Leadership is a video from Phil Haack about practices for effective work with distributed teams. The author worked in Github for a long time, and as you may know Github is fully distributed from the very beginning.
Let's check what practices are recommended:
✏️ Provide context to the messages: never send a message like 'Hey, what's around?'. It does not explain the reason why you reach that person. Make message clear so the person can respond without additional clarifications.
✏️ Write things down: document decisions, meeting notes, guidelines, etc. Make these information discoverable for everyone (e.g., as md docs in git)
✏️ Use video calls: it help to feel each other as people not just resources
✏️ Use chats, small talks, emojis, gifs, jokes: creating an environment of camaraderie. The author refers to this as a "distributed water cooler," replicating the experience of casual conversations near the water cooler in a physical office.
✏️ Organize in-person meetings periodically: spend time physically together. Github organizes a full company summit for all employees once a year, team summits can be more frequent.
✏️ Avoid drive-by comments: don't comment if there are nothing valuable to add
✏️ ChatOps: automate as much as you can
✏️ Avoid synchronization points: avoid meetings, prioritize tasks to unblock other teams
✏️ Use decision-making frameworks: RACI (Responsible, Accountable, Consulted, Informed) or DACI (Driver, Approver, Contributor, Informed), clearly assign responsibilities, make priorities alignment inside the team
✏️ Support work-life balance
The video doesn't provide any revolutionary ideas on managing distributed teams but it does provide a good summary of working practices. I use most of these practices in my daily work and can confirm that they are really helpful, especially the communication part.
#leadership #management
YouTube
Distributed Leadership - Phil Haack - NDC Sydney 2024
This talk was recorded at NDC Sydney in Sydney, Australia. #ndcsydney #ndcconferences #developer #softwaredeveloper
Attend the next NDC conference near you:
https://ndcconferences.com
https://ndcsydney.com/
Subscribe to our YouTube channel and learn…
Attend the next NDC conference near you:
https://ndcconferences.com
https://ndcsydney.com/
Subscribe to our YouTube channel and learn…
❤3👍2
Addressing Cascading Failures
A cascading failure is a failure that grows over time: failure of one or a few parts of a system triggers a domino effect, leading to the progressive failure of other parts. Cascading failures are one of the biggest challenges to the reliability of distributed systems.
Google SRE Book has a separate chapter about potential root causes of this type of failure, design recommendations and immediate steps to fix.
Let's start with typical causes of the problem:
- Server Overload: there are more requests then server can handle
- Resource Exhaustion: running out of CPU, RAM, threads, file denoscriptors
- Service Unavailability: container crash, failed readiness probes, errors
- Slow startup and cold caching
Common triggers of cascading failures:
- Maintenance procedures: updates, new rollouts, planned changes in infrastructure
- Organic growth of the load
- System usage changes: more users, increased non-typical usage scenarios
- Resource limits: clusters are usually overprovisioned, so some heavy operation can occupy resources and impact other services
Design recommendations that can help to avoid cascading failures:
✏️ In case of overload, perform load shedding and graceful degradation: reject requests that cannot be served, serve degraded results (less data, data only from cache, etc.)
✏️ Instrument higher-level systems to reject requests, rather than overload servers
✏️ Perform accurate capacity planning. Capacity planning reduces the probability of triggering a cascading failure, but it is not sufficient for protection
✏️ Smart retries implementation:
- Use randomized exponential backoff when scheduling retries
- Limit retries per request. Don’t retry a given request indefinitely.
- Consider having a server-wide retry budget. For example, only allow 60 retries per minute in a process, and if the retry budget is exceeded, don’t retry; just fail the request.
- Use clear response codes and consider how different failure modes should be handled. Don’t retry permanent errors or malformed requests in a client, because neither will ever succeed.
✏️ Set requests timeouts (they called them deadlines). Implement deadline propagation, where each service in the call chain checks whether the deadline has already been exceeded. Based on this, the service can decide whether to proceed or terminate the process.
Immediate steps to fix cascading failures on problem environment:
- Increase resources
- Restart servers
- Limit or drop traffic
- Enter degraded mode (should be supported on service level)
- Decrease batch load: Some services have load that is important, but not critical. Consider turning off those sources of load.
When systems are overloaded, something has to give to remedy the situation. If a service reaches its limit, it's better to let some errors or lower-quality results than try to serve all requests. Understanding system limits and how the system behaves under the load is critical to implement protection from cascading failures.
#architecture #systemdesign #reliability
A cascading failure is a failure that grows over time: failure of one or a few parts of a system triggers a domino effect, leading to the progressive failure of other parts. Cascading failures are one of the biggest challenges to the reliability of distributed systems.
Google SRE Book has a separate chapter about potential root causes of this type of failure, design recommendations and immediate steps to fix.
Let's start with typical causes of the problem:
- Server Overload: there are more requests then server can handle
- Resource Exhaustion: running out of CPU, RAM, threads, file denoscriptors
- Service Unavailability: container crash, failed readiness probes, errors
- Slow startup and cold caching
Common triggers of cascading failures:
- Maintenance procedures: updates, new rollouts, planned changes in infrastructure
- Organic growth of the load
- System usage changes: more users, increased non-typical usage scenarios
- Resource limits: clusters are usually overprovisioned, so some heavy operation can occupy resources and impact other services
Design recommendations that can help to avoid cascading failures:
✏️ In case of overload, perform load shedding and graceful degradation: reject requests that cannot be served, serve degraded results (less data, data only from cache, etc.)
✏️ Instrument higher-level systems to reject requests, rather than overload servers
✏️ Perform accurate capacity planning. Capacity planning reduces the probability of triggering a cascading failure, but it is not sufficient for protection
✏️ Smart retries implementation:
- Use randomized exponential backoff when scheduling retries
- Limit retries per request. Don’t retry a given request indefinitely.
- Consider having a server-wide retry budget. For example, only allow 60 retries per minute in a process, and if the retry budget is exceeded, don’t retry; just fail the request.
- Use clear response codes and consider how different failure modes should be handled. Don’t retry permanent errors or malformed requests in a client, because neither will ever succeed.
✏️ Set requests timeouts (they called them deadlines). Implement deadline propagation, where each service in the call chain checks whether the deadline has already been exceeded. Based on this, the service can decide whether to proceed or terminate the process.
Immediate steps to fix cascading failures on problem environment:
- Increase resources
- Restart servers
- Limit or drop traffic
- Enter degraded mode (should be supported on service level)
- Decrease batch load: Some services have load that is important, but not critical. Consider turning off those sources of load.
When systems are overloaded, something has to give to remedy the situation. If a service reaches its limit, it's better to let some errors or lower-quality results than try to serve all requests. Understanding system limits and how the system behaves under the load is critical to implement protection from cascading failures.
#architecture #systemdesign #reliability
❤🔥2👍1
Netflix Priority-Based Load Shedding
In the previous blog post we already discussed cascading failures and way to address them. One of the approaches was load shedding and graceful degradation. Today we'll explore how these techniques are used in practice to improve user experience at Netflix.
In November 2020, Netflix introduced the concept of prioritized load shedding at the API gateway level:
✏️ Classify incoming traffic:
- NON_CRITICAL: This traffic does not affect playback or user experience (e.g., logs and background requests)
- DEGRADED_EXPERIENCE: This traffic affects user experience, but not the ability to play videos (e.g., stop and pause markers, language selection in the player, viewing history)
- CRITICAL: This traffic affects the ability to play.
✏️ Categorize the requests into priority buckets on API gateway level (Zuul)
✏️ Operate as usual under normal conditions
✏️ Drop lower priority requests if system is overload, higher priority requests still get served
✏️ Drop traffic progressively, starting with the lowest priority
✏️ Send a signal to clients to indicate how many retries they can perform and what kind of time window they can perform them in. Requests with higher priority will retry more aggressively than lower ones, also increasing streaming availability.
That approach helps to shed enough requests to stabilize services without members noticing the degradation, improving overall user experience.
In June 2024, Netflix published enhancement of their previous prioritized load shedding approach:
✏️ Add requests prioritization logic at the service layer additionally to the logic on API Gateway
✏️ Classify incoming traffic on service layer to the following buckets:
- CRITICAL: Affect core functionality. These will never be shed until full service failure
- DEGRADED Affect user experience. These will be progressively shed as the load increases
- BEST_EFFORT: Do not affect the user. These will be responded to in a best effort fashion and may be shed progressively
- BULK: Background work, can be shed
✏️ Categorize the requests based on the upstream client’s priority or other request attributes
✏️ Operate as usual under normal conditions
✏️ Drop lower priority requests if system is overload, higher priority requests still get served
✏️ Implement additional logic to support correct autoscaling triggers. Example: shed the requests only after hitting the target CPU utilization if an autoscaler is based on CPU metric
According to the article, priority-based load shedding helps to keep high availability of critical user features during multiple infrastructure outages: there were throttling more than 50% of all requests but the availability of user-initiated requests continued to be > 99.4%.
#systemdesign #reliability #usecase
In the previous blog post we already discussed cascading failures and way to address them. One of the approaches was load shedding and graceful degradation. Today we'll explore how these techniques are used in practice to improve user experience at Netflix.
In November 2020, Netflix introduced the concept of prioritized load shedding at the API gateway level:
✏️ Classify incoming traffic:
- NON_CRITICAL: This traffic does not affect playback or user experience (e.g., logs and background requests)
- DEGRADED_EXPERIENCE: This traffic affects user experience, but not the ability to play videos (e.g., stop and pause markers, language selection in the player, viewing history)
- CRITICAL: This traffic affects the ability to play.
✏️ Categorize the requests into priority buckets on API gateway level (Zuul)
✏️ Operate as usual under normal conditions
✏️ Drop lower priority requests if system is overload, higher priority requests still get served
✏️ Drop traffic progressively, starting with the lowest priority
✏️ Send a signal to clients to indicate how many retries they can perform and what kind of time window they can perform them in. Requests with higher priority will retry more aggressively than lower ones, also increasing streaming availability.
That approach helps to shed enough requests to stabilize services without members noticing the degradation, improving overall user experience.
In June 2024, Netflix published enhancement of their previous prioritized load shedding approach:
✏️ Add requests prioritization logic at the service layer additionally to the logic on API Gateway
✏️ Classify incoming traffic on service layer to the following buckets:
- CRITICAL: Affect core functionality. These will never be shed until full service failure
- DEGRADED Affect user experience. These will be progressively shed as the load increases
- BEST_EFFORT: Do not affect the user. These will be responded to in a best effort fashion and may be shed progressively
- BULK: Background work, can be shed
✏️ Categorize the requests based on the upstream client’s priority or other request attributes
✏️ Operate as usual under normal conditions
✏️ Drop lower priority requests if system is overload, higher priority requests still get served
✏️ Implement additional logic to support correct autoscaling triggers. Example: shed the requests only after hitting the target CPU utilization if an autoscaler is based on CPU metric
According to the article, priority-based load shedding helps to keep high availability of critical user features during multiple infrastructure outages: there were throttling more than 50% of all requests but the availability of user-initiated requests continued to be > 99.4%.
#systemdesign #reliability #usecase
👍2❤1
Platform Engineering
I spent several years working in a platform team, and I was really surprised by hype around the field over past year. So I decided to understand why platform engineering has become so popular and what is really meant under that nowadays.
According to Wikipedia:
Sounds simple. But what problem does it solve?
Over the last 2-3 decades, the complexity of software development significantly increased. Developers should understand CI\CD pipelines, know how to work with Kubernetes and its components, integrate with public cloud services, incorporate scaling strategies and observability tools. This complexity increases cognitive load and slows down the delivery of business features. Introducing dedicated platform teams should help shift the focus of product and delivery teams back to implementing business features.
So what platform teams actually do (of course, nobody can do everything and specialization is required):
✏️ Help developers to be self-sufficient: prepare started kits, IDE plugins, "golden path" templates and docs, self-service APIs
✏️ Encapsulate common patterns and practices into reusable building blocks: identity and secret management, messaging, data services (including databases, caches and object storages), observability tools, dashboards and code instrumentation approach
✏️ Automate build and test processes for products and services
✏️ Automate delivery and security verification processes for products and services
✏️ Accumulate expertise about underlying tools and services, optimize their usage
✏️ Provide early advice and feedback on problems or security risks
Platform engineering principles:
✏️ Adopt a product mindset: take ownership of the platform, make it attractive for developers to use
✏️ Focus on user experience
✏️ Make platform services optional and composable: allow product teams to use only the parts of the platform, or replace them with their own solutions when necessary.
✏️ Provide self-service experience with guardrails: empower development teams to make their own decisions within a set of well-defined parameters
✏️ Improve discovery of available tools, patterns and templates
✏️ Enforce automation and everything as a code approach
Since the publication of the CNCF Platform White Paper in 2023, the popularity of platform engineering is still growing. In 2024, there was even a dedicated conference—Platform Conf' 24—highlighting huge interest and importance of the discipline.
Summing up, platform engineering is a powerful pattern to reduce cognitive complexity of application development, speed up business features development, and provide a more reliable and scalable infrastructure.
References:
- CNCF Platforms White Paper
- Google Cloud: How to Become a Platform Engineer
- Microsoft: What is Platform Engineering
#engineering
I spent several years working in a platform team, and I was really surprised by hype around the field over past year. So I decided to understand why platform engineering has become so popular and what is really meant under that nowadays.
According to Wikipedia:
Platform engineering is a software engineering discipline that focuses on building toolchains and self-service workflows for the use of developers. Platform engineering is about creating a shared platform for software engineers using computer code.
Sounds simple. But what problem does it solve?
Over the last 2-3 decades, the complexity of software development significantly increased. Developers should understand CI\CD pipelines, know how to work with Kubernetes and its components, integrate with public cloud services, incorporate scaling strategies and observability tools. This complexity increases cognitive load and slows down the delivery of business features. Introducing dedicated platform teams should help shift the focus of product and delivery teams back to implementing business features.
So what platform teams actually do (of course, nobody can do everything and specialization is required):
✏️ Help developers to be self-sufficient: prepare started kits, IDE plugins, "golden path" templates and docs, self-service APIs
✏️ Encapsulate common patterns and practices into reusable building blocks: identity and secret management, messaging, data services (including databases, caches and object storages), observability tools, dashboards and code instrumentation approach
✏️ Automate build and test processes for products and services
✏️ Automate delivery and security verification processes for products and services
✏️ Accumulate expertise about underlying tools and services, optimize their usage
✏️ Provide early advice and feedback on problems or security risks
Platform engineering principles:
✏️ Adopt a product mindset: take ownership of the platform, make it attractive for developers to use
✏️ Focus on user experience
✏️ Make platform services optional and composable: allow product teams to use only the parts of the platform, or replace them with their own solutions when necessary.
✏️ Provide self-service experience with guardrails: empower development teams to make their own decisions within a set of well-defined parameters
✏️ Improve discovery of available tools, patterns and templates
✏️ Enforce automation and everything as a code approach
Since the publication of the CNCF Platform White Paper in 2023, the popularity of platform engineering is still growing. In 2024, there was even a dedicated conference—Platform Conf' 24—highlighting huge interest and importance of the discipline.
Summing up, platform engineering is a powerful pattern to reduce cognitive complexity of application development, speed up business features development, and provide a more reliable and scalable infrastructure.
References:
- CNCF Platforms White Paper
- Google Cloud: How to Become a Platform Engineer
- Microsoft: What is Platform Engineering
#engineering
🔥4
Shopify’s Modular Monolith
In the age of microservices, exploring real-world examples of alternative architectures is really interesting. Today, we'll check Shopify's architecture through an interview with one of its principal engineers.
Why is it interesting? Shopify employs a modular monolith architecture, and their system seamlessly managed peaks of 60 million requests per minute on Black Friday
Key points from the interview:
- Shopify's architecture is based on Ruby on Rails, MySQL, Kafka, Elasticsearch, and Memcached\Redis
- Some parts of the system migrated to Rust for better performance
- Certain applications migrated to Vitess for better horizontal data sharding
- Applications are operated within Kubernetes on the Google Cloud platform
- The company actively contributes to open source projects used internally to improve their performance and scaling capabilities
- All shops on the platform are grouped within dedicated sets of database servers to minimize blast radius (the same pattern we saw in Netflix StatefulSet reliability approach)
- The majority of user-facing functionality is served by Shopify Core, a monolith divided into multiple modules focused on different business domains
- Shopify Core can be scaled horizontally so there are no plans to split it on separate services
- New features are rolled out to production using a canary approach
#architecture #scalability #usecase
In the age of microservices, exploring real-world examples of alternative architectures is really interesting. Today, we'll check Shopify's architecture through an interview with one of its principal engineers.
Why is it interesting? Shopify employs a modular monolith architecture, and their system seamlessly managed peaks of 60 million requests per minute on Black Friday
Key points from the interview:
- Shopify's architecture is based on Ruby on Rails, MySQL, Kafka, Elasticsearch, and Memcached\Redis
- Some parts of the system migrated to Rust for better performance
- Certain applications migrated to Vitess for better horizontal data sharding
- Applications are operated within Kubernetes on the Google Cloud platform
- The company actively contributes to open source projects used internally to improve their performance and scaling capabilities
- All shops on the platform are grouped within dedicated sets of database servers to minimize blast radius (the same pattern we saw in Netflix StatefulSet reliability approach)
- The majority of user-facing functionality is served by Shopify Core, a monolith divided into multiple modules focused on different business domains
- Shopify Core can be scaled horizontally so there are no plans to split it on separate services
- New features are rolled out to production using a canary approach
#architecture #scalability #usecase
👍2
Saga Design Pattern
Distributed systems are complex, handling transactions in distributed systems are even more complex. One way to address the issue is by using the saga pattern. A common use case for it is to manage single transaction over multiple services with their own databases.
Implementation logic:
- Define local transaction as atomic work performed by a service
- Organize local transaction into a sequence - saga
- After local transaction completion, publish a message or event to trigger the next local transaction
- In case of failure, execute a series of compensating transactions that undo the changes that were made by all previously executed local transactions
- Compensations must be idempotent because they might be called more than once within multiple retry attempts
Saga coordination options:
- Choreography. It's event-based approach where each local transaction publishes events that trigger local transactions in other services. Requires a mature event-driven architecture.
- Orchestration. This approached requires central orchestrator, that tells the services what local transactions to execute or rollback.
Benefits:
- It allows to implement non-blocking long-running transactions
- Local transactions are fully independent
- Enforce separation of concerns as participants may not know about each other
Drawbacks:
- Eventual data consistency
- Difficult to troubleshoot when number of participants grow up
- Design and implementation are complex and expensive (need to implement common logic and compensation logic for all steps in the sequence)
From my perspective, the pattern is too complex, and as we know complex logic tend to bring complex issues. So if you can avoid distributed transactions, please, avoid them.
References:
- Sagas
- Saga Pattern
- Data Consistency in Microservices Architecture
#architecture #systemdesign #patterns
Distributed systems are complex, handling transactions in distributed systems are even more complex. One way to address the issue is by using the saga pattern. A common use case for it is to manage single transaction over multiple services with their own databases.
Implementation logic:
- Define local transaction as atomic work performed by a service
- Organize local transaction into a sequence - saga
- After local transaction completion, publish a message or event to trigger the next local transaction
- In case of failure, execute a series of compensating transactions that undo the changes that were made by all previously executed local transactions
- Compensations must be idempotent because they might be called more than once within multiple retry attempts
Saga coordination options:
- Choreography. It's event-based approach where each local transaction publishes events that trigger local transactions in other services. Requires a mature event-driven architecture.
- Orchestration. This approached requires central orchestrator, that tells the services what local transactions to execute or rollback.
Benefits:
- It allows to implement non-blocking long-running transactions
- Local transactions are fully independent
- Enforce separation of concerns as participants may not know about each other
Drawbacks:
- Eventual data consistency
- Difficult to troubleshoot when number of participants grow up
- Design and implementation are complex and expensive (need to implement common logic and compensation logic for all steps in the sequence)
From my perspective, the pattern is too complex, and as we know complex logic tend to bring complex issues. So if you can avoid distributed transactions, please, avoid them.
References:
- Sagas
- Saga Pattern
- Data Consistency in Microservices Architecture
#architecture #systemdesign #patterns
👍3
Sometimes pattern names produce visual associations, so check mine 😀
#architecture #systemdesign #patterns
#architecture #systemdesign #patterns
🔥3👍1😍1
Draw to Win
Did you know that 2/3 of our brain activity is occupied by processing visual information? Most of the information about the world is received from our eyes. Visualization is the most powerful way to communicate. So what? We as a leaders can use it to educate and persuade people, to share our vision and sell our ideas.
But how to do that in practice? That's what the Dan Roam book Draw to Win: a Crash Course on How Lead, Sell, and Innovate with Your Visual Mind is about.
Why drawing is important:
- Drawing is the oldest 'technology' in the world
- 90% of all information is visual
- Visualization attracts attention and improves clarity in communications
- Visual information is memorable
Knowing that, we can significantly enhance our presentation skills. No special knowledge is required to start drawing: we can easily draw lines, arrows, shapes and smiles. That's all you need to start drawing to explain or sell your ideas. The accuracy of result pictures is not important.
The author explains that our brain processes visual information to answer the following questions: who? what? how many? when? where? and why?
Organize your ideas into a visual story that provides that answers and you will need only 6 pictures or slides to explain everything. The book even includes an example of how to explain a salary increase to a manager using this technique, but I won’t spoil it—better to check out the original for the full story 😉!
Additionally the book contains practical tips to get started with drawing and use it to improve creative thinking. One that I really like: if you don't know what to draw, start with the circle, name it, continue adding circles until your idea takes shape.
To sum up, visualization is a powerful tool that can be used to manage, educate, sell, share, collaborate and innovate. I really enjoy the book, it's full of interesting facts and practical advice. There are more books by the same author on this topic, and I’ll definitely add them to my reading list!
#booknook #softskills #presentationskills
Did you know that 2/3 of our brain activity is occupied by processing visual information? Most of the information about the world is received from our eyes. Visualization is the most powerful way to communicate. So what? We as a leaders can use it to educate and persuade people, to share our vision and sell our ideas.
But how to do that in practice? That's what the Dan Roam book Draw to Win: a Crash Course on How Lead, Sell, and Innovate with Your Visual Mind is about.
Why drawing is important:
- Drawing is the oldest 'technology' in the world
- 90% of all information is visual
- Visualization attracts attention and improves clarity in communications
- Visual information is memorable
Knowing that, we can significantly enhance our presentation skills. No special knowledge is required to start drawing: we can easily draw lines, arrows, shapes and smiles. That's all you need to start drawing to explain or sell your ideas. The accuracy of result pictures is not important.
The author explains that our brain processes visual information to answer the following questions: who? what? how many? when? where? and why?
Organize your ideas into a visual story that provides that answers and you will need only 6 pictures or slides to explain everything. The book even includes an example of how to explain a salary increase to a manager using this technique, but I won’t spoil it—better to check out the original for the full story 😉!
Additionally the book contains practical tips to get started with drawing and use it to improve creative thinking. One that I really like: if you don't know what to draw, start with the circle, name it, continue adding circles until your idea takes shape.
To sum up, visualization is a powerful tool that can be used to manage, educate, sell, share, collaborate and innovate. I really enjoy the book, it's full of interesting facts and practical advice. There are more books by the same author on this topic, and I’ll definitely add them to my reading list!
#booknook #softskills #presentationskills
👍2🔥1
Book cover and some illustrations from the book that show the importance and simplicity of the drawing
#booknook #softskills #presentationskills
#booknook #softskills #presentationskills
👍1🔥1
In one of the previous blog posts we broke down the Saga pattern and I recommended not to use it because of high complexity. However, it's really interesting to explore successful implementations of the pattern. Let’s take a look at how HALO scaled to 11.6 million users using the Saga design pattern.
HALO is a very popular shooting game that was initially introduced in 1999. At that time the game was based on a single SQL database to store the entire game data. Their growth was explosive, and single database became not enough soon.
So they set up a NoSQL database and partitioned it. Data for each player was kept in a dedicated database partition. It resolves scaling limitations, but brought new issues:
- Data writes are not atomic anymore
- Partitions may have non-consistent information
It means that players can have different game data that significantly impacts game experience.
So HALO team decided to set up Saga:
✏️ Each partition is changed within a local transaction only
✏️ Orchestrator manages update within all database partitions
✏️State of each local transaction is stored in durable distributed log that allows:
- Track if a sub-transaction failed
- Find compensating transactions that must be executed
- Track the state of compensating transactions
- Recover from failures
✏️The log is stored outside Orchestrator that makes it stateless
✏️ Orchestrator interacts with the log to identify local transaction or compensating actions to execute
The introduced technical solution enables further HALO usage growth, which still remains a popular game series for Xbox with millions of unique users.
#architecture #systemdesign #usecase
HALO is a very popular shooting game that was initially introduced in 1999. At that time the game was based on a single SQL database to store the entire game data. Their growth was explosive, and single database became not enough soon.
So they set up a NoSQL database and partitioned it. Data for each player was kept in a dedicated database partition. It resolves scaling limitations, but brought new issues:
- Data writes are not atomic anymore
- Partitions may have non-consistent information
It means that players can have different game data that significantly impacts game experience.
So HALO team decided to set up Saga:
✏️ Each partition is changed within a local transaction only
✏️ Orchestrator manages update within all database partitions
✏️State of each local transaction is stored in durable distributed log that allows:
- Track if a sub-transaction failed
- Find compensating transactions that must be executed
- Track the state of compensating transactions
- Recover from failures
✏️The log is stored outside Orchestrator that makes it stateless
✏️ Orchestrator interacts with the log to identify local transaction or compensating actions to execute
The introduced technical solution enables further HALO usage growth, which still remains a popular game series for Xbox with millions of unique users.
#architecture #systemdesign #usecase
👍3❤2