NEW BOT Телеграм, страница

TechLead Bits

Mostly nobody likes to write docs. But AI can significantly simplify that experience.
I experimented with ChatGPT to generate ADR according to the template link and decision text like "We decided to use PostgreSQL as a main database storage as it has high expertise in our company, team to support it, it's easy and has drivers for mostly all languages. Other options that we verify were Oracle, MySQL"
Result is a well-structured document. Of course, It needs to be carefully reviewed as AI tends to make up some details 😃.

#architecture #documentation

👍2❤1🔥1

246 views03:25

TechLead Bits

Distributed Leadership

Distributed Leadership is a video from Phil Haack about practices for effective work with distributed teams. The author worked in Github for a long time, and as you may know Github is fully distributed from the very beginning.

Let's check what practices are recommended:
✏️ Provide context to the messages: never send a message like 'Hey, what's around?'. It does not explain the reason why you reach that person. Make message clear so the person can respond without additional clarifications.
✏️ Write things down: document decisions, meeting notes, guidelines, etc. Make these information discoverable for everyone (e.g., as md docs in git)
✏️ Use video calls: it help to feel each other as people not just resources
✏️ Use chats, small talks, emojis, gifs, jokes: creating an environment of camaraderie. The author refers to this as a "distributed water cooler," replicating the experience of casual conversations near the water cooler in a physical office.
✏️ Organize in-person meetings periodically: spend time physically together. Github organizes a full company summit for all employees once a year, team summits can be more frequent.
✏️ Avoid drive-by comments: don't comment if there are nothing valuable to add
✏️ ChatOps: automate as much as you can
✏️ Avoid synchronization points: avoid meetings, prioritize tasks to unblock other teams
✏️ Use decision-making frameworks: RACI (Responsible, Accountable, Consulted, Informed) or DACI (Driver, Approver, Contributor, Informed), clearly assign responsibilities, make priorities alignment inside the team
✏️ Support work-life balance

The video doesn't provide any revolutionary ideas on managing distributed teams but it does provide a good summary of working practices. I use most of these practices in my daily work and can confirm that they are really helpful, especially the communication part.

#leadership #management

YouTube

Distributed Leadership - Phil Haack - NDC Sydney 2024

This talk was recorded at NDC Sydney in Sydney, Australia. #ndcsydney #ndcconferences #developer #softwaredeveloper

Attend the next NDC conference near you:
https://ndcconferences.com
https://ndcsydney.com/

Subscribe to our YouTube channel and learn…

❤3👍2

276 views02:59

TechLead Bits

Addressing Cascading Failures

A cascading failure is a failure that grows over time: failure of one or a few parts of a system triggers a domino effect, leading to the progressive failure of other parts. Cascading failures are one of the biggest challenges to the reliability of distributed systems.

Google SRE Book has a separate chapter about potential root causes of this type of failure, design recommendations and immediate steps to fix.

Let's start with typical causes of the problem:
- Server Overload: there are more requests then server can handle
- Resource Exhaustion: running out of CPU, RAM, threads, file denoscriptors
- Service Unavailability: container crash, failed readiness probes, errors
- Slow startup and cold caching

Common triggers of cascading failures:
- Maintenance procedures: updates, new rollouts, planned changes in infrastructure
- Organic growth of the load
- System usage changes: more users, increased non-typical usage scenarios
- Resource limits: clusters are usually overprovisioned, so some heavy operation can occupy resources and impact other services

Design recommendations that can help to avoid cascading failures:
✏️ In case of overload, perform load shedding and graceful degradation: reject requests that cannot be served, serve degraded results (less data, data only from cache, etc.)
✏️ Instrument higher-level systems to reject requests, rather than overload servers
✏️ Perform accurate capacity planning. Capacity planning reduces the probability of triggering a cascading failure, but it is not sufficient for protection
✏️ Smart retries implementation:
- Use randomized exponential backoff when scheduling retries
- Limit retries per request. Don’t retry a given request indefinitely.
- Consider having a server-wide retry budget. For example, only allow 60 retries per minute in a process, and if the retry budget is exceeded, don’t retry; just fail the request.
- Use clear response codes and consider how different failure modes should be handled. Don’t retry permanent errors or malformed requests in a client, because neither will ever succeed.
✏️ Set requests timeouts (they called them deadlines). Implement deadline propagation, where each service in the call chain checks whether the deadline has already been exceeded. Based on this, the service can decide whether to proceed or terminate the process.

Immediate steps to fix cascading failures on problem environment:
- Increase resources
- Restart servers
- Limit or drop traffic
- Enter degraded mode (should be supported on service level)
- Decrease batch load: Some services have load that is important, but not critical. Consider turning off those sources of load.

When systems are overloaded, something has to give to remedy the situation. If a service reaches its limit, it's better to let some errors or lower-quality results than try to serve all requests. Understanding system limits and how the system behaves under the load is critical to implement protection from cascading failures.

#architecture #systemdesign #reliability

❤‍🔥2👍1

311 views03:25

TechLead Bits

Netflix Priority-Based Load Shedding

In the previous blog post we already discussed cascading failures and way to address them. One of the approaches was load shedding and graceful degradation. Today we'll explore how these techniques are used in practice to improve user experience at Netflix.

In November 2020, Netflix introduced the concept of prioritized load shedding at the API gateway level:
✏️ Classify incoming traffic:
- NON_CRITICAL: This traffic does not affect playback or user experience (e.g., logs and background requests)
- DEGRADED_EXPERIENCE: This traffic affects user experience, but not the ability to play videos (e.g., stop and pause markers, language selection in the player, viewing history)
- CRITICAL: This traffic affects the ability to play.
✏️ Categorize the requests into priority buckets on API gateway level (Zuul)
✏️ Operate as usual under normal conditions
✏️ Drop lower priority requests if system is overload, higher priority requests still get served
✏️ Drop traffic progressively, starting with the lowest priority
✏️ Send a signal to clients to indicate how many retries they can perform and what kind of time window they can perform them in. Requests with higher priority will retry more aggressively than lower ones, also increasing streaming availability.

That approach helps to shed enough requests to stabilize services without members noticing the degradation, improving overall user experience.

In June 2024, Netflix published enhancement of their previous prioritized load shedding approach:
✏️ Add requests prioritization logic at the service layer additionally to the logic on API Gateway
✏️ Classify incoming traffic on service layer to the following buckets:
- CRITICAL: Affect core functionality. These will never be shed until full service failure
- DEGRADED Affect user experience. These will be progressively shed as the load increases
- BEST_EFFORT: Do not affect the user. These will be responded to in a best effort fashion and may be shed progressively
- BULK: Background work, can be shed
✏️ Categorize the requests based on the upstream client’s priority or other request attributes
✏️ Operate as usual under normal conditions
✏️ Drop lower priority requests if system is overload, higher priority requests still get served
✏️ Implement additional logic to support correct autoscaling triggers. Example: shed the requests only after hitting the target CPU utilization if an autoscaler is based on CPU metric

According to the article, priority-based load shedding helps to keep high availability of critical user features during multiple infrastructure outages: there were throttling more than 50% of all requests but the availability of user-initiated requests continued to be > 99.4%.

#systemdesign #reliability #usecase

👍2❤1

404 views02:57

TechLead Bits

Platform Engineering

I spent several years working in a platform team, and I was really surprised by hype around the field over past year. So I decided to understand why platform engineering has become so popular and what is really meant under that nowadays.

According to Wikipedia:

Platform engineering is a software engineering discipline that focuses on building toolchains and self-service workflows for the use of developers. Platform engineering is about creating a shared platform for software engineers using computer code.

Sounds simple. But what problem does it solve?

Over the last 2-3 decades, the complexity of software development significantly increased. Developers should understand CI\CD pipelines, know how to work with Kubernetes and its components, integrate with public cloud services, incorporate scaling strategies and observability tools. This complexity increases cognitive load and slows down the delivery of business features. Introducing dedicated platform teams should help shift the focus of product and delivery teams back to implementing business features.

So what platform teams actually do (of course, nobody can do everything and specialization is required):
✏️ Help developers to be self-sufficient: prepare started kits, IDE plugins, "golden path" templates and docs, self-service APIs
✏️ Encapsulate common patterns and practices into reusable building blocks: identity and secret management, messaging, data services (including databases, caches and object storages), observability tools, dashboards and code instrumentation approach
✏️ Automate build and test processes for products and services
✏️ Automate delivery and security verification processes for products and services
✏️ Accumulate expertise about underlying tools and services, optimize their usage
✏️ Provide early advice and feedback on problems or security risks

Platform engineering principles:
✏️ Adopt a product mindset: take ownership of the platform, make it attractive for developers to use
✏️ Focus on user experience
✏️ Make platform services optional and composable: allow product teams to use only the parts of the platform, or replace them with their own solutions when necessary.
✏️ Provide self-service experience with guardrails: empower development teams to make their own decisions within a set of well-defined parameters
✏️ Improve discovery of available tools, patterns and templates
✏️ Enforce automation and everything as a code approach

Since the publication of the CNCF Platform White Paper in 2023, the popularity of platform engineering is still growing. In 2024, there was even a dedicated conference—Platform Conf' 24—highlighting huge interest and importance of the discipline.

Summing up, platform engineering is a powerful pattern to reduce cognitive complexity of application development, speed up business features development, and provide a more reliable and scalable infrastructure.

References:
- CNCF Platforms White Paper
- Google Cloud: How to Become a Platform Engineer
- Microsoft: What is Platform Engineering

#engineering

🔥4

566 views03:25

TechLead Bits

Shopify’s Modular Monolith

In the age of microservices, exploring real-world examples of alternative architectures is really interesting. Today, we'll check Shopify's architecture through an interview with one of its principal engineers.

Why is it interesting? Shopify employs a modular monolith architecture, and their system seamlessly managed peaks of 60 million requests per minute on Black Friday

Key points from the interview:
- Shopify's architecture is based on Ruby on Rails, MySQL, Kafka, Elasticsearch, and Memcached\Redis
- Some parts of the system migrated to Rust for better performance
- Certain applications migrated to Vitess for better horizontal data sharding
- Applications are operated within Kubernetes on the Google Cloud platform
- The company actively contributes to open source projects used internally to improve their performance and scaling capabilities
- All shops on the platform are grouped within dedicated sets of database servers to minimize blast radius (the same pattern we saw in Netflix StatefulSet reliability approach)
- The majority of user-facing functionality is served by Shopify Core, a monolith divided into multiple modules focused on different business domains
- Shopify Core can be scaled horizontally so there are no plans to split it on separate services
- New features are rolled out to production using a canary approach

#architecture #scalability #usecase

👍2

449 viewsedited 13:11

TechLead Bits

Saga Design Pattern

Distributed systems are complex, handling transactions in distributed systems are even more complex. One way to address the issue is by using the saga pattern. A common use case for it is to manage single transaction over multiple services with their own databases.

Implementation logic:
- Define local transaction as atomic work performed by a service
- Organize local transaction into a sequence - saga
- After local transaction completion, publish a message or event to trigger the next local transaction
- In case of failure, execute a series of compensating transactions that undo the changes that were made by all previously executed local transactions
- Compensations must be idempotent because they might be called more than once within multiple retry attempts

Saga coordination options:
- Choreography. It's event-based approach where each local transaction publishes events that trigger local transactions in other services. Requires a mature event-driven architecture.
- Orchestration. This approached requires central orchestrator, that tells the services what local transactions to execute or rollback.

Benefits:
- It allows to implement non-blocking long-running transactions
- Local transactions are fully independent
- Enforce separation of concerns as participants may not know about each other

Drawbacks:
- Eventual data consistency
- Difficult to troubleshoot when number of participants grow up
- Design and implementation are complex and expensive (need to implement common logic and compensation logic for all steps in the sequence)

From my perspective, the pattern is too complex, and as we know complex logic tend to bring complex issues. So if you can avoid distributed transactions, please, avoid them.

References:
- Sagas
- Saga Pattern
- Data Consistency in Microservices Architecture

#architecture #systemdesign #patterns

👍3

356 views15:36

TechLead Bits

Sometimes pattern names produce visual associations, so check mine 😀

#architecture #systemdesign #patterns

🔥3👍1😍1

388 views15:38

TechLead Bits

Draw to Win

Did you know that 2/3 of our brain activity is occupied by processing visual information? Most of the information about the world is received from our eyes. Visualization is the most powerful way to communicate. So what? We as a leaders can use it to educate and persuade people, to share our vision and sell our ideas.

But how to do that in practice? That's what the Dan Roam book Draw to Win: a Crash Course on How Lead, Sell, and Innovate with Your Visual Mind is about.

Why drawing is important:
- Drawing is the oldest 'technology' in the world
- 90% of all information is visual
- Visualization attracts attention and improves clarity in communications
- Visual information is memorable

Knowing that, we can significantly enhance our presentation skills. No special knowledge is required to start drawing: we can easily draw lines, arrows, shapes and smiles. That's all you need to start drawing to explain or sell your ideas. The accuracy of result pictures is not important.

The author explains that our brain processes visual information to answer the following questions: who? what? how many? when? where? and why?
Organize your ideas into a visual story that provides that answers and you will need only 6 pictures or slides to explain everything. The book even includes an example of how to explain a salary increase to a manager using this technique, but I won’t spoil it—better to check out the original for the full story 😉!

Additionally the book contains practical tips to get started with drawing and use it to improve creative thinking. One that I really like: if you don't know what to draw, start with the circle, name it, continue adding circles until your idea takes shape.

To sum up, visualization is a powerful tool that can be used to manage, educate, sell, share, collaborate and innovate. I really enjoy the book, it's full of interesting facts and practical advice. There are more books by the same author on this topic, and I’ll definitely add them to my reading list!

#booknook #softskills #presentationskills

👍2🔥1

385 views07:03

TechLead Bits

Book cover and some illustrations from the book that show the importance and simplicity of the drawing

#booknook #softskills #presentationskills

👍1🔥1

574 views07:05

TechLead Bits

In one of the previous blog posts we broke down the Saga pattern and I recommended not to use it because of high complexity. However, it's really interesting to explore successful implementations of the pattern. Let’s take a look at how HALO scaled to 11.6 million users using the Saga design pattern.

HALO is a very popular shooting game that was initially introduced in 1999. At that time the game was based on a single SQL database to store the entire game data. Their growth was explosive, and single database became not enough soon.

So they set up a NoSQL database and partitioned it. Data for each player was kept in a dedicated database partition. It resolves scaling limitations, but brought new issues:
- Data writes are not atomic anymore
- Partitions may have non-consistent information
It means that players can have different game data that significantly impacts game experience.

So HALO team decided to set up Saga:
✏️ Each partition is changed within a local transaction only
✏️ Orchestrator manages update within all database partitions
✏️State of each local transaction is stored in durable distributed log that allows:
- Track if a sub-transaction failed
- Find compensating transactions that must be executed
- Track the state of compensating transactions
- Recover from failures
✏️The log is stored outside Orchestrator that makes it stateless
✏️ Orchestrator interacts with the log to identify local transaction or compensating actions to execute

The introduced technical solution enables further HALO usage growth, which still remains a popular game series for Xbox with millions of unique users.

#architecture #systemdesign #usecase

👍3❤2

444 viewsedited 03:39

TechLead Bits

Elastic: Back to Open Source?

This week, I came across the surprising news that Elastic has decided to return Elasticsearch and Kibana to open source.

Let me remind you that 3 years ago Elastic changed their license from Apache 2.0 to semi-proprietary Server Side Public License. Teams who actively used ELK stack remember that. In response, AWS forked the latest open Elasticsearch and Kibana versions, creating OpenSearch project.

In a year, OpenSearch had 100 million downloads and gathered 8,760 pull requests from 496 contributors over the globe. It even launched its own OpenSearch Conference in 2023. The fork became extremely popular and successful.

Now, Elastic announced AGPLv3 license for Elasticsearch and Kibana products. Maybe it relates to the decrease of the interest in Elasticsearch as a product. There is also a good article on TheNewStack that makes an attempt to explain and understand the reasons of that unexpected decision, which I recommend to read if you're interested in the topic.

The main question is whether teams already using OpenSearch will switch back to Elasticsearch. I don't think so. That's easy to change the license, but it's harder to return community trust.

#news #technologies

ir.elastic.co

Elastic Announces Open Source License for Elasticsearch and Kibana Source Code

OSI-approved AGPL license will be added for a subset of Elasticsearch and Kibana source code Elastic (NYSE: ESTC), the Search AI Company, today announced that it is adding the GNU Affero General Public License v3 (AGPL) as an option for users to license the…

👍7😭3

465 views03:04

TechLead Bits

Cassandra 5.0 is Officially Released

On September 5, the Cassandra 5 GA release was announced. Why is it important? First, Cassandra doesn't get updated very often; the last major release was in 2021. Second, the end-of-support for the 3.x series was announced at the same time. So, if you're still using 3.x, it's time to start planning an upgrade at least to 4.x.

Key changes:
- Storage Attached Indexes (SAI) (CEP-7). This is a new index implementation that replaces Cassandra secondary indexes, fixing their limitations. It allows creating indexes for multiple columns on the same table, improves query performance, reduces index storage overhead, and support complex queries (like numeric range, boolean queries).
- Trie Memtables and Trie SSTables (CEP-19, CEP-25). This is a change of the underlying data structures for the in-memory memtables and on-disk SSTables. These storage formats utilize tries and byte-comparable representations of database keys to improve Cassandra’s performance for reads and modification operations.
- Migration to JDK 17
- Unified Compaction Strategy (UCS) (CEP-26). It combines the tiered and levelled compaction strategies into a single algorithm. UCS has been designed to maximize the speed of compactions, using a unique sharding mechanism that compacts partitioned data in parallel.
- New Aggregation and Math Functions. Cassandra 5 adds new native CQL functions like count, max, min, sum, avg, exp, log, round and others. Users can also create their own custom functions.
- Approximate Nearest Neighbor Vector Search (CEP-30). The feature uses SAI and a new Vector CQL type. Vector is an array of floating-point numbers that show how similar specific objects or entities are to one another. It is a powerful technique for finding relevant content within large document collections and it can be used as a data-layer technology for AI/ML projects.

New Cassandra release makes significant optimizations in existing functionality and brings some really promising new features. For full details, you can check out the release notes here.

#news #technologies

cassandra.apache.org

Apache Cassandra | Apache Cassandra Documentation

The Apache Cassandra Community

👍3❤2🔥1

442 views03:22

TechLead Bits

Manage Your Day

Career growth always means taking more responsibilities within the team or company. The more responsibilities you have, the more tasks you need to manage. At some point of time, you may feel like a squirrel on a wheel—constantly responding to incoming requests, issues, one after another, with no time for actual work.

Do not let external requests manage your work. Manage them yourself. This sounds simple, but everyone who has been in this situation knows it's not so easy to do in practice.

Simple tips that might help:
✏️ You don't have to respond immediately to every question you receive. In most cases, nothing bad will happen if you check your messenger or email once in 2-3 hours.
✏️ You don't have to go to every meeting you're invited to. Review invitations carefully and decide which ones are really important. It’s ok to decline or ask to reschedule.
✏️ You don't need to execute any task immediately as you receive it. Ask about priorities and deadlines, estimate impact on other tasks, discuss and plan accordingly.
✏️ If a task takes less than 2 minutes, just do it (That's the only principle from Getting Things Done that really works for me).
✏️ Book time on the calendar to work on important tasks. Try to reserve at least a few hours a day for focused work.
✏️ Set task priorities. I like Covey model, that groups all task by importance and urgency. Choose an approach that works for you.
✏️ Do not try to keep everything in the head: write down all important ideas, tasks, agreements, requests, whatever is important to perform your job.

Additionally I recommend to read Time Management Techniques That Actually Work. The article also contains a bunch of useful recommendations for the same. Try different tools and methods, and see what works for you.

Also, please, feel free to share other recommendations that work for you in the comments.

#softskills #productivity

👍6🔥2

441 views03:56

TechLead Bits

Canonical Logs

Logging is the oldest tool to troubleshoot issues with the software. But relevant information is spread across many individual log lines, making it difficult or even impossible to quickly search right details, perform some aggregation or analysis. That's where canonical logs concept can help.

Canonical log is the structured one long log line at the end of the request (or any other type of work) that includes fields with request’s key characteristics. Having that data collocated in single information-dense lines makes queries and aggregations over it faster to write, and faster to run.

Canonical log can include the following information:
- HTTP verb, path, response code and status
- Authentication related information
- Request ID, Trace ID
- Error ID and error message
- Service info: name, version, revision
- Timing information: operation duration, percentiles, time spent in database queries and others
- Remaining and total rate limits
- Any other useful information for your service

I want to highlight that log must be structured (key-value, json) to make it machine readable. Structured logs can be easily indexed by many of existing tools, providing an ability to search and aggregate collected data.

Simple canonical log sample:

[2019-03-18 22:48:32.999] canonical-log-line alloc_count=9123 auth_type=api_key database_queries=34 duration=0.009 http_method=POST http_path=/v1/charges http_status=200 key_id=mk_123 permissions_used=account_write rate_allowed=true rate_quota=100 rate_remaining=99 request_id=req_123 team=acquiring user_id=usr_123

Good practice is to formalize log contract across services and applications. As an example protobuf structure can be used for that purposes.

Canonical logs seems to be a lightweight, flexible, and technology-agnostic technique to improve overall system observability. It's easy to implement and extend any existing logging capabilities.

References:
- Using Canonical Log Lines for Online Visibility
- Fast and flexible observability with canonical log lines
- Logs Unchained: Exploring the benefits of Canonical Logs

#engineering #observability

👍4

434 views03:52

About

Blog

Apps

Platform