TechLead Bits – Telegram
TechLead Bits
424 subscribers
62 photos
1 file
157 links
About software development with common sense.
Thoughts, tips and useful resources on technical leadership, architecture and engineering practices.

Author: @nelia_loginova
Download Telegram
DR Strategies

When RPO and RTO requirements are defined, it's time to select DR strategy:

✏️ Backup\Restore. The simplest option with quite big downtime (RTO) - hours or even days:
- the application runs on a single region only
- regular backups (full and incremental) are sent to another region
- only active region has reserved capacity to run the whole infrastructure
- in case of disaster the whole infrastructure should be rebuilt on a new region (in some cases it can be the same region), after that application is reinstalled and data is restored from backups

✏️ Cold Standby. This option requires less downtime but still it can take hours to fully restore the infrastructure:
- the application runs on a single region only
- minimal infrastructure is prepared in another region: copy of application or data storage may be installed but it's scaled down or run with minimum replicas
- regular backups (full and incremental) are sent to another region
- in case of disaster the application is restored from backups and scaled up appropriately

✏️ Hot Standby. The most complex and expensive option with minimal RTO measured in minutes:
- both regions have the same capacity reserved
- all applications are up and running on both regions
- data is replicated between regions in near real-time
- in case of disaster one of the regions continues to operate.

What to select is usually depends on availability and business requirements of the services you provide. But anyway DR plan should be defined and documented to know what to do in case of disaster. Moreover it's a good practice to provide regular testing on how to restore the system. Otherwise you may end up with the situation when you have a backup but cannot restore the system, or even worse there will be no actual backups at all.

#architecture #systemdesign
👍3
DR Strategies. My attempt to visualize main ideas 🙂

#architecture #systemdesign
2
Thinking Like an Architect

What makes a good architect different from other technical roles? If you've ever thought about that - I recommend to check a talk from Gregor Hohpe "Thinking Like an Architect"

Gregor said that architects are not the smartest people, they make everyone else smarter.

And to achieve this, they use the following tools:
✏️ Connect Levels. Architects talk with management on a business language and with developers on a technical language. So they can translate business requirements to technical decisions and technical limitations to business impacts.
✏️ Use Metaphors. They use well-known metaphors to explain complex ideas in a simple way.
✏️ See More. Architects see more dimensions of the problem and can do more precise trade-off analysis.
✏️ Sell Options. Estimate and prepare options, sometimes defer decisions to the future.
✏️ Make Better Decisions with Models. Models shape our thinking. If solution is simple, the model is good, if it's not - there is probably something wrong with the model.
✏️ Become Stronger with Resistance. Not all people are happy with the changes, architects can identify what beliefs people hold that make their arguments rationale. By understanding this, architects can influence how people think and work.

I really like Gregor's talks, they are practical, make you think about standard things under different angle and contains a good piece of humor. So if you have time, I recommend to watch the full version.

#architecture
👍3
Really nice illustration from "Thinking Like an Architect" that shows what it means to see more 👍
👍1
Arbnb: Large-Scale Test Migration with LLM

In all that hype about replacing developers by LLMs I really like to read about practical examples of how LLMs are used to solve engineering tasks. Last week Airbnb published an article Accelerating Large-Scale Test Migration with LLMs where they described the experience to automate migration of ~3.5K React test files from Enzyme to React Testing Library (RTL).

Interesting points there:
✏️ Migration was built as a pipeline with multiple steps, where files are moved to the next stage only after validation on the previous step passed
✏️ If validation is failed, result is sent to LLM one more time with request to fix it
✏️ For small and mid size files the most effective strategy was a brute force: retry steps multiple times until they passed or reached a limit.
✏️ For huge complex files the context was extended with the source code of the component, related tests in the same directory, general migration guidelines and common solutions. Note from the authors that the main success driver there was not prompt engineering but choosing the right related files.
✏️ The overall result was successful migration of 97% of tests, remaining part was fixed manually.

The overall story looks like a huge potential for routine tasks automation. Even with a custom pipeline and some tooling around it, the overall migration with LLM was significantly cheaper than doing it manually.

#engineering #ai #usecase
👍4
Balancing Coupling

Today we'll talk about Balancing Coupling in Software Design book by Vlad Khononov. That's a quite fresh book (2024) that addresses a common architecture problem - how to balance coupling between components to make it easy to support new features and technologies without turning the solution into a big ball of mud.

The author defines coupling as a relationship between connected entities. If entities are coupled, they can affect each other. As a result, coupled entities should be changed together.

Main reasons for change:
- Shared Lifecycle: build, test, deployment
- Shared Knowledge: model, implementation details, order of execution, etc.

The author defines 4 levels of coupling:
📍Contract coupling. Modules communicate through an integration-specific contract.
📍 Model coupling. The same model of the business domain is used by multiple modules.
📍Functional coupling. Share the knowledge of the functionality: the sequence of steps to do, sharing the same transaction, logic duplication.
📍Intrusive coupling. Integration through component implementation details that were not intended for integration.

Coupling can be described by the following dimensions:
📍Connascence. Shared lifecycle levels: static - compilation time or dynamic - runtime dependencies.
📍Integration Strength. The more knowledge components share, the stronger the integration is between them .
📍Distance. The physical distance between components: the same class, the same package, the same lib, etc. The greater the distance is, the more effort is needed to introduce a cascading change.
📍Volatility. How frequently the module is changed.

Then the author suggests a model to calculate coupling and other architecture characteristics using values of these dimensions.

For example,
Changes Cost = Volatility AND Distance 

It means that if both distance and volatility are high, the actual cost of changes is high.

Coupling balance equation:
Balance = (Strength XOR Distance) OR NOT Volatility


Of course, the scale is relative and quite subjective but it allows you to have a framework to assess your architectural decisions, predict their consequences, and adjust solution characteristics to find the right balance.

Overall book impression is very positive: it has no fluff, it's clear, structured and very practical. Definitely recommend.

#booknook #engineering
🔥4👍2
Some graphical representation for concepts from the book

Source: Balancing Coupling in Software Design

#booknook #engineering
Adaptive Integration

Modern solutions typically consists of a mix of services, functions, queues and DBs. To implement an E2E scenario developers need to build a chain of calls to get the result. And if some API is changed, the whole E2E may be broken.

Of course, we have proto specs, Open API, autogenerated clients, but the problem is that any change brings significant adoption overhead to all its dependencies.

Marty Pitt in his talk Adaptive Architectures - Building API Layers that Build Themselves presents an attempt to solve the problem and make changes cheap and fully automated.

I like the part with problem statement, it really describes the pain of existing microservice ecosystem: change API - integration is broken, change message format - integration is broken, change function - you get the idea, right? So you need to be really careful with any contract change and work with all your consumers to make the migration smooth.

Then the author assumes that the reason of that problem is the lack of business semantics in our API specs. And if we add them, the system can automatically generate chain calls to perform any requested operation.

Idea can be represented as the following steps:
✏️ Add semantics to the entities: for example, instead of int id use accountId id across all services in the organization
✏️ Register service specs during startup on a special integration service.
✏️ Any service can call the integration using DSL like Get balance for the account X with a specified email
✏️ The integration service automatically generates an execution chain based on the registered specs. After that it orchestrates all queries and returns the result to the caller.
✏️ If a service changes its API, it simply uploads a new spec version, and the integration service rebuilds the call chain accordingly.

Author and his team already implemented the approach in https://github.com/taxilang/taxilang and https://github.com/orbitalapi.

From my point of view, the system that decides in runtime what APIs to call to perform a business transaction looks uncontrollable and difficult to troubleshoot. So I'm not ready to use the approach in a real production. But the idea sounds interesting, let's see if such tools usage will grow in the future.

#engineering
👍3
Kafka 4.0 Official Release

If you’re a fan of Kafka like I am, you might know that Kafka 4.0 was officially released last week. Except the fact that it's the first release that operates entirely without Apache Zookeeper, it also contains some other interesting changes:

✏️ The Next Generation of the Consumer Rebalance Protocol (KIP-848). The team promised significant performance improvements and no “stop-the-world” rebalances anymore.
✏️ Early access to the Queues feature (I already described it there )
✏️ Improved transactional protocol (KIP-890) that should solve the problem with hanging transactions
✏️ Ability to make a whitelist of OIDC providers via org.apache.kafka.sasl.oauthbearer.allowed.urls property
✏️ Custom processor wrapping for Kafka Streams (KIP-1112) that should simplify common code usage across different streams topologies
✏️ Values for some default parameters were changed. Actually it's a public contract change with potential issues during upgrade, so need to be careful with that - KIP-1030
✏️ A big housekeeping work was done, so the version removes a lot of deprecations:
- v0 and v1 message formats were dropped (KIP-724)
- kafka clients versions <=2.1 are not supported anymore (KIP-1124)
- APIs and configs deprecated prior version 3.7 were removed
- Old MirrorMaker (MM1) was removed
- Old java versions support was removed, now clients require Java 11+, brokers - Java 17+

Full list of changes can be found in release notes and official upgrade recommendations.

New release looks like a significant milestone for the community 💪. As always, before any upgrade I recommend to wait for the first patch versions (4.0.x), which will probably contain fixes for the most noticeable bugs and issues.

#engineering #news #kafka
🔥7
Netflixed - The Epic Battle for America's Eyeballs

Recently I visited a bookshop to pick up a pocket book to read during a long flight. I noticed something with a word Netflix and decided to buy it. It was Netflixed: The Epic Battle for America's Eyeballs by Gina Keating.

Initially I thought that's the book about technology or leadership. But it was a story about Netflix's way to success. The book was published in 2013 but it's still relevant as Netflix remains a leader in online streaming today.

The author tells Netflix’s history starting from online DVDs rental service to online movie streaming. A main part of the book focuses on Netflix’s competition with Blockbuster (it's America’s biggest DVD and media retailer at that time). It’s really interesting to see how their market and optimization strategies went through different stages of technology evolution.

I won’t retell the whole book, but there’s one moment that really impressed me. Blockbuster was one step before beating Netflix and become a market leader in online movies services. But at that critical time, disagreements among Blockbuster’s top management led to the company crash.

Most board members failed to see that the DVD era was ending and Internet technologies were the future. They fired the executive who drove the online program and brought a new CEO with no experience in the domain. This new CEO decided to focus on expanding physical DVD stores. Unfortunately, he didn't want to hear about new technologies at all. That leads to full Blockbuster bankruptcy.

What can we learn from this? Some managers cannot accept the fact they are wrong and a bad manager can ruin the whole business. Good leaders must listen to their teams, understand industry trends, and be flexible enough to adapt to the changes. For me, the book felt like a drama story, even though I already knew what's in the end.

#booknook #leadership #business
👍1
ReBAC: Can It Make Authorization Simpler?

Security in general - and authorization in particular - is one of the most complex parts in big tech software development. At first look, it's simple: invent some role and add some verification at the API level, to make it configurable - put the mapping somewhere outside the service. Profit!

The real complexity starts at scale when you need to map hundreds of services with thousands of APIs to hundreds e2e flows and user roles. Things get even more complicated when you add dynamic access conditions—like time of day, geographical region, or contextual rules. And you should present that security matrix to the business, validate and test it. In my practice, that's always a nightmare 🤯.

So from time to time I'm checking what's there in the industry that can help to simplify authorization management. This time I checked the talk Fine-Grained Authorization for Modern Applications from NDC London 2025.

Interesting points:
✏️ Introduce ReBAC - relationship-based access control. That model allows to calculate and inherit access rules based on relationships between users and objects
✏️ To use this approach a special authorization model should be defined. It's kind of yaml configuration that describe types of entities and their relationships.
✏️ Once you have a model, you can map real entities to that and set allow\deny rules.
✏️ Opensource tool OpenFGA already implements ReBAC. It even has a playground to test and experiment with authorization rules.

Overall idea may sound interesting but a new concept still doesn't solve the fundamental problem - how to manage security at scale. That's just yet another way to produce thousands of authorization policies.

The author mentioned that the implementation of OpenFGA is inspired by Zanzibar - Google's authorization system. There is a separate whitepaper that describes main principles of how it works, so I added this whitepaper to my reading list and probably I will publish some details in the future 😉.

#architecture #security
👍4
Technology Radar

In the beginning of April Thoughtworks published a new version of Technology Radar with the latest industry trends.

Interesting points:

✏️ AI. There is a significant growth of AI agentic approach in technologies and tools, but all of them still work in a supervised fashion helping developers to automate the routine. No surprises there.

✏️ Architecture Advice Process. Architecture decision process moves to decentralized approach where anyone can make any architectural decision getting advice from the people with the relevant expertise. The approach is based on Architecture Decision Records (ADRs) and advisory forum practices. I made short ADR overview there.

✏️ OpenTelemetry Adoption. Most popular tools (e.g. Loki, Alloy, Tempo) in observability stack added OpenTelemetry native support.

✏️ Observability & ML Integration. Major monitoring platforms embedded machine learning for anomaly detection, alert correlation and root-cause analysis.

✏️ Data Product Thinking. In extended AI adoption many teams started treating data as a product with clear ownership, quality standards, and focus on customer needs. Data catalogs like DataHub, Collibra, Atlan or Informatica become more popular.

✏️ Gitlab CI\CD was moved to adopted state.

Of course, there are much more items in the report, so if you're interested I recommend to check and find trends that are relevant to your tech stack.

Since this post is about trends, I'll share one more helpful tool - StackShare. It shows the tech stacks used by specific companies, and how wide a particular technology is adopted across different companies.

#news #engineering
1👍1
Measuring Software Development Productivity

The more senior position you have, the more you need to think about how to communicate and evaluate the impact of your team’s development efforts. The business doesn't think in features and test coverage, it thinks in terms of business benefits, revenue, costs savings, and customers satisfaction.

There was an interesting post for that topic in AWS Enterprise Strategy Blog called A CTO’s Guide to Measuring Software Development Productivity. The author suggests to measure development productivity in 4 dimensions:

✏️ Business Benefits. Establish a connection between particular feature and business value it brings. Targets must be clear and measurable. For example, “Increase checkout completion from 60% to 75% within three months.” instead of “improve sales”. When measuring cost savings from automation, track process times and error rates before and after the change to show the difference.

✏️ Speed To Market. It is the time from requirement to feature delivery to production. One of the tools that can be used there is value stream mapping. In that approach you draw you process as a set of steps and then can analyze where ideas spend time, whether in active work or waiting for decisions, handoffs, or approvals. This insight helps you plan and measure future process improvements.

✏️ Delivery Reliability. This dimension is about quality. It covers reliability, performance, and security. You need to transform technical metrics (e.g., uptime, rps, response time, number of security vulnerabilities) to the business metrics like application availability, customer experience, security compliance, etc.

✏️ Team Health. Burnout team cannot deliver successful software. The leader should pay attention to the teams juggling too many complex tasks, constantly switching between projects, and working late hours. These problems predict future failures. Focused teams are the business priority.

The overall author's recommendation is to start with small steps, dimension by dimension, carefully tracking your results and share them with the stakeholders at least monthly. Strong numbers shift the conversation from controlling costs to investing in growth.

From my perspective, this is a good framework that can be used to communicate with the business and talk with them using the same language.

#leadership #management #engineering
👍2🔥1
Are Microservices Still Good Enough?

There was a lot of hype around microservices for many years. Sometimes they are used with good reasons, sometimes without. But looks like the time for fast growth came to the end, companies started to focus more on cost reduction. It promotes more practical approach for architecture selection.

One of the recent articles about this topic is Lessons from a Decade of Complexity: Microservices to Simplicity.

The author starts with downsides of microservice architecture:
✏️ Too many tiny services. Some microservices become too small.
✏️ Reliability didn't improve. One small failure can trigger cascade failure of the system.
✏️ Network complexity. More network calls produce higher latency.
✏️ Operational and maintenance overhead. Special deployment pipelines, central monitoring, logging, alerting, resource management, upgrades coordination. This is just a small part of what's needed to serve the architecture.
✏️ Poor resource utilization. Microservices can be too small that even 10 millicores are not utilized. It makes the whole cluster resource management ineffective.

Recommendations to select the architecture:
✏️ Be pragmatic. Don’t get caught up in trendy architecture patterns, select what's really needed for your task and team now.
✏️ Start simple. Keeping things simple saves time and pain in the long run.
✏️ Split only when needed. Split services when there’s a clear technical reason, like performance, resource needs, or special hardware.
✏️ Microservices are just a tool. Use them only when they help your team move faster, stay flexible, and solve real problems.
✏️ Analyze tradeoffs. Every decision has upsides and downsides. Make the best choice for your team.

Additionally the author shared a story where he and his team consolidated hundreds of microservices into larger ones. They reduced the total number of microservices from hundreds to fewer then ten. This helped to cut down alerts, simplify deployments, and improve infrastructure usage. The overall solution support became easier and less expensive.

I hope that finally cost effectiveness of technical decisions became a new trend in software development 😉.

#engineering #architecture
❤‍🔥4👍3
The Pop-up Pitch

Do you have situations when you need to sell your ideas to management? Or to explain your solution to the team? Or to convince someone with a selected approach?
The Pop-up Pitch: The Two-Hour Creative Sprint to the Most Persuasive Presentation of Your Life. This is a really helpful book from master of visualization Dan Roam on how to do that (overview of his book regarding visualization is there).

As you can suggest from the book name, it is focused on creation of persuasive presentations. As a base the author uses storytelling principles, sketching, simplicity and emotional involvement to attract the auditory attention.

Main Ideas:

✏️ To make a successful meeting you need to define its purpose. Pop-up pitch is focused on the meetings to present new ideas and meetings for sales (to request an action).

✏️ Every meeting is about persuasion. The most effective approach is positive persuasion when you don't put a pressure on the people, but attract and emotionally involve them. Positive persuasion consists of 3 elements:
1. Benefits. The presenter truly believes that idea is beneficial for the audience.
2. Truth. The idea is something the audience actually wants to get.
3. Achievability. We can do that with a clear step-by-step plan.

✏️ Visual Decoder. You should always start preparation with your idea denoscription the following dimensions:
- Title – What’s the story about?
- Who? What? – Main characters and key elements
- Where? – Where things happen and how people/parts interact
- How many? – Key numbers and quantities, measurements
- When? – Timeline and sequence of events
- Lessons Learned – What the audience should remember at the end

✏️ Pitch. It's a 10-min presentation based on storytelling techniques. Your story should have the element of drama, ups and downs. The whole storyline consists of 10 steps:
1. Clarity. A clear and simple story noscript.
2. Trust. Establish a connection with the audience, show that you understand their problems.
3. Fear. Problem explanation.
4. Hope. Show how successful results could look like.
5. Sobering Reality. We cannot continue to do the same, we need to change the approach to achieve another results.
6. Gusto. Offer a solution.
7. Courage. Show the result is achievable with key steps and a clear plan.
8. Commitment. Explain what actions are needed.
9. Reward. Show what audience can have in the nearest future.
10. True Aspiration. The big long-term win.

The book is well-structured with step-by-step guidelines to apply recommendations in practice. It makes me rethink some meetings approach and keep in mind that the most important thing in the presentation is not what you want to say, but what the audience is ready to hear.

#booknook #softskills #presentationskills #leadership
👍3
Template for storyline of the pitch

Story Sample:
1. Your upcoming presentation deserves to be amazing.
2. Take a deep breath. A big presentation is coming up.
3. But how do you grab people’s attention?
4. Imagine how great it can be.
5. The same old “as usual” presentation doesn't work anymore.
6. Maybe it’s time to try something new?
7. There’s a simple way.
8. You only need three things…
9. It’s the pop-up pitch!
10. What do you have to win?

Source: https://www.danroam.com/

#booknook #softskills #presentationskills #leadership
❤‍🔥4
The Subtle Art of Support

At IT conferences the main focus is usually on how to build a spaceship (or at least a rocket! 😃) with the latest technologies and tools. Everyone enjoys writing new features, but mostly nobody is excited about fixing bugs. That's why I was really surprised when I've seen a talk about support work - The subtle art of supporting mature products.

The author shared her experience organizing L4 support team. Actually most recommendations are really trivial like organize training sessions, improve documentation, talk with your clients, etc. But the idea to have fully separate support and development teams is really confused me.

From my point of view, such model makes sense for one-time project delivery only: you develop something, deliver it to the customer, make a support hand-over and move on. But it's totally wrong for actively developed products.

In this case separating support from development breaks the feedback loop (we talk about L4 product support, of course). You simply cannot improve a product in a right way if you're not in touch with your customers and their pain. Support is a critical activity for the business. Nobody cares about new features if existing features don't work.

I prefer a model when the teams own a product or component. It means that the team is responsible for both development and support. The better quality you have, the more capacity you can spend for feature development. Such model produces a really good motivation to work on internal stability, process optimizations and overall delivery quality.

One of the simplest way to implement the approach is to to rotate people between support and development work for a sprint, a few sprints, or a full release. In my practice, schema with 2-3 sprints works quite well.

Of course, I often hear arguments that support requires a lot of routine communications because users just "don’t use features correctly". That's why it should be some other people. But for me, that’s a sign that there is something wrong: the product is hard to use, the documentation is poor, test cases are missed, etc. That's exactly the point to perform some analysis and make improvements. And in the era of GenAI teams can automate a lot of support routine and focus on making the products really better.

#engineering #leadership
👍4
NATS: The Opensource Story

Opensource projects play a key role in modern software development. They are widely used in building commercial solutions: we all know and actively adopt Kubernetes, PosgtreSQL, Kafka, Cassandra and many other really great products. But opensource comes with a risk - a risk that one day a vendor will change the license to a commercial one (remember a story around Elasticsearch 😡?).

If a project become commercial, what can be done further:
✏️ Start paying for the product
✏️ Migrate to an opensource or home-grown alternative
✏️ Freeze the version and provide critical fixes and security patches on your own
✏️ Fork the project and start contributing to it

The actual decision cost will vary depending on the product importance and complexity. But anyway it will be extra costs and efforts.

That’s why, when choosing an open source software, I recommend to pay attention to the following:
✏️ Community: check activity in github repo, response time to issues, release frequency, number of real contributors
✏️ Foundation: if a project belongs to the Linux Foundation or CNCF the risk of a license change is very low

That's why I found a story around NATS (https://nats.io/) really interesting. Seven years ago, the NATS project was donated by Synadia to the CNCF. Since then, community has grown and has ~700 contributors. Of course, Synadia continues to play an important role in NATS development and roadmap.

But in April, Synadia officially requested to take the project back with the plans to change license to Business Source License. From CNCF blog:
Synadia’s legal counsel demanded in writing that CNCF hand over “full control of the nats.io domain name and the nats-io GitHub repository within two weeks.


This is a first attempt to take a full project back and exit from a foundation I've seen. If it succeeds, it will create a dangerous precedent in the industry and kill the trust to the opensource foundations. Actually they exist to prevent such cases and provide protection from a vendor lock.

The good news are that on May 1st CNCF defended its rights for NATS and reached an agreement with Synadia to keep the project within CNCF under Apache 2.0. In that story CNCF demonstrated the ability to protect its projects, so belonging to the CNCF is still a good indicator to choose opensource projects for your needs.

#news #technologies
🤔1