TechLead Bits – Telegram
TechLead Bits
424 subscribers
62 photos
1 file
156 links
About software development with common sense.
Thoughts, tips and useful resources on technical leadership, architecture and engineering practices.

Author: @nelia_loginova
Download Telegram
Software Complexity

Have you ever seen a project turned into a monster over time? Hard to understand, difficult to maintain? If so, I highly recommend Peter van Hardenberg’s talk - Why Can't We Make Simple Software?

The author explains what complexity is (it's not the same as difficulty!), why software gets so complicated, and what we can actually do about it.

Common reasons for complex software:
✏️ Defensive Code. Code that starts simple with implementing some sunny day scenario but grows over as more edge cases are handled. Over time, it turns into a mess with too many execution paths.
✏️ Scaling. A system designed for 100 users is really different from one built for 10 million. Handling scale often adds layers of complexity.
✏️ Leaky Abstractions. A well-designed interface should hide complexity, not expose unnecessary details. (A good discussion on this is in Build Abstractions not Illusions post).
✏️ Gap Between Model and Reality. If a software model isn't actually mapped to the problem domain, it leads to growing system complexity that really hard to fix.
✏️ Hyperspace. Problem can multiply when a system has to work across many dimensions—different browsers, mobile platforms, OS versions, screen sizes, and more.

The software architecture degrades over time with the changes made. Every change can introduce more complexity, so it’s critical to keep things simple. Some strategies to do that:
✏️ Start Over. Rebuild everything from scratch. Sometimes, it is the only way forward if the existing architecture can't support new business requirements.
✏️ Eliminate Dependencies. Less dependencies the system has, the easier it is to predict system behavior and make impact analysis.
✏️ Reduce Scope. Build only what you actually need now. Avoid premature optimizations and "nice-to-have" features for some hypothetical future.
✏️ Simplify Architecture. No comments 😃
✏️ Avoid N-to-M Complexity. Reduce unnecessary variability to limit testing scope and system interactions.

Complexity starts when interactions appear. So it is about dynamic system behavior. Watching this talk made me reflect on why systems become so complex and how I can make better design decisions.

#architecture #engineering
🔥3👍2
Adopting ARM at Scale

Some time ago I wrote about infrastructure cost savings using Multi-Arch Images and the growing ARM-based competition between big cloud providers.

Interestingly, just last week, Uber published an article about their big migration from on-premise data centers to Oracle Cloud and Google Cloud platforms, integrating Arm-based servers for cost efficiency. The main challenge is to migrate existing infrastructure and around 5000 services on multi-arch approach.

Uber team defined the following migration steps:
- Host Readiness. Ensure that host-level software is compatible with Arm.
- Build Readiness. Update build systems to support multi-arch images.
- Platform Readiness. Deployment system changes.
- SKU Qualification. Assess hardware reliability and performance.
- Workload Readiness. Migrate code repositories and container images to support Arm.
- Adoption Readiness. Test workloads on Arm architecture.
- Adoption. The final rollout. The team built an additional safety mechanism that reverts back to x86 if a service is deployed with a single-architecture image.

The migration is not fully finished yet, but the first services are already successfully built, scheduled, and running on Arm-based hosts. Looks like a really good achievement in migrating huge infrastructure.
 
#engineering #news
🔥3👍2
eBPF: What's in it?

If you work with cloud apps, you've probably noticed a growing trend to use eBPF for profiling, observability, security and network tasks. To fully understand the potential and limitations of this technology, it's good to know how it works under the hood.

Let's look at how applications are executed from a Linux system perspective. In simple terms, everything operates in three layers:
1. User Space. It's where our applications run. This is the non-privileged part of the OS.
2. Kernel space. The privileged part of the OS that handles low-level operations. These operations usually provide access to the system hardware (file system, network, memory, etc.). Applications interact with it through system calls (syscalls).
3. Hardware. The physical device layer.

eBPF is a technology that allows to embed a program on Kernel OS level, where this program is triggered on particular system events like opening file, reading file, establishing a network connection, etc. In other words, eBPF approach allows to monitor what's going on with your applications on a system level without code instrumentation. One of the earliest and most well-known tools based on this technology is tcpdump.

Some interesting ways companies use eBPF now:
- Netflix introduced bpftop, a tool to measure how long processes spend in the CPU scheduled state. If processes take too long, it often points to CPU bottlenecks like throttling or over-allocation.
- Datadog shared their experience using eBPF for chaos testing via ChaosMesh.
- Confluent adopted Cilium, an eBPF-based CNI plugin for networking and security in Kubernetes.

Over the past few years I've seen more and more adoption of eBPF-based tools across the industry. And looks like trend will continue to grow especially in the area of observability and profiling.

#engineering
🔥4👍1
Shift Security Down

Last week CNCF Kubernetes Policy Working Group released a Security "Shift Down" whitepaper. The main idea is to shift the security focus down to the platform layer.

By embedding security directly into the Kubernetes platform, rather than adding it as afterthought, we empower developers, operators, and security teams strengthening the software supply chain, simplifying compliance, and building more resilient and secure cloud-native environments.

said Poonam Lamba, co-chair of the CNCF Kubernetes Policy Working Group and a Product Manager at Google Cloud.

While Shift-Left Security emphasizes developer responsibility for security, Shift-Down Security focuses on integrating security directly into the platform, providing an environment that is secured by default.

Key elements of the Shift-Down Strategy:
✏️ Common security concerns are handled on the platform level rather then by business applications
✏️ Security is codified, automated, and managed as a code
✏️ Platform security complements Shift-Left approach and existing processes

The whitepaper provides a shared responsibility model across developers, operations, and security teams, introduces common patterns for managing vulnerabilities and misconfigurations, promotes automation and simplification, enforces security best practices on the platform layer.

#engineering #security #news
Responsibility Matrix from Shift-Down Security whitepaper

#engineering #security
DR: Main Concepts

Last months I've been working a lot on Disaster Recovery topics, so I decided to summarize key points and patterns for that.

Disaster recovery (DR) is an ability to restore access and functionality of IT services after a disaster event, whether it's natural or caused by a human action (or error).

DR is usually designed in terms of Availability Zones and Regions:
- Availability Zone (AZ) – minimal and atomic unit of geo-redundancy. It can be represented by the whole Data Center (physical building) or smaller parts like isolated rack, floor, or hypervisor.
- Region - a set of Availability Zones within a single geographic area.

The most popular setups:
✏️ Public clouds. AZ is represented as a separate datacenter, datacenters are located within ~100 km of each other. The chance that all datacenters will be broken at the same time is very low. So it's enough to distribute a workload across multiple AZ. Different regions may still make sense but mostly for load and content distribution.
✏️ On-premise clouds. In that case AZ is usually represented by different floors or racks in the same building. In that case it's better to have at least 2 regions to cover DR cases.

DR approach is measured by:
✏️ Recovery Time Objective (RTO) is the maximum acceptable delay between the interruption of a service and restoration of service. It's how long your service is not available.
✏️ Recovery Point Objective (RPO) is the maximum acceptable amount of time since the last data recovery point (e.g. backup). It's how much data you can loose in case of failure.

Disaster Recovery architecture is driven by requirements to RTO and RPO values for particular application. It's the first thing you should define before implementing any solution. In one of the next posts we'll check DR implementation strategies.

#architecture #systemdesign
👍1
Region and Availability Zones concepts visualization
👍2🤩1
DR Strategies

When RPO and RTO requirements are defined, it's time to select DR strategy:

✏️ Backup\Restore. The simplest option with quite big downtime (RTO) - hours or even days:
- the application runs on a single region only
- regular backups (full and incremental) are sent to another region
- only active region has reserved capacity to run the whole infrastructure
- in case of disaster the whole infrastructure should be rebuilt on a new region (in some cases it can be the same region), after that application is reinstalled and data is restored from backups

✏️ Cold Standby. This option requires less downtime but still it can take hours to fully restore the infrastructure:
- the application runs on a single region only
- minimal infrastructure is prepared in another region: copy of application or data storage may be installed but it's scaled down or run with minimum replicas
- regular backups (full and incremental) are sent to another region
- in case of disaster the application is restored from backups and scaled up appropriately

✏️ Hot Standby. The most complex and expensive option with minimal RTO measured in minutes:
- both regions have the same capacity reserved
- all applications are up and running on both regions
- data is replicated between regions in near real-time
- in case of disaster one of the regions continues to operate.

What to select is usually depends on availability and business requirements of the services you provide. But anyway DR plan should be defined and documented to know what to do in case of disaster. Moreover it's a good practice to provide regular testing on how to restore the system. Otherwise you may end up with the situation when you have a backup but cannot restore the system, or even worse there will be no actual backups at all.

#architecture #systemdesign
👍3
DR Strategies. My attempt to visualize main ideas 🙂

#architecture #systemdesign
2
Thinking Like an Architect

What makes a good architect different from other technical roles? If you've ever thought about that - I recommend to check a talk from Gregor Hohpe "Thinking Like an Architect"

Gregor said that architects are not the smartest people, they make everyone else smarter.

And to achieve this, they use the following tools:
✏️ Connect Levels. Architects talk with management on a business language and with developers on a technical language. So they can translate business requirements to technical decisions and technical limitations to business impacts.
✏️ Use Metaphors. They use well-known metaphors to explain complex ideas in a simple way.
✏️ See More. Architects see more dimensions of the problem and can do more precise trade-off analysis.
✏️ Sell Options. Estimate and prepare options, sometimes defer decisions to the future.
✏️ Make Better Decisions with Models. Models shape our thinking. If solution is simple, the model is good, if it's not - there is probably something wrong with the model.
✏️ Become Stronger with Resistance. Not all people are happy with the changes, architects can identify what beliefs people hold that make their arguments rationale. By understanding this, architects can influence how people think and work.

I really like Gregor's talks, they are practical, make you think about standard things under different angle and contains a good piece of humor. So if you have time, I recommend to watch the full version.

#architecture
👍3
Really nice illustration from "Thinking Like an Architect" that shows what it means to see more 👍
👍1
Arbnb: Large-Scale Test Migration with LLM

In all that hype about replacing developers by LLMs I really like to read about practical examples of how LLMs are used to solve engineering tasks. Last week Airbnb published an article Accelerating Large-Scale Test Migration with LLMs where they described the experience to automate migration of ~3.5K React test files from Enzyme to React Testing Library (RTL).

Interesting points there:
✏️ Migration was built as a pipeline with multiple steps, where files are moved to the next stage only after validation on the previous step passed
✏️ If validation is failed, result is sent to LLM one more time with request to fix it
✏️ For small and mid size files the most effective strategy was a brute force: retry steps multiple times until they passed or reached a limit.
✏️ For huge complex files the context was extended with the source code of the component, related tests in the same directory, general migration guidelines and common solutions. Note from the authors that the main success driver there was not prompt engineering but choosing the right related files.
✏️ The overall result was successful migration of 97% of tests, remaining part was fixed manually.

The overall story looks like a huge potential for routine tasks automation. Even with a custom pipeline and some tooling around it, the overall migration with LLM was significantly cheaper than doing it manually.

#engineering #ai #usecase
👍4
Balancing Coupling

Today we'll talk about Balancing Coupling in Software Design book by Vlad Khononov. That's a quite fresh book (2024) that addresses a common architecture problem - how to balance coupling between components to make it easy to support new features and technologies without turning the solution into a big ball of mud.

The author defines coupling as a relationship between connected entities. If entities are coupled, they can affect each other. As a result, coupled entities should be changed together.

Main reasons for change:
- Shared Lifecycle: build, test, deployment
- Shared Knowledge: model, implementation details, order of execution, etc.

The author defines 4 levels of coupling:
📍Contract coupling. Modules communicate through an integration-specific contract.
📍 Model coupling. The same model of the business domain is used by multiple modules.
📍Functional coupling. Share the knowledge of the functionality: the sequence of steps to do, sharing the same transaction, logic duplication.
📍Intrusive coupling. Integration through component implementation details that were not intended for integration.

Coupling can be described by the following dimensions:
📍Connascence. Shared lifecycle levels: static - compilation time or dynamic - runtime dependencies.
📍Integration Strength. The more knowledge components share, the stronger the integration is between them .
📍Distance. The physical distance between components: the same class, the same package, the same lib, etc. The greater the distance is, the more effort is needed to introduce a cascading change.
📍Volatility. How frequently the module is changed.

Then the author suggests a model to calculate coupling and other architecture characteristics using values of these dimensions.

For example,
Changes Cost = Volatility AND Distance 

It means that if both distance and volatility are high, the actual cost of changes is high.

Coupling balance equation:
Balance = (Strength XOR Distance) OR NOT Volatility


Of course, the scale is relative and quite subjective but it allows you to have a framework to assess your architectural decisions, predict their consequences, and adjust solution characteristics to find the right balance.

Overall book impression is very positive: it has no fluff, it's clear, structured and very practical. Definitely recommend.

#booknook #engineering
🔥4👍2
Some graphical representation for concepts from the book

Source: Balancing Coupling in Software Design

#booknook #engineering
Adaptive Integration

Modern solutions typically consists of a mix of services, functions, queues and DBs. To implement an E2E scenario developers need to build a chain of calls to get the result. And if some API is changed, the whole E2E may be broken.

Of course, we have proto specs, Open API, autogenerated clients, but the problem is that any change brings significant adoption overhead to all its dependencies.

Marty Pitt in his talk Adaptive Architectures - Building API Layers that Build Themselves presents an attempt to solve the problem and make changes cheap and fully automated.

I like the part with problem statement, it really describes the pain of existing microservice ecosystem: change API - integration is broken, change message format - integration is broken, change function - you get the idea, right? So you need to be really careful with any contract change and work with all your consumers to make the migration smooth.

Then the author assumes that the reason of that problem is the lack of business semantics in our API specs. And if we add them, the system can automatically generate chain calls to perform any requested operation.

Idea can be represented as the following steps:
✏️ Add semantics to the entities: for example, instead of int id use accountId id across all services in the organization
✏️ Register service specs during startup on a special integration service.
✏️ Any service can call the integration using DSL like Get balance for the account X with a specified email
✏️ The integration service automatically generates an execution chain based on the registered specs. After that it orchestrates all queries and returns the result to the caller.
✏️ If a service changes its API, it simply uploads a new spec version, and the integration service rebuilds the call chain accordingly.

Author and his team already implemented the approach in https://github.com/taxilang/taxilang and https://github.com/orbitalapi.

From my point of view, the system that decides in runtime what APIs to call to perform a business transaction looks uncontrollable and difficult to troubleshoot. So I'm not ready to use the approach in a real production. But the idea sounds interesting, let's see if such tools usage will grow in the future.

#engineering
👍3
Kafka 4.0 Official Release

If you’re a fan of Kafka like I am, you might know that Kafka 4.0 was officially released last week. Except the fact that it's the first release that operates entirely without Apache Zookeeper, it also contains some other interesting changes:

✏️ The Next Generation of the Consumer Rebalance Protocol (KIP-848). The team promised significant performance improvements and no “stop-the-world” rebalances anymore.
✏️ Early access to the Queues feature (I already described it there )
✏️ Improved transactional protocol (KIP-890) that should solve the problem with hanging transactions
✏️ Ability to make a whitelist of OIDC providers via org.apache.kafka.sasl.oauthbearer.allowed.urls property
✏️ Custom processor wrapping for Kafka Streams (KIP-1112) that should simplify common code usage across different streams topologies
✏️ Values for some default parameters were changed. Actually it's a public contract change with potential issues during upgrade, so need to be careful with that - KIP-1030
✏️ A big housekeeping work was done, so the version removes a lot of deprecations:
- v0 and v1 message formats were dropped (KIP-724)
- kafka clients versions <=2.1 are not supported anymore (KIP-1124)
- APIs and configs deprecated prior version 3.7 were removed
- Old MirrorMaker (MM1) was removed
- Old java versions support was removed, now clients require Java 11+, brokers - Java 17+

Full list of changes can be found in release notes and official upgrade recommendations.

New release looks like a significant milestone for the community 💪. As always, before any upgrade I recommend to wait for the first patch versions (4.0.x), which will probably contain fixes for the most noticeable bugs and issues.

#engineering #news #kafka
🔥7
Netflixed - The Epic Battle for America's Eyeballs

Recently I visited a bookshop to pick up a pocket book to read during a long flight. I noticed something with a word Netflix and decided to buy it. It was Netflixed: The Epic Battle for America's Eyeballs by Gina Keating.

Initially I thought that's the book about technology or leadership. But it was a story about Netflix's way to success. The book was published in 2013 but it's still relevant as Netflix remains a leader in online streaming today.

The author tells Netflix’s history starting from online DVDs rental service to online movie streaming. A main part of the book focuses on Netflix’s competition with Blockbuster (it's America’s biggest DVD and media retailer at that time). It’s really interesting to see how their market and optimization strategies went through different stages of technology evolution.

I won’t retell the whole book, but there’s one moment that really impressed me. Blockbuster was one step before beating Netflix and become a market leader in online movies services. But at that critical time, disagreements among Blockbuster’s top management led to the company crash.

Most board members failed to see that the DVD era was ending and Internet technologies were the future. They fired the executive who drove the online program and brought a new CEO with no experience in the domain. This new CEO decided to focus on expanding physical DVD stores. Unfortunately, he didn't want to hear about new technologies at all. That leads to full Blockbuster bankruptcy.

What can we learn from this? Some managers cannot accept the fact they are wrong and a bad manager can ruin the whole business. Good leaders must listen to their teams, understand industry trends, and be flexible enough to adapt to the changes. For me, the book felt like a drama story, even though I already knew what's in the end.

#booknook #leadership #business
👍1