TechLead Bits – Telegram
TechLead Bits
424 subscribers
62 photos
1 file
157 links
About software development with common sense.
Thoughts, tips and useful resources on technical leadership, architecture and engineering practices.

Author: @nelia_loginova
Download Telegram
Is Your Cluster Underutilized?

"In clusters with 50 CPUs or more, only 13% of the CPUs that were provisioned were utilized, on average. Memory utilization was slightly higher at 20%, on average," - 2024 Kubernetes Cost Benchmark Report.

This aligns with what I've seen: often there are no enough resources to deploy an app, even the cluster resource usage is less than 50%.

The most common reasons for that:

✏️ Incorrect requests and limits. Kubernetes reserves cluster resources based on requests parameters, while limits prevent a service from consuming more than expected. Typical issues are:
- Requests = Limits. Resources are exclusively reserved for peak loads and cannot be shared with other services. This often happens with memory configuration (standard explanation is to prevent OOM), but mostly all modern languages can manage memory dynamically (e.g., Java since v17 with G1 and Shenandoah GC, Go, and C++ handle this natively).
- Requests > Average Usage. It may seem counterintuitive, but requests should be set below average usage. By the law of large numbers, peak loads is balanced across multiple deployments. Probability that all services hit peak usage at the same time is relatively small.

✏️ Mismatched between resource settings and runtime parameters. Requests and limits should align with language-specific configurations (e.g., GOMEMLIMIT for Golang, Xms and Xmx for java).

✏️ High startup resource requirements. Some services need a lot of CPU and memory just to start, even though they consume far less after that. This requires high resource requests just to make deployment possible. A common example is Spring applications, which consume significant resources to load all beans at startup. Using native compilation or more efficient technologies can help.

As you can see the real problem is somewhere between developers and deployment engineers. Technical decisions and implementation details directly impact resource utilization and infrastructure costs.

#engineering
👍3
Be Curious

To make the right decisions, technical leaders and architects need to understand not only the technical part of the project, but its existing limitations, business, integrations, legal and contractual restrictions. I call that the project context.

And I usually recommend to extend this context understanding for all my mentees and more junior colleagues. As a response, they often ask: "Where can I read about this?" First times I was really confused by this question. The problem is that there's no single document that describes absolutely everything about a project. Information is often fragmented, distributed across documents and teams, a lot of things are not described at all.

So what to do and how to extend the overall project knowledge? I spent some time to reflect on that, the answer is simple—be curious. Ask questions, request specific documents, talk to people, and be interested in what’s happening around you. You can start with your manager, neighbor teams, colleagues from other departments.

Over time, it will help you to build a wide picture of the project, improve business understanding, perform better trade-offs analysis and choose more efficient technical solutions.

#softskills #architecture
5👍3
Software Complexity

Have you ever seen a project turned into a monster over time? Hard to understand, difficult to maintain? If so, I highly recommend Peter van Hardenberg’s talk - Why Can't We Make Simple Software?

The author explains what complexity is (it's not the same as difficulty!), why software gets so complicated, and what we can actually do about it.

Common reasons for complex software:
✏️ Defensive Code. Code that starts simple with implementing some sunny day scenario but grows over as more edge cases are handled. Over time, it turns into a mess with too many execution paths.
✏️ Scaling. A system designed for 100 users is really different from one built for 10 million. Handling scale often adds layers of complexity.
✏️ Leaky Abstractions. A well-designed interface should hide complexity, not expose unnecessary details. (A good discussion on this is in Build Abstractions not Illusions post).
✏️ Gap Between Model and Reality. If a software model isn't actually mapped to the problem domain, it leads to growing system complexity that really hard to fix.
✏️ Hyperspace. Problem can multiply when a system has to work across many dimensions—different browsers, mobile platforms, OS versions, screen sizes, and more.

The software architecture degrades over time with the changes made. Every change can introduce more complexity, so it’s critical to keep things simple. Some strategies to do that:
✏️ Start Over. Rebuild everything from scratch. Sometimes, it is the only way forward if the existing architecture can't support new business requirements.
✏️ Eliminate Dependencies. Less dependencies the system has, the easier it is to predict system behavior and make impact analysis.
✏️ Reduce Scope. Build only what you actually need now. Avoid premature optimizations and "nice-to-have" features for some hypothetical future.
✏️ Simplify Architecture. No comments 😃
✏️ Avoid N-to-M Complexity. Reduce unnecessary variability to limit testing scope and system interactions.

Complexity starts when interactions appear. So it is about dynamic system behavior. Watching this talk made me reflect on why systems become so complex and how I can make better design decisions.

#architecture #engineering
🔥3👍2
Adopting ARM at Scale

Some time ago I wrote about infrastructure cost savings using Multi-Arch Images and the growing ARM-based competition between big cloud providers.

Interestingly, just last week, Uber published an article about their big migration from on-premise data centers to Oracle Cloud and Google Cloud platforms, integrating Arm-based servers for cost efficiency. The main challenge is to migrate existing infrastructure and around 5000 services on multi-arch approach.

Uber team defined the following migration steps:
- Host Readiness. Ensure that host-level software is compatible with Arm.
- Build Readiness. Update build systems to support multi-arch images.
- Platform Readiness. Deployment system changes.
- SKU Qualification. Assess hardware reliability and performance.
- Workload Readiness. Migrate code repositories and container images to support Arm.
- Adoption Readiness. Test workloads on Arm architecture.
- Adoption. The final rollout. The team built an additional safety mechanism that reverts back to x86 if a service is deployed with a single-architecture image.

The migration is not fully finished yet, but the first services are already successfully built, scheduled, and running on Arm-based hosts. Looks like a really good achievement in migrating huge infrastructure.
 
#engineering #news
🔥3👍2
eBPF: What's in it?

If you work with cloud apps, you've probably noticed a growing trend to use eBPF for profiling, observability, security and network tasks. To fully understand the potential and limitations of this technology, it's good to know how it works under the hood.

Let's look at how applications are executed from a Linux system perspective. In simple terms, everything operates in three layers:
1. User Space. It's where our applications run. This is the non-privileged part of the OS.
2. Kernel space. The privileged part of the OS that handles low-level operations. These operations usually provide access to the system hardware (file system, network, memory, etc.). Applications interact with it through system calls (syscalls).
3. Hardware. The physical device layer.

eBPF is a technology that allows to embed a program on Kernel OS level, where this program is triggered on particular system events like opening file, reading file, establishing a network connection, etc. In other words, eBPF approach allows to monitor what's going on with your applications on a system level without code instrumentation. One of the earliest and most well-known tools based on this technology is tcpdump.

Some interesting ways companies use eBPF now:
- Netflix introduced bpftop, a tool to measure how long processes spend in the CPU scheduled state. If processes take too long, it often points to CPU bottlenecks like throttling or over-allocation.
- Datadog shared their experience using eBPF for chaos testing via ChaosMesh.
- Confluent adopted Cilium, an eBPF-based CNI plugin for networking and security in Kubernetes.

Over the past few years I've seen more and more adoption of eBPF-based tools across the industry. And looks like trend will continue to grow especially in the area of observability and profiling.

#engineering
🔥4👍1
Shift Security Down

Last week CNCF Kubernetes Policy Working Group released a Security "Shift Down" whitepaper. The main idea is to shift the security focus down to the platform layer.

By embedding security directly into the Kubernetes platform, rather than adding it as afterthought, we empower developers, operators, and security teams strengthening the software supply chain, simplifying compliance, and building more resilient and secure cloud-native environments.

said Poonam Lamba, co-chair of the CNCF Kubernetes Policy Working Group and a Product Manager at Google Cloud.

While Shift-Left Security emphasizes developer responsibility for security, Shift-Down Security focuses on integrating security directly into the platform, providing an environment that is secured by default.

Key elements of the Shift-Down Strategy:
✏️ Common security concerns are handled on the platform level rather then by business applications
✏️ Security is codified, automated, and managed as a code
✏️ Platform security complements Shift-Left approach and existing processes

The whitepaper provides a shared responsibility model across developers, operations, and security teams, introduces common patterns for managing vulnerabilities and misconfigurations, promotes automation and simplification, enforces security best practices on the platform layer.

#engineering #security #news
Responsibility Matrix from Shift-Down Security whitepaper

#engineering #security
DR: Main Concepts

Last months I've been working a lot on Disaster Recovery topics, so I decided to summarize key points and patterns for that.

Disaster recovery (DR) is an ability to restore access and functionality of IT services after a disaster event, whether it's natural or caused by a human action (or error).

DR is usually designed in terms of Availability Zones and Regions:
- Availability Zone (AZ) – minimal and atomic unit of geo-redundancy. It can be represented by the whole Data Center (physical building) or smaller parts like isolated rack, floor, or hypervisor.
- Region - a set of Availability Zones within a single geographic area.

The most popular setups:
✏️ Public clouds. AZ is represented as a separate datacenter, datacenters are located within ~100 km of each other. The chance that all datacenters will be broken at the same time is very low. So it's enough to distribute a workload across multiple AZ. Different regions may still make sense but mostly for load and content distribution.
✏️ On-premise clouds. In that case AZ is usually represented by different floors or racks in the same building. In that case it's better to have at least 2 regions to cover DR cases.

DR approach is measured by:
✏️ Recovery Time Objective (RTO) is the maximum acceptable delay between the interruption of a service and restoration of service. It's how long your service is not available.
✏️ Recovery Point Objective (RPO) is the maximum acceptable amount of time since the last data recovery point (e.g. backup). It's how much data you can loose in case of failure.

Disaster Recovery architecture is driven by requirements to RTO and RPO values for particular application. It's the first thing you should define before implementing any solution. In one of the next posts we'll check DR implementation strategies.

#architecture #systemdesign
👍1
Region and Availability Zones concepts visualization
👍2🤩1
DR Strategies

When RPO and RTO requirements are defined, it's time to select DR strategy:

✏️ Backup\Restore. The simplest option with quite big downtime (RTO) - hours or even days:
- the application runs on a single region only
- regular backups (full and incremental) are sent to another region
- only active region has reserved capacity to run the whole infrastructure
- in case of disaster the whole infrastructure should be rebuilt on a new region (in some cases it can be the same region), after that application is reinstalled and data is restored from backups

✏️ Cold Standby. This option requires less downtime but still it can take hours to fully restore the infrastructure:
- the application runs on a single region only
- minimal infrastructure is prepared in another region: copy of application or data storage may be installed but it's scaled down or run with minimum replicas
- regular backups (full and incremental) are sent to another region
- in case of disaster the application is restored from backups and scaled up appropriately

✏️ Hot Standby. The most complex and expensive option with minimal RTO measured in minutes:
- both regions have the same capacity reserved
- all applications are up and running on both regions
- data is replicated between regions in near real-time
- in case of disaster one of the regions continues to operate.

What to select is usually depends on availability and business requirements of the services you provide. But anyway DR plan should be defined and documented to know what to do in case of disaster. Moreover it's a good practice to provide regular testing on how to restore the system. Otherwise you may end up with the situation when you have a backup but cannot restore the system, or even worse there will be no actual backups at all.

#architecture #systemdesign
👍3
DR Strategies. My attempt to visualize main ideas 🙂

#architecture #systemdesign
2
Thinking Like an Architect

What makes a good architect different from other technical roles? If you've ever thought about that - I recommend to check a talk from Gregor Hohpe "Thinking Like an Architect"

Gregor said that architects are not the smartest people, they make everyone else smarter.

And to achieve this, they use the following tools:
✏️ Connect Levels. Architects talk with management on a business language and with developers on a technical language. So they can translate business requirements to technical decisions and technical limitations to business impacts.
✏️ Use Metaphors. They use well-known metaphors to explain complex ideas in a simple way.
✏️ See More. Architects see more dimensions of the problem and can do more precise trade-off analysis.
✏️ Sell Options. Estimate and prepare options, sometimes defer decisions to the future.
✏️ Make Better Decisions with Models. Models shape our thinking. If solution is simple, the model is good, if it's not - there is probably something wrong with the model.
✏️ Become Stronger with Resistance. Not all people are happy with the changes, architects can identify what beliefs people hold that make their arguments rationale. By understanding this, architects can influence how people think and work.

I really like Gregor's talks, they are practical, make you think about standard things under different angle and contains a good piece of humor. So if you have time, I recommend to watch the full version.

#architecture
👍3
Really nice illustration from "Thinking Like an Architect" that shows what it means to see more 👍
👍1
Arbnb: Large-Scale Test Migration with LLM

In all that hype about replacing developers by LLMs I really like to read about practical examples of how LLMs are used to solve engineering tasks. Last week Airbnb published an article Accelerating Large-Scale Test Migration with LLMs where they described the experience to automate migration of ~3.5K React test files from Enzyme to React Testing Library (RTL).

Interesting points there:
✏️ Migration was built as a pipeline with multiple steps, where files are moved to the next stage only after validation on the previous step passed
✏️ If validation is failed, result is sent to LLM one more time with request to fix it
✏️ For small and mid size files the most effective strategy was a brute force: retry steps multiple times until they passed or reached a limit.
✏️ For huge complex files the context was extended with the source code of the component, related tests in the same directory, general migration guidelines and common solutions. Note from the authors that the main success driver there was not prompt engineering but choosing the right related files.
✏️ The overall result was successful migration of 97% of tests, remaining part was fixed manually.

The overall story looks like a huge potential for routine tasks automation. Even with a custom pipeline and some tooling around it, the overall migration with LLM was significantly cheaper than doing it manually.

#engineering #ai #usecase
👍4
Balancing Coupling

Today we'll talk about Balancing Coupling in Software Design book by Vlad Khononov. That's a quite fresh book (2024) that addresses a common architecture problem - how to balance coupling between components to make it easy to support new features and technologies without turning the solution into a big ball of mud.

The author defines coupling as a relationship between connected entities. If entities are coupled, they can affect each other. As a result, coupled entities should be changed together.

Main reasons for change:
- Shared Lifecycle: build, test, deployment
- Shared Knowledge: model, implementation details, order of execution, etc.

The author defines 4 levels of coupling:
📍Contract coupling. Modules communicate through an integration-specific contract.
📍 Model coupling. The same model of the business domain is used by multiple modules.
📍Functional coupling. Share the knowledge of the functionality: the sequence of steps to do, sharing the same transaction, logic duplication.
📍Intrusive coupling. Integration through component implementation details that were not intended for integration.

Coupling can be described by the following dimensions:
📍Connascence. Shared lifecycle levels: static - compilation time or dynamic - runtime dependencies.
📍Integration Strength. The more knowledge components share, the stronger the integration is between them .
📍Distance. The physical distance between components: the same class, the same package, the same lib, etc. The greater the distance is, the more effort is needed to introduce a cascading change.
📍Volatility. How frequently the module is changed.

Then the author suggests a model to calculate coupling and other architecture characteristics using values of these dimensions.

For example,
Changes Cost = Volatility AND Distance 

It means that if both distance and volatility are high, the actual cost of changes is high.

Coupling balance equation:
Balance = (Strength XOR Distance) OR NOT Volatility


Of course, the scale is relative and quite subjective but it allows you to have a framework to assess your architectural decisions, predict their consequences, and adjust solution characteristics to find the right balance.

Overall book impression is very positive: it has no fluff, it's clear, structured and very practical. Definitely recommend.

#booknook #engineering
🔥4👍2
Some graphical representation for concepts from the book

Source: Balancing Coupling in Software Design

#booknook #engineering
Adaptive Integration

Modern solutions typically consists of a mix of services, functions, queues and DBs. To implement an E2E scenario developers need to build a chain of calls to get the result. And if some API is changed, the whole E2E may be broken.

Of course, we have proto specs, Open API, autogenerated clients, but the problem is that any change brings significant adoption overhead to all its dependencies.

Marty Pitt in his talk Adaptive Architectures - Building API Layers that Build Themselves presents an attempt to solve the problem and make changes cheap and fully automated.

I like the part with problem statement, it really describes the pain of existing microservice ecosystem: change API - integration is broken, change message format - integration is broken, change function - you get the idea, right? So you need to be really careful with any contract change and work with all your consumers to make the migration smooth.

Then the author assumes that the reason of that problem is the lack of business semantics in our API specs. And if we add them, the system can automatically generate chain calls to perform any requested operation.

Idea can be represented as the following steps:
✏️ Add semantics to the entities: for example, instead of int id use accountId id across all services in the organization
✏️ Register service specs during startup on a special integration service.
✏️ Any service can call the integration using DSL like Get balance for the account X with a specified email
✏️ The integration service automatically generates an execution chain based on the registered specs. After that it orchestrates all queries and returns the result to the caller.
✏️ If a service changes its API, it simply uploads a new spec version, and the integration service rebuilds the call chain accordingly.

Author and his team already implemented the approach in https://github.com/taxilang/taxilang and https://github.com/orbitalapi.

From my point of view, the system that decides in runtime what APIs to call to perform a business transaction looks uncontrollable and difficult to troubleshoot. So I'm not ready to use the approach in a real production. But the idea sounds interesting, let's see if such tools usage will grow in the future.

#engineering
👍3