TechLead Bits – Telegram
TechLead Bits
424 subscribers
62 photos
1 file
156 links
About software development with common sense.
Thoughts, tips and useful resources on technical leadership, architecture and engineering practices.

Author: @nelia_loginova
Download Telegram
Kafka 4.1 Release

At the beginning of September Kafka 4.1 was released. It doesn't contain any big surprises but it follows the overall industry direction to improve security and operability.

Noticeable changes:
🔸 Preview state for Kafka Queues (Detailed overview there). It's still not recommended for production, but it's a good time to check how it works and what scenarios it really covers.
🔸 Early access to Stream Rebalance protocol. It moves rebalance logic to the broker side. Initially the approach was implemented for consumers and now it's extended for streams (KIP-1071)
🔸 Ability for plugins and connectors to register their own metrics via Monitorable interface (KIP-877)
🔸 Metrics naming unification between consumers and producers (KIP-1109). Previously Kafka consumer replaces periods (.) in topic names in metrics with underscores (_), while producer keeps topic name unchanged. Now both producers and consumers preserve original topic name format. Old metrics will be removed in Kafka 5.0.
🔸 OAuth jwt-bearer grant type support in addition to client_credentials (KIP-1139)
🔸 Ability to enforce explicit naming for internal topics (like changelog, repartition). A new configuration flag prevents Kafka Streams from starting if any of their internal topics have auto-generated names (KIP-1111).

Full list of changes can be found in release note and official upgrade recommendations.

#news #technologies
👍1🔥1
GenAI as a Thought Partner

AI is mostly used to get answers, provide summaries, generate some text or automate routine tasks. But it can be much more than that if it's used in a Thought Partner mode.

What does it mean?
It means that you can ask AI to generate ideas, challenge your solutions, offer alternative options, or even play devil’s advocate.

This mode is really helpful for the leaders. For example, I use it to challenge my proposals, find alternative options and arguments. It helps me to be more prepared for the meetings with management and customers.

Basic template:
🔸 Role: "Act as my Strategic Thought Partner"
🔸 Context: situation or problem denoscription, objectives
🔸 Task: what to do

Example:
Act as my Strategic Thought Partner by engaging me in a structured problem-solving process. Here’s the situation: [provide necessary context].
My goal is to [state objective].
Challenge my current assumptions, ask clarifying questions, and help me think through alternative solutions. I’d like you to surface blind spots and uncover insights I may have overlooked.


More ideas to use:
🔸 Give me 10 unexpected angles to consider for...
🔸 Act as a devil's advocate and challenge my current assumptions about...
🔸 Evaluate the pros and cons for...
🔸 Help me uncover blind spots and overlooked insights related to...

Thought partner mode is a great tool, but don’t take everything as the absolute truth. If you miss any important details, it can give you totally wrong results. And of course, it still can lie, make mistakes and hallucinate 😵‍💫. Use it with a critical eye.

#ai #tips
3🔥2👍1
A Few Words About Configurations

Ability to change system configuration is very important aspect of service operability. But too much configurations can turn system support into a nightmare.

From my experience, dev teams tend to overcomplicate provided configs. They try to allow as much options as possible. The common explanation is "We don't know what would be really needed". Then all configurations are carefully documented in a several-thousand lines guide and delivered to the ops team. Of course, ops team never ever reads it 😁

There is a really good metaphor from Google SRE Book that illustrates this situation:
A user can ask for “hot green tea” and get roughly what they want. On the opposite end, a user can specify the whole process: water volume, boiling temperature, tea brand and flavor, steeping time, tea cup type, and tea volume in the cup.


Configuration is intended to be used by humans, and it should be designed for humans.

Main principle here is simplicity and reasonable defaults. The less configuration is required, the simpler system is to operate and maintain.

One more important aspect of configurability is testing surface. It is quite expensive to check all possible parameters and their combinations. As a result too much variety increases the risk of errors and human mistakes.

So next time you think about adding new configuration parameter, keep in mind that the best configuration is no configuration.

#systemdesign #engineering
3👍3
Is Open Source Free?

Do you know that the term open source was invented in 1998 to replace the term free software? To highlight "free as in freedom, not free as in beer" 😀?

Dylan Beattie presented the open source history and its current trends in the talk Open Source, Open Mind: The Cost of Free Software.

The history itself is very interesting: from pirating computer games, creating first linux distro to licenses evolution, CLAs and current number of limitations to use open software. That part I recommend to watch once you have free time, it's really entertaining.

But here I want to highlight the following:
🔸 Open source projects provide us a code. Not more. If you want continuity, support, availability, or convenience, expect to pay. It can be licenses, managed services, sponsorship or even your own investments.
🔸 "People who share the source code do not owe you anything". They don’t even promise the software works properly or it works at all.
🔸 Open source projects can change their license to commercial at any time. You should be just ready for that (remember Redis, Graylog, Vault, ElasticSearch, etc.).

Some example:
You can take postgres for free, it's fully open.
But can you use for production? Probably not 😯.
First you need to package it, prepare installation and upgrade procedures, implement HA, configure metrics and monitoring dashboards, provide backup approach, tune security, teach operations team, etc.
You can do it on your own or pay to another company to do it for you.

So anyone who says open source is free and it costs nothing has clearly never run it in production. Open source software is really "open" but not free.

#technologies
4👍2
Failure Is Always An Option

One more great video from Dylan Beattie - Failure Is Always An Option. This time it's the talk about software reliability and risks of system misbehavior.

Key ideas:
🔸 Use System Thinking. Reliability is not just about software, it's about holistic view on the system that includes software, hardware, finance and people.
🔸 Design For Failure. Be ready for failure at all system levels and components.
🔸 Measure Risk by Impact not Frequency. You might never had a car accident, but it doesn’t mean you don’t need airbags and seat belts.
🔸 Focus on Results. Define things done by outcomes not executed steps or procedures.
🔸 Expect Surprises. Users are really creative, they can use features in an unpredictable way. Don't be arrogant to say "This is wrong". Learn from them to build awesome stuff together.

The talk is full of interesting samples of building complex reliable systems. The most impressive part for me was the story around Apollo 13 mission 🚀.

Just imagine: shuttle, astronauts, space, and some software... The whole mission success and astronauts lives depend on software quality and reliability. Sounds like a horror, right? 😃
HA & DR for Shuttle software was implemented using 6 computers:
🔸 4 identical computers to compare results and provide availability.
🔸 5th computer to perform the same logic but with software written by a different vendor.
🔸 6th computer without software at all, the idea was to use it to install the software from scratch in case there are issues with software on all other computers. Lately 6th computer was removed from shuttles as it was "never really used".

The video has many more great examples from software engineering history, I really watched it in one sitting. And I love Dylan’s presentation style: energetic, with a good dose of humor, engaging and inspirational. Recommend 👍.

#systemdesign #engineering #reliability
🔥2👍1
Open Infrastructure is Not Free

There was a piece of news last week that might not be very noticeable, but it's really important for the whole open source community. On Sep 23 open source foundations like Sonatype (Maven Central), Open Source Security Foundation (OpenSSF), Python Software Foundation (PyPI) and others published a joined letter - Open Infrastructure is Not Free: A Joint Statement on Sustainable Stewardship.

What problem they highlighted:
🔸 Open source infrastructure is the foundation of any modern digital infrastructure.
🔸 User expects this infrastructure to be secure, fast, reliable, and global.
🔸 Public registries are often used to distribute proprietary software (it may have open source license but it can work only as a part of a paid product).
🔸 Commercial organizations heavily use open source infrastructure as free CDN and distribution systems.
🔸 Open source infrastructure is supported by non-profit foundations and enthusiasts. They don't have enough resources to meet growing expectations.
🔸 Load on the infrastructure grows exponentially, donations - linearly.
🔸 This situation produces a disbalance: billion-dollars ecosystems live on services that are built on goodwill, unpaid weekends and sponsorships.

The problem is obvious: too many companies make money on open source infrastructure without giving a cent back. They profit, while the real costs are carried by volunteers and foundation sponsors. The claim is fair enough.

Proposed ideas:
🔸 Commercial Partnership: Fund infrastructure in proportion to usage.
🔸 Tiered Access: Free access for individual contributors, paid options for scale and performance for high-volume consumers.
🔸 Additional Capabilities: Provide additional capabilities that might be interesting for commercial entities (e.g. some statistics or analytics)

The authors said that this letter is only the beginning: they will start to actively work with foundations, governments, and industry partners to improve the situation. Looks like in 2-3 years we'll have totally different infrastructure, and, most probably, it will not be free.

#news #technologies
🔥3👍2😱1
Software Quality: What does it mean?

We all want to build high-quality products. But what do we understand under high-quality? Is it high test coverage? Low defects rate? Reliability? Compliance?
Actually, developers, business and users mean different things under the quality.

There is a really good publication from Google team regarding this topic - Developer Productivity for Humans, Part 7: Software Quality.

The authors break down software quality into 4 types:
🔸 Process Quality. It usually includes code reviews, organizational consistency, effective planning, testing strategy, tests flakiness, distribution of work. Typically, higher process quality leads to higher code quality.
🔸 Code Quality. It's code testability, complexity, readability and maintainability. High code quality improves quality of the system by reducing defects and increasing reliability.
🔸 System Quality. It means high reliability, high performance, security, privacy and low defect rates.
🔸 Product Quality. It's the type of quality experienced by the customers. It includes utility, usability, and reliability. Also this level includes other business parameters: brand reputation, costs and overheads, revenue.

These four types of quality impact each other: the process quality affects code quality, which affects system quality, which affects product quality. The end goal is always to improve product quality.

This model also explains why ideas like "we'll improve test coverage to X% and we'll get the good quality" rarely works in practice. It might help a little bit, but the connection with product quality is far away.

So if the team is concerned about the quality, they need to analyze what type of quality they want to work on and select appropriate metrics.

#engineering #quality
🔥21👍1
Schrodinger Backup

Let's imagine that you carefully design your backup strategy (refer to Backup Strategy, Backup Types), deliver it to production, configure schedule to trigger it regularly, store backups on another region for DR purposes.
Can you feel safe after it?
No 😱.

The problem is that the backup is there, but not really...
Until you have a process for regularly restoring production data, you have no guarantees that it works. It's not possible to test restoration on a real production environment, so this procedure should retore data on another environment and execute at least basic sanity checks.

With this idea in mind I decided to check what's in the industry has there: I asked about testing backup procedure in X DevOps community, checked what public clouds offer and looked for the suggestions over the Internet.

Key findings:
🔸 Most teams have never tested the restoration of production backups. They verify only procedure itself on some test environments.
🔸 GCP and Azure recommend to test production restoration, but you should prepare e2e procedure on your own (or I was not able to find it quickly).
🔸 AWS offers automatic testing procedure for its managed storages with an ability to create custom validation workflows.
🔸 Uber has a great article where they shared their continuous backup\restore approach.

Surprisingly, there are not much practical information about how to implement regular restoration testing.
Most probably there are 3 reasons for that:
- It's expensive
- The process is very env and company specific
- It may be more relevant for big tech companies where data loss is a critical business risk

So don't assume that no errors and existent backup files mean that you have a backup. You don't really know until the real incident .

#engineering #backups
🔥4👍1💯1
Adizes Leadership Styles

Have you ever worked with a leader who could quickly launch any initiative but terrible at organizing any process around it? Or with someone who can put in order any chaos but couldn't drive the change? According to the Dr I. Adizes it's different styles of leadership.

Adizes defines the following styles:
🔸 Producer. Focuses on WHAT should be done (product, services, some KPIs).
🔸 Administrator. Focuses on HOW should it be done (processes, methodologies).
🔸 Entrepreneur. Focuses on WHY and WHEN should it be done (changes, initiatives, new opportunities).
🔸 Integrator. Focuses on WITH WHOM should it be done (people).

The model was named PAEI by first letters from specified types.

The main idea is that most of us can embody one or two of these styles, may pick up some elements from the third, but nobody can include all of them. These styles are defined by personality type, experience and the situation.

It's good to know your own type and type of your manager. Different types of managers use different language and are interested in different things. Especially it's important in communication with your direct manager.

First time I met this model more than 10 years ago and defined myself as a strong Integrator. Recently I passed the tests one more time and found myself with main styles of Entrepreneur and Administrator. So what does it mean? The situation (position), tasks and experience were changed.

If you want to go in more details I recommend to read Management/Mismanagement Styles by I. Adizes. Actually the author has much more books but they are more or less about the same.

#booknook #softskills #leadership
👍2🔥2
2👍1
AWS Aurora Stateful Blue-Green

The most existing blue-green implementations that I know relate to the stateless services. Stateful services are rarely became a subject of blue-green. The main reason is the high cost of copying production data between versions.

But in some cases we need to provide strong guarantees that the next upgrade will not break our production. AWS tried to solve this issue introducing Aurora blue green deployment.

The marketing part sounds good:
You can make changes to the Aurora DB cluster in the green environment without affecting production workloads... When ready, you can _switch over_ the environments to transition the green environment to be the new production environment. The switchover typically takes under a minute with no data loss and no need for application changes.


But let's check how it works:
🔸 A new Aurora cluster with the same configuration and topology is created. It has a new service name with a green- prefix.
🔸 The new cluster can have a higher version and other set of database parameters than a production cluster.
🔸 Logical replication is established between a production and a new green cluster.
🔸 The green cluster is readonly by default. Enabling write operations can cause replication conflicts.
🔸 Once green database is tested, it's possible to perform a switchover. The names and endpoints of the current production environment is assigned to a newly created cluster.

In simple words, it's just a new cluster that copies data from production using logical replication feature. And as a result it inherits all restrictions of that feature such as missed DDL operations, no replication for large objects, lack of extensions support and others. So you need to be very careful deciding to use this approach.

For me it looks like this solution is suitable only for very basic scenarios with simple data types. Anything more complex won't work.

#engineering #systemdesign #bluegreen
🔥2👍1
Stateful Service Upgrade Strategy

Last time I wrote about AWS Aurora Stateful Blue-Green approach. Despite the limitations it gave a good pattern that can be used to make upgrade procedure safer even for in prem installations.

For simplicity let's focus on a database example but the idea is applicable far any stateful service.

The simplified approach is the following:
🔸 Hide a real cluster name under specific DNS name (in AWS it's Route53, in Kubernetes it can be just service name)
🔸 Perform a backup of a production database
🔸 Restore production backup as a new database instance
🔸 Execute upgrade of the production cluster (or any other potentially dangerous operation)
🔸 If upgrade failed, switch DNS to the backup database created before
🔸 If upgrade succeed, just remove backup database

The main trick there is that you create backup instance before any change to production, so in case of failure, you can quickly switch the system to a working state.
Of course, there is a delta between a database created from a backup and a production database. But it in case of a real disaster it can be fine ( of course, you need to check your RPO and RTO requirements, allowed maintenance window, etc.).

Approach with backup is much simpler then logical replication, can be used in different environments and can provide you additional guarantees especially for major upgrade or huge data migrations.

#engineering #bluegreen #backups
🔥2👍1
Illustration for the described upgrade approach

#engineering #bluegreen #backups
🔥4👍1
Platform Speed vs Efficiency

Platform teams become more and more popular. I remind you that the idea under them is very simple: move all common functionality to the platform so products can focus on business logic. It allows to reuse the same features across products, don't implement things twice and save development efforts.

Sounds good but this approach can lead to another issue: product teams generate more requirements than platform team can implement, so platform starts to be a bottleneck for everyone.

This problem is described in the article Platforms should focus on speed, not efficiency by Jan Bosch:
Although the functionality in the platform might be common, very often product teams find out that they need a different flavor or that some part is missing. The product teams need to request this functionality to be developed by the platform team, which often gets overwhelmed by all the requests. The consequence is that everyone waits for the slow-moving platform team.


This denoscription reflects my own observations: products want to get more and more features for free, platform team is piled up with requests, everything is stuck.

To solve this problem the author suggest to focus not on platform efficiency but on the speed to extend the platform with the functionality required by products.

He suggested 3 strategies to achieve that:
1. Make platform optional. In that case the platform team is motivated to earn trust and solve real problems instead of optimizing their own efficiency.
2. Allow product teams to contribute the the platform code.
3. Merge product and platform. Instead of separating “platform” and “products,” create a shared codebase that contains all functionality.

From my experience p.3 is not always possible especially for large codebase. This approach requires significant investments in build and CI infrastructure that can be too expensive. But other points look relevant and they often mentioned in other resources about platform engineering.

This article is a part of the series called "Software Platforms 10 lessons". So I'm planning to read other lessons soon.

#platformengineering #engineering
👍5🔥2
Software Platforms 10 lessons

As I promised I read the whole series of articles "Software Platforms 10 lessons" by Jan Bosch. While reading I had several "Aha!" moments .
You know that feeling when you understand there is a problem but you cannot clearly put it into words? The author brings those issues to the table and clears them up, one by one.

Let's check the lessons 🧑‍🎓:

1. Focus on speed, not efficiency. Delivery speed is more important then local efficiency. More details here.

2. Avoid platform\product dichotomy. Treat them as one configurable system instead of two competing layers. The author suggest to have a single codebase to speed up feature development and delivery.

3. Balance architecture and continuous testing. The number of configurations and connections between different parts of the system is so high that it's not possible (or too expensive) to test them all. The best approach here is to clean architecture with strong interfaces and decoupled functionality. It helps to simplify testing and move most of the tests on the component level.

4. Don't integrate new functionality too early.
The idea is provide new functionality as experimental outside the platform and include it to the main delivery only when there are active users.

5. Prefer customer-first over customer-unique. Reject functionality that will not needed by any other customers, it's too expensive in support and maintain.

6. Control variability. Each variation point has a constant ‘tax’ to keep it working. So you need regularly remove not used variations as part of technical debt management. Of course, it's no so easy if you don't know who and how uses your features. To solve that problem the author suggests to instrument the code to collect statistics on platform feature usage and make informed decisions on what to remove.

7. Optimize total cost of ownership. Reduce cost on keeping features up and running, that allows to have time for innovations. Otherwise the whole RnD efforts can be spend on supporting existing functionality only.

8. Instrument the platform for data driven decisions.
More or less the same as p.6 about variability. This lesson is fully focused on platform instrumentation importance.

9. Be careful to open up to 3rd parties. If you decide to open you platform to other vendors for extensions, you need to carefully manage requests priorities. Remember that 3rd parties are focused on building and promoting their own business first.

10. Keep one stakeholder at a time. Focus on building features to satisfy one group of stakeholder first, then got to the next.

From first look platform development looks easy. But in reality there are a lot of common pitfalls that prevent platforms to be really beneficial. I think these 10 lessons is a really good point to rethink your platform development processes to make them more efficient.

#platformengineering #engineering
Please open Telegram to view this post
VIEW IN TELEGRAM
👍3🔥2
Machines, Learning, and Machine Learning

Absolutely great talk from the latest NDC Porto - Machines, Learning, and Machine Learning by Dylan Beattie. The author reflects about current state of AI and try to answer the most important question - if AI replaces software developers in the nearest future?😱

Some interesting points from the talk:
🔸 Software is predictable, the reality is not. GenAI is not deterministic by its nature.
🔸 AI doesn't really `think`. It predicts the most suitable tokens according to the provided context. But a lot of people who doesn't really understand how it works tends to think that current AI has some intellect (but it doesn't).
🔸 Vibe coding has no sense without human in the middle. Generated code requires review and the person who will take the responsibility of the end result.
🔸 Companies that invest to GenAI are interested to create addiction to AI tools. As you don't know if your prompt produce expected results, you will to try more and more times to get what you want. It's very similar to the slot machine: sometimes you succeed, sometimes not.
🔸 The less someone understands how AI works, the more convinced they are that it will replace software engineers soon. Right now, it's mostly managers and journalists who talk about that.

The overall idea is that current generation of AI tools can be helpful as assistants and copilots, but they can't replace the real human (at least for software development 😉).

#ai #engineering
👍2🔥1💯1
AI Reading List

I really like reading books, but what I enjoy even more is collecting lists of book for future read 😀.
Each time I see someone mentioned interesting book, I put it to the list. I have lists for system design and architecture, management, leadership, communications, etc. Unfortunately items in the lists grow much faster then my reading capabilities.

Today I decided to share my GenAI reading list so you don’t have to wait for my reviews. The topic is booming, and these books can already bring you some value.

The list:
🔸 The AI-Driven Leader: Harnessing AI to Make Faster, Smarter Decisions by G. Woods
🔸 The Coming Wave: Technology, Power, and the Twenty-first Century's Greatest Dilemma by Mustafa Suleyman
🔸 Nexus: A Brief History of Information Networks from the Stone Age to AI by Yuval Noah Harari
🔸 The Alignment Problem: Machine Learning and Human Values by Brian Christian
🔸 Competing in the Age of AI: Strategy and Leadership When Algorithms and Networks Run the World by Marco Iansiti
🔸 Human Robot Agent: New Fundamentals for AI-Driven Leadership with Algorithmic Management by Jurgen Appelo
🔸 Agentic Artificial Intelligence: Harnessing AI Agents to Reinvent Business, Work and Life by Pascal Bornet
🔸 How AI Ate the World by Chris Stokel-Walker
🔸 Ingrain AI: Strategy through Execution - The Blueprint to Scale an AI-first Culture by John Munsell
🔸 AI Snake Oil: What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference by Arvind Narayanan

Enjoy! 😎

#booknook
🔥3👍1
British Postal Office Scandal

Have you ever thought about how much we rely on information systems? And how much does the mistake cost?

British Postal Office Scandal is one of the most dramatic examples of faulty software. It's also often used in different educational programs to show how dangerous technology can be.

So what's happened?

In 1999 UK Post Office launched Horizon system to automate accounting and stocktaking. It was a large project to deliver this system across all branches in the country. Management reported the great program success.

From the first days the system showed that some sub-postmasters accounts had shortfalls. Between 1999 and 2015 around 900 employees were convicted of crimes like theft, fraud, or false accounting.

So looks like the system visualized existing organizational problems, right?

Unfortunately, no. The system had some critical defects that led to show false shortfalls. The situation was even worse, because vendor and management knew about these problems:
Fujitsu was aware that Horizon contained software bugs as early as 1999, the Post Office insisted that Horizon was robust and failed to disclose knowledge of the faults in the system during criminal and civil cases.


The main part of convictions were quashed only in 2021. For that period many lives were already broken with bankruptcy, stress, illness, and even suicides.

In the era when everyone wants to build next Apple, Google, or Facebook, it's important to study failures. Software development requires strong responsibility of what we are doing, and a culture where speaking up about problems is valued, not punished.

#offtop #usecase #failures
👍2
Continuous Integration Visualization Technique (CIViT)

Unit testing, component testing, E2E testing... There are different types of test that we use to check product quality. But is it enough? Are you sure?

CIViT is the model that is intended to visualize testing activities, identify bottlenecks and missing test coverage. It maps what is tested, when and at what automation level.

CIViT defines the following types of testing activities:
🔸 Functionality (F). This type checks functionality that is under development.
🔸 Legacy Functionality (L). It's regression testing that verify that previously delivered features are not broken.
🔸 Quality Attributes (Q). There are non-functional tests like performance, reliability, security, etc.
🔸 Edge Case (E). This type covers unusual or unexpected usage scenarios, that are typically discovered from defects.

Each type of testing is measured by automation level (manual, partial or full), execution frequency and coverage (fully covered, partially or not covered). The model often uses colors to indicate the state: e.g., green = fully automated or fully covered; orange = partial; red = manual or not covered.

The model can help to answer the following questions:
🔸 Which level of test requires the most manual efforts?
🔸 Are there any long feedback loops (e.g., regression tests only once per release) causing risk?
🔸 Are there missing tests for quality attributes or edge cases?
🔸 Is there some duplication in testing?
🔸 Where should you invest to increase reliability and improve quality?
 
Actually I cannot say that visual representation of the model is really clear (check next post with samples), probably it's the reason the model didn't become popular. But I think it's a good and helpful tool to perform the audit of your testing activities.

#engineering #testing #ci
🔥31👍1
Visual explanation of CIViT model and full diagram sample.

Full diagram source: Boost your digitalization: build and test infrastructure

#engineering #testing #ci
🔥4