TechLead Bits – Telegram
TechLead Bits
424 subscribers
62 photos
1 file
156 links
About software development with common sense.
Thoughts, tips and useful resources on technical leadership, architecture and engineering practices.

Author: @nelia_loginova
Download Telegram
Adizes Leadership Styles

Have you ever worked with a leader who could quickly launch any initiative but terrible at organizing any process around it? Or with someone who can put in order any chaos but couldn't drive the change? According to the Dr I. Adizes it's different styles of leadership.

Adizes defines the following styles:
🔸 Producer. Focuses on WHAT should be done (product, services, some KPIs).
🔸 Administrator. Focuses on HOW should it be done (processes, methodologies).
🔸 Entrepreneur. Focuses on WHY and WHEN should it be done (changes, initiatives, new opportunities).
🔸 Integrator. Focuses on WITH WHOM should it be done (people).

The model was named PAEI by first letters from specified types.

The main idea is that most of us can embody one or two of these styles, may pick up some elements from the third, but nobody can include all of them. These styles are defined by personality type, experience and the situation.

It's good to know your own type and type of your manager. Different types of managers use different language and are interested in different things. Especially it's important in communication with your direct manager.

First time I met this model more than 10 years ago and defined myself as a strong Integrator. Recently I passed the tests one more time and found myself with main styles of Entrepreneur and Administrator. So what does it mean? The situation (position), tasks and experience were changed.

If you want to go in more details I recommend to read Management/Mismanagement Styles by I. Adizes. Actually the author has much more books but they are more or less about the same.

#booknook #softskills #leadership
👍2🔥2
2👍1
AWS Aurora Stateful Blue-Green

The most existing blue-green implementations that I know relate to the stateless services. Stateful services are rarely became a subject of blue-green. The main reason is the high cost of copying production data between versions.

But in some cases we need to provide strong guarantees that the next upgrade will not break our production. AWS tried to solve this issue introducing Aurora blue green deployment.

The marketing part sounds good:
You can make changes to the Aurora DB cluster in the green environment without affecting production workloads... When ready, you can _switch over_ the environments to transition the green environment to be the new production environment. The switchover typically takes under a minute with no data loss and no need for application changes.


But let's check how it works:
🔸 A new Aurora cluster with the same configuration and topology is created. It has a new service name with a green- prefix.
🔸 The new cluster can have a higher version and other set of database parameters than a production cluster.
🔸 Logical replication is established between a production and a new green cluster.
🔸 The green cluster is readonly by default. Enabling write operations can cause replication conflicts.
🔸 Once green database is tested, it's possible to perform a switchover. The names and endpoints of the current production environment is assigned to a newly created cluster.

In simple words, it's just a new cluster that copies data from production using logical replication feature. And as a result it inherits all restrictions of that feature such as missed DDL operations, no replication for large objects, lack of extensions support and others. So you need to be very careful deciding to use this approach.

For me it looks like this solution is suitable only for very basic scenarios with simple data types. Anything more complex won't work.

#engineering #systemdesign #bluegreen
🔥2👍1
Stateful Service Upgrade Strategy

Last time I wrote about AWS Aurora Stateful Blue-Green approach. Despite the limitations it gave a good pattern that can be used to make upgrade procedure safer even for in prem installations.

For simplicity let's focus on a database example but the idea is applicable far any stateful service.

The simplified approach is the following:
🔸 Hide a real cluster name under specific DNS name (in AWS it's Route53, in Kubernetes it can be just service name)
🔸 Perform a backup of a production database
🔸 Restore production backup as a new database instance
🔸 Execute upgrade of the production cluster (or any other potentially dangerous operation)
🔸 If upgrade failed, switch DNS to the backup database created before
🔸 If upgrade succeed, just remove backup database

The main trick there is that you create backup instance before any change to production, so in case of failure, you can quickly switch the system to a working state.
Of course, there is a delta between a database created from a backup and a production database. But it in case of a real disaster it can be fine ( of course, you need to check your RPO and RTO requirements, allowed maintenance window, etc.).

Approach with backup is much simpler then logical replication, can be used in different environments and can provide you additional guarantees especially for major upgrade or huge data migrations.

#engineering #bluegreen #backups
🔥2👍1
Illustration for the described upgrade approach

#engineering #bluegreen #backups
🔥4👍1
Platform Speed vs Efficiency

Platform teams become more and more popular. I remind you that the idea under them is very simple: move all common functionality to the platform so products can focus on business logic. It allows to reuse the same features across products, don't implement things twice and save development efforts.

Sounds good but this approach can lead to another issue: product teams generate more requirements than platform team can implement, so platform starts to be a bottleneck for everyone.

This problem is described in the article Platforms should focus on speed, not efficiency by Jan Bosch:
Although the functionality in the platform might be common, very often product teams find out that they need a different flavor or that some part is missing. The product teams need to request this functionality to be developed by the platform team, which often gets overwhelmed by all the requests. The consequence is that everyone waits for the slow-moving platform team.


This denoscription reflects my own observations: products want to get more and more features for free, platform team is piled up with requests, everything is stuck.

To solve this problem the author suggest to focus not on platform efficiency but on the speed to extend the platform with the functionality required by products.

He suggested 3 strategies to achieve that:
1. Make platform optional. In that case the platform team is motivated to earn trust and solve real problems instead of optimizing their own efficiency.
2. Allow product teams to contribute the the platform code.
3. Merge product and platform. Instead of separating “platform” and “products,” create a shared codebase that contains all functionality.

From my experience p.3 is not always possible especially for large codebase. This approach requires significant investments in build and CI infrastructure that can be too expensive. But other points look relevant and they often mentioned in other resources about platform engineering.

This article is a part of the series called "Software Platforms 10 lessons". So I'm planning to read other lessons soon.

#platformengineering #engineering
👍5🔥2
Software Platforms 10 lessons

As I promised I read the whole series of articles "Software Platforms 10 lessons" by Jan Bosch. While reading I had several "Aha!" moments .
You know that feeling when you understand there is a problem but you cannot clearly put it into words? The author brings those issues to the table and clears them up, one by one.

Let's check the lessons 🧑‍🎓:

1. Focus on speed, not efficiency. Delivery speed is more important then local efficiency. More details here.

2. Avoid platform\product dichotomy. Treat them as one configurable system instead of two competing layers. The author suggest to have a single codebase to speed up feature development and delivery.

3. Balance architecture and continuous testing. The number of configurations and connections between different parts of the system is so high that it's not possible (or too expensive) to test them all. The best approach here is to clean architecture with strong interfaces and decoupled functionality. It helps to simplify testing and move most of the tests on the component level.

4. Don't integrate new functionality too early.
The idea is provide new functionality as experimental outside the platform and include it to the main delivery only when there are active users.

5. Prefer customer-first over customer-unique. Reject functionality that will not needed by any other customers, it's too expensive in support and maintain.

6. Control variability. Each variation point has a constant ‘tax’ to keep it working. So you need regularly remove not used variations as part of technical debt management. Of course, it's no so easy if you don't know who and how uses your features. To solve that problem the author suggests to instrument the code to collect statistics on platform feature usage and make informed decisions on what to remove.

7. Optimize total cost of ownership. Reduce cost on keeping features up and running, that allows to have time for innovations. Otherwise the whole RnD efforts can be spend on supporting existing functionality only.

8. Instrument the platform for data driven decisions.
More or less the same as p.6 about variability. This lesson is fully focused on platform instrumentation importance.

9. Be careful to open up to 3rd parties. If you decide to open you platform to other vendors for extensions, you need to carefully manage requests priorities. Remember that 3rd parties are focused on building and promoting their own business first.

10. Keep one stakeholder at a time. Focus on building features to satisfy one group of stakeholder first, then got to the next.

From first look platform development looks easy. But in reality there are a lot of common pitfalls that prevent platforms to be really beneficial. I think these 10 lessons is a really good point to rethink your platform development processes to make them more efficient.

#platformengineering #engineering
Please open Telegram to view this post
VIEW IN TELEGRAM
👍3🔥2
Machines, Learning, and Machine Learning

Absolutely great talk from the latest NDC Porto - Machines, Learning, and Machine Learning by Dylan Beattie. The author reflects about current state of AI and try to answer the most important question - if AI replaces software developers in the nearest future?😱

Some interesting points from the talk:
🔸 Software is predictable, the reality is not. GenAI is not deterministic by its nature.
🔸 AI doesn't really `think`. It predicts the most suitable tokens according to the provided context. But a lot of people who doesn't really understand how it works tends to think that current AI has some intellect (but it doesn't).
🔸 Vibe coding has no sense without human in the middle. Generated code requires review and the person who will take the responsibility of the end result.
🔸 Companies that invest to GenAI are interested to create addiction to AI tools. As you don't know if your prompt produce expected results, you will to try more and more times to get what you want. It's very similar to the slot machine: sometimes you succeed, sometimes not.
🔸 The less someone understands how AI works, the more convinced they are that it will replace software engineers soon. Right now, it's mostly managers and journalists who talk about that.

The overall idea is that current generation of AI tools can be helpful as assistants and copilots, but they can't replace the real human (at least for software development 😉).

#ai #engineering
👍2🔥1💯1
AI Reading List

I really like reading books, but what I enjoy even more is collecting lists of book for future read 😀.
Each time I see someone mentioned interesting book, I put it to the list. I have lists for system design and architecture, management, leadership, communications, etc. Unfortunately items in the lists grow much faster then my reading capabilities.

Today I decided to share my GenAI reading list so you don’t have to wait for my reviews. The topic is booming, and these books can already bring you some value.

The list:
🔸 The AI-Driven Leader: Harnessing AI to Make Faster, Smarter Decisions by G. Woods
🔸 The Coming Wave: Technology, Power, and the Twenty-first Century's Greatest Dilemma by Mustafa Suleyman
🔸 Nexus: A Brief History of Information Networks from the Stone Age to AI by Yuval Noah Harari
🔸 The Alignment Problem: Machine Learning and Human Values by Brian Christian
🔸 Competing in the Age of AI: Strategy and Leadership When Algorithms and Networks Run the World by Marco Iansiti
🔸 Human Robot Agent: New Fundamentals for AI-Driven Leadership with Algorithmic Management by Jurgen Appelo
🔸 Agentic Artificial Intelligence: Harnessing AI Agents to Reinvent Business, Work and Life by Pascal Bornet
🔸 How AI Ate the World by Chris Stokel-Walker
🔸 Ingrain AI: Strategy through Execution - The Blueprint to Scale an AI-first Culture by John Munsell
🔸 AI Snake Oil: What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference by Arvind Narayanan

Enjoy! 😎

#booknook
🔥3👍1
British Postal Office Scandal

Have you ever thought about how much we rely on information systems? And how much does the mistake cost?

British Postal Office Scandal is one of the most dramatic examples of faulty software. It's also often used in different educational programs to show how dangerous technology can be.

So what's happened?

In 1999 UK Post Office launched Horizon system to automate accounting and stocktaking. It was a large project to deliver this system across all branches in the country. Management reported the great program success.

From the first days the system showed that some sub-postmasters accounts had shortfalls. Between 1999 and 2015 around 900 employees were convicted of crimes like theft, fraud, or false accounting.

So looks like the system visualized existing organizational problems, right?

Unfortunately, no. The system had some critical defects that led to show false shortfalls. The situation was even worse, because vendor and management knew about these problems:
Fujitsu was aware that Horizon contained software bugs as early as 1999, the Post Office insisted that Horizon was robust and failed to disclose knowledge of the faults in the system during criminal and civil cases.


The main part of convictions were quashed only in 2021. For that period many lives were already broken with bankruptcy, stress, illness, and even suicides.

In the era when everyone wants to build next Apple, Google, or Facebook, it's important to study failures. Software development requires strong responsibility of what we are doing, and a culture where speaking up about problems is valued, not punished.

#offtop #usecase #failures
👍2
Continuous Integration Visualization Technique (CIViT)

Unit testing, component testing, E2E testing... There are different types of test that we use to check product quality. But is it enough? Are you sure?

CIViT is the model that is intended to visualize testing activities, identify bottlenecks and missing test coverage. It maps what is tested, when and at what automation level.

CIViT defines the following types of testing activities:
🔸 Functionality (F). This type checks functionality that is under development.
🔸 Legacy Functionality (L). It's regression testing that verify that previously delivered features are not broken.
🔸 Quality Attributes (Q). There are non-functional tests like performance, reliability, security, etc.
🔸 Edge Case (E). This type covers unusual or unexpected usage scenarios, that are typically discovered from defects.

Each type of testing is measured by automation level (manual, partial or full), execution frequency and coverage (fully covered, partially or not covered). The model often uses colors to indicate the state: e.g., green = fully automated or fully covered; orange = partial; red = manual or not covered.

The model can help to answer the following questions:
🔸 Which level of test requires the most manual efforts?
🔸 Are there any long feedback loops (e.g., regression tests only once per release) causing risk?
🔸 Are there missing tests for quality attributes or edge cases?
🔸 Is there some duplication in testing?
🔸 Where should you invest to increase reliability and improve quality?
 
Actually I cannot say that visual representation of the model is really clear (check next post with samples), probably it's the reason the model didn't become popular. But I think it's a good and helpful tool to perform the audit of your testing activities.

#engineering #testing #ci
🔥31👍1
Visual explanation of CIViT model and full diagram sample.

Full diagram source: Boost your digitalization: build and test infrastructure

#engineering #testing #ci
🔥4
Total Cost of Ownership: The Theory

Do you know the real cost of your features?
We usually focus on initial development costs and, at best, future maintenance costs. But the reality is much more complex.
Total Cost of Ownership (TCO) model was created to address this complexity.

TCO is the overall cost of a product or service throughout its entire life cycle:
- Direct costs: easily measurable and directly tied to software development, deployment, and maintenance.
- Indirect costs: harder to quantify, but they significantly impact business operations and profitability over time.

Direct costs:
🔸 Initial Development: costs of planning, designing, coding, testing, and deploying the software.
🔸 Infrastructure & Hardware: the prices of on-premise or public cloud infrastructure, equipment, storages, network, deployment models, scalability patterns, workload predictability.
🔸 Licensing & 3rd party services: costs of using 3rd party services (proprietary or open-source), subnoscriptions, collaboration platforms like Confluence, Jira, Zoom, etc.
🔸 Compliance & Security: expenses for regulatory compliance, certifications, data protection, cybersecurity measures.
🔸 Maintenance & Support: regular updates, performance optimizations, security patching, incident management, bugfixing, customer support and helpdesk services.
🔸 Operations: everything is required to make the system up and running: personnel, data storage and backups, observability, reliability, usability.

Indirect costs:
🔸 Trainings & Documentation: user training, documentation preparation and keeping it up to date.
🔸 Scaling: costs to scale infrastructure and implement performance optimizations when business grows.
🔸 Future Enhancements: costs of adding new features and expand existing ones, technical debt management.
🔸 Downtime: the cost of a system outage or data loss. It’s important to calculate all potential damages: lost productivity, users, money.
🔸 Upgrades: regular production upgrades and codebase maintenance, additional costs in case of technology license change or end of technology support.

The main goal of the model is to provide a comprehensive view of all costs associated with a software product. This value then can be used in budgeting, ROI and business value calculation, making strategic decisions.

#engineering #costoptimization
22
TCO: How to Sell Your Tech Debt

Right now we know what the Total Cost of Ownership is, let's try to understand if it can be useful for us in practice.

Let's take an example with technical debt and imagine we need to approve budget for it.
Inputs:
✏️ The system consists of 15 services. The system contains 1 core microservice with basic functionality shared across other services.
✏️ At least 6 microservices has direct dependencies on the logic of this core service.
✏️ The core service has huge technical debt that slows down the overall feature development. There are at least 5 features per quarter that impacts the core service.
✏️ 1 day of 1 engineer work = 5000$ /20 working days per month = 250$
✏️ The team estimated that they need around 100 working days to perform the refactoring: 100×250 ~25 K $

Cost of ownership:

Initial development:
🔸 Average cost per features 10 days = 10×250= 2 500$
🔸 Speed of development is 2x without refactoring, so average cost per feature became 5 000$
🔸 Yearly extra cost on development = 2500 $× 5 features /quarter = 50 K $

Maintenance & Support
🔸 Average time to resolve bug 1.5 days
🔸 Bugfixing time also increases by 2x, so average cost per bug became 250$×3=750 $
🔸 Yearly extra cost on support = 375 $ × 10 bug/month = 45 K $

Downtime
🔸 Tech debt in core service increases deployment failure rate, incident duration, rollback frequency (if there are any real statistic from your project, it also can be used for calculation)
🔸1 critical incident/year ~30 K$

Upgrades:
🔸 Upgrades are more risky and requires additional 1 hour for maintenance window
🔸 It means that max availability is 99.5% (that may not satisfy overall system requirements)

Summary:
🔸 Refactoring overhead costs company around 95K yearly, brings additional risks of incidents, reduce overall availability.
🔸 Cost of refactoring 25 K$
🔸 Time to pay back ~ 3 month
🔸 Refactoring is more beneficial then just continue doing features without it.

As you can see the overall strategy is simple: minimize TCO and maximize business benefits.

Maybe the approach looks a bit complex but the common problem between technical experts and management is different language.
Business talks in a language of money. Business is not interested in engineering best practices, clean architecture or the size of technical debt.
But we can use tools like TCO to unify our language, make communications much more productive and, of course, get the required budget 😉.

#engineering #leadership #costoptimization
21👍1
The Evolution of SRE at Google

Google not only pioneered SRE practices, they constantly improve SRE approach to keep Google's large-scale production up and running. SLOs\SLIs, error budgets, isolation strategies, postmortems are well-known tools for reliability.

But Google's team went beyond that and adopted systems theory and control theory. The main idea is to focus on the complex system as a whole rather than on individual elements and their failures.

A new approach is based on System-Theoretic Accident Model and Processes (STAMP) framework. In complex systems, most accidents are the result of interactions between components that are all functioning well, but collectively produce an unsafe state. STAMP shifts analysis from a traditional linear chain of failures (root cause) to a "control problem" (I think STAMP itself deserves a separate overview).

From very basic perspective it analyzes the system in terms of control-feedback loops to find issues both in the control path and the feedback path. The feedback path is usually less well understood, but just as important as control path from a system safety perspective.

For example, imagine the service "resizer" that changes resource quotas according to the resource usage. The logic responsible for calculating resource usage and delivering these values to the resizer is probably even more critical than the logic that changes the quota itself.

Google SRE team performs such analysis on a regular basis for their global services. Even the first results showed a set of scenarios that can potentially produce hazards in the future. But as they are "predicted" before the real incidents, improvements can be planned without rush as any other feature.

Overall, I think the new approach is a really great example of system theory in practice. Moreover, I’ve discovered there’s a whole area of reliability theory behind this that I plan to explore in more detail in the future.

#engineering #reliability
🔥5👍1
STAMP Framework

As I wrote previously I think that the STAMP framework deserves its own overview.

The framework was introduced by MIT professor Nancy Levenson in the book "Engineering a Safer World" in 2011.

STAMP (The System Theoretic Accident Model and Processes) is a functional model of controllers which interact with each other through control actions and feedback:
- A system is considered as a control system.
- The control system consists of hierarchical control-feedback loops.
- Control system enforces safety constraints and prevents accidents.

STAMP is based on 2 main methodologies: Systems-Theoretic Process Analysis (STPA) and Causal Analysis Using System Theory (CAST).

Systems-Theoretic Process Analysis (STPA) - a hazard analysis technique performed at the design stage of the development (proactive analysis):
🔸 Define the purpose. For example, meet RPO\RTO requirements, prevent data loss, meet GDPR requirements, etc.
🔸 Model control structure. Describe and document interactions between key components for this type of hazard.
🔸 Identify unsafe control actions. For example, deploying a new version before testing, failing to scale up during high load, routing traffic to unhealthy backend, etc.
🔸 Identify loss scenarios. Find scenarios that could lead to unsafe actions. For example, failed update causes service outage, broken autoscaling leads to dropped users, etc.
🔸 Define safety constraints. Create controls and design changes to prevent unsafe control actions. For example, any deployment must have rollback strategy, service load must be monitored and alarms must be sent if resources usage exceeds 80%, etc.

Causal Analysis Using System Theory (CAST) - an accident investigation method performed after the incident is occurred (reactive analysis):
🔸 Collect information about the incident.
🔸 Model control structure.
🔸 Analyze each component in loss. Define the reason why the component didn't prevent the incident.
🔸 Identify control structure flaws: communication and coordination, safety management, culture, environment, etc.
🔸 Create improvement program. Prepare recommendations for changes to prevent similar loss in the future.

To sum up, the STAMP framework suggests to enforce safety constraints instead of just trying to prevent system failures. What I really like about this approach is that it allows to incorporate reliability into the system design itself.

P.S. If the topic sounds interesting for you, "Engineering a Safer World" book is available for free at MIT Press.

#engineering #reliability #systemdesign
🔥2❤‍🔥1
Three Layers of Architectural Debt

Technical debt is a very popular topic in software development. Architectural debt is less popular, but I see more and more interest in it for the last year.

Today I want to share the article Architectural debt is not just technical debt by Eoin Woods. The author highlights the importance of identifying and managing architectural debt as it can impact the whole organization.

The author defines architectural debt as "structural decisions that come back to bite you six months later". It's not very scientific 😃, but provides a good feeling of what it is.

According to the article this debt can be split on 3 layers according to its impact: application layer, business layer and strategic layer (please, refer to the attached picture with the model in the next post).

Application Layer
It's the level of particular service, its integrations and technologies. Problems are easy to detect, issues there directly impact delivery time and day-to-day operations.

Business Layer
It's a level of organizational structure and team topologies, defined ownership and stewardship. Poorly designed structures produce heavy communication flows (Conway law, remember?) that can impact overall system architecture, produce duplication in functionality, and conflicts of interests between the teams. Issues here will multiply issues on the operational side.

Strategy Layer
Debt at this level may impact the whole organization. A single strategic misstep creates a cascade of misalignment that amplifies at each level:
Strategy debt (wrong capability decisions) -> Wrong Business Assumptions -> Faulty Requirements -> Technical Issues -> Operational Chaos


The responsibility of an architect there is to raise a red flag, describe the debt with AS-IS and a TO-BE states and explain the risks to he business of not handling it.

One more important idea from the article is that modern architecture cannot be a responsibility of one person or a small group of people. To be successful, architecture should be a shared activity of understanding and learning, guided by common principles. And at this point it's more about company culture then technical knowledge:
Trust and curiosity allow principles to live and decisions to evolve. This turns architecture from a static artefact into an ongoing activity.


#architecture
🔥3
👍2🔥2
ML Nested Learning Approach

Recently Google published quite interesting research "Introducing Nested Learning: A new ML paradigm for continual learning" .
So let's check what it is about and why it can be interesting for ML society.

Modern LLMs tend to loose efficiency in old tasks while learning new tasks. This effect is called catastrophic forgetting (CF). Researchers and data scientists spend significant amount of time to invent some architectural tweaks or better optimizations to deal with it.

Authors define the root cause of this issue as a separation of model's architecture and optimization algorithms for two different things.

The Nested Learning treats ML model as a system of interconnected, multi-level learning problems that are optimized simultaneously. And in that approach the model's architecture and the rules used to train it are fundamentally the same concepts.

The overall idea is heavily based on associative memory concept: the ability to map and recall one thing based on another (like recalling a name when you see someone's face):
🔸 The training process is modeled as an associative memory. The model learns to map a given data point to the value of its local error, which serves as a measure of how "surprising" or unexpected that data point was.
🔸 Key architectural components (like transformers) are also formalized as simple associative memory modules that learn the mapping between tokens in a sequence.

Proof-of-concept tests shows superior performance in language modeling and demonstrates better long-context memory management than in existing models (you can check the measurements in the article or in the full text of the research there ).

ML keeps evolving. What’s interesting is that the best architecture ideas come from systems theory and attempts to copy human brain behavior.

#ai #architecture
🔥2
What Changes as You Grow

When I was first an engineer then a team lead I used to think that my managers and senior architects always know what to do and how to do. I can ask them for directions, bring the problems and request for help (I'm lucky to have really great managers during my career). And, of course, it is because they are super smart, experienced and kind of "grown-up".

What I realized that on each level there are people just like us with their own problems, fears and they also may not know what to do. Moreover they can also make mistakes.

The key difference is the ability to continue working in situations with a high level of uncertainty: good leaders stay focused, accept risks, take the responsibility, choose a way to go forward and engage the right people.

And here we can really help our leaders with the right details and expertise, new ideas, suggestions. Believe me, it would be really appreciated. I say this as someone who’s now expected to always know what to do for my team 😉.

#selfreflection #offtop
👍2