TechLead Bits – Telegram
TechLead Bits
424 subscribers
62 photos
1 file
156 links
About software development with common sense.
Thoughts, tips and useful resources on technical leadership, architecture and engineering practices.

Author: @nelia_loginova
Download Telegram
Cannot stop writing about Pixar. Accidentally open non-existing-page. That's the best 404 page I've seen 😍 .
🔥7😁3
Observability 2.0

There is very charismatic talk from Charity Majors, Co-Author of "Observability Engineering", called Is it time for Observability 2.0? Sounds intriguing, so let's check what's inside.

Key ideas:
📍 Observability 1.0. It's traditional observability techniques based on 3 pillars: logs, metrics and traces. It's complex, expensive and requires skilled engineer to analyze correlations between different sources of data.
📍 Observability 1.0 has significant limitations:
- static dashboards
- lack of important details
- designed to answer to pre-defined questions
- multiple source of truth
- multiple systems to support
📍Observability 2.0 paradigm is based on the idea to use wide structured logs that contain all necessary information. It makes easy to aggregate data, zoom in and out for details when needed.
📍Observability 2.0 based on single source of truth - logs, all others is just visualization, aggregation and making dynamic queries. There is no data duplication. There is no need to install and maintain a set of tools for the telemetry. In that terms it's cheaper.

Implementation tips:
📍 Instrument the code using principles of Canonical Logs (we already checked the concept there)
📍Add Trace IDs and Span IDs to trace requests chain execution
📍 Feed the data into a columnar store, to move away from predefined schemas or indexes
📍 Use a storage engine that supports high cardinality
📍Adopt tools with explorable interfaces or dynamic dashboards

Our systems have become too hard and complicated. So it's critical to have effective observability tools and practices. The approach from the talk looks promising especially as it doesn't require any new tools to be developed. Let's see if it become a new trend in observability implementation.

#engineering #observability
👍3🔥1
Shipping Threads in 5 Months

As developers, we often prefer writing new code rather then reusing the older one. Of course, new code will be better, faster, more readable, maintainable, it will use newer tools and frameworks and definitely will be great. Spoiler: No 😉

Old battle-tested code can be significantly better because it's already tested, covered by automations, it has common features in place, little learning encoded, it doesn't have stupid mistakes already, it's mature.

That's why I think there is really interesting experience that shared by Meta team how they reuse Instagram code to build Threads app from scratch.

Initially, the team had an ambitious goal to build a service to compete with X (Twitter) in a couple of months. To achieve that, they decided to use existing Instagram code with core features like profiles, posts, recommendations, followers as the base for the new service. Additionally, Threads was built on existing Instagram infra with support of existing product teams. This approach allows to deliver new fully-featured service in 5 months.

Key findings during the process:
✏️ You need the deep knowledge of legacy code to successfully reuse it
✏️ Code readability really matters
✏️ Repurposing and customization existing code for the new requirements brough additional technical debt that should be paid in the future
✏️ Shared infrastructure and product teams can significantly reduce development costs
✏️ Old code is already tested in real conditions and contains less issues then the new code

So don't rush to rewrite the existing code. Thoughtful evolutionary approach can bring more business benefits, reducing time to market and overall development costs.

#engineering #usecase
👍2🔥1
Put Your Own Mask On First

"Put your own mask on first before assisting your child 😷"—they always say it on the plane before the flight. It sounds clear and familiar for us. But this same rule can apply to other parts of our lives. Metaphorically of course.

What do we usually do to meet deadlines, deal with a pile of issues at work, or handle business pressures? The most common scenario is to work more and more, eventually leading to burnout. But a burned-out leader can't solve problems effectively or help their team to survive in the storm of difficulties or achieve business goals.

It may sound counterintuitive but more pressure and problems you face at work, the more time and care you need to give yourself: proper nutrition, walking, regular physical activity, full sleep and less overwork.

When you take care of yourself, you can better take care of your team. So put your own mask on first before assisting to others.

#softskills #leadership
8👍2
API Governance

In modern distributed systems, where individual teams manage different services, it's pretty common for each team to create their own APIs in different ways. Each team tries to make their APIs unique and impressive. As a result, a company may have a lot of APIs that follow different rules and principles and reflect organizational structure instead of business domains (you remember Conway's Law, right?). This can be a full mess.

To avoid this, APIs must be properly managed to stay consistent.

API Governance is the set of practices, tools and policies to enforce API quality, security, consistency and compliance. It involves creating standards and processes for every stage in the API lifecycle, from design, development, and testing to deployment, management, and retirement.

API Governance consists of
the following elements:
📍 Centralization. A single point where policies are created and enforced.
📍 API Contract. Standard specifications to define APIs like OpenAPI, gRPC, GraphQL, AsyncAPI and others.
📍Implementation Guidelines. Establish and enforce style guidelines for all APIs. Good examples are Google Cloud API Guidelines , Azure API Guidelines.
📍 Security Policies. Defining API security standards and policies that protect sensitive data from cyber threats and ensuring API compliance with regulations.
📍 Automation. Developers and other roles need to quickly make sure that APIs are compliant with the enterprise standards at various stages of the lifecycle.
📍Versioning.
📍Deprecation Policy.
📍API Discovery. Provide a way to easily search for and discover existing APIs.

API Governance provides the guardrails to develop high-quality consistent APIs within the company. But to make it work a good level of automation is required.

#engineering #api
1👍1
API Governance: Versioning

Let's continue today with API management and talk about versioning. To define you versioning policy you need to answer the following questions:
😲 What versioning method will be used?
🤔 When do you need to release a new version?
😬 What naming convention to use?
😵‍💫 How to keep compatibility with the clients?

The most popular versioning strategies:
✏️ No Versioning. Yes, that's also a choice 😀
✏️ Semantic Versioning. It's well-know strategy to version anything in software development world.
✏️ Stability Levels: alpha, beta, stable. Major version is changed on breaking changes. Examples: v1alpha, v2beta, v1aplha3, v2. More details in Google API Design Guide.
✏️ Release Numbers: Simple sequential versions like v1, v2, v3, updated mainly for breaking changes.
✏️ Product Release Version: Use your product’s version for APIs. Example: product version 2024.3 then API version 2024.3. In that case version is changed even if there are no major changes, but it really simplifies tracking compatibility between releases and APIs.

To reduce the impact of API changes the following compatibility strategies can be used:
✏️ Synchronized Updates: Both API and clients are updated and delivered together. Simple, fragile. It can be useful if you control and manage all API consumers.
✏️ Client Supports Multiple Versions: One client can work with multiple API versions, but outdated clients may stop working and require updates to match newer APIs.
✏️ API Serves Multiple Versions: New API version is added in parallel to the existing one on the same server. In that case you may serve as many versions as you need to support all your clients. To reduce API management overhead Hub-Spoke pattern can be used: the hub represents the main version, while spoke versions are converted from the hub. This approach is actively used in Kubernetes, so you can read more details in Kubebuilder Conversion Concepts.

Analyze your requirements and architecture, set clear rules, define the versioning and compatibility approach. It's really important to document those decisions and socialize them to your clients.

#engineering #api
1
SMURF Testing

Google introduced new mnemonic for test quality attributes - SMURF:
📌 Speed: Unit tests are faster than other test types so they can be run more often.
📌 Maintainability: Cost of debugging and maintaining tests.
📌 Utilization: A good test suite optimizes resource utilization, fewer resources cost less to run.
📌 Reliability: Sorting out flaky tests wastes developer time and costs resources in rerunning the tests.
📌 Fidelity: High-fidelity tests come closer to approximating real operating conditions. So integration tests have more fidelity than unit tests.

In many cases improving one quality attribute can affect the others, so be careful and measure your costs and trade-offs.

#engineering #testing
1
API Governance at Scale

The most difficult part of API Governance is to ensure that developers follow provided guidelines and policies. Without proper controls, the real code will eventually drift from the guidelines—it’s only a matter of time. This doesn’t happen because developers are bad or unwilling to follow the rules, but because we’re human, and humans make mistakes. Mistakes accumulate and grow over time and as a result you can get APIs that are too far from initial recommendations.

In small teams with a small codebase, developers education can work, and trained reviewers can ensure the code follows the rules. But as your team or organization grows, this approach isn't enough. I strongly believe that only automation can maintain policy compliance over a large codebase or multiple teams.

Google recently published API Governance at Scale, sharing their experience and tools to control API guidelines execution.

They introduced 3 key components:
✏️ API Improvement Proposals (AIPs). This is a design document providing high-level documentation for API development. Each rule is introduced as a separate AIP that consists of a problem denoscription and a guideline to follow (Example, AIP-126).
✏️ API Linter . This tool provides real-time checks for compliance with existing AIPs.
✏️ API Readability Program. This is an educational program to prepare and certify API design experts, who then perform a code review for API changes.

While Google developed the AIPs concept, they encourage other companies to adopt the approach. Many of the rules are generic and easily reusable. They even provide a special guide on how to adopt AIPs. Adoption strategy is not finished now, but preparation status can be tracked via appropriate Github issue.

#engineering #api
1👍1
Take a Vacation

Last week I was on vacation, so there was a little break in the publications😌. Therefore I would like to talk a little about the vacation and how important it is. High quality vacation is not just opportunity for relax but it is also a prevention mechanism for many serious diseases.

But it’s not enough just to take vacations regularly; the way you spend them determines if you re-charge your internal battery or not.

My tips for a good vacation:

✏️ Take Enough Time: Ideally, a vacation length should be at least 14 days (as a single period). If you feel heavily exhausted, then better to take 21 days. That time is usually enough to recharge.
✏️ Change the Scenery: Travelling to a new place (even a short trip) gives you new impressions, experience, fill you with new ideas, inspiration and energy. Spending time outside standard surroundings significantly decreases an overall strain level. The fact is also proved by German researchers.
✏️ Digital Detox: Don't touch your laptop, don't open working chats, don't read the news, minimize social networks usage. Give the rest to your brain from constant information noise.
✏️ Be Spontaneous: Don't try to plan everything: constant following the schedule makes vacation feel more work-like and doesn't allow to enjoy the moment. Spontaneous activities can provide more fun and satisfaction.
✏️ Do Nothing: Allow yourself to take time for idleness. That's really difficult as you feel just wasting time that can be spend more effectively😀. But that's the trick: state of nothingness rewires the brain, improve creativity and problem solving capabilities.

So take care of yourself and plan a proper rest during the year.

#softskills #productivity
4🔥4👍1
Google ARM Processor

Last week, Google announced their own custom ARM-based processor for general-purpose workloads. They promised up to 65% better price-performance and up to 60% better energy-efficiency.

Why is it interesting? Until now, only AWS offered a custom cost-optimized ARM processor - AWS Graviton. And now Google joined the competition. This shows that interest in ARM processors still grows and continue to grow in the future.

From engineering perspective, it's not possible just to switch workload from one architecture to another as images need to be pre-built for a specific architecture. One of the ways to test ARM nodes and migrate smoothly on the new architecture is by using multi-architecture images (I wrote about that here)

#engineering #news
👍2🔥1
Uber’s Gen AI On-Call Copilot

GenAI continues its march in routine automation. This time Uber shared their experience with Genie - on-call support automation for internal teams.

The issue is very common for large companies with many teams: there is some channels (for Uber, it's slack with ~45 000 questions per month) where teams can put questions and request help with the service or technology. Of course, there are a lot of docs and relevant articles, but they are fragmented and spread across internal resources. It's really hard for users to find answers on their own. As a result, the number of repetitive questions grows, load and demand on support engineers increase.

Key elements of implemented solution:
✏️ RAG (Retrieval-Augmented Generation) Approach to work with LLM
✏️ Data Pipeline: Information from wikis, internal Stack Overflow, and engineering docs is scraped daily, transformed into vectors, and stored in an in-house vector database with the source links. Data pipeline is implemented on Apache Spark.
✏️ Knowledge Service: When a user posts a question in Slack, Genie’s backend converts it into a vector and fetches the most relevant chunks from the vector database.
✏️ User Feedback: Users can rank answers as Resolved, Helpful, Not Helpful, or Relevant, these ratings are used to analyze answer quality.
✏️ Source Quality Improvements: There is a separate evaluation process to improve source data quality. The LLM performs docs analysis and returns an evaluation score, explanations of the score and actionable suggestions to improve. All these information is collected to an evaluation report for further analysis and fixes.

Since Genie’s launch in September 2023, Uber reports it has answered 70,000 questions with 48.9% helpfulness rate, saving 13 000 engineering hours😲. It's impressive! I definitely want to have something similar at my work. Just a small hurdle left—get the budget and resources for implementation. No big deal, right? 😉

#engineering #usecase #ai
1🔥1
Columnar Databases

Traditional databases store data in a row-oriented approach that is optimized for transactional, single-entity data lookup. But if you need to aggregate data by a specific column, the system has to read all columns from disk, which slows down query performance and increase resource usage.
To solve the issue, columnar databases was introduced.

Columnar database is a type of a database that stores data in columns together on the disk.

Imagine the following sample:
|Account|LastName|FirstName|Purchase,$|
| 0122 | Jones | Jason | 325.5 |
| 0123 | Diamond| Richard | 500 |
| 0124 | Tailor | Alice | 125 |


In row-database it will be stored as following:
   0122, Jones, Jason, 325.5;   
0123, Diamond, Richard, 500;
0124, Tailor, Alice, 125;


In column-database:
   0122, 0123, 0124;
Jones, Diamond, Tailor;
Jason, Richard, Alice;
325.5, 500, 125;


Benefits of the approach:
📍High data compression due to the similarity of data within a column
📍Enhanced querying and aggregation performance for in analytical and reporting tasks
📍Reduced I/O load as there is no need to process irrelevant data

The most popular columnar databases:
1. Amazon Redshift
2. Google Cloud BigTable
3. Microsoft Azure Cosmos DB
4. Apache Druid
5. Vertica
6. ClickHouse
7. Snowflake Data Cloud

Columnar databases are well-suited for building data warehouse, real-time analytics, statistics, storing and aggregating time-series data.

#engineering
👍1🔥1
Manage Your Energy Level

Recently I wrote about the importance of having high-quality vacations. What I didn’t share is that I went on vacation completely drained, with zero level of internal resources and even a diagnosis from a neurologist 😵‍💫. It is a tough state to be in, I never want to feel like that again.

So I reflected on how to prevent burning out in the future.

First of all, I understand that's my fault - not heavy work, urgent issues, or company changes. It's primary responsibility of any leader to support their internal resource and energy. That's very important. Leader cannot work without enough energy level, as it's not possible to drive anything or meet business goals in that state.

Next, I started to study different recommendations what to do. The advice is usually very common: exercise, walk, eat well, and have time for hobbies. Unfortunately, I already knew that, but it didn't help me. My issue is that I don't notice the point where I am completely drained and it's too late to go for a walk.

So I need to control internal state somehow. As technical people, we know that to control something we need to measure something. One resource recommends Welltory app that makes personal health analysis based on heart rate variability (Garmin watches have similar features already built-in). Additionally it uses info about sleep, steps, stress level, and more from mostly any smart watch. Looks like a magic, but there is real science under that. This isn’t an ad—just sharing a tool I found useful 🙂.

I've been using the app for about 2 weeks now. The algorithm is still training (about 35% done), but I’m already using its basic features. I periodically make measurements and check overall state: green, orange or red. Based on this, I’ve started taking short recovery breaks at work to avoid hitting zero. Also I control overall health trend to understand if my daily routine needs additional corrections like more exercise, walk, etc.

Burnout is very common problem in our industry that's why I decided to share my experience on what can be helpful to control internal state and support good level of motivation and energy. Of course, 2 weeks are not enough to say the approach works. Put likes if the topic is interesting and I'll share my results in 1-2 months.

Stay healthy and take care of yourself.

#softskills #productivity
👍94❤‍🔥2
Cloud Ecosystem Trends

This week CNCF published Emerging trends in the cloud native ecosystem with a list of trends that will continue to grow in 2025.

Top trends:
🚀 Cloud Cost Optimizations. With growing cloud adoption, businesses focus on controlling cloud costs using tools like Karpenter and OpenCost. The same trend was also highlighted by FinOps Foundation earlier this year.
🚀 Platform Engineering (I did overview there). Extend developer experience with platforms for observability, policies as a code, internal developer portals, security, CI/CD, and storage to speed up business development.
🚀 AI Synergy. The trend is to support AI training and operations in the cloud. New actively developed projects there:
- OPEA: a collection of cloud-native patterns for GenAI workloads
- Milvus: a high-performance vector database
- Kubeflow: a project to deploy machine-learning workflows on Kubernetes
- KServe: a toolset for serving predictive and generative machine-learning models
🚀 Observability Standards Unification. Projects like OpenTelemetry and the Observability TAG unify standards, minimize vendor locks, and reduce costs.
🚀 Security. Security is a top priority topic in CNCF. There are some newly graduated projects in that area (like Falco) and separate TAG-Security group that publishes white papers that offer directions to the industry on the topic of security.
🚀 Sustainability (more about GreenOps there).  Sustainability tools (like Kepler, OpenCost) measure carbon footprints of Kubernetes applications. The area is under active development now, but it already has promising open-source projects and standards.

It's interesting that overprovisioning and high resource waste is still the main problem in modern clouds. According to the Kubernetes Cost Benchmark Report clusters with 50 or more CPUs used only 13% of their provisioned capacity, memory utilization was at the level of 20%. This shows a huge opportunity for future optimizations.

#news
1
I'm introducing a new section on the channel: #aibasics !

Over the past two years, ML is the top trend in the industry with the huge interest not just in tech but across various business domains. ML helps to automate routine tasks and significantly decrease operational costs. And definitely this trend will continue to grow next few years or even more.

As engineers we should at least know the fundamentals of that technology. I mean not just using lots of GenAI tools in daily work but understanding how it works under the hood, its limitations, capabilities and applicability for business and engineering tasks. As of me, I have a significant knowledge gap here, which I plan to close next several months.

I plan to start with the following courses (they are absolutely free):
✏️ Mashing Learning Crash Course from Google that has fresh updates in November
✏️ LLM Course by Cohere

I will use those courses as a base and extend them with additional sources on demand.

So I'm starting my AI learning journey and will share my progress and key takeaways here 💪
1👍1
ML Introduction

Let's start AI basics with the ML definition and their types.

Definition from Google ML Introduction course:
ML is the process of training a piece of software, called a model, to make useful predictions or generate content from data.


ML Types:
📍 Supervised Learning. The model is trained on lots of data with existing correct answers. It's "supervised" in the sense that a human gives the ML system data with the known correct results. This type is used for regressions and classifications.
📍 Unsupervised Learning. The model makes predictions using data that does not contain any correct answers. A commonly used unsupervised learning model employs a technique called clustering. The difference from classification is that categories are discovered during training and not defined by a human.
📍Reinforcement Learning. The model make predictions by getting rewards or penalties based the on actions performed. The goal is to find the best strategy to get the most rewards. Approach is used to train robots to execute different tasks.
📍Generative AI. The model creates content (text, images, music, etc.) from a user input.  These models learn existing patterns in data with the goal to produce new but similar data.

Each ML type has its own purpose, like making predictions, finding patterns, creating content, or automating routine tasks. Among them, Generative AI is the most popular and well-known today.

#aibasics
1👍1
ML Basic Terms

To be on the same page with AI-experts we need to build a special vocabulary with basic terms and concepts:
✏️ Feature - input parameter for the model. Usually it represents some characteristic of the entity or facts for which the model makes prediction.
✏️ Label - existing answer for input data. Usually used to train supervised models: predicted value can be compared with labels to check the size of discrepancy.
✏️ Loss - the difference between predicted value and label. For different models different functions to calculate loss is used.
✏️ Learning Rate - a floating-point number that tells the optimization algorithm the step size for the iteration while moving toward a minimum of a loss function. If the learning rate is too low, the model can take a long time to converge. If the learning rate is too high, the model may never converge.

#aibasics
👍3🔥1
Examples of a model that converges vs one that doesn't.

#aibasics
Linear Regression

Linear regression is the simplest supervised ML model that finds relationships between features and labels.

Mathematically it looks like:
 y'=b+w1*x1  + w2*x2 + ... + wn*xn


where
- y' - predicted value
- b - bias (calculated during training)
- wn - weight for a feature (calculated during training)
- xn - feature value (input to the model)

Loss for that type of model is usually calculated as a mean squared error(MSE) or mean absolute error (MAE):
- MSE is sensitive to outliers and adjusts the model toward them.
- MAE minimizes the absolute differences, making it less sensitive to outliers.

Training steps:
1. Calculate the loss with the current weight and bias.
2. Determine the direction to move the weights and bias that reduce loss.
3. Move the weight and bias values a small amount in the direction that reduces loss.
4. Return to step one and repeat the process until the model can't reduce the loss any further.

Example:
The model needs to predict taxi ride prices based on features like distance and ride duration. Past ride prices can be used as labels.

The model formula:
y'=b+w1*distance  + w2*ride_duration


The goal is to find values for b, w1, and w2 that minimize the MSE for the given labels. A well-trained model should converge after limited number of iterations, where the loss cannot be optimized anymore.

Use Cases:
✏️ Predicting Outcomes. Forecast values based on multiple inputs, e.g., taxi fares, apartment rentals, or flight prices.
✏️ Discovering Relationships. Reveal how variables are related and how changes in one variable affect the whole result.
✏️ Processes Optimizations. Optimize processes by understanding the relationships between different factors.

Studying linear regression made me realize why I learned linear algebra and statistics at university 😄. I really had some fun with the math and dynamic examples.

References:
- Google ML Crash Course: Linear Regression
- Understanding Multiple Linear Regression in ML

#aibasics
2
Visualization of how different loss functions can change model training results. As mentioned above MSE moves the model more toward the outliers, while MAE doesn't.

#aibasics
👍2