Engineer Readings – Telegram
[databases]
https://www.uber.com/en-NL/blog/auto-categorizing-data-through-ai-ml/
Data categorization–the process of classifying data based on its characteristics and essence–is a foundational pillar of any privacy or security program. The effectiveness of fine-grained data categorization is pivotal in implementing privacy and security controls, such as access policies and encryption, as well as managing the lifecycle of data assets, encompassing retention and deletion. This blog delves into Uber’s approach to achieving data categorization at scale by leveraging various AI/ML techniques.
[news][ai][hackaton]
Great projects out of the Mistral AI hackaton which took place in Paris.

https://x.com/alexreibman/status/1796349663710511114?s=46&t=eNN3Y-GKeBSlFyyj1ozvgg
[distributed systems][kafka]

Kora: A Cloud-Native Event Streaming Platform For Kafka

https://www.vldb.org/pvldb/vol16/p3822-povzner.pdf
[memory]

What Every Programmer Should Know About Memory

This paper explains the structure of memory subsys- tems in use on modern commodity hardware, illustrating why CPU caches were developed, how they work, and what programs should do to achieve optimal performance by utilizing them.

https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
🔥2
[learning][distributed systems]
Colleague shared an amazing thing you can try to study distributed systems by building.

https://fly.io/dist-sys/1/
🔥3
[distributed systems][paper]

Event-Based Programming without Inversion of Control

https://lampwww.epfl.ch/~odersky/papers/jmlc06.pdf
[paper][GC][state machine]
https://arxiv.org/html/2405.11182v1

In this paper, the authors quantify the overhead of running a state machine replication system for cloud systems written in a language with garbage collection (GC). To this end, they (1) design a canonical cloud system—a distributed, consensus-based, linearizable key-value store—from scratch, (2) implement it in C++, Java, Rust, and Go, and (3) evaluate the implementations under update-heavy and read-heavy workloads on AWS with different resource constraints, aiming to maximize throughput while maintaining low tail latency. The results show that GC incurs a non-trivial cost, even with ample memory. With limited memory, languages with manual memory management can achieve an order of magnitude higher throughput than those with GC on the same hardware. A key observation is that if a cloud system is expected to scale significantly, building it in a language with manual memory management, despite the higher development cost, may lead to substantial cloud cost savings in the long run.
🔥2
🔥1