Those fucking zoom meetings are more nerve-racking than actual in-person meetings. It's almost illegal.
How to not look like a hostage at your next Zoom meeting? Here are some tips from my experience on the topic.
Link
How to not look like a hostage at your next Zoom meeting? Here are some tips from my experience on the topic.
Link
Blog | iamluminousmen
Guidelines for business meetings
Remote meetings have become an essential part of a workflow or even the only way of communication in various teams across the globe. How to make them effective?
Abstraction is not OOP
Once I was taught at university that there are only three principles of OOP: encapsulation, inheritance, and polymorphism. Times have changed and now another principle has been added to Wikipedia: abstraction. Now I hear it all the time at interviews, and so it drives me crazy.
Abstraction is a powerful programming tool. It is what allows us to build large systems and maintain control over them.
But abstraction is not an attribute of OOP alone, nor of programming in general. The process of creating abstraction levels extends to almost all areas of human knowledge.
Have you heard about Plato's idealism? We always deal with abstractions - models, and "the reality is not available to us". We can easily talk about complex mechanisms, such as a computer, an airplane turbine, or the human body, without remembering the individual details of these entities. We talk about ideas - ideal concepts not about erroneous implementations.
There have always been abstractions in programming. Splitting up the code into sub-programs. Combining sub-programs into modules and packages. Types? Same idea.
While encapsulation, polymorphism, and inheritance are the principles of OOP, abstraction is an element of OOP. It is above the principles of the OOP. The OOP principles implement abstraction. But this is more the philosophy, than a principle...
#dev
Once I was taught at university that there are only three principles of OOP: encapsulation, inheritance, and polymorphism. Times have changed and now another principle has been added to Wikipedia: abstraction. Now I hear it all the time at interviews, and so it drives me crazy.
Abstraction is a powerful programming tool. It is what allows us to build large systems and maintain control over them.
But abstraction is not an attribute of OOP alone, nor of programming in general. The process of creating abstraction levels extends to almost all areas of human knowledge.
Have you heard about Plato's idealism? We always deal with abstractions - models, and "the reality is not available to us". We can easily talk about complex mechanisms, such as a computer, an airplane turbine, or the human body, without remembering the individual details of these entities. We talk about ideas - ideal concepts not about erroneous implementations.
There have always been abstractions in programming. Splitting up the code into sub-programs. Combining sub-programs into modules and packages. Types? Same idea.
While encapsulation, polymorphism, and inheritance are the principles of OOP, abstraction is an element of OOP. It is above the principles of the OOP. The OOP principles implement abstraction. But this is more the philosophy, than a principle...
#dev
Open AI has shared the results of its research on DALL-E, a new neural network, a further extension of the GPT-3 idea using Transformers, but this time for generating images from text.
DALL-E neural network with 12 billion parameters, trained on picture-text pairs, which creates pictures from text denoscriptions. More here
✨Magic
DALL-E neural network with 12 billion parameters, trained on picture-text pairs, which creates pictures from text denoscriptions. More here
✨Magic
Soft skills thoughts
I feel like in western countries no one wants a person to know the depths of one particular technology. The knowledge and broad expertise of a whole stack or even one platform are valued much higher. That's why all of the cloud providers trying to include a service for any possible use case and such platforms as snowflake are flourish on the latest IPO.
Moreover, those countries is more and more inclined to soft skills — communication skills, teamwork, presentation skills, even sales skills. People here understood that there is no sense in chasing technologies. It is easier and cheaper to outsource those technologies to a platform where specially trained people do everything for them.
#soft_skills
I feel like in western countries no one wants a person to know the depths of one particular technology. The knowledge and broad expertise of a whole stack or even one platform are valued much higher. That's why all of the cloud providers trying to include a service for any possible use case and such platforms as snowflake are flourish on the latest IPO.
Moreover, those countries is more and more inclined to soft skills — communication skills, teamwork, presentation skills, even sales skills. People here understood that there is no sense in chasing technologies. It is easier and cheaper to outsource those technologies to a platform where specially trained people do everything for them.
#soft_skills
Gen C — The Covid Generation
'... And in case this isn’t forward-thinking enough for you, BofA notes that the next to come along is Gen C: The Covid generation.
"It is the generation that will have only ever known problem solving through fiscal stimulus and free government money potentially paving the way for universal basic income and health-care access," the strategists said. "Gen C will be unable to live without tech in every aspect of their lives" and "their avatars will protest virtually in the online Total Reality world with their friends on the latest cultural movement.'
Source
'... And in case this isn’t forward-thinking enough for you, BofA notes that the next to come along is Gen C: The Covid generation.
"It is the generation that will have only ever known problem solving through fiscal stimulus and free government money potentially paving the way for universal basic income and health-care access," the strategists said. "Gen C will be unable to live without tech in every aspect of their lives" and "their avatars will protest virtually in the online Total Reality world with their friends on the latest cultural movement.'
Source
Bloomberg.com
Zillennials Are Going to Change Investing Forever, BofA Says
Eating meat is out, flight-shaming is in. Gen Z is transforming the world and investors need to be prepared.
See how you should advertise your products properly. JetBrains downloaded 10,000,000 Jupyter notebooks from Github and made analytics on them. It seems to be nothing interesting and everything is absolutely clear, but there are things that I noticed.
▪️NumPy is the single most used data science library;
▪️NumPy and pandas is the most popular combination;
▪️Keras is wildly popular as a deep learning framework, although PyTorch has seen massive growth recently;
▪️Half of all these notebooks have fewer than four markdown cells;
▪️Over a third of these notebooks will may fail if you try to run the cells in order.
Unfortunately, if you ever work in data science, especially in ML, you will find that 1/3 is too little. It should be much more. Usually regular data scientists, engineers, papers from big conferences are at least partially unreproducible. Even some of the Titanic dataset' experts can not make them reproducible.
A good notebook is like a great conversation. Cells should be like sentences. You talk about something then add context and argument.
Link
▪️NumPy is the single most used data science library;
▪️NumPy and pandas is the most popular combination;
▪️Keras is wildly popular as a deep learning framework, although PyTorch has seen massive growth recently;
▪️Half of all these notebooks have fewer than four markdown cells;
▪️Over a third of these notebooks will may fail if you try to run the cells in order.
Unfortunately, if you ever work in data science, especially in ML, you will find that 1/3 is too little. It should be much more. Usually regular data scientists, engineers, papers from big conferences are at least partially unreproducible. Even some of the Titanic dataset' experts can not make them reproducible.
A good notebook is like a great conversation. Cells should be like sentences. You talk about something then add context and argument.
Link
The JetBrains Blog
We Downloaded 10,000,000 Jupyter Notebooks From Github – This Is What We Learned | The Datalore Blog
Here’s how we used the hundreds of thousands of publicly accessible repos on GitHub to learn more about the current state of data science.
NumPy
This is an open-source library, once separated from the SciPy project. NumPy is based on the LAPAC library, which is written in Fortran. Fortran-based implementation makes NumPy a fast library. And by virtue of the fact that it supports vector operations with multidimensional arrays, it is extremely convenient.
The non-Python alternative for NumPy is Matlab.
Besides support for multidimensional arrays, NumPy includes a set of packages for solving specialized problems, for example:
▪️
▪️
▪️
A guide to NumPy with many nice illustrations
#python
This is an open-source library, once separated from the SciPy project. NumPy is based on the LAPAC library, which is written in Fortran. Fortran-based implementation makes NumPy a fast library. And by virtue of the fact that it supports vector operations with multidimensional arrays, it is extremely convenient.
The non-Python alternative for NumPy is Matlab.
Besides support for multidimensional arrays, NumPy includes a set of packages for solving specialized problems, for example:
▪️
numpy.linalg - implements linear algebra operations;▪️
numpy.random - implements functions for dealing with random variables;▪️
numpy.fft - implements direct and inverse Fourier transform.A guide to NumPy with many nice illustrations
#python
Medium
NumPy Illustrated: The Visual Guide to NumPy
Brush up your NumPy or learn it from scratch
Interesting article about why Apache Kafka is so fast and popular. For those who work with the technology, read what Kafka has "under the hood". It explains a lot. Record batching, batch compression, buffered operations and other tricks. Zero-copy is really cool, never heard of that before.
Who does not like paywalls try open in a private tab.
Who does not like paywalls try open in a private tab.
Medium
Why Kafka Is so Fast
Discover the deliberate design decisions that have made Kafka the performance powerhouse it is today.
Testing and validation in ML
Testing is an important part of the software development cycle. Perhaps crucial to the delivery of a good product. As a software project grows, dealing with bugs and technical debt can consume all of the team time if it don't implement any testing approach. And overall software testing methodologies seem to me to be well understood.
Machine learning models bring a new set of complexities beyond traditional software. In particular, they depend on data in addition to code. As a result, testing methodologies for machine learning systems are less well understood and less widely applied in practice. Nowadays anyone can call a couple of functions on sklearn and proudly say he' s a data scientist, but to relate the results to the real world and validate that the model does reasonable things is quite difficult.
Here is a good talk about the importance of testing in ML, an overview of the types of testing available to ML practitioners, and recommendations on how you can start implementing more robust testing into ML projects.
#ml
Testing is an important part of the software development cycle. Perhaps crucial to the delivery of a good product. As a software project grows, dealing with bugs and technical debt can consume all of the team time if it don't implement any testing approach. And overall software testing methodologies seem to me to be well understood.
Machine learning models bring a new set of complexities beyond traditional software. In particular, they depend on data in addition to code. As a result, testing methodologies for machine learning systems are less well understood and less widely applied in practice. Nowadays anyone can call a couple of functions on sklearn and proudly say he' s a data scientist, but to relate the results to the real world and validate that the model does reasonable things is quite difficult.
Here is a good talk about the importance of testing in ML, an overview of the types of testing available to ML practitioners, and recommendations on how you can start implementing more robust testing into ML projects.
#ml
YouTube
PyData MTL: "Testing production machine learning systems" by Josh Tobin
"Testing production machine learning systems" by Josh Tobin
Testing is a critical part of the software development cycle. As your software project grows, dealing with bugs and regressions can consume your team if you do not take a principled approach to…
Testing is a critical part of the software development cycle. As your software project grows, dealing with bugs and regressions can consume your team if you do not take a principled approach to…
Martin Kleppmann(the guy behind Designing Data-Intensive applications book) has uploaded his new 8-lecture university course on distributed systems to the public.
Link
Link
Forwarded from Data Science, Machine Learning, AI & IOT
Huge repo of courses, resources covering Computer Science, AI, ML, Data SCIENCE, Maths and lot more
#beginner #machinelearning #datascience #github
@kdnuggets @datasciencechats
https://github.com/Developer-Y/cs-video-courses#math-for-computer-scientist
#beginner #machinelearning #datascience #github
@kdnuggets @datasciencechats
https://github.com/Developer-Y/cs-video-courses#math-for-computer-scientist
GitHub
GitHub - Developer-Y/cs-video-courses: List of Computer Science courses with video lectures.
List of Computer Science courses with video lectures. - Developer-Y/cs-video-courses
A little bit of Python. In Python exists no native empty-set-literal, because
{} is already reserved for dictionaries. However, you can unpack an empty list and get a sort of empty-set-literal with Python >= 3.5 (see PEP 448): s = {*[]} # or {*{}} or {*()}
>>> print(s)
set()
#pythonPython Enhancement Proposals (PEPs)
PEP 448 – Additional Unpacking Generalizations | peps.python.org
This PEP proposes extended usages of the * iterable unpacking operator and ** dictionary unpacking operators to allow unpacking in more positions, an arbitrary number of times, and in function calls and displays.
There two modern architectures that are commonly used in a lot of the projects in Big Data today.
The first one - Lambda Architecture, attributed to Nathan Marz, is one of the most common architectures you will see in real-time data processing today. It is designed to handle low-latency reads and updates in a linearly scalable and fault-tolerant way.
The second one - Kappa Architecture was first described by Jay Kreps. It focuses on only processing data as a stream. It is not a replacement for the Lambda Architecture, except for where your use case fits.
https://luminousmen.com/post/modern-big-data-architectures-lambda-kappa
The first one - Lambda Architecture, attributed to Nathan Marz, is one of the most common architectures you will see in real-time data processing today. It is designed to handle low-latency reads and updates in a linearly scalable and fault-tolerant way.
The second one - Kappa Architecture was first described by Jay Kreps. It focuses on only processing data as a stream. It is not a replacement for the Lambda Architecture, except for where your use case fits.
https://luminousmen.com/post/modern-big-data-architectures-lambda-kappa
I want to try a new format. I'm a fucking blogger - I have to act like one. Anyway, the idea is to talk about just one topic during the week.
And this week we're gonna talk about Apache Spark. Nothing specific, just some rambling about the topic, should be interesting though.
And this week we're gonna talk about Apache Spark. Nothing specific, just some rambling about the topic, should be interesting though.
Do you use Apache Spark at work?
Anonymous Poll
44%
daily
22%
occasionally
34%
what is Apache Spark?
RDD lineage memory error
Let's start with an interesting problem - memory error due to large RDD lineage.
When scheduling a stage for execution, Spark’s scheduler assigns a set of tasks to workers. But before dispatching tasks to workers, Spark’s scheduler broadcasts the metadata of the RDD to all workers, which includes the full lineage of that RDD. Moreover, when dispatching a task to a worker, Spark’s scheduler sends to the worker the task’s lineage(the parent RDD partitions that this task depends on) and its ancestors’ dependencies.
This can create a significant load from the single-threaded scheduler and may eventually lead to an error. You can identify it by the error message with something like OutOfMemoryError and something like Java heap space.
If the lineage is very long and contains a large number of RDDs and stages, the DAG can be huge. Such a problem can easily occur in iterative machine learning algorithms. Since Spark's dependency DAG cannot represent branches and loops, a loop unfolds in the DAG, and optimization steps with model parameter updates are repeated dozens or even hundreds of times.
People say that this could also easily happen in streaming applications, which can run for long periods of time and update the RDD whenever new data arrives. But I personally never seen that problem.
The solution here is to reduce the lineage size before the
#spark #big_data
Let's start with an interesting problem - memory error due to large RDD lineage.
When scheduling a stage for execution, Spark’s scheduler assigns a set of tasks to workers. But before dispatching tasks to workers, Spark’s scheduler broadcasts the metadata of the RDD to all workers, which includes the full lineage of that RDD. Moreover, when dispatching a task to a worker, Spark’s scheduler sends to the worker the task’s lineage(the parent RDD partitions that this task depends on) and its ancestors’ dependencies.
This can create a significant load from the single-threaded scheduler and may eventually lead to an error. You can identify it by the error message with something like OutOfMemoryError and something like Java heap space.
If the lineage is very long and contains a large number of RDDs and stages, the DAG can be huge. Such a problem can easily occur in iterative machine learning algorithms. Since Spark's dependency DAG cannot represent branches and loops, a loop unfolds in the DAG, and optimization steps with model parameter updates are repeated dozens or even hundreds of times.
People say that this could also easily happen in streaming applications, which can run for long periods of time and update the RDD whenever new data arrives. But I personally never seen that problem.
The solution here is to reduce the lineage size before the
.fit() step. For example use checkpoint() to shorten the lineage. Just use it from time to time when you think the lineage is increasing significantly. Also, cache() is useful in a similar context.#spark #big_data
Spark Mllib
Spark has two machine learning packages -
Nevertheless, we use "MLlib" as an umbrella term for the machine learning library in Apache Spark.
#spark #big_data
Spark has two machine learning packages -
spark.mllib and spark.ml.spark.mllib is the original machine learning API based on the RDD API (which has been in maintenance mode since Spark version 2.0), while spark.ml is a newer API based on DataFrames. Obviously, the DataFrames-based API is much faster due to all the Spark optimizations, but if you look into the code you can often see that spark.ml uses a lot of spark.mllib calls.Nevertheless, we use "MLlib" as an umbrella term for the machine learning library in Apache Spark.
#spark #big_data
Speaking of Spark Mllib
To the authors' credit, in the last couple of years Apache Spark got more algorithms in its machine learning libraries (MLlib and ML↑) and greatly improved performance.
But a huge number of algorithms are still implemented using old RDD interfaces, and some are still not implemented there (ensembles as an obvious example). Contributing to Spark Mllib today is unrealistic, the company has changed its focus and is trying to support what it already has instead of implementing SOTA algorithms.
It makes sense to me - it's better to write 100500 integration interfaces to third-party products and a good processing engine than to focus on a bunch of hard-to-implement algorithms. Plus, as Python users grow, they're trying to adapt the tool for the general public, such as by implementing Koalas. Plugins, add-ons (spark packages) to Spark are created to use the right algorithms, such as Microsoft has MMLSpark, there are also h20, Horovod, and many overs.
Good article on Spark ML library weknesses
#spark #big_data
To the authors' credit, in the last couple of years Apache Spark got more algorithms in its machine learning libraries (MLlib and ML↑) and greatly improved performance.
But a huge number of algorithms are still implemented using old RDD interfaces, and some are still not implemented there (ensembles as an obvious example). Contributing to Spark Mllib today is unrealistic, the company has changed its focus and is trying to support what it already has instead of implementing SOTA algorithms.
It makes sense to me - it's better to write 100500 integration interfaces to third-party products and a good processing engine than to focus on a bunch of hard-to-implement algorithms. Plus, as Python users grow, they're trying to adapt the tool for the general public, such as by implementing Koalas. Plugins, add-ons (spark packages) to Spark are created to use the right algorithms, such as Microsoft has MMLSpark, there are also h20, Horovod, and many overs.
Good article on Spark ML library weknesses
#spark #big_data
Medium
Weakness of the Apache Spark ML library
Like everything in the world, the Spark Distributed ML library widely known as MLlib is not perfect and working with it every day you come…