NEW BOT Телеграм, страница

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

NumPy

This is an open-source library, once separated from the SciPy project. NumPy is based on the LAPAC library, which is written in Fortran. Fortran-based implementation makes NumPy a fast library. And by virtue of the fact that it supports vector operations with multidimensional arrays, it is extremely convenient.

The non-Python alternative for NumPy is Matlab.

Besides support for multidimensional arrays, NumPy includes a set of packages for solving specialized problems, for example:

▪️numpy.linalg - implements linear algebra operations;
▪️numpy.random - implements functions for dealing with random variables;
▪️numpy.fft - implements direct and inverse Fourier transform.

A guide to NumPy with many nice illustrations

#python

Medium

NumPy Illustrated: The Visual Guide to NumPy

Brush up your NumPy or learn it from scratch

174 views07:10

👍 2

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Interesting article about why Apache Kafka is so fast and popular. For those who work with the technology, read what Kafka has "under the hood". It explains a lot. Record batching, batch compression, buffered operations and other tricks. Zero-copy is really cool, never heard of that before.

Who does not like paywalls try open in a private tab.

Medium

Why Kafka Is so Fast

Discover the deliberate design decisions that have made Kafka the performance powerhouse it is today.

201 views07:10

👍 2

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Testing and validation in ML

Testing is an important part of the software development cycle. Perhaps crucial to the delivery of a good product. As a software project grows, dealing with bugs and technical debt can consume all of the team time if it don't implement any testing approach. And overall software testing methodologies seem to me to be well understood.

Machine learning models bring a new set of complexities beyond traditional software. In particular, they depend on data in addition to code. As a result, testing methodologies for machine learning systems are less well understood and less widely applied in practice. Nowadays anyone can call a couple of functions on sklearn and proudly say he' s a data scientist, but to relate the results to the real world and validate that the model does reasonable things is quite difficult.

Here is a good talk about the importance of testing in ML, an overview of the types of testing available to ML practitioners, and recommendations on how you can start implementing more robust testing into ML projects.

#ml

YouTube

PyData MTL: "Testing production machine learning systems" by Josh Tobin

"Testing production machine learning systems" by Josh Tobin
Testing is a critical part of the software development cycle. As your software project grows, dealing with bugs and regressions can consume your team if you do not take a principled approach to…

260 views07:10

👍 3

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Martin Kleppmann(the guy behind Designing Data-Intensive applications book) has uploaded his new 8-lecture university course on distributed systems to the public.

Link

214 views07:10

👍 4

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Forwarded from Data Science, Machine Learning, AI & IOT

Huge repo of courses, resources covering Computer Science, AI, ML, Data SCIENCE, Maths and lot more
#beginner #machinelearning #datascience #github
@kdnuggets @datasciencechats
https://github.com/Developer-Y/cs-video-courses#math-for-computer-scientist

GitHub

GitHub - Developer-Y/cs-video-courses: List of Computer Science courses with video lectures.

List of Computer Science courses with video lectures. - Developer-Y/cs-video-courses

216 views18:10

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

A little bit of Python. In Python exists no native empty-set-literal, because {} is already reserved for dictionaries. However, you can unpack an empty list and get a sort of empty-set-literal with Python >= 3.5 (see PEP 448):

 s = {*[]}  # or {*{}} or {*()}
>>> print(s)
set()

#python

Python Enhancement Proposals (PEPs)

PEP 448 – Additional Unpacking Generalizations | peps.python.org

This PEP proposes extended usages of the * iterable unpacking operator and ** dictionary unpacking operators to allow unpacking in more positions, an arbitrary number of times, and in function calls and displays.

215 viewsedited 07:10

👍 3

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

There two modern architectures that are commonly used in a lot of the projects in Big Data today.

The first one - Lambda Architecture, attributed to Nathan Marz, is one of the most common architectures you will see in real-time data processing today. It is designed to handle low-latency reads and updates in a linearly scalable and fault-tolerant way.

The second one - Kappa Architecture was first described by Jay Kreps. It focuses on only processing data as a stream. It is not a replacement for the Lambda Architecture, except for where your use case fits.

https://luminousmen.com/post/modern-big-data-architectures-lambda-kappa

274 views09:12

👍 6

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Nice visualization of the Data Science and AI landscape!

Source

#ds

207 viewsedited 07:10

👍 5

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

I want to try a new format. I'm a fucking blogger - I have to act like one. Anyway, the idea is to talk about just one topic during the week.

And this week we're gonna talk about Apache Spark. Nothing specific, just some rambling about the topic, should be interesting though.

192 views07:10

👍 5

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Do you use Apache Spark at work?

Anonymous Poll

what is Apache Spark?

41 voters188 views07:11

👍

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

RDD lineage memory error

Let's start with an interesting problem - memory error due to large RDD lineage.

When scheduling a stage for execution, Spark’s scheduler assigns a set of tasks to workers. But before dispatching tasks to workers, Spark’s scheduler broadcasts the metadata of the RDD to all workers, which includes the full lineage of that RDD. Moreover, when dispatching a task to a worker, Spark’s scheduler sends to the worker the task’s lineage(the parent RDD partitions that this task depends on) and its ancestors’ dependencies.

This can create a significant load from the single-threaded scheduler and may eventually lead to an error. You can identify it by the error message with something like OutOfMemoryError and something like Java heap space.

If the lineage is very long and contains a large number of RDDs and stages, the DAG can be huge. Such a problem can easily occur in iterative machine learning algorithms. Since Spark's dependency DAG cannot represent branches and loops, a loop unfolds in the DAG, and optimization steps with model parameter updates are repeated dozens or even hundreds of times.

People say that this could also easily happen in streaming applications, which can run for long periods of time and update the RDD whenever new data arrives. But I personally never seen that problem.

The solution here is to reduce the lineage size before the .fit() step. For example use checkpoint() to shorten the lineage. Just use it from time to time when you think the lineage is increasing significantly. Also, cache() is useful in a similar context.

#spark #big_data

207 views09:12

👍 4

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Spark Mllib

Spark has two machine learning packages - spark.mllib and spark.ml.

spark.mllib is the original machine learning API based on the RDD API (which has been in maintenance mode since Spark version 2.0), while spark.ml is a newer API based on DataFrames. Obviously, the DataFrames-based API is much faster due to all the Spark optimizations, but if you look into the code you can often see that spark.ml uses a lot of spark.mllib calls.

Nevertheless, we use "MLlib" as an umbrella term for the machine learning library in Apache Spark.

#spark #big_data

208 views07:10

👍 2

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Speaking of Spark Mllib

To the authors' credit, in the last couple of years Apache Spark got more algorithms in its machine learning libraries (MLlib and ML↑) and greatly improved performance.

But a huge number of algorithms are still implemented using old RDD interfaces, and some are still not implemented there (ensembles as an obvious example). Contributing to Spark Mllib today is unrealistic, the company has changed its focus and is trying to support what it already has instead of implementing SOTA algorithms.

It makes sense to me - it's better to write 100500 integration interfaces to third-party products and a good processing engine than to focus on a bunch of hard-to-implement algorithms. Plus, as Python users grow, they're trying to adapt the tool for the general public, such as by implementing Koalas. Plugins, add-ons (spark packages) to Spark are created to use the right algorithms, such as Microsoft has MMLSpark, there are also h20, Horovod, and many overs.

Good article on Spark ML library weknesses

#spark #big_data

Medium

Weakness of the Apache Spark ML library

Like everything in the world, the Spark Distributed ML library widely known as MLlib is not perfect and working with it every day you come…

237 views09:12

👍 2

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Spark configuration

There are many ways to set configuration properties in Spark. I keep getting confused as to which is the best place to put it.

Among all the ways you can set Spark properties, the priority order determines which values will be respected.

Based on the loading order:

▪️Any values or flags defined in the spark-defaults.conf file will be read first

▪️Then the values specified in the command line using the spark-submit or spark-shell

▪️Finally, the values set through SparkSession in the Spark application.

All these properties will be merged, with all duplicate properties discarded in the Spark application. Thus, for example, the values provided on the command line will override the settings in the configuration file, if they are not overwritten in the application itself.

#spark #big_data

209 views07:10

👍 2

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

PySpark documentation will follow numpydoc style. I do not see why — current Python docs for Spark always were ok. More readable than any of the java docs.

So this:

"""Specifies some hint on the current :class:DataFrame.

:param name: A name of the hint.
:param parameters: Optional parameters.
:return: :class:DataFrame

will be something like this:

"""Specifies some hint on the current :class:DataFrame.

:param name: A name of the hint.
:param parameters: Optional parameters.
:return: :class:DataFrame

will be something like this:

"""Specifies some hint on the current :class:DataFrame.

Parameters
----------
name : str
    A name of the hint.
parameters : dict, optional
    Optional parameters

Returns
-------
DataFrame

Probably it's gonna be more readable HTML and linking between pages. Will see.

#spark #python

217 views07:10

👍 2

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

PySpark configuration provides the spark.python.worker.reuse option which can be used to choose between forking Python process for each task and reusing the existing process. In it equals to true the process pool will be created and reuse on the executors. It should be useful to avoid expensive serialization, transfer data between JVM and Python and even garbage collection.

Although, it is more an impression than a result of systematic tests.

#spark #big_data

240 views09:12

👍 2

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

partitionOverwriteMode

Sometimes it is necessary to overwrite some failed partitions that Spark failed to process and you need to run the job again to have all the data written correctly.

There are two options here:

1. process and overwrite all data
2. process and overwrite data for the relevant partition

The first option sounds very dumb - to do all the work all over again. But to do the second option you need to rewrite the job. Meh - more code means more problems.
Luckily spark has a parameter spark.sql.sources.partitionOverwriteMode with option dynamic. This only overwrites data for partitions present in the current batch.

This configuration works well in cases where it is possible to overwrite external table metadata with a simple CREATE EXTERNAL TABLE when writing data to an external data store such as HDFS or S3.

#spark #big_data

261 views07:10

👍 1

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Performance Monitoring Tools for Spark aside from print statements

https://supergloo.com/spark-monitoring/spark-performance-monitoring-tools/

#spark #big_data

Supergloo Inc

Spark Performance Monitoring Tools - A List of Options

Which Spark performance monitoring tools are available to monitor the performance of your Spark cluster? Let's find out with these list of tools.

301 views09:12

👍 1

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

I've combined my experience on publishing book into the post

347 views13:16

👍 2

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Machine Learning roadmap — an interesting overview of the world of ML. A lot of information and links that will help not get lost.

Link

#ml

YouTube

2020 Machine Learning Roadmap (87% valid for 2024)

Getting into machine learning is quite the adventure. And as any adventurer knows, sometimes it can be helpful to have a compass to figure out if you're heading in the right direction.

Although the noscript of this video says machine learning roadmap, you should…

348 views07:10

👍 4

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Boston Dynamics has shown how the robot dog Spot's arm works.

Now that Spot has an arm in addition to its legs and cameras, it can do mobile copying. He finds and picks up objects (trash), cleans the living room, opens doors, operates switches and valves, tends the garden and generally has fun.

https://youtu.be/6Zbhvaac68Y

YouTube

Spot's Got an Arm!

Now that Spot has an arm in addition to legs and cameras, it can do mobile manipulation. It finds and picks up objects (trash), tidies up the living room, opens doors, operates switches and valves, tends the garden, and generally has fun. Motion of the…

279 views07:10

👍

About

Blog

Apps

Platform