Big Data Science – Telegram
Big Data Science
3.75K subscribers
65 photos
9 videos
12 files
637 links
Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼https://news.1rj.ru/str/bds_job — channel about Data Science jobs and career
💻https://news.1rj.ru/str/bdscience_ru — Big Data Science [RU]
Download Telegram
🔥GATO: the new SOTA from DeepMind
May 19, 2022 DeepMind published an article about a new generic universal agent outside the realm of text outputs. GATO operates as a multi-modal, multi-tasking, multi-variant universal policy. On the same network with the same weights, you can play Atari, record images, chat, manipulate blocks, and set other tasks depending on your context: generate text, determine optimal joint torques, detect points, etc.
GATO is trained on a large number of data sets including experienced agents in both simulated and sparse environments, in addition to many natural language and image data sets. At the GATO training stage, data from various tasks and modalities are serialized in a flat sequence of tokens, grouped and processed by a transformer neural network, similar to a large language model. Losses are masked so GATO only predicts action and text targets.
When the Gato is deployed, the demo prompt is tokenized, forming the initial sequence. The environment then emits the first observation, which is also tokenized and reproduced to the next one. GATO autoregressively selects an action vector one token at a time. Once all markers containing an action vector are selected (defined by environment specific actions), the action is decoded and placed in the environment, which performs the steps and produces a new observation. Then the procedure is repeated. The model always observes all observations and actions in its context window of 1024 tokens.
https://www.deepmind.com/publications/a-generalist-agent
🔥2
Data analytics - blog of the leading Data Scientist working at Uber, one of the authors of 🔥 Machine Learning. The channel material will help you really grow into a data professional.

1 channel instead of thousands of textbooks and courses, subscribe: 👇👇👇

🚀 @data_analysis_ml
👍1
#test
False signal of the car alarm sensor (without real threat) is error
Anonymous Quiz
41%
type II
47%
type I
4%
depends of statistical significance level
8%
it is not an error
🔥1
🚀New Python: faster more than 2x!
Released in April 2022, the alpha of Python 3.11 can run up to 60% faster than the previous version in some cases. Benchmarking tests conducted by Phoronix, conducted on Ubuntu Linux and compiled with the GCC compiler, have shown that Python 3.11 noscripts run an average of 25% faster than Python 3.10 without changing the code. This became possible due to the fact that now the interpreter is responsible for the static placement of its code objects and speeding up the seda of execution. Now, every time Python is used to call one of its own functions, a new frame is created, the internal structure of which has been improved so that it saves only the most important information without additional data about memory management and debugging.
Also, as of release 3.11, it is introduced that when CPython encounters a Python function that calls another function, it sets up a new frame and jumps to the new code contained within it. This avoids calling the function responsible for interpreting C (previously, each call to a Python function called a C function that interpreted it). This innovation further accelerated the execution of Python noscripts.
https://levelup.gitconnected.com/the-fastest-python-yet-up-to-60-faster-2eeb3d9a99d0
🔥5
Forwarded from Big Data Science [RU]
Новый Python: теперь намного быстрее!
🔥3
💥Best of May Airflow Summit 2022!
Top reports from data engineers for data engineers: the most interesting talks, from the intricacies of the batch-orchestrator to best practices of deployment and data management.
https://medium.com/apache-airflow/airflow-summit-2022-the-best-of-373bee2527fa
🔥1
🪢AI for Robotics: NVIDIA's New Version of Isaac Sim
On June 17, 2022, NVIDIA announced the release of a version of the Isaac Sim robotics modeling tool and synthetic data generator. This platform accelerates the development, testing and training of AI in robotics. Developers use Isaac Sim to create product quality datasets for AI model development, simulate robotic navigation and control, and generate tests to test robotics applications.
Isaac Sim includes implementation tools for building collaborative robots (cobots), a GPU-accelerated RL learning tool, a set of tools for creating synthetic data, APIs and workflows, and new capabilities for building synthetic data products.
https://developer.nvidia.com/blog/expedite-the-development-testing-and-training-of-ai-robots-with-isaac-sim/
🗣Special for MRI: Nilearn Library
Nilearn is a lightweight Python-library for MRI data analysis. It provides statistical and machine learning tools, detailed documentation, and an open community.
Nilearn now includes the functionality of the Nistats library and extends it with new features. It supports general linear model (GLM) analysis and uses scikit-learn tools for multivariate statistics with applications such as predictive modeling, classification, decoding, or connectivity analysis. Nilearn perfectly visualizes the results of the analysis in the projections of the human brain.
The library is available under the BSD license and is being actively updated: version 0.1.1 was released in 2015, and release 0.9.1 was released as of June 2022. https://nilearn.github.io/stable/index.html
💥💦🌸TOP-5 Data Science and ML conferences all over the World in July 2022:
• Jul 7,
Challenging The Status Quo to Avoid AI Failure. Virtual. https://www.cognilytica.com/CLNjc0ODl8MjI
• Jul 12, oneAPI DevSummit for AI. Virtual. https://software.seek.intel.com/oneapi-devsummit-ai-2022
• Jul 13-14, Business of Data Festival. Virtual. https://www.businessofdatafestival.com/
• Jul 13-17, MLDM 2022: 18th Int. Conf. on Machine Learning and Data Mining. New York, NY, USA http://www.mldm.de/mldm2022.php
• Jul 16-21, ICDM 2022: 22th Industrial Conf. on Data Mining. New York, NY, USA. https://www.data-mining-forum.de/icdm2022.php
🔥1
👌TOP-5 MLOps frameworks
The MLOps topic is in hype today: ML systems are getting more complex, and their development and support includes not only computational algorithms and other data science, but also the best software developments with the intricacies of deploying in production. At the same time, it is necessary to constantly monitor the drift of the input data and the reaction of ML models to them, with the accuracy of correcting the code with its versioning. Establish such a continuous chain of response to complex MLOps frameworks, the most important of which are possible:
MLflow is an open source platform for managing the end-to-end machine learning lifecycle. https://www.mlflow.org/
Kubeflow is an open-source machine learning platform designed to enable using machine learning pipelines to orchestrate complicated workflows running on Kubernetes (e.g. doing data processing then using TensorFlow or PyTorch to train a model, and deploying to TensorFlow Serving or Seldon). Kubeflow is built based on Google’s internal method to deploy TensorFlow models called TensorFlow Extended. https://www.kubeflow.org/
FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. It fully supports asynchronous programming and can run with Uvicorn and Gunicorn. https://fastapi.tiangolo.com/
ZenML is an open-source MLOps framework built for ML teams. It provides the abstraction layer to bring Machine Learning Models from research to production as easily as possible. Data scientists don’t need to know the details behind the deployment but gain full control and ownership over the whole pipeline process. ZenML standardizes writing ML pipelines across different MLOps stacks, agnostic of cloud providers, third-party vendors, and underlying infrastructure. https://zenml.io/
Seldon Core is an open-source framework, makes it easier and faster to deploy our machine learning models and experiments at scale on Kubernetes. Seldon Core serves models built in any open-source or commercial model building framework. You can make use of powerful Kubernetes features like custom resource definitions to manage model graphs. And then connect your continuous integration and deployment (CI/CD) tools to scale and update your deployment. Seldon handles scaling to thousands of production machine learning models and provides advanced machine learning capabilities out of the box including Advanced Metrics, Request Logging, Explainers, Outlier Detectors, A/B Tests, Canaries, and more. https://www.seldon.io/solutions/open-source-projects/core
🔥3
🪢2 libraries for unit testing Python noscripts
Unit testing allows developers to ensure that their code base is working as intended at an atomic level. The point of unit testing is not that the code is working as a whole, but instead that each individual function is doing what it is supposed to be doing. For writing Python unit tests you can use Pytest and Chispa:
• Pytest is framework to write small, readable tests, and can scale to support complex functional testing for applications and libraries. It requires: Python 3.7+ or PyPy3. https://docs.pytest.org/en/7.1.x/
• Chispa provides fast PySpark test helper methods that output denoscriptive error messages. This library makes it easy to write high quality PySpark code. By the way, chispa means Spark in Spanish. https://github.com/MrPowers/chispa
🔥1
#test
In Recurrent neural networks the ouput of neuron depends on
Anonymous Quiz
2%
inputs
21%
inputs and weights
74%
inputs, weights and outputs
2%
weights
🔥1🤔1
💥Python library for sending PySpark noscripts to Spark cluster via Livy REST API
Spark-apps developers know that there are two approaches to programmatically submit jobs to an Apache Spark cluster, and each has some limitations to achieve real-time interaction: spark-submit and spark-shell.
However, in practice, there are times when you need to submit Spark jobs interactively from a web or mobile app. At the same time, the Apache Spark cluster is hosted in the local infrastructure, but it is necessary that many users simultaneously consume and perform heavy aggregations with data sources from their mobile phones, web or desktop applications. In this case, a service approach, Spark-as-a-Service, will help, including exposing JDBC/ODBC data sources through a Spark standby server or using Apache Livy, a service that allows you to easily interact with an Apache Spark cluster via a REST API.
For Livy, the Python-package livyc works well for sending PySpark noscripts dynamically and asynchronously to the Apache Livy server, interacting transparently with the Apache Spark cluster. https://github.com/Wittline/livyc
🔥1
🗣7 speech recognition tools
To develop your own ML speech recognition system, you can use the following frameworks and libraries:
• wav2letter - an open-course open source toolkit from Facebook AI Research merged with a larger library called Flashlight https://github.com/flashlight/wav2letter
• DeepSpeech powered by Baidu DeepSpeech, which will help you decode an audio file using pre-trained models or set up/train a custom dataset https://deepspeech.readthedocs.io/en/r0.9/?badge=latest
• TensorFlowASR - Open source package from Tensorflow implements some reference models trained using RNN with CTC https://github.com/TensorSpeech/TensorFlowASR
• OpenSeq2Seq - research project by NVIDIA on the problems of converting sequences to sequences https://github.com/NVIDIA/OpenSeq2Seq/blob/master/Streaming-ASR.ipynb
• SpeechRecognition - the project provides access to several automatic speech recognition models, including speech API wrappers from Google, Microsoft Azure and IBM https://github.com/Uberi/speech_recognition

We also note 2 ready-made services that provide an API for accessing the capabilities of services, from speech recognition to generating "natural" voice data:
• SmartSpeech by SberDevices https://sberdevices.ru/smartspeech/
• Yandex SpeechKit by Yandex https://cloud.yandex.ru/services/speechkit
🔥1
#test
ACID-requirements to transactions are implemented full in
Anonymous Quiz
60%
relational databases
7%
NoSQL databases
16%
any databases
16%
OLTP-databases
🔥1
🙌🏼7 Platforms of Federated ML
Federated learning is also referred to as collaborative because ML models are trained on multiple decentralized edge devices or servers containing local data samples without exchanging them. This approach differs from traditional centralized ML methods, where all local datasets are uploaded to a single server, and from more classical decentralized approaches with the same distribution of local data. Today, federated learning is actively used in the defense industry, telecommunications, pharmaceuticals and IoT platforms.
Federated Machine Learning ideas were first introduced by Google in 2017 to improve mobile keyboard text prediction using machine learning models trained on data from multiple devices. In federated ML, models are trained on multiple local datasets on local nodes without explicit data exchange, but with periodic exchange of parameters, such as deep neural network weights and biases, between local nodes to create a common global model. Unlike distributed learning, which was originally aimed at parallelizing computations, federated learning is aimed at learning heterogeneous data sets. In federated ML, datasets are usually highly heterogeneous in size. And clients, i.e. end devices where local models are trained can be unreliable and more prone to failure than in distributed learning systems where the nodes are data centers with powerful computing capabilities. Therefore, in order to provide distributed computing and synchronization of its results, federated ML requires frequent data exchange between nodes.
Due to its architectural features, federated ML has a number of disadvantages:
• heterogeneity between different local datasets - each node has an error in relation to the general population, and sample sizes can vary significantly;
• temporal heterogeneity - the distribution of each local dataset changes over time;
• it is necessary to ensure the compatibility of the data set on all nodes;
• hiding training datasets is fraught with the risk of introducing vulnerabilities into the global model;
• lack of access to global training data makes it difficult to identify unwanted biases in training inputs;
• there is a risk of losing updates to local ML models due to failures at individual nodes, which may affect the global model.
Today, federated ML is supported by the following platforms:
• FATE (Federated AI Technology Enabler) https://fate.fedai.org/
• Substra https://www.substra.ai/
• Python libraries PySyft and PyGrid https://github.com/OpenMined/PySyft, https://github.com/OpenMined/PyGrid, https://github.com/OpenMined/pygrid-admin
• Open FL https://github.com/intel/openfl
• TensorFlow Federated (TFF) https://www.tensorflow.org/federated
• IBM Federated Learning https://ibmfl.mybluemix.net/
• NVIDIA CLARA https://developer.nvidia.com/clara
👍2
🙌🏼Computational complexity of ML algorithms
When the amount of data is low, almost any ML algorithm gives acceptable accuracy and is suitable for solving the tasks. But when the volume and size of the data become large, it is necessary to choose an algorithm for training the ML model that does not require too many computing resources. It is better to choose a simple or less expensive algorithm in terms of computation than an algorithm that requires large computational resources, when the accuracy of prediction and evaluation of results is similar or even slightly worse.
The choice of algorithm depends on the following consequences:
• the order of time (complexity of time) required to calculate the algorithm - functions associated with the data of the algorithm itself, the volume and number of features
• set of computational space (spatial complexity) - the order of the space required during the calculation of the algorithm - a function associated with the algorithm, such as the number of features, coefficients, hidden layers of neural networks. Space complexity includes both the size of the input data and the ancillary space (auxiliary space) used by the algorithm during execution;
For example, Mergesort has an ancillary space 𝑂(𝑛) and volume complexity 𝑂(𝑛), while Quicksort has an ancillary space 𝑂(1) and volume complexity 𝑂(𝑛). As a result, both merge sort and quick sort have time stability 𝑂(𝑛log𝑛).
https://medium.com/datadailyread/computational-complexity-of-machine-learning-algorithms-16e7ffcafa7d
👍2
Computational Complexity of Machine Learning Algorithms
👍5
#test
Avoid overfitting the ML-model on a large volume of highly noisy input data by highlighting the most significant features one of the following methods will help
Anonymous Quiz
14%
filtration
47%
L1 regularization
27%
L2 regularization
12%
normalization
👍3
👍🏻TOP 4 dbt tips for data analyst and data engineer
dbt (data build tool)
is an open source code framework for executing, testing and documenting SQL queries, which allows you to process data analysis machine, including structuring and denoscription of arrivals, their search, nested calls, rule triggering, documentation and testing. For example, you can use the dbt CLI or dbt Cloud to work with data collection to consume, transform, and load data into storage by computing a dynamic database on a schedule. To increase the efficiency of using dbt for the selection of schemas, sources and models, it is possible to use data:
The Schema.yml file can only be found in the dbt models folder. The tool allows you to create a unit test that counts the duration of a column for nulls.
dbt data tests have a strict rule that they must return null rows in order to pass the test. Instead of looking for a value such as the number of a particular set of rows, the data test should be written to expect to find null rows if the results do not match the correct set of sums. Therefore, when developing test data, you need to think about how to return 0 rows in the expected key, but at the same time you need to check the number. You can use the != or <= operators to validate data.
• To increase the speed of testing increase the number of threads in the project profile, in the profiles.yml file. For example, if there are 30 tests, then there are 40 threads, indicate in the profiles.yml file. Probably 30 data and schema tests in 4 seconds.
The history test needs a meaningful name. Although dbt automatically learns the test names, it is recommended that you label them yourself. dbt doesn't have much control over running small test suites, it needs to be able to see all running projects. In the same way that developers are encouraged to use functions and variables with semantic name definitions, testing should be given tests for meaningful names. Otherwise, it will be difficult to determine which test passes or fails during test execution. When a test error is found in dbt, all schema and data tests are run together. It's not easy to use a single directory in the data tests folder, but you can name them "dbt test - schema" or "dbt test - data" to quickly determine which tests to use.

https://corissa-haury.medium.com/4-quick-facts-about-dbt-testing-5c32b487b8cd
👍2