NEW BOT Телеграм, страница

Big Data Science

Data analytics - blog of the leading Data Scientist working at Uber, one of the authors of 🔥 Machine Learning. The channel material will help you really grow into a data professional.

1 channel instead of thousands of textbooks and courses, subscribe: 👇👇👇

🚀 @data_analysis_ml

👍1

561 viewsedited 07:04

Big Data Science

#test
False signal of the car alarm sensor (without real threat) is error

Anonymous Quiz

depends of statistical significance level

it is not an error

🔥1

76 voters489 views15:40

Big Data Science

🚀New Python: faster more than 2x!
Released in April 2022, the alpha of Python 3.11 can run up to 60% faster than the previous version in some cases. Benchmarking tests conducted by Phoronix, conducted on Ubuntu Linux and compiled with the GCC compiler, have shown that Python 3.11 noscripts run an average of 25% faster than Python 3.10 without changing the code. This became possible due to the fact that now the interpreter is responsible for the static placement of its code objects and speeding up the seda of execution. Now, every time Python is used to call one of its own functions, a new frame is created, the internal structure of which has been improved so that it saves only the most important information without additional data about memory management and debugging.
Also, as of release 3.11, it is introduced that when CPython encounters a Python function that calls another function, it sets up a new frame and jumps to the new code contained within it. This avoids calling the function responsible for interpreting C (previously, each call to a Python function called a C function that interpreted it). This innovation further accelerated the execution of Python noscripts.
https://levelup.gitconnected.com/the-fastest-python-yet-up-to-60-faster-2eeb3d9a99d0

Medium

The Fastest Python Yet: Up to 60% Faster⚡

You won’t believe how fast it can be!

🔥5

614 views04:20

Big Data Science

Forwarded from Big Data Science [RU]

Новый Python: теперь намного быстрее!

🔥3

562 views04:20

Big Data Science

💥Best of May Airflow Summit 2022!
Top reports from data engineers for data engineers: the most interesting talks, from the intricacies of the batch-orchestrator to best practices of deployment and data management.
https://medium.com/apache-airflow/airflow-summit-2022-the-best-of-373bee2527fa

Medium

Airflow Summit 2022 — The Best Of

The Airflow Summit has a very special place in my heart. I hope I can share my sentiment here with whoever reads that, and by the end of…

🔥1

586 viewsedited 05:33

Big Data Science

#test
The statistical power does NOT depend on

Anonymous Quiz

21%

the magnitude of the effect of interest in the population

52%

expected value

the sample size used to detect the effect

18%

the statistical significance criterion used in the test

🔥1

66 voters589 views05:36

Big Data Science

🪢AI for Robotics: NVIDIA's New Version of Isaac Sim
On June 17, 2022, NVIDIA announced the release of a version of the Isaac Sim robotics modeling tool and synthetic data generator. This platform accelerates the development, testing and training of AI in robotics. Developers use Isaac Sim to create product quality datasets for AI model development, simulate robotic navigation and control, and generate tests to test robotics applications.
Isaac Sim includes implementation tools for building collaborative robots (cobots), a GPU-accelerated RL learning tool, a set of tools for creating synthetic data, APIs and workflows, and new capabilities for building synthetic data products.
https://developer.nvidia.com/blog/expedite-the-development-testing-and-training-of-ai-robots-with-isaac-sim/

NVIDIA Technical Blog

Expedite the Development, Testing, and Training of AI Robots with Isaac Sim

This release of Isaac Sim adds more tools for AI-based robotics including Isaac Gym support for RL, Isaac Cortex for cobot programming, and much more.

622 views05:16

Big Data Science

🗣Special for MRI: Nilearn Library
Nilearn is a lightweight Python-library for MRI data analysis. It provides statistical and machine learning tools, detailed documentation, and an open community.
Nilearn now includes the functionality of the Nistats library and extends it with new features. It supports general linear model (GLM) analysis and uses scikit-learn tools for multivariate statistics with applications such as predictive modeling, classification, decoding, or connectivity analysis. Nilearn perfectly visualizes the results of the analysis in the projections of the human brain.
The library is available under the BSD license and is being actively updated: version 0.1.1 was released in 2015, and release 0.9.1 was released as of June 2022. https://nilearn.github.io/stable/index.html

Nilearn

Nilearn enables approachable and versatile analyses of brain volumes and surfaces. It provides statistical and machine-learning tools, with instructive documentation & open community. It suppor...

565 views04:15

Big Data Science

💥💦🌸TOP-5 Data Science and ML conferences all over the World in July 2022:
• Jul 7, Challenging The Status Quo to Avoid AI Failure. Virtual. https://www.cognilytica.com/CLNjc0ODl8MjI
• Jul 12, oneAPI DevSummit for AI. Virtual. https://software.seek.intel.com/oneapi-devsummit-ai-2022
• Jul 13-14, Business of Data Festival. Virtual. https://www.businessofdatafestival.com/
• Jul 13-17, MLDM 2022: 18th Int. Conf. on Machine Learning and Data Mining. New York, NY, USA http://www.mldm.de/mldm2022.php
• Jul 16-21, ICDM 2022: 22th Industrial Conf. on Data Mining. New York, NY, USA. https://www.data-mining-forum.de/icdm2022.php

🔥1

766 views11:07

Big Data Science

👌TOP-5 MLOps frameworks
The MLOps topic is in hype today: ML systems are getting more complex, and their development and support includes not only computational algorithms and other data science, but also the best software developments with the intricacies of deploying in production. At the same time, it is necessary to constantly monitor the drift of the input data and the reaction of ML models to them, with the accuracy of correcting the code with its versioning. Establish such a continuous chain of response to complex MLOps frameworks, the most important of which are possible:
• MLflow is an open source platform for managing the end-to-end machine learning lifecycle. https://www.mlflow.org/
• Kubeflow is an open-source machine learning platform designed to enable using machine learning pipelines to orchestrate complicated workflows running on Kubernetes (e.g. doing data processing then using TensorFlow or PyTorch to train a model, and deploying to TensorFlow Serving or Seldon). Kubeflow is built based on Google’s internal method to deploy TensorFlow models called TensorFlow Extended. https://www.kubeflow.org/
• FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. It fully supports asynchronous programming and can run with Uvicorn and Gunicorn. https://fastapi.tiangolo.com/
• ZenML is an open-source MLOps framework built for ML teams. It provides the abstraction layer to bring Machine Learning Models from research to production as easily as possible. Data scientists don’t need to know the details behind the deployment but gain full control and ownership over the whole pipeline process. ZenML standardizes writing ML pipelines across different MLOps stacks, agnostic of cloud providers, third-party vendors, and underlying infrastructure. https://zenml.io/
• Seldon Core is an open-source framework, makes it easier and faster to deploy our machine learning models and experiments at scale on Kubernetes. Seldon Core serves models built in any open-source or commercial model building framework. You can make use of powerful Kubernetes features like custom resource definitions to manage model graphs. And then connect your continuous integration and deployment (CI/CD) tools to scale and update your deployment. Seldon handles scaling to thousands of production machine learning models and provides advanced machine learning capabilities out of the box including Advanced Metrics, Request Logging, Explainers, Outlier Detectors, A/B Tests, Canaries, and more. https://www.seldon.io/solutions/open-source-projects/core

🔥3

843 views09:45

Big Data Science

🪢2 libraries for unit testing Python noscripts
Unit testing allows developers to ensure that their code base is working as intended at an atomic level. The point of unit testing is not that the code is working as a whole, but instead that each individual function is doing what it is supposed to be doing. For writing Python unit tests you can use Pytest and Chispa:
• Pytest is framework to write small, readable tests, and can scale to support complex functional testing for applications and libraries. It requires: Python 3.7+ or PyPy3. https://docs.pytest.org/en/7.1.x/
• Chispa provides fast PySpark test helper methods that output denoscriptive error messages. This library makes it easy to write high quality PySpark code. By the way, chispa means Spark in Spanish. https://github.com/MrPowers/chispa

GitHub

GitHub - MrPowers/chispa: PySpark test helper methods with beautiful error messages

PySpark test helper methods with beautiful error messages - MrPowers/chispa

🔥1

739 views05:47

Big Data Science

#test
In Recurrent neural networks the ouput of neuron depends on

Anonymous Quiz

inputs, weights and outputs

weights

🔥1🤔1

81 voters655 views04:42

Big Data Science

💥Python library for sending PySpark noscripts to Spark cluster via Livy REST API
Spark-apps developers know that there are two approaches to programmatically submit jobs to an Apache Spark cluster, and each has some limitations to achieve real-time interaction: spark-submit and spark-shell.
However, in practice, there are times when you need to submit Spark jobs interactively from a web or mobile app. At the same time, the Apache Spark cluster is hosted in the local infrastructure, but it is necessary that many users simultaneously consume and perform heavy aggregations with data sources from their mobile phones, web or desktop applications. In this case, a service approach, Spark-as-a-Service, will help, including exposing JDBC/ODBC data sources through a Spark standby server or using Apache Livy, a service that allows you to easily interact with an Apache Spark cluster via a REST API.
For Livy, the Python-package livyc works well for sending PySpark noscripts dynamically and asynchronously to the Apache Livy server, interacting transparently with the Apache Spark cluster. https://github.com/Wittline/livyc

GitHub

GitHub - Wittline/livyc: Apache Spark as a Service with Apache Livy Client

Apache Spark as a Service with Apache Livy Client. Contribute to Wittline/livyc development by creating an account on GitHub.

🔥1

688 viewsedited 03:55

Big Data Science

🗣7 speech recognition tools
To develop your own ML speech recognition system, you can use the following frameworks and libraries:
• wav2letter - an open-course open source toolkit from Facebook AI Research merged with a larger library called Flashlight https://github.com/flashlight/wav2letter
• DeepSpeech powered by Baidu DeepSpeech, which will help you decode an audio file using pre-trained models or set up/train a custom dataset https://deepspeech.readthedocs.io/en/r0.9/?badge=latest
• TensorFlowASR - Open source package from Tensorflow implements some reference models trained using RNN with CTC https://github.com/TensorSpeech/TensorFlowASR
• OpenSeq2Seq - research project by NVIDIA on the problems of converting sequences to sequences https://github.com/NVIDIA/OpenSeq2Seq/blob/master/Streaming-ASR.ipynb
• SpeechRecognition - the project provides access to several automatic speech recognition models, including speech API wrappers from Google, Microsoft Azure and IBM https://github.com/Uberi/speech_recognition

We also note 2 ready-made services that provide an API for accessing the capabilities of services, from speech recognition to generating "natural" voice data:
• SmartSpeech by SberDevices https://sberdevices.ru/smartspeech/
• Yandex SpeechKit by Yandex https://cloud.yandex.ru/services/speechkit

GitHub

GitHub - flashlight/wav2letter: Facebook AI Research's Automatic Speech Recognition Toolkit

Facebook AI Research's Automatic Speech Recognition Toolkit - GitHub - flashlight/wav2letter: Facebook AI Research's Automatic Speech Recognition Toolkit

🔥1

770 views05:29

Big Data Science

#test
ACID-requirements to transactions are implemented full in

Anonymous Quiz

🔥1

67 voters619 views06:28

Big Data Science

🙌🏼7 Platforms of Federated ML
Federated learning is also referred to as collaborative because ML models are trained on multiple decentralized edge devices or servers containing local data samples without exchanging them. This approach differs from traditional centralized ML methods, where all local datasets are uploaded to a single server, and from more classical decentralized approaches with the same distribution of local data. Today, federated learning is actively used in the defense industry, telecommunications, pharmaceuticals and IoT platforms.
Federated Machine Learning ideas were first introduced by Google in 2017 to improve mobile keyboard text prediction using machine learning models trained on data from multiple devices. In federated ML, models are trained on multiple local datasets on local nodes without explicit data exchange, but with periodic exchange of parameters, such as deep neural network weights and biases, between local nodes to create a common global model. Unlike distributed learning, which was originally aimed at parallelizing computations, federated learning is aimed at learning heterogeneous data sets. In federated ML, datasets are usually highly heterogeneous in size. And clients, i.e. end devices where local models are trained can be unreliable and more prone to failure than in distributed learning systems where the nodes are data centers with powerful computing capabilities. Therefore, in order to provide distributed computing and synchronization of its results, federated ML requires frequent data exchange between nodes.
Due to its architectural features, federated ML has a number of disadvantages:
• heterogeneity between different local datasets - each node has an error in relation to the general population, and sample sizes can vary significantly;
• temporal heterogeneity - the distribution of each local dataset changes over time;
• it is necessary to ensure the compatibility of the data set on all nodes;
• hiding training datasets is fraught with the risk of introducing vulnerabilities into the global model;
• lack of access to global training data makes it difficult to identify unwanted biases in training inputs;
• there is a risk of losing updates to local ML models due to failures at individual nodes, which may affect the global model.
Today, federated ML is supported by the following platforms:
• FATE (Federated AI Technology Enabler) https://fate.fedai.org/
• Substra https://www.substra.ai/
• Python libraries PySyft and PyGrid https://github.com/OpenMined/PySyft, https://github.com/OpenMined/PyGrid, https://github.com/OpenMined/pygrid-admin
• Open FL https://github.com/intel/openfl
• TensorFlow Federated (TFF) https://www.tensorflow.org/federated
• IBM Federated Learning https://ibmfl.mybluemix.net/
• NVIDIA CLARA https://developer.nvidia.com/clara

Fate

HOME

An Industrial Grade
Federated Learning Framework
support federated learning architectures and secure computation of any machine learning algorithms

👍2

765 views04:53

Big Data Science

🙌🏼Computational complexity of ML algorithms
When the amount of data is low, almost any ML algorithm gives acceptable accuracy and is suitable for solving the tasks. But when the volume and size of the data become large, it is necessary to choose an algorithm for training the ML model that does not require too many computing resources. It is better to choose a simple or less expensive algorithm in terms of computation than an algorithm that requires large computational resources, when the accuracy of prediction and evaluation of results is similar or even slightly worse.
The choice of algorithm depends on the following consequences:
• the order of time (complexity of time) required to calculate the algorithm - functions associated with the data of the algorithm itself, the volume and number of features
• set of computational space (spatial complexity) - the order of the space required during the calculation of the algorithm - a function associated with the algorithm, such as the number of features, coefficients, hidden layers of neural networks. Space complexity includes both the size of the input data and the ancillary space (auxiliary space) used by the algorithm during execution;
For example, Mergesort has an ancillary space 𝑂(𝑛) and volume complexity 𝑂(𝑛), while Quicksort has an ancillary space 𝑂(1) and volume complexity 𝑂(𝑛). As a result, both merge sort and quick sort have time stability 𝑂(𝑛log𝑛).
https://medium.com/datadailyread/computational-complexity-of-machine-learning-algorithms-16e7ffcafa7d

Medium

Computational Complexity of Machine Learning Algorithms

Pick the right algorithm for your data

👍2

675 viewsedited 15:53

Big Data Science

Computational Complexity of Machine Learning Algorithms

👍5

1.88K views15:54

Big Data Science

#test
Avoid overfitting the ML-model on a large volume of highly noisy input data by highlighting the most significant features one of the following methods will help

Anonymous Quiz

👍3

78 voters672 views04:18

Big Data Science

👍🏻TOP 4 dbt tips for data analyst and data engineer
dbt (data build tool) is an open source code framework for executing, testing and documenting SQL queries, which allows you to process data analysis machine, including structuring and denoscription of arrivals, their search, nested calls, rule triggering, documentation and testing. For example, you can use the dbt CLI or dbt Cloud to work with data collection to consume, transform, and load data into storage by computing a dynamic database on a schedule. To increase the efficiency of using dbt for the selection of schemas, sources and models, it is possible to use data:
• The Schema.yml file can only be found in the dbt models folder. The tool allows you to create a unit test that counts the duration of a column for nulls.
• dbt data tests have a strict rule that they must return null rows in order to pass the test. Instead of looking for a value such as the number of a particular set of rows, the data test should be written to expect to find null rows if the results do not match the correct set of sums. Therefore, when developing test data, you need to think about how to return 0 rows in the expected key, but at the same time you need to check the number. You can use the != or <= operators to validate data.
• To increase the speed of testing increase the number of threads in the project profile, in the profiles.yml file. For example, if there are 30 tests, then there are 40 threads, indicate in the profiles.yml file. Probably 30 data and schema tests in 4 seconds.
• The history test needs a meaningful name. Although dbt automatically learns the test names, it is recommended that you label them yourself. dbt doesn't have much control over running small test suites, it needs to be able to see all running projects. In the same way that developers are encouraged to use functions and variables with semantic name definitions, testing should be given tests for meaningful names. Otherwise, it will be difficult to determine which test passes or fails during test execution. When a test error is found in dbt, all schema and data tests are run together. It's not easy to use a single directory in the data tests folder, but you can name them "dbt test - schema" or "dbt test - data" to quickly determine which tests to use.

https://corissa-haury.medium.com/4-quick-facts-about-dbt-testing-5c32b487b8cd

Medium

4 Quick Facts About dbt Testing

You may be using dbt CLI or dbt Cloud for your data pipeline work to Extract, Transform, and Load data into a warehouse by creating…

👍2

834 views08:09

Big Data Science

🔥PyMLPipe: A lightweight MLOps Python Package
PyMLPipe is a lightweight Python package for MLOps processes. It helps to automate:
• Monitoring of models and data schemas
• Versioning of ML models and data
• Model performance comparison
• API deployment in one click
This source library supports Scikit-Learn, XGBoost, LightGBM and Pytorch. It has a modular structure, represented by a set of Python functions combined into an API and a visual graphical interface. PyMLPipe is great for working with tabular data.
https://neelindresh.github.io/pymlpipe.documentation.io/

neelindresh.github.io

PyMLPipe

None

🔥2

723 views04:34