NEW BOT Телеграм, страница

Big Data Science

🌞Development in Python according to 12 SaaS principles with the Python-dotenv library
ML modelers and data analysts don't always write code like professional programmers. To improve the quality of the code, use simple methodology for developing web applications or SaaS. It recommends:
• use of declarative formats for registration to establish the time and strength of new commitments joining the project;
• have a clean agreement with the underlying system, providing portability between environments;
• start deployment on modern cloud platforms, eliminating the need to administer servers and systems;
• reduce spread between origin and production, ensure continuous deployment for rapid agility;
• scale without major changes in tooling, architecture, or development methods.

To implement these SaaS ideas, it is proposed to build applications on 12 repositories:
1. One codebase is version controlled, many are deployed
2. Explicitly declare and isolate the dependency
3. Keep health in the environment
4. Treat supporting services as attached resources
5. Strictly separate the stages of assembly and launch
6. Use the application as one or more stateless processes
7. Export services via port binding
8. Modify parallelism by scaling with the process model.
9. Maximum reliability due to fast start-up and smooth shutdown
10. Portability and credibility of environments from development to production through tests
11. Log, view event stream logs
12. Performs administration/management tasks as one-time processes

To implement all this for a Python program open library Python-dotenv. It reads key-value pairs from the .env file and can consider them as environment variables. If the application meets the requirements of the environment, running it during development is not very practical, because the developer needs to set these environment variables themselves. By adding Python-dotenv to your application, you can simplify the development process. The library itself loads the settings from the .env file, while remaining configurable through the environment.
You can also load the configuration without environment changes, parse the configuration as a stream and .env files in IPython. The tool also has a CLI interface that allows you to manipulate the .env file without manually opening it.
https://github.com/theskumar/python-dotenv

GitHub

GitHub - theskumar/python-dotenv: Reads key-value pairs from a .env file and can set them as environment variables. It helps in…

Reads key-value pairs from a .env file and can set them as environment variables. It helps in developing applications following the 12-factor principles. - theskumar/python-dotenv

🔥3

531 views03:53

Big Data Science

🌞TOP-10 Data Science and ML conferences all over the World in June 2022:
1. Jun 13, Machine Learning Methods in Visualisation for Big Data 2022. Rome, Italy. https://events.tuni.fi/mlvis/mlvis2022/
2. Jun 14-15, Chief Data & Analytics Officers, Spring. San Francisco, CA, USA. https://cdao-spring.coriniumintelligence.com/
3. Jun 15-16, The AI Summit London. London, UK. https://london.theaisummit.com/
4. Jun 20-22, CLA2022: The 16th International Conference on Concept Lattices and Their Applications. Tallinn, Estonia. https://cs.ttu.ee/events/cla2022/
5. Jun 19-24, Machine Learning Week, Predictive Analytics World conferences. Las Vegas, NV, USA. https://www.predictiveanalyticsworld.com/machinelearningweek/
6. Jun 20-21, Deep Learning World Europe. Munich, Germany. https://deeplearningworld.de/
7. Jun 21, Data Engineering Show On The Road. London, UK. https://hi.firebolt.io/lp/the-data-engineering-show-on-the-road-london
8. Jun 22, Data Stack Summit 2022. Virtual. https://datastacksummit.com/
9. Jun 28-29, Future.AI. Virtual. https://events.altair.com/future-ai/
10. Jun 29, Designing Flexibility to Address Uncertainty in the Supply Chain with AI. Chicago, IL, USA. https://www.luc.edu/leadershiphub/centers/aibusinessconsortium/upcomingevents/archive/designing-flexible-supply-chains-with-ai.shtml

🔥3👍1

559 views09:46

Big Data Science

#test
What is the difference between projection and view in relational databases?

Anonymous Quiz

these terms are the same

11%

these terms are relevant to different context

74%

projection is operation of relational algebra, view is result of query execution

11%

view is operation of relational algebra, projection is result of query execution

👍2

62 voters552 views05:31

Big Data Science

🔥LAION-5B: open dataset for multi-modal ML for 5+ billion text-image pairs
On May 31, 2022, the non-profit organization of AI researchers presented the largest dataset of 5.85 billion image-text pairs filtered using CLIP. The LAION-5B is 14 times larger than its predecessor, the LAION-400M, which was previously the world's largest open image-to-text dataset.
2.3 billion pairs are in English, and the other half of the dataset contains samples from over 100 other languages. The dataset also includes several nearest neighbor indices, an improved web interface for exploration and subsetting, and watermark detection and NSFW scores. The dataset is recommended for research purposes and is not specifically controlled.
The entire 5 billion dataset is divided into 3 datasets, each of which can be downloaded separately. They all have the following column structure:
• Image URL
• TEXT - subnoscripts, in English for en, in other languages for multi and nolang
• WIDTH - image width
• HEIGHT - image height
• LANGUAGE - sample language, laion2B-multi only, calculated using cld3
• Similarity – similarity, cosine between text and image for ViT-B/32 embedding, clip for en, mclip for multi and nolang
• Pwatermark - the probability of a watermarked image, calculated using the laion watermark detector.
• Punsafe - The probability that an image is unsafe is calculated using the laion clip detector.
pwatermark and punsafe are either available as separate collections that must be joined with a url+text hash.
Details and links to download: https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/

🔥1

526 views06:19

Big Data Science

🔥GATO: the new SOTA from DeepMind
May 19, 2022 DeepMind published an article about a new generic universal agent outside the realm of text outputs. GATO operates as a multi-modal, multi-tasking, multi-variant universal policy. On the same network with the same weights, you can play Atari, record images, chat, manipulate blocks, and set other tasks depending on your context: generate text, determine optimal joint torques, detect points, etc.
GATO is trained on a large number of data sets including experienced agents in both simulated and sparse environments, in addition to many natural language and image data sets. At the GATO training stage, data from various tasks and modalities are serialized in a flat sequence of tokens, grouped and processed by a transformer neural network, similar to a large language model. Losses are masked so GATO only predicts action and text targets.
When the Gato is deployed, the demo prompt is tokenized, forming the initial sequence. The environment then emits the first observation, which is also tokenized and reproduced to the next one. GATO autoregressively selects an action vector one token at a time. Once all markers containing an action vector are selected (defined by environment specific actions), the action is decoded and placed in the environment, which performs the steps and produces a new observation. Then the procedure is repeated. The model always observes all observations and actions in its context window of 1024 tokens.
https://www.deepmind.com/publications/a-generalist-agent

Google DeepMind

Publications

Explore a selection of our recent research on some of the most complex and interesting challenges in AI.

🔥2

569 views03:26

Big Data Science

Data analytics - blog of the leading Data Scientist working at Uber, one of the authors of 🔥 Machine Learning. The channel material will help you really grow into a data professional.

1 channel instead of thousands of textbooks and courses, subscribe: 👇👇👇

🚀 @data_analysis_ml

👍1

561 viewsedited 07:04

Big Data Science

#test
False signal of the car alarm sensor (without real threat) is error

Anonymous Quiz

depends of statistical significance level

it is not an error

🔥1

76 voters489 views15:40

Big Data Science

🚀New Python: faster more than 2x!
Released in April 2022, the alpha of Python 3.11 can run up to 60% faster than the previous version in some cases. Benchmarking tests conducted by Phoronix, conducted on Ubuntu Linux and compiled with the GCC compiler, have shown that Python 3.11 noscripts run an average of 25% faster than Python 3.10 without changing the code. This became possible due to the fact that now the interpreter is responsible for the static placement of its code objects and speeding up the seda of execution. Now, every time Python is used to call one of its own functions, a new frame is created, the internal structure of which has been improved so that it saves only the most important information without additional data about memory management and debugging.
Also, as of release 3.11, it is introduced that when CPython encounters a Python function that calls another function, it sets up a new frame and jumps to the new code contained within it. This avoids calling the function responsible for interpreting C (previously, each call to a Python function called a C function that interpreted it). This innovation further accelerated the execution of Python noscripts.
https://levelup.gitconnected.com/the-fastest-python-yet-up-to-60-faster-2eeb3d9a99d0

Medium

The Fastest Python Yet: Up to 60% Faster⚡

You won’t believe how fast it can be!

🔥5

614 views04:20

Big Data Science

Forwarded from Big Data Science [RU]

Новый Python: теперь намного быстрее!

🔥3

562 views04:20

Big Data Science

💥Best of May Airflow Summit 2022!
Top reports from data engineers for data engineers: the most interesting talks, from the intricacies of the batch-orchestrator to best practices of deployment and data management.
https://medium.com/apache-airflow/airflow-summit-2022-the-best-of-373bee2527fa

Medium

Airflow Summit 2022 — The Best Of

The Airflow Summit has a very special place in my heart. I hope I can share my sentiment here with whoever reads that, and by the end of…

🔥1

586 viewsedited 05:33

Big Data Science

#test
The statistical power does NOT depend on

Anonymous Quiz

21%

the magnitude of the effect of interest in the population

52%

expected value

the sample size used to detect the effect

18%

the statistical significance criterion used in the test

🔥1

66 voters589 views05:36

Big Data Science

🪢AI for Robotics: NVIDIA's New Version of Isaac Sim
On June 17, 2022, NVIDIA announced the release of a version of the Isaac Sim robotics modeling tool and synthetic data generator. This platform accelerates the development, testing and training of AI in robotics. Developers use Isaac Sim to create product quality datasets for AI model development, simulate robotic navigation and control, and generate tests to test robotics applications.
Isaac Sim includes implementation tools for building collaborative robots (cobots), a GPU-accelerated RL learning tool, a set of tools for creating synthetic data, APIs and workflows, and new capabilities for building synthetic data products.
https://developer.nvidia.com/blog/expedite-the-development-testing-and-training-of-ai-robots-with-isaac-sim/

NVIDIA Technical Blog

Expedite the Development, Testing, and Training of AI Robots with Isaac Sim

This release of Isaac Sim adds more tools for AI-based robotics including Isaac Gym support for RL, Isaac Cortex for cobot programming, and much more.

622 views05:16

Big Data Science

🗣Special for MRI: Nilearn Library
Nilearn is a lightweight Python-library for MRI data analysis. It provides statistical and machine learning tools, detailed documentation, and an open community.
Nilearn now includes the functionality of the Nistats library and extends it with new features. It supports general linear model (GLM) analysis and uses scikit-learn tools for multivariate statistics with applications such as predictive modeling, classification, decoding, or connectivity analysis. Nilearn perfectly visualizes the results of the analysis in the projections of the human brain.
The library is available under the BSD license and is being actively updated: version 0.1.1 was released in 2015, and release 0.9.1 was released as of June 2022. https://nilearn.github.io/stable/index.html

Nilearn

Nilearn enables approachable and versatile analyses of brain volumes and surfaces. It provides statistical and machine-learning tools, with instructive documentation & open community. It suppor...

565 views04:15

Big Data Science

💥💦🌸TOP-5 Data Science and ML conferences all over the World in July 2022:
• Jul 7, Challenging The Status Quo to Avoid AI Failure. Virtual. https://www.cognilytica.com/CLNjc0ODl8MjI
• Jul 12, oneAPI DevSummit for AI. Virtual. https://software.seek.intel.com/oneapi-devsummit-ai-2022
• Jul 13-14, Business of Data Festival. Virtual. https://www.businessofdatafestival.com/
• Jul 13-17, MLDM 2022: 18th Int. Conf. on Machine Learning and Data Mining. New York, NY, USA http://www.mldm.de/mldm2022.php
• Jul 16-21, ICDM 2022: 22th Industrial Conf. on Data Mining. New York, NY, USA. https://www.data-mining-forum.de/icdm2022.php

🔥1

766 views11:07

Big Data Science

👌TOP-5 MLOps frameworks
The MLOps topic is in hype today: ML systems are getting more complex, and their development and support includes not only computational algorithms and other data science, but also the best software developments with the intricacies of deploying in production. At the same time, it is necessary to constantly monitor the drift of the input data and the reaction of ML models to them, with the accuracy of correcting the code with its versioning. Establish such a continuous chain of response to complex MLOps frameworks, the most important of which are possible:
• MLflow is an open source platform for managing the end-to-end machine learning lifecycle. https://www.mlflow.org/
• Kubeflow is an open-source machine learning platform designed to enable using machine learning pipelines to orchestrate complicated workflows running on Kubernetes (e.g. doing data processing then using TensorFlow or PyTorch to train a model, and deploying to TensorFlow Serving or Seldon). Kubeflow is built based on Google’s internal method to deploy TensorFlow models called TensorFlow Extended. https://www.kubeflow.org/
• FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. It fully supports asynchronous programming and can run with Uvicorn and Gunicorn. https://fastapi.tiangolo.com/
• ZenML is an open-source MLOps framework built for ML teams. It provides the abstraction layer to bring Machine Learning Models from research to production as easily as possible. Data scientists don’t need to know the details behind the deployment but gain full control and ownership over the whole pipeline process. ZenML standardizes writing ML pipelines across different MLOps stacks, agnostic of cloud providers, third-party vendors, and underlying infrastructure. https://zenml.io/
• Seldon Core is an open-source framework, makes it easier and faster to deploy our machine learning models and experiments at scale on Kubernetes. Seldon Core serves models built in any open-source or commercial model building framework. You can make use of powerful Kubernetes features like custom resource definitions to manage model graphs. And then connect your continuous integration and deployment (CI/CD) tools to scale and update your deployment. Seldon handles scaling to thousands of production machine learning models and provides advanced machine learning capabilities out of the box including Advanced Metrics, Request Logging, Explainers, Outlier Detectors, A/B Tests, Canaries, and more. https://www.seldon.io/solutions/open-source-projects/core

🔥3

843 views09:45

Big Data Science

🪢2 libraries for unit testing Python noscripts
Unit testing allows developers to ensure that their code base is working as intended at an atomic level. The point of unit testing is not that the code is working as a whole, but instead that each individual function is doing what it is supposed to be doing. For writing Python unit tests you can use Pytest and Chispa:
• Pytest is framework to write small, readable tests, and can scale to support complex functional testing for applications and libraries. It requires: Python 3.7+ or PyPy3. https://docs.pytest.org/en/7.1.x/
• Chispa provides fast PySpark test helper methods that output denoscriptive error messages. This library makes it easy to write high quality PySpark code. By the way, chispa means Spark in Spanish. https://github.com/MrPowers/chispa

GitHub

GitHub - MrPowers/chispa: PySpark test helper methods with beautiful error messages

PySpark test helper methods with beautiful error messages - MrPowers/chispa

🔥1

739 views05:47

Big Data Science

#test
In Recurrent neural networks the ouput of neuron depends on

Anonymous Quiz

inputs, weights and outputs

weights

🔥1🤔1

81 voters655 views04:42

Big Data Science

💥Python library for sending PySpark noscripts to Spark cluster via Livy REST API
Spark-apps developers know that there are two approaches to programmatically submit jobs to an Apache Spark cluster, and each has some limitations to achieve real-time interaction: spark-submit and spark-shell.
However, in practice, there are times when you need to submit Spark jobs interactively from a web or mobile app. At the same time, the Apache Spark cluster is hosted in the local infrastructure, but it is necessary that many users simultaneously consume and perform heavy aggregations with data sources from their mobile phones, web or desktop applications. In this case, a service approach, Spark-as-a-Service, will help, including exposing JDBC/ODBC data sources through a Spark standby server or using Apache Livy, a service that allows you to easily interact with an Apache Spark cluster via a REST API.
For Livy, the Python-package livyc works well for sending PySpark noscripts dynamically and asynchronously to the Apache Livy server, interacting transparently with the Apache Spark cluster. https://github.com/Wittline/livyc

GitHub

GitHub - Wittline/livyc: Apache Spark as a Service with Apache Livy Client

Apache Spark as a Service with Apache Livy Client. Contribute to Wittline/livyc development by creating an account on GitHub.

🔥1

688 viewsedited 03:55

Big Data Science

🗣7 speech recognition tools
To develop your own ML speech recognition system, you can use the following frameworks and libraries:
• wav2letter - an open-course open source toolkit from Facebook AI Research merged with a larger library called Flashlight https://github.com/flashlight/wav2letter
• DeepSpeech powered by Baidu DeepSpeech, which will help you decode an audio file using pre-trained models or set up/train a custom dataset https://deepspeech.readthedocs.io/en/r0.9/?badge=latest
• TensorFlowASR - Open source package from Tensorflow implements some reference models trained using RNN with CTC https://github.com/TensorSpeech/TensorFlowASR
• OpenSeq2Seq - research project by NVIDIA on the problems of converting sequences to sequences https://github.com/NVIDIA/OpenSeq2Seq/blob/master/Streaming-ASR.ipynb
• SpeechRecognition - the project provides access to several automatic speech recognition models, including speech API wrappers from Google, Microsoft Azure and IBM https://github.com/Uberi/speech_recognition

We also note 2 ready-made services that provide an API for accessing the capabilities of services, from speech recognition to generating "natural" voice data:
• SmartSpeech by SberDevices https://sberdevices.ru/smartspeech/
• Yandex SpeechKit by Yandex https://cloud.yandex.ru/services/speechkit

GitHub

GitHub - flashlight/wav2letter: Facebook AI Research's Automatic Speech Recognition Toolkit

Facebook AI Research's Automatic Speech Recognition Toolkit - GitHub - flashlight/wav2letter: Facebook AI Research's Automatic Speech Recognition Toolkit

🔥1

771 views05:29

Big Data Science

#test
ACID-requirements to transactions are implemented full in

Anonymous Quiz

🔥1

67 voters619 views06:28

Big Data Science

🙌🏼7 Platforms of Federated ML
Federated learning is also referred to as collaborative because ML models are trained on multiple decentralized edge devices or servers containing local data samples without exchanging them. This approach differs from traditional centralized ML methods, where all local datasets are uploaded to a single server, and from more classical decentralized approaches with the same distribution of local data. Today, federated learning is actively used in the defense industry, telecommunications, pharmaceuticals and IoT platforms.
Federated Machine Learning ideas were first introduced by Google in 2017 to improve mobile keyboard text prediction using machine learning models trained on data from multiple devices. In federated ML, models are trained on multiple local datasets on local nodes without explicit data exchange, but with periodic exchange of parameters, such as deep neural network weights and biases, between local nodes to create a common global model. Unlike distributed learning, which was originally aimed at parallelizing computations, federated learning is aimed at learning heterogeneous data sets. In federated ML, datasets are usually highly heterogeneous in size. And clients, i.e. end devices where local models are trained can be unreliable and more prone to failure than in distributed learning systems where the nodes are data centers with powerful computing capabilities. Therefore, in order to provide distributed computing and synchronization of its results, federated ML requires frequent data exchange between nodes.
Due to its architectural features, federated ML has a number of disadvantages:
• heterogeneity between different local datasets - each node has an error in relation to the general population, and sample sizes can vary significantly;
• temporal heterogeneity - the distribution of each local dataset changes over time;
• it is necessary to ensure the compatibility of the data set on all nodes;
• hiding training datasets is fraught with the risk of introducing vulnerabilities into the global model;
• lack of access to global training data makes it difficult to identify unwanted biases in training inputs;
• there is a risk of losing updates to local ML models due to failures at individual nodes, which may affect the global model.
Today, federated ML is supported by the following platforms:
• FATE (Federated AI Technology Enabler) https://fate.fedai.org/
• Substra https://www.substra.ai/
• Python libraries PySyft and PyGrid https://github.com/OpenMined/PySyft, https://github.com/OpenMined/PyGrid, https://github.com/OpenMined/pygrid-admin
• Open FL https://github.com/intel/openfl
• TensorFlow Federated (TFF) https://www.tensorflow.org/federated
• IBM Federated Learning https://ibmfl.mybluemix.net/
• NVIDIA CLARA https://developer.nvidia.com/clara

Fate

HOME

An Industrial Grade
Federated Learning Framework
support federated learning architectures and secure computation of any machine learning algorithms

👍2

766 views04:53

About

Blog

Apps

Platform