Big Data Science – Telegram
Big Data Science
3.75K subscribers
65 photos
9 videos
12 files
637 links
Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼https://news.1rj.ru/str/bds_job — channel about Data Science jobs and career
💻https://news.1rj.ru/str/bdscience_ru — Big Data Science [RU]
Download Telegram
💫AI + quantum computing = quantum memristor
A memristor, or memory resistor, is a kind of building block for an electronic circuit. The first concept of it was created about 10 years ago. They are a frequency switch that remembers its state (on or off) after the perception of power, similar to synapses - connections between neurons in the human brain, electrical conduction, the frequency of which decreases or weakens depending on how many charge frequencies passed through them in the past .
Theoretical memristors act as artificial neurons capable of both computing and storing data. Therefore, neuromorphic (brain-like) computers based on memristors work well with artificial neural networks, i.e. ML company.
Unlike classical computers, which turn transistors on or off to symbolize data as 1 or 0, quantum qubits are used. Qubits can be in a state of superposition when they combine 1s and 0s with each other. The more qubits connected together with a quantum computer, the more its computing power can grow exponentially.
The quantum merristor is based on the flow of photons, directly in superpositions, where each individual photon can travel along the effect paths created by the laser on the glass. One of the mechanisms in this combined photonic circuit is used to measure the flux of these photons, and this data, through a complex electronic communication circuit, controls the transmission along the path. As a result, he turned out to be like a memristor.
Usually memristive behavior and quantum effects do not combine. Memristors act as a consequence of measuring their internal data, and spectacular effects have a high fragility value when it comes to external interference such as measurement. The researchers overcame this controversy by designing the interactions within their device so that they were sufficiently scrutinized to thus memristivity, but scrutinized enough to retain the detected behavior.
The advantage of using a quantum memristor in quantum ML over conventional quantum circuits is that the memristor, unlike any other quantum component, has memory. The next measurement is the connection of several memristors in the aggregate, the increase in the number of photons in each memristor and the amount of accumulation in which they matter in each number of measurements.
https://spectrum.ieee.org/quantum-memristor
👍1
🌞Development in Python according to 12 SaaS principles with the Python-dotenv library
ML modelers and data analysts don't always write code like professional programmers. To improve the quality of the code, use simple methodology for developing web applications or SaaS. It recommends:
• use of declarative formats for registration to establish the time and strength of new commitments joining the project;
• have a clean agreement with the underlying system, providing portability between environments;
• start deployment on modern cloud platforms, eliminating the need to administer servers and systems;
• reduce spread between origin and production, ensure continuous deployment for rapid agility;
• scale without major changes in tooling, architecture, or development methods.

To implement these SaaS ideas, it is proposed to build applications on 12 repositories:
1. One codebase is version controlled, many are deployed
2. Explicitly declare and isolate the dependency
3. Keep health in the environment
4. Treat supporting services as attached resources
5. Strictly separate the stages of assembly and launch
6. Use the application as one or more stateless processes
7. Export services via port binding
8. Modify parallelism by scaling with the process model.
9. Maximum reliability due to fast start-up and smooth shutdown
10. Portability and credibility of environments from development to production through tests
11. Log, view event stream logs
12. Performs administration/management tasks as one-time processes

To implement all this for a Python program open library Python-dotenv. It reads key-value pairs from the .env file and can consider them as environment variables. If the application meets the requirements of the environment, running it during development is not very practical, because the developer needs to set these environment variables themselves. By adding Python-dotenv to your application, you can simplify the development process. The library itself loads the settings from the .env file, while remaining configurable through the environment.
You can also load the configuration without environment changes, parse the configuration as a stream and .env files in IPython. The tool also has a CLI interface that allows you to manipulate the .env file without manually opening it.
https://github.com/theskumar/python-dotenv
🔥3
🌞TOP-10 Data Science and ML conferences all over the World in June 2022:
1. Jun 13, Machine Learning Methods in Visualisation for Big Data 2022. Rome, Italy. https://events.tuni.fi/mlvis/mlvis2022/
2. Jun 14-15, Chief Data & Analytics Officers, Spring. San Francisco, CA, USA. https://cdao-spring.coriniumintelligence.com/
3. Jun 15-16, The AI Summit London. London, UK. https://london.theaisummit.com/
4. Jun 20-22, CLA2022: The 16th International Conference on Concept Lattices and Their Applications. Tallinn, Estonia. https://cs.ttu.ee/events/cla2022/
5. Jun 19-24, Machine Learning Week, Predictive Analytics World conferences. Las Vegas, NV, USA. https://www.predictiveanalyticsworld.com/machinelearningweek/
6. Jun 20-21, Deep Learning World Europe. Munich, Germany. https://deeplearningworld.de/
7. Jun 21, Data Engineering Show On The Road. London, UK. https://hi.firebolt.io/lp/the-data-engineering-show-on-the-road-london
8. Jun 22, Data Stack Summit 2022. Virtual. https://datastacksummit.com/
9. Jun 28-29, Future.AI. Virtual. https://events.altair.com/future-ai/
10. Jun 29, Designing Flexibility to Address Uncertainty in the Supply Chain with AI. Chicago, IL, USA. https://www.luc.edu/leadershiphub/centers/aibusinessconsortium/upcomingevents/archive/designing-flexible-supply-chains-with-ai.shtml
🔥3👍1
🔥LAION-5B: open dataset for multi-modal ML for 5+ billion text-image pairs
On May 31, 2022, the non-profit organization of AI researchers presented the largest dataset of 5.85 billion image-text pairs filtered using CLIP. The LAION-5B is 14 times larger than its predecessor, the LAION-400M, which was previously the world's largest open image-to-text dataset.
2.3 billion pairs are in English, and the other half of the dataset contains samples from over 100 other languages. The dataset also includes several nearest neighbor indices, an improved web interface for exploration and subsetting, and watermark detection and NSFW scores. The dataset is recommended for research purposes and is not specifically controlled.
The entire 5 billion dataset is divided into 3 datasets, each of which can be downloaded separately. They all have the following column structure:
• Image URL
• TEXT - subnoscripts, in English for en, in other languages for multi and nolang
• WIDTH - image width
• HEIGHT - image height
• LANGUAGE - sample language, laion2B-multi only, calculated using cld3
• Similarity – similarity, cosine between text and image for ViT-B/32 embedding, clip for en, mclip for multi and nolang
• Pwatermark - the probability of a watermarked image, calculated using the laion watermark detector.
• Punsafe - The probability that an image is unsafe is calculated using the laion clip detector.
pwatermark and punsafe are either available as separate collections that must be joined with a url+text hash.
Details and links to download: https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/
🔥1
🔥GATO: the new SOTA from DeepMind
May 19, 2022 DeepMind published an article about a new generic universal agent outside the realm of text outputs. GATO operates as a multi-modal, multi-tasking, multi-variant universal policy. On the same network with the same weights, you can play Atari, record images, chat, manipulate blocks, and set other tasks depending on your context: generate text, determine optimal joint torques, detect points, etc.
GATO is trained on a large number of data sets including experienced agents in both simulated and sparse environments, in addition to many natural language and image data sets. At the GATO training stage, data from various tasks and modalities are serialized in a flat sequence of tokens, grouped and processed by a transformer neural network, similar to a large language model. Losses are masked so GATO only predicts action and text targets.
When the Gato is deployed, the demo prompt is tokenized, forming the initial sequence. The environment then emits the first observation, which is also tokenized and reproduced to the next one. GATO autoregressively selects an action vector one token at a time. Once all markers containing an action vector are selected (defined by environment specific actions), the action is decoded and placed in the environment, which performs the steps and produces a new observation. Then the procedure is repeated. The model always observes all observations and actions in its context window of 1024 tokens.
https://www.deepmind.com/publications/a-generalist-agent
🔥2
Data analytics - blog of the leading Data Scientist working at Uber, one of the authors of 🔥 Machine Learning. The channel material will help you really grow into a data professional.

1 channel instead of thousands of textbooks and courses, subscribe: 👇👇👇

🚀 @data_analysis_ml
👍1
#test
False signal of the car alarm sensor (without real threat) is error
Anonymous Quiz
41%
type II
47%
type I
4%
depends of statistical significance level
8%
it is not an error
🔥1
🚀New Python: faster more than 2x!
Released in April 2022, the alpha of Python 3.11 can run up to 60% faster than the previous version in some cases. Benchmarking tests conducted by Phoronix, conducted on Ubuntu Linux and compiled with the GCC compiler, have shown that Python 3.11 noscripts run an average of 25% faster than Python 3.10 without changing the code. This became possible due to the fact that now the interpreter is responsible for the static placement of its code objects and speeding up the seda of execution. Now, every time Python is used to call one of its own functions, a new frame is created, the internal structure of which has been improved so that it saves only the most important information without additional data about memory management and debugging.
Also, as of release 3.11, it is introduced that when CPython encounters a Python function that calls another function, it sets up a new frame and jumps to the new code contained within it. This avoids calling the function responsible for interpreting C (previously, each call to a Python function called a C function that interpreted it). This innovation further accelerated the execution of Python noscripts.
https://levelup.gitconnected.com/the-fastest-python-yet-up-to-60-faster-2eeb3d9a99d0
🔥5
Forwarded from Big Data Science [RU]
Новый Python: теперь намного быстрее!
🔥3
💥Best of May Airflow Summit 2022!
Top reports from data engineers for data engineers: the most interesting talks, from the intricacies of the batch-orchestrator to best practices of deployment and data management.
https://medium.com/apache-airflow/airflow-summit-2022-the-best-of-373bee2527fa
🔥1
🪢AI for Robotics: NVIDIA's New Version of Isaac Sim
On June 17, 2022, NVIDIA announced the release of a version of the Isaac Sim robotics modeling tool and synthetic data generator. This platform accelerates the development, testing and training of AI in robotics. Developers use Isaac Sim to create product quality datasets for AI model development, simulate robotic navigation and control, and generate tests to test robotics applications.
Isaac Sim includes implementation tools for building collaborative robots (cobots), a GPU-accelerated RL learning tool, a set of tools for creating synthetic data, APIs and workflows, and new capabilities for building synthetic data products.
https://developer.nvidia.com/blog/expedite-the-development-testing-and-training-of-ai-robots-with-isaac-sim/
🗣Special for MRI: Nilearn Library
Nilearn is a lightweight Python-library for MRI data analysis. It provides statistical and machine learning tools, detailed documentation, and an open community.
Nilearn now includes the functionality of the Nistats library and extends it with new features. It supports general linear model (GLM) analysis and uses scikit-learn tools for multivariate statistics with applications such as predictive modeling, classification, decoding, or connectivity analysis. Nilearn perfectly visualizes the results of the analysis in the projections of the human brain.
The library is available under the BSD license and is being actively updated: version 0.1.1 was released in 2015, and release 0.9.1 was released as of June 2022. https://nilearn.github.io/stable/index.html
💥💦🌸TOP-5 Data Science and ML conferences all over the World in July 2022:
• Jul 7,
Challenging The Status Quo to Avoid AI Failure. Virtual. https://www.cognilytica.com/CLNjc0ODl8MjI
• Jul 12, oneAPI DevSummit for AI. Virtual. https://software.seek.intel.com/oneapi-devsummit-ai-2022
• Jul 13-14, Business of Data Festival. Virtual. https://www.businessofdatafestival.com/
• Jul 13-17, MLDM 2022: 18th Int. Conf. on Machine Learning and Data Mining. New York, NY, USA http://www.mldm.de/mldm2022.php
• Jul 16-21, ICDM 2022: 22th Industrial Conf. on Data Mining. New York, NY, USA. https://www.data-mining-forum.de/icdm2022.php
🔥1
👌TOP-5 MLOps frameworks
The MLOps topic is in hype today: ML systems are getting more complex, and their development and support includes not only computational algorithms and other data science, but also the best software developments with the intricacies of deploying in production. At the same time, it is necessary to constantly monitor the drift of the input data and the reaction of ML models to them, with the accuracy of correcting the code with its versioning. Establish such a continuous chain of response to complex MLOps frameworks, the most important of which are possible:
MLflow is an open source platform for managing the end-to-end machine learning lifecycle. https://www.mlflow.org/
Kubeflow is an open-source machine learning platform designed to enable using machine learning pipelines to orchestrate complicated workflows running on Kubernetes (e.g. doing data processing then using TensorFlow or PyTorch to train a model, and deploying to TensorFlow Serving or Seldon). Kubeflow is built based on Google’s internal method to deploy TensorFlow models called TensorFlow Extended. https://www.kubeflow.org/
FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. It fully supports asynchronous programming and can run with Uvicorn and Gunicorn. https://fastapi.tiangolo.com/
ZenML is an open-source MLOps framework built for ML teams. It provides the abstraction layer to bring Machine Learning Models from research to production as easily as possible. Data scientists don’t need to know the details behind the deployment but gain full control and ownership over the whole pipeline process. ZenML standardizes writing ML pipelines across different MLOps stacks, agnostic of cloud providers, third-party vendors, and underlying infrastructure. https://zenml.io/
Seldon Core is an open-source framework, makes it easier and faster to deploy our machine learning models and experiments at scale on Kubernetes. Seldon Core serves models built in any open-source or commercial model building framework. You can make use of powerful Kubernetes features like custom resource definitions to manage model graphs. And then connect your continuous integration and deployment (CI/CD) tools to scale and update your deployment. Seldon handles scaling to thousands of production machine learning models and provides advanced machine learning capabilities out of the box including Advanced Metrics, Request Logging, Explainers, Outlier Detectors, A/B Tests, Canaries, and more. https://www.seldon.io/solutions/open-source-projects/core
🔥3
🪢2 libraries for unit testing Python noscripts
Unit testing allows developers to ensure that their code base is working as intended at an atomic level. The point of unit testing is not that the code is working as a whole, but instead that each individual function is doing what it is supposed to be doing. For writing Python unit tests you can use Pytest and Chispa:
• Pytest is framework to write small, readable tests, and can scale to support complex functional testing for applications and libraries. It requires: Python 3.7+ or PyPy3. https://docs.pytest.org/en/7.1.x/
• Chispa provides fast PySpark test helper methods that output denoscriptive error messages. This library makes it easy to write high quality PySpark code. By the way, chispa means Spark in Spanish. https://github.com/MrPowers/chispa
🔥1
#test
In Recurrent neural networks the ouput of neuron depends on
Anonymous Quiz
2%
inputs
21%
inputs and weights
74%
inputs, weights and outputs
2%
weights
🔥1🤔1
💥Python library for sending PySpark noscripts to Spark cluster via Livy REST API
Spark-apps developers know that there are two approaches to programmatically submit jobs to an Apache Spark cluster, and each has some limitations to achieve real-time interaction: spark-submit and spark-shell.
However, in practice, there are times when you need to submit Spark jobs interactively from a web or mobile app. At the same time, the Apache Spark cluster is hosted in the local infrastructure, but it is necessary that many users simultaneously consume and perform heavy aggregations with data sources from their mobile phones, web or desktop applications. In this case, a service approach, Spark-as-a-Service, will help, including exposing JDBC/ODBC data sources through a Spark standby server or using Apache Livy, a service that allows you to easily interact with an Apache Spark cluster via a REST API.
For Livy, the Python-package livyc works well for sending PySpark noscripts dynamically and asynchronously to the Apache Livy server, interacting transparently with the Apache Spark cluster. https://github.com/Wittline/livyc
🔥1
🗣7 speech recognition tools
To develop your own ML speech recognition system, you can use the following frameworks and libraries:
• wav2letter - an open-course open source toolkit from Facebook AI Research merged with a larger library called Flashlight https://github.com/flashlight/wav2letter
• DeepSpeech powered by Baidu DeepSpeech, which will help you decode an audio file using pre-trained models or set up/train a custom dataset https://deepspeech.readthedocs.io/en/r0.9/?badge=latest
• TensorFlowASR - Open source package from Tensorflow implements some reference models trained using RNN with CTC https://github.com/TensorSpeech/TensorFlowASR
• OpenSeq2Seq - research project by NVIDIA on the problems of converting sequences to sequences https://github.com/NVIDIA/OpenSeq2Seq/blob/master/Streaming-ASR.ipynb
• SpeechRecognition - the project provides access to several automatic speech recognition models, including speech API wrappers from Google, Microsoft Azure and IBM https://github.com/Uberi/speech_recognition

We also note 2 ready-made services that provide an API for accessing the capabilities of services, from speech recognition to generating "natural" voice data:
• SmartSpeech by SberDevices https://sberdevices.ru/smartspeech/
• Yandex SpeechKit by Yandex https://cloud.yandex.ru/services/speechkit
🔥1
#test
ACID-requirements to transactions are implemented full in
Anonymous Quiz
60%
relational databases
7%
NoSQL databases
16%
any databases
16%
OLTP-databases
🔥1