NEW BOT Телеграм, страница

Big Data Science

YDB: scalable fault-tolerant NewSQL DBMS from Yandex. Now open source
April 19, 2022. Yandex has published the source code of the distributed NewSQL DBMS YDB, which allows you to create scalable, fault-tolerant services that can support a large operational load. The code is available under the Apache 2.0 license.
YDB is an open-source Distributed SQL Database that combines high availability and scalability with strict consistency and ACID transactions. YDB observes the occurrence of occurrences and recovery in the event of occurrences from the formation or even the occurrence of the center of the day. The reliability of YDB has been tested on Yandex services (Alisa, Taxi, Market, Metrika and almost 500 more projects). You can deploy YDB both at will and on external servers, including Yandex Cloud or providers.
https://ydb.tech/
https://github.com/ydb-platform/ydb

ydb.tech

YDB — Beyond Distributed SQL Database

YDB is an AI-powered Distributed SQL DBMS that unifies transactional, analytical, federated, and streaming workloads, delivers strict consistency and high availability, and brings AI capabilities directly to developers.

🔥3

826 views01:57

Big Data Science

🗒Loguru for logging Python noscripts
This library is useful for ML specialists and data engineers who often write in Python. It automates the logging and debugging process. In addition, Loguru includes a number of useful features that ensure that standard logging facilities are retained.
Loguru works according to a plug-and-play recipe and has features such as collapsing multiple event logs, quickly compressing log files, and deleting them regularly. It also supports multi-threaded security and log highlighting. This open source library can be used in conjunction with email media to receive email emails or to send other types of messages.
Finally, Loguru supports correlation with a large Python research module, increases the transmission of all information, measures the initial concentration of the logger, in Loguru.
Source code: https://github.com/Delgan/loguru
Use case example: https://medium.com/geekculture/python-loguru-a-powerful-logging-module-5f4208f4f78c

GitHub

GitHub - Delgan/loguru: Python logging made (stupidly) simple

Python logging made (stupidly) simple. Contribute to Delgan/loguru development by creating an account on GitHub.

🔥4👍1

822 viewsedited 03:42

Big Data Science

#test
What is the main difference between MapReduce-operations in Spark and Hadoop?

Anonymous Quiz

the different dataset's scale

🔥6

113 voters725 views04:11

Big Data Science

🔥TOP 5 New Python Alpha 5
At the beginning of 2022, a new version of Python was released - Alpha 5 (3.11). Main features:
• Improved debugging with the exception chain of their messages. Python 2022 is showing features with trace detection pointing directly to where the error occurs. Python 2 had a similar feature, but required the addition of context to the code, which made things more difficult. Now context is done automatically.
• Variable exception handling - you can now reduce dependency in different ways depending on what other exceptions it is associated with. The ability to use multiple exception operations with an explicit random exception for all. Just create a big try/except block with all possible exception names, and then add more exclude statements to it. It is for this purpose that a group of exceptions, which can be associated with grouping many different exceptions together and applying a single handler function, is only called if something occurs internally due to exceptional probability.
• Variadic Generics - now you can create functions that take a variable number of arguments (up to 22). It was necessary to define a characteristic that could take into account the magnitude of the transferred value each time. Variadic Generics in Python 3.6 allows you to select any number of options at once, which is useful when iterating multiple operations.
• CPython performance optimization. Changes to functions applied to calls and word lookups should reduce overhead, call by the C stack, speed up everything from developing object-oriented code to accessing data dictionaries.
• Simplify work in other languages such as JavaScript on top of Python through high performance and parallel computing.
https://morioh.com/p/af7debd024e2
https://medium.com/@Sabrina-Carpenter/python-alpha-5-is-here-5-promising-features-that-will-blow-your-mind-a4abd406d0ad

Morioh

Python Alpha 5 - 5 Promising Features that will blow your mind 🤯

🔥2

688 views03:04

Big Data Science

💫Graph visualization with PyGraphistry
PyGraphistry is a Python AI library for visual graphs that allows you to extract, transform, analyze and visualize large graphs along with end-to-end Graphistry graphics server sessions. Graphics created specifically for large graphs. The WebGL custom rendering engine client renders up to 8 million nodes + a number at a time, and most client GPUs detect between 100,000 and 2 million elements. The GPU analytics engine on the server interface supports even larger graphics. Graphics smoothes graphics workflows in the PyData ecosystem, including Pandas/Spark/Dask dataframes, Nvidia RAPIDS GPUs, GPU graphics, DGL/PyTorch graphics neural networks, and various data connectors.
PyGraphistry is a streamlined and optimized native PyData interface for language independent Graphistry REST APIs. It is possible to use PyGraphistry using Python data sources such as CSV, SQL, Neo4j, Splunk and more.
The PyGraphistry Python client uses different categories of users:
• Data Explorer: Comprehensive data exploration to accelerated visual analysis in a couple of lines, share results on time phenomena, create complex predictions in Jupyter Notebook and Google Colab.
• Developer: Quickly prototype amazing Python solutions with PyGraphistry, embed in a language-independent way with the REST API, customize colors, icons, templates, JavaScript, and more.
• Analyst: Customize visual ashboards using interactive search, filters, timelines, bar charts, and more, embedding them in any framework.
https://github.com/graphistry/pygraphistry

GitHub

GitHub - graphistry/pygraphistry: PyGraphistry is a Python library to quickly load, shape, embed, and explore big graphs with the…

PyGraphistry is a Python library to quickly load, shape, embed, and explore big graphs with the GPU-accelerated Graphistry visual graph analyzer - graphistry/pygraphistry

🔥3

706 views04:04

Big Data Science

#test
Key difference between window and aggregate functions is

Anonymous Quiz

Aggregate functions operate on a set of values to return a range of values

They are the same, but the window functions are more difficult to write

58%

Window functions operate on a set of values to return a range of values

34%

Aggregate functions return single value for each row from the underlying query

🔥3

77 voters500 views15:02

Big Data Science

🗣Estimating the information content of data: a new method from MIT
Information and data are different things. Not all data are valuable. How much any information from data fragments can be obtained? This question first arose in the 1948 paper "A Mathematical Theory of Communication" by MIT Professor Emeritus Claude Shannon. One breakthrough result is Shannon's idea of entropy, which allows one to estimate the amount of information inherent in any random object, including random variables that model allergy data. Shannon's results laid the foundation for information theory and modern telecommunications. The concept of entropy has also found its way into the field of computer science and machine learning.
Using Shannon's formula can quickly become computationally intractable. This requires accurate calculation of the probability models of the data and all possible occurrences of the data within the probabilistic framework. This disease becomes rare, for example, a survey where a positive test result is identified by hundreds of interacting manifestations, and all of them are unknown. With only 10 unknowns, the data already has 1000 implementations. With many hundreds of possible manifestations, there are more than atoms in the aggregate, which makes entropy calculation an absolutely intractable disease.
MIT researchers have developed a new method for estimating approximations to many information quantities, such as the Shannon entropy, using probabilistic inference. The work is presented in the AISTATS 2022 conference paper. The key takeaway is that, instead of listing all denoscriptions of algorithms for using probabilistic inference, first conclude which explanations are great, and use them to build building entropy estimates. It has been proven that this inference-based approach can be much faster and more accurate than opposing approaches.
Estimation of entropy and information in a probabilistic model is fundamentally difficult, since it often requires solving a multidimensional complexion problem. In many cases past work has done value estimates for some special cases, but new entropy estimates by inference (EEVI) return a first approach that can give accurate upper and lower bounds for a wide range of values based on information theory. We can get a number that is less than it, and a number that is higher. The difference between high and low values gives an idea that we should be sure about the low values. the value of large computational resources, which can be reduced between the outer boundaries to use, which "compresses" the true value with a wide range of resources. You can also take into account how informative different variables in the models are for each other. A new, particularly useful method for finding probabilistic patterns in research such as medical diagnostics.
https://news.mit.edu/2022/estimating-informativeness-data-0425

MIT News

Estimating the informativeness of data

MIT researchers discovered a new way to estimate the amount of information contained in a piece of data using probabilistic programming and probabilistic inference. The breakthrough entropy estimators open up new applications in medicine, scientific discovery…

619 views03:10

Big Data Science

🙌🏼MEGAscaling with quantum ML
Theoretically, quantum computers could be more powerful than any conventional computer, especially at finding prime factors of numbers, the mathematical basis of modern encryption that protects banking and other sensitive data. The more components known as qubits are connected to each other in a quantum computer, where multiple particles can instantly influence each other no matter how far apart they are, the more its processing power can grow exponentially.
One potential application of quantum ML is the simulation of quantum systems, such as chemical reactions, to create new drugs. But the average performance of an ML algorithm depends on how much data it has. The amount of data ultimately limits the performance of machine learning. Therefore, to simulate a quantum system, the amount of training data that a quantum computer might need will grow exponentially as the system being modeled gets larger. This potentially eliminates the advantage of quantum computing over classical computing.
The scientists proposed to link additional qubits to the quantum system that the quantum computer should model. This additional set of "auxiliary" qubits can help the quantum ML circuit to simultaneously interact with many quantum states in the training data. So the quantum ML scheme can work even with a relatively small number of auxiliary devices. In practice, it is still quite difficult to implement this idea, but it can be tested within the framework of the experiments of CERN, the largest particle physics laboratory in the world.
https://spectrum.ieee.org/quantum-machine-learning

IEEE Spectrum

Spooky Action Could Help Boost Quantum Machine Learning

Mysterious quantum links could help lead to exponential scale-up

610 views03:31

Big Data Science

#test
How do Shuffle operations affect the execution speed of a distributed program?

Anonymous Quiz

there is no any effect

👏3

78 voters588 views06:58

Big Data Science

👀Live monitoring of ML and software metrics on one platform
In cases where machine learning is discovered, it is important to constantly monitor data and structures. Even the ML model itself has remained the same, the nature of the data can change, which can significantly affect user experience. There are many software monitoring platforms on the market today, where various system and business metrics are collected to reflect the most important platform monitoring data and generate a platform. For example, Grafana, Datadog, Graphite, etc.
There are also tools for monitoring ML machine learning systems like Neptune, Amazon SageMaker Model Monitor, Censius, and other MLOps environments. But it is possible to prevent monitoring the operation of a machine learning system with classical software engineering monitoring on the same platform. This is achieved with New Relic, a telemetry platform for remote monitoring of mobile and web applications that allows you to collect, receive and be notified of all telemetry data from any source in one place. Thanks to a large number of open source tools, New Relic can work with data sources and sinks.
Sending data from the ML system to New Relic is implemented using the ml-performance-monitoring Python library with Quick Source, which is available on GitHub (https://github.com/newrelic-experimental/ml-performance-monitoring).
https://towardsdatascience.com/monitor-easy-mlops-model-monitoring-with-new-relic-ef2a9b611bd1

GitHub

GitHub - newrelic-experimental/ml-performance-monitoring: A Python package for sending model inference data, data metrics, and…

A Python package for sending model inference data, data metrics, and model metrics - newrelic-experimental/ml-performance-monitoring

❤1🔥1

650 views13:51

Big Data Science

🤷How to evaluate changes in data models quickly? Use Datafold!
You can identify and evaluate changes in different versions of the same data model by writing your own noscript or using the data inheritance function built into dbt. But for the average business user or novice date analyst, this is too much of a hassle. In use cases, Datafold (https://www.datafold.com/) is a cloud-based product with useful features, including data quality assurance, data monitoring, monitoring, and alerting. The function of calculating column statistics and helps to evaluate subsequent changes in real conditions, in particular, to discuss changes between sets of dates based on a column and values. For large projects, integration with dbt is used. Datafold works with a direct connection to a custom data store and uses Github to compare changes, make changes to dbt models to improve the quality of data collection.
In practice, Datafold can be used to set up product analysts for A / B testing in order to increase preferences and a collection of product features, data engineers - for regression testing of ETL pipelines and users of BI systems for reporting.
Use case: https://medium.com/geekculture/what-if-you-could-compare-changes-in-your-data-models-now-you-can-75f039580d08

Datafold

Datafold | Automated Data Migrations and Quality Testing

Datafold automates critical data engineering workflows, dramatically speeding up data migrations, code testing and review, and monitoring and observability.

🔥1

587 views04:21

Big Data Science

#test
What statistical test is suitable to check the hypothesis about the differences between small matched (dependent) samples

Anonymous Quiz

31%

Wilcoxon signed-rank test

Fisher's combined probability test

🔥2

87 voters539 views05:52

Big Data Science

💫AI + quantum computing = quantum memristor
A memristor, or memory resistor, is a kind of building block for an electronic circuit. The first concept of it was created about 10 years ago. They are a frequency switch that remembers its state (on or off) after the perception of power, similar to synapses - connections between neurons in the human brain, electrical conduction, the frequency of which decreases or weakens depending on how many charge frequencies passed through them in the past .
Theoretical memristors act as artificial neurons capable of both computing and storing data. Therefore, neuromorphic (brain-like) computers based on memristors work well with artificial neural networks, i.e. ML company.
Unlike classical computers, which turn transistors on or off to symbolize data as 1 or 0, quantum qubits are used. Qubits can be in a state of superposition when they combine 1s and 0s with each other. The more qubits connected together with a quantum computer, the more its computing power can grow exponentially.
The quantum merristor is based on the flow of photons, directly in superpositions, where each individual photon can travel along the effect paths created by the laser on the glass. One of the mechanisms in this combined photonic circuit is used to measure the flux of these photons, and this data, through a complex electronic communication circuit, controls the transmission along the path. As a result, he turned out to be like a memristor.
Usually memristive behavior and quantum effects do not combine. Memristors act as a consequence of measuring their internal data, and spectacular effects have a high fragility value when it comes to external interference such as measurement. The researchers overcame this controversy by designing the interactions within their device so that they were sufficiently scrutinized to thus memristivity, but scrutinized enough to retain the detected behavior.
The advantage of using a quantum memristor in quantum ML over conventional quantum circuits is that the memristor, unlike any other quantum component, has memory. The next measurement is the connection of several memristors in the aggregate, the increase in the number of photons in each memristor and the amount of accumulation in which they matter in each number of measurements.
https://spectrum.ieee.org/quantum-memristor

IEEE Spectrum

AI Fuses With Quantum Computing in Promising New Memristor

Scientists in Austria and Italy have developed a quantum memristor they suggest could lead to quantum neuromorphic computers. Using computer simulations, they find quantum memristors could deliver exponential performance growth in a machine learning approach…

👍1

507 views05:06

Big Data Science

🌞Development in Python according to 12 SaaS principles with the Python-dotenv library
ML modelers and data analysts don't always write code like professional programmers. To improve the quality of the code, use simple methodology for developing web applications or SaaS. It recommends:
• use of declarative formats for registration to establish the time and strength of new commitments joining the project;
• have a clean agreement with the underlying system, providing portability between environments;
• start deployment on modern cloud platforms, eliminating the need to administer servers and systems;
• reduce spread between origin and production, ensure continuous deployment for rapid agility;
• scale without major changes in tooling, architecture, or development methods.

To implement these SaaS ideas, it is proposed to build applications on 12 repositories:
1. One codebase is version controlled, many are deployed
2. Explicitly declare and isolate the dependency
3. Keep health in the environment
4. Treat supporting services as attached resources
5. Strictly separate the stages of assembly and launch
6. Use the application as one or more stateless processes
7. Export services via port binding
8. Modify parallelism by scaling with the process model.
9. Maximum reliability due to fast start-up and smooth shutdown
10. Portability and credibility of environments from development to production through tests
11. Log, view event stream logs
12. Performs administration/management tasks as one-time processes

To implement all this for a Python program open library Python-dotenv. It reads key-value pairs from the .env file and can consider them as environment variables. If the application meets the requirements of the environment, running it during development is not very practical, because the developer needs to set these environment variables themselves. By adding Python-dotenv to your application, you can simplify the development process. The library itself loads the settings from the .env file, while remaining configurable through the environment.
You can also load the configuration without environment changes, parse the configuration as a stream and .env files in IPython. The tool also has a CLI interface that allows you to manipulate the .env file without manually opening it.
https://github.com/theskumar/python-dotenv

GitHub

GitHub - theskumar/python-dotenv: Reads key-value pairs from a .env file and can set them as environment variables. It helps in…

Reads key-value pairs from a .env file and can set them as environment variables. It helps in developing applications following the 12-factor principles. - theskumar/python-dotenv

🔥3

531 views03:53

Big Data Science

🌞TOP-10 Data Science and ML conferences all over the World in June 2022:
1. Jun 13, Machine Learning Methods in Visualisation for Big Data 2022. Rome, Italy. https://events.tuni.fi/mlvis/mlvis2022/
2. Jun 14-15, Chief Data & Analytics Officers, Spring. San Francisco, CA, USA. https://cdao-spring.coriniumintelligence.com/
3. Jun 15-16, The AI Summit London. London, UK. https://london.theaisummit.com/
4. Jun 20-22, CLA2022: The 16th International Conference on Concept Lattices and Their Applications. Tallinn, Estonia. https://cs.ttu.ee/events/cla2022/
5. Jun 19-24, Machine Learning Week, Predictive Analytics World conferences. Las Vegas, NV, USA. https://www.predictiveanalyticsworld.com/machinelearningweek/
6. Jun 20-21, Deep Learning World Europe. Munich, Germany. https://deeplearningworld.de/
7. Jun 21, Data Engineering Show On The Road. London, UK. https://hi.firebolt.io/lp/the-data-engineering-show-on-the-road-london
8. Jun 22, Data Stack Summit 2022. Virtual. https://datastacksummit.com/
9. Jun 28-29, Future.AI. Virtual. https://events.altair.com/future-ai/
10. Jun 29, Designing Flexibility to Address Uncertainty in the Supply Chain with AI. Chicago, IL, USA. https://www.luc.edu/leadershiphub/centers/aibusinessconsortium/upcomingevents/archive/designing-flexible-supply-chains-with-ai.shtml

🔥3👍1

559 views09:46

Big Data Science

#test
What is the difference between projection and view in relational databases?

Anonymous Quiz

these terms are the same

11%

these terms are relevant to different context

74%

projection is operation of relational algebra, view is result of query execution

11%

view is operation of relational algebra, projection is result of query execution

👍2

62 voters552 views05:31

Big Data Science

🔥LAION-5B: open dataset for multi-modal ML for 5+ billion text-image pairs
On May 31, 2022, the non-profit organization of AI researchers presented the largest dataset of 5.85 billion image-text pairs filtered using CLIP. The LAION-5B is 14 times larger than its predecessor, the LAION-400M, which was previously the world's largest open image-to-text dataset.
2.3 billion pairs are in English, and the other half of the dataset contains samples from over 100 other languages. The dataset also includes several nearest neighbor indices, an improved web interface for exploration and subsetting, and watermark detection and NSFW scores. The dataset is recommended for research purposes and is not specifically controlled.
The entire 5 billion dataset is divided into 3 datasets, each of which can be downloaded separately. They all have the following column structure:
• Image URL
• TEXT - subnoscripts, in English for en, in other languages for multi and nolang
• WIDTH - image width
• HEIGHT - image height
• LANGUAGE - sample language, laion2B-multi only, calculated using cld3
• Similarity – similarity, cosine between text and image for ViT-B/32 embedding, clip for en, mclip for multi and nolang
• Pwatermark - the probability of a watermarked image, calculated using the laion watermark detector.
• Punsafe - The probability that an image is unsafe is calculated using the laion clip detector.
pwatermark and punsafe are either available as separate collections that must be joined with a url+text hash.
Details and links to download: https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/

🔥1

526 views06:19

Big Data Science

🔥GATO: the new SOTA from DeepMind
May 19, 2022 DeepMind published an article about a new generic universal agent outside the realm of text outputs. GATO operates as a multi-modal, multi-tasking, multi-variant universal policy. On the same network with the same weights, you can play Atari, record images, chat, manipulate blocks, and set other tasks depending on your context: generate text, determine optimal joint torques, detect points, etc.
GATO is trained on a large number of data sets including experienced agents in both simulated and sparse environments, in addition to many natural language and image data sets. At the GATO training stage, data from various tasks and modalities are serialized in a flat sequence of tokens, grouped and processed by a transformer neural network, similar to a large language model. Losses are masked so GATO only predicts action and text targets.
When the Gato is deployed, the demo prompt is tokenized, forming the initial sequence. The environment then emits the first observation, which is also tokenized and reproduced to the next one. GATO autoregressively selects an action vector one token at a time. Once all markers containing an action vector are selected (defined by environment specific actions), the action is decoded and placed in the environment, which performs the steps and produces a new observation. Then the procedure is repeated. The model always observes all observations and actions in its context window of 1024 tokens.
https://www.deepmind.com/publications/a-generalist-agent

Google DeepMind

Publications

Explore a selection of our recent research on some of the most complex and interesting challenges in AI.

🔥2

569 views03:26

Big Data Science

Data analytics - blog of the leading Data Scientist working at Uber, one of the authors of 🔥 Machine Learning. The channel material will help you really grow into a data professional.

1 channel instead of thousands of textbooks and courses, subscribe: 👇👇👇

🚀 @data_analysis_ml

👍1

561 viewsedited 07:04

Big Data Science

#test
False signal of the car alarm sensor (without real threat) is error

Anonymous Quiz

depends of statistical significance level

it is not an error

🔥1

76 voters489 views15:40

Big Data Science

🚀New Python: faster more than 2x!
Released in April 2022, the alpha of Python 3.11 can run up to 60% faster than the previous version in some cases. Benchmarking tests conducted by Phoronix, conducted on Ubuntu Linux and compiled with the GCC compiler, have shown that Python 3.11 noscripts run an average of 25% faster than Python 3.10 without changing the code. This became possible due to the fact that now the interpreter is responsible for the static placement of its code objects and speeding up the seda of execution. Now, every time Python is used to call one of its own functions, a new frame is created, the internal structure of which has been improved so that it saves only the most important information without additional data about memory management and debugging.
Also, as of release 3.11, it is introduced that when CPython encounters a Python function that calls another function, it sets up a new frame and jumps to the new code contained within it. This avoids calling the function responsible for interpreting C (previously, each call to a Python function called a C function that interpreted it). This innovation further accelerated the execution of Python noscripts.
https://levelup.gitconnected.com/the-fastest-python-yet-up-to-60-faster-2eeb3d9a99d0

Medium

The Fastest Python Yet: Up to 60% Faster⚡

You won’t believe how fast it can be!

🔥5

614 views04:20

About

Blog

Apps

Platform