NEW BOT Телеграм, страница

Big Data Science

📝Dataframe validation with Pandera
In large DS projects, the Great Expectations framework can be used to validate the dataset and check the quality of the data. However, smaller tasks require simpler tools. For example, the lightweight Python library Pandera, which explicitly checks information in dataframes at runtime. Pandera allows you to define a data schema once using a class-based API with pydantic syntax and use it to validate various types of dataframes, including pandas, dask, modin, and pyspark.pandas. You can check the types and properties of columns in pd.DataFrame or values in pd.Series, perform more complex statistical testing such as hypothesis testing. You can synthesize data from schema objects for property-based testing using pandas data structures.
Function decorators allow you to integrate with existing data analysis/processing pipelines using function decorators. With lazy validation, you can validate dataframes before errors occur. Finally, compatibility with other Python tools such as pydantic, fastapi, and mypy makes Pandera a useful tool for the ML developer and data analyst.
Documentation: https://pandera.readthedocs.io/en/stable/
Example: https://towardsdatascience.com/validate-your-pandas-dataframe-with-pandera-2995910e564

Medium

Validate Your pandas DataFrame with Pandera

Make Sure Your Data Matches Your Expectation

👍2

703 views14:41

Big Data Science

💥Why you need Modin: Pandas alternative for fast big data processing
Handling large frames of data with Pandas is slow because this Python library does not support working with data that does not fit in available memory. As a result, Pandas workflows that work well for prototyping a few MB of data don't scale to a real or hundreds of real GB dataset. Therefore, due to the single-threaded execution of operations in RAM, Pandas is not very suitable for processing really large data sets. with a wide range of data. There is an alternative - the Modin, Python-library with a Pandas-like API that scales to all processor cores using the Dask or Ray engine.
Modin supports working with data that won't fit in, so you can comfortably work with hundreds of GB without worrying about massive memory slowdowns or memory errors. With support for the cluster and beyond the core, Modin represents the use of a DataFrame with exceptional performance on a single node and high scalability in a cluster.
In the context of an algorithm (no cluster), Modin will create and manage a local (Dask or Ray) cluster for execution. There is no need to suggest how to evaluate the data, or even know how many cores the system has. Extraction, you can use code with Pandas by simply changing the library import statement from pandas to modin.pandas and getting a significant speedup even on a single machine. Modin speeds up to 4x on a laptop with 4 main cores.
Docs: https://modin.readthedocs.io/en/latest/index.html
Github: https://github.com/modin-project/modin

GitHub

GitHub - modin-project/modin: Modin: Scale your Pandas workflows by changing a single line of code

Modin: Scale your Pandas workflows by changing a single line of code - modin-project/modin

👍3

735 views03:47

Big Data Science

#test
Support-vector machine (SVM) method is used for

Anonymous Quiz

87%

classification and regression analysis

text generating with NLP

recomendation systems

prediction on noisy data

👍4

109 voters690 views06:27

Big Data Science

Z-scoring for simple and fast anomaly detection
Anomaly detection is a fairly common problem that covers many scenarios, from financial fraud to computer network failures. Some problems require complex machine learning models, but most often some simpler and cheaper methods are sufficient. For example, you have sales data over a period of time where you want to flag days with abnormally high volumes or highlight customers with abnormally high credit card swipes for risk testing.
For such cases, a simple statistical method of marking outliers, called Z-scoring, will do. The score is equal to the difference between the current and mean values, divided by the standard deviation. Z-scoring assumes the classical normal distribution of random variables. Converting nominal scale values to a logarithmic scale will improve the ability of most ML models to discern relationships and improve the ability of Z-scores to flag outliers.

714 views14:13

Big Data Science

In practice the implementation of Z-score is very simple: it can be written as a small software noscript or even a set of SQL queries to quickly get a lightweight MVP and quickly test a hypothesis.
https://towardsdatascience.com/anomaly-detection-in-sql-2bcd8648f7a8

Medium

Anomaly Detection in SQL

How to implement fast, powerful, anomaly detection models directly in the data warehouse

732 views14:13

Big Data Science

🕸✍🏻3 Python-libraries for working with URLs
The task of processing URLs is quite common in practice. For example, make a list of the most frequently visited sites or those that are allowed to be visited during business hours from corporate computers. To automate such cases, the following Python libraries are useful:
• Yarl - allows you to extract features from a URL, provides a convenient class for parsing and changing the address of a web resource. But it only works with Python 3 and does not accept boolean values in the API - you need to convert boolean values to strings yourself using the desired translation protocol. https://github.com/aio-libs/yarl
• Furl - makes parsing and manipulating URLs easier. The library has a wide range of features, but also a number of limitations. In particular, the furl object can change, so problems can occur when passing it to the outside. https://github.com/gruns/furl
• URLObject - A utility class for manipulating URLs with a clean API that focuses on proper method names rather than operator overrides. The object itself is immutable here, each URL change creates a new URL object. But the library does not perform any decoding / encoding transformations, which the user has to deal with on their own. https://github.com/zacharyvoase/urlobject

GitHub

GitHub - aio-libs/yarl: Yet another URL library

Yet another URL library. Contribute to aio-libs/yarl development by creating an account on GitHub.

👍2

798 views04:16

Big Data Science

#test
Why multicollinearity of features is not ok for ML?

Anonymous Quiz

18%

It is too hard to define dependent variables in learning dataset

60%

It reduces reliability of results and speed of calculations with raise of the scope of feature space

It reduces confidence intervals

15%

It increases the complexity of Ml-algorithms

👍2🥰2

89 voters847 views04:23

Big Data Science

👆🏻Something about deduplication with DISTINCT
You can exclude duplicates from the selection by simply adding the DISTINCT keyword to the SQL query. However, this simple solution will not always be correct. To ensure that there are no duplicates in a data set, the DBMS needs to compare all rows with each other, filtering out duplicates. This requires a lot of CPU and memory resources to store all the strings. they need to be compared with each other in memory, even if the hash is being worked on at a low level. In addition, DISTINCT reduces computational parallelism by slowing down query execution.
DISTINCT removes duplicates, but does not resolve incorrect joins and filters, which in practice most often lead to repetitions, for example, due to CROSS JOIN or using RANK instead of ROW_NUMBER, which leads to duplication due to a poorly defined section window. See here for details with code examples: https://jmarquesdatabeyond.medium.com/sql-like-a-pro-please-stop-using-distinct-31bdb6481256

Medium

SQL Like a Pro: Please Stop Using Distinct!!

Every time I see a “DISTINCT” I ask the same question: Why??

🔥2

823 views02:07

Big Data Science

💥DataSpell: A professional data science development environment from JetBrains
Lacking a comfortable development environment in a lightweight Jupyter notebook? Need to write Python code in a reliable IDE with all DS libraries? Try DataSpell by JetBrains, a professional IDE like PyCharm that combines many popular data analysis and machine learning libraries with a powerful set of developer tools.
Released in 2020, today DataSpell is in demand by machine learning developers and data analysts around the world.
https://www.jetbrains.com/ru-ru/dataspell/

JetBrains

JetBrains DataSpell: The IDE for Data Scientists.

JetBrains DataSpell is an IDE for data science with intelligent Jupyter notebooks, interactive Python noscripts, and lots of other built-in tools.

🔥1🤯1

878 views05:00

Big Data Science

#test
What could be used to avoid the risk of ML-model's overfitting?

Anonymous Quiz

👍1

133 voters703 views04:55

Big Data Science

☀️TOP-15 Data Science and ML conferences all over the World in May 2022:
• 5-6 May - The #1 MLOps Conference on the planet - Marriott Marquis, New York, NY https://rev.dominodatalab.com/
• 5-6 May - Data Innovation Summit 2022 - KISTAMÄSSAN, STOCKHOLM https://datainnovationsummit.com/
• 10-12 May - Wrangle Summit 2022 Virtual https://www.trifacta.com/events/wrangle-summit-2022/
• 11-12 May - Big Data & AI World. Frankfurt, Germany. https://www.bigdataworldfrankfurt.de/
• 12-13 May - The Data Science Conference. Chicago, IL, USA https://www.thedatascienceconference.com/
• 12 - May 9AM ET, Ontotext Demo-Day. Virtual. https://event.gotowebinar.com/event/bfd3b6ef-828c-46a1-a644-b4e785cece6c
• 15-18 - May FLAIRS-35: Special Track on Neural Networks and Data Mining, Jensen Beach, FL, USA. https://sites.google.com/view/flairs-35-nn-dm-track/home
• 17 May - The data dividend: reimagining data strategies to deepen insight. San Francisco, CA, USA https://events.economist.com/custom-events/the-data-dividend-san-francisco/
• 18 May - Data Science Mini Salon | AI and ML in Retail & E-Commerce. Virtual. https://www.datascience.salon/retail-and-ecommerce
• 23-25 May - TDWI Visualization, Dashboards, and Analytics Adoption https://tdwi.org/events/seminars/may/dashboards-visualization-analytics-adoption/home.aspx
• 24-25 May - Graph + AI Summit. Virtual. https://www.tigergraph.com/graphaisummit
• 24-25 May - Chief Data & Analytics Officers, Insurance US. New York, NY, USA. https://cdaoi.coriniumintelligence.com/
• 25-26 May - Data Reliability Engineering Conference. Virtual https://drecon.org/
• 26 May - Zero Gravity: A Modern Cloud Data Pipeline Event. Virtual. https://www.incorta.com/zerogravity
• 30 May – HeyGrowth - Yerevan, Armenia https://heygrowth.com/yerevan

Dominodatalab

Rev 4: MLOps and Data Science Conference | Powered by Domino

Rev is the largest MLOps and Data Science conference that happens just once a year. Where movers and shakers in the industry gather for two days of unparalleled learnings, mind-expanding conversations, interactive sessions and networking with industry luminaries.

🔥3

1.08K views04:35

Big Data Science

💫Continuous Machine Learning: CML for CI/CD
Need to introduce CI / CD in the development of ML systems? Try CML, an open source CLI tool from Iterative.ai for implementing CI/CD within MLOps. It is suitable for automating ML model development workflows, including provisioning, training and evaluation, comparison of experiments in the history of the project, and monitoring of changing datasets. CML is based on the following principles:
• GitLab or GitHub for managing ML experiments, monitoring model training and data changes using DVC;
• Automated reports for machine learning experiments with metrics and graphs on every Git pull to make informed decisions based on data.
• no additional services - only GitLab, Bitbucket or GitHub, Docker and DVC. Optionally, you can add cloud storage, as well as self-hosted or cloud workers such as AWS EC2 or MS Azure.
CML introduces CI/CD-style automation into the workflow: most of the configurations are defined in the cml.yaml file stored in the repository. This file specifies what actions should be taken when a new feature branch is ready to be merged into the main branch. When a pull request is created, GitHub Actions uses this workflow and performs the actions specified in the configuration file.
Source code: https://github.com/iterative/cml
Documentation: https://cml.dev/doc
Use case example: https://towardsdatascience.com/continuous-machine-learning-e1ffb847b8da

GitHub

GitHub - iterative/cml: ♾️ CML - Continuous Machine Learning | CI/CD for ML

♾️ CML - Continuous Machine Learning | CI/CD for ML - iterative/cml

👍1🔥1

920 views03:51

Big Data Science

#test
What method in Apahe Spark deals with File System instead of RAM?

Anonymous Quiz

🔥2

94 voters701 views07:59

Big Data Science

YDB: scalable fault-tolerant NewSQL DBMS from Yandex. Now open source
April 19, 2022. Yandex has published the source code of the distributed NewSQL DBMS YDB, which allows you to create scalable, fault-tolerant services that can support a large operational load. The code is available under the Apache 2.0 license.
YDB is an open-source Distributed SQL Database that combines high availability and scalability with strict consistency and ACID transactions. YDB observes the occurrence of occurrences and recovery in the event of occurrences from the formation or even the occurrence of the center of the day. The reliability of YDB has been tested on Yandex services (Alisa, Taxi, Market, Metrika and almost 500 more projects). You can deploy YDB both at will and on external servers, including Yandex Cloud or providers.
https://ydb.tech/
https://github.com/ydb-platform/ydb

ydb.tech

YDB — Beyond Distributed SQL Database

YDB is an AI-powered Distributed SQL DBMS that unifies transactional, analytical, federated, and streaming workloads, delivers strict consistency and high availability, and brings AI capabilities directly to developers.

🔥3

826 views01:57

Big Data Science

🗒Loguru for logging Python noscripts
This library is useful for ML specialists and data engineers who often write in Python. It automates the logging and debugging process. In addition, Loguru includes a number of useful features that ensure that standard logging facilities are retained.
Loguru works according to a plug-and-play recipe and has features such as collapsing multiple event logs, quickly compressing log files, and deleting them regularly. It also supports multi-threaded security and log highlighting. This open source library can be used in conjunction with email media to receive email emails or to send other types of messages.
Finally, Loguru supports correlation with a large Python research module, increases the transmission of all information, measures the initial concentration of the logger, in Loguru.
Source code: https://github.com/Delgan/loguru
Use case example: https://medium.com/geekculture/python-loguru-a-powerful-logging-module-5f4208f4f78c

GitHub

GitHub - Delgan/loguru: Python logging made (stupidly) simple

Python logging made (stupidly) simple. Contribute to Delgan/loguru development by creating an account on GitHub.

🔥4👍1

822 viewsedited 03:42

Big Data Science

#test
What is the main difference between MapReduce-operations in Spark and Hadoop?

Anonymous Quiz

the different dataset's scale

🔥6

113 voters725 views04:11

Big Data Science

🔥TOP 5 New Python Alpha 5
At the beginning of 2022, a new version of Python was released - Alpha 5 (3.11). Main features:
• Improved debugging with the exception chain of their messages. Python 2022 is showing features with trace detection pointing directly to where the error occurs. Python 2 had a similar feature, but required the addition of context to the code, which made things more difficult. Now context is done automatically.
• Variable exception handling - you can now reduce dependency in different ways depending on what other exceptions it is associated with. The ability to use multiple exception operations with an explicit random exception for all. Just create a big try/except block with all possible exception names, and then add more exclude statements to it. It is for this purpose that a group of exceptions, which can be associated with grouping many different exceptions together and applying a single handler function, is only called if something occurs internally due to exceptional probability.
• Variadic Generics - now you can create functions that take a variable number of arguments (up to 22). It was necessary to define a characteristic that could take into account the magnitude of the transferred value each time. Variadic Generics in Python 3.6 allows you to select any number of options at once, which is useful when iterating multiple operations.
• CPython performance optimization. Changes to functions applied to calls and word lookups should reduce overhead, call by the C stack, speed up everything from developing object-oriented code to accessing data dictionaries.
• Simplify work in other languages such as JavaScript on top of Python through high performance and parallel computing.
https://morioh.com/p/af7debd024e2
https://medium.com/@Sabrina-Carpenter/python-alpha-5-is-here-5-promising-features-that-will-blow-your-mind-a4abd406d0ad

Morioh

Python Alpha 5 - 5 Promising Features that will blow your mind 🤯

🔥2

688 views03:04

Big Data Science

💫Graph visualization with PyGraphistry
PyGraphistry is a Python AI library for visual graphs that allows you to extract, transform, analyze and visualize large graphs along with end-to-end Graphistry graphics server sessions. Graphics created specifically for large graphs. The WebGL custom rendering engine client renders up to 8 million nodes + a number at a time, and most client GPUs detect between 100,000 and 2 million elements. The GPU analytics engine on the server interface supports even larger graphics. Graphics smoothes graphics workflows in the PyData ecosystem, including Pandas/Spark/Dask dataframes, Nvidia RAPIDS GPUs, GPU graphics, DGL/PyTorch graphics neural networks, and various data connectors.
PyGraphistry is a streamlined and optimized native PyData interface for language independent Graphistry REST APIs. It is possible to use PyGraphistry using Python data sources such as CSV, SQL, Neo4j, Splunk and more.
The PyGraphistry Python client uses different categories of users:
• Data Explorer: Comprehensive data exploration to accelerated visual analysis in a couple of lines, share results on time phenomena, create complex predictions in Jupyter Notebook and Google Colab.
• Developer: Quickly prototype amazing Python solutions with PyGraphistry, embed in a language-independent way with the REST API, customize colors, icons, templates, JavaScript, and more.
• Analyst: Customize visual ashboards using interactive search, filters, timelines, bar charts, and more, embedding them in any framework.
https://github.com/graphistry/pygraphistry

GitHub

GitHub - graphistry/pygraphistry: PyGraphistry is a Python library to quickly load, shape, embed, and explore big graphs with the…

PyGraphistry is a Python library to quickly load, shape, embed, and explore big graphs with the GPU-accelerated Graphistry visual graph analyzer - graphistry/pygraphistry

🔥3

706 views04:04

Big Data Science

#test
Key difference between window and aggregate functions is

Anonymous Quiz

Aggregate functions operate on a set of values to return a range of values

They are the same, but the window functions are more difficult to write

58%

Window functions operate on a set of values to return a range of values

34%

Aggregate functions return single value for each row from the underlying query

🔥3

77 voters500 views15:02

Big Data Science

🗣Estimating the information content of data: a new method from MIT
Information and data are different things. Not all data are valuable. How much any information from data fragments can be obtained? This question first arose in the 1948 paper "A Mathematical Theory of Communication" by MIT Professor Emeritus Claude Shannon. One breakthrough result is Shannon's idea of entropy, which allows one to estimate the amount of information inherent in any random object, including random variables that model allergy data. Shannon's results laid the foundation for information theory and modern telecommunications. The concept of entropy has also found its way into the field of computer science and machine learning.
Using Shannon's formula can quickly become computationally intractable. This requires accurate calculation of the probability models of the data and all possible occurrences of the data within the probabilistic framework. This disease becomes rare, for example, a survey where a positive test result is identified by hundreds of interacting manifestations, and all of them are unknown. With only 10 unknowns, the data already has 1000 implementations. With many hundreds of possible manifestations, there are more than atoms in the aggregate, which makes entropy calculation an absolutely intractable disease.
MIT researchers have developed a new method for estimating approximations to many information quantities, such as the Shannon entropy, using probabilistic inference. The work is presented in the AISTATS 2022 conference paper. The key takeaway is that, instead of listing all denoscriptions of algorithms for using probabilistic inference, first conclude which explanations are great, and use them to build building entropy estimates. It has been proven that this inference-based approach can be much faster and more accurate than opposing approaches.
Estimation of entropy and information in a probabilistic model is fundamentally difficult, since it often requires solving a multidimensional complexion problem. In many cases past work has done value estimates for some special cases, but new entropy estimates by inference (EEVI) return a first approach that can give accurate upper and lower bounds for a wide range of values based on information theory. We can get a number that is less than it, and a number that is higher. The difference between high and low values gives an idea that we should be sure about the low values. the value of large computational resources, which can be reduced between the outer boundaries to use, which "compresses" the true value with a wide range of resources. You can also take into account how informative different variables in the models are for each other. A new, particularly useful method for finding probabilistic patterns in research such as medical diagnostics.
https://news.mit.edu/2022/estimating-informativeness-data-0425

MIT News

Estimating the informativeness of data

MIT researchers discovered a new way to estimate the amount of information contained in a piece of data using probabilistic programming and probabilistic inference. The breakthrough entropy estimators open up new applications in medicine, scientific discovery…

619 views03:10

Big Data Science

🙌🏼MEGAscaling with quantum ML
Theoretically, quantum computers could be more powerful than any conventional computer, especially at finding prime factors of numbers, the mathematical basis of modern encryption that protects banking and other sensitive data. The more components known as qubits are connected to each other in a quantum computer, where multiple particles can instantly influence each other no matter how far apart they are, the more its processing power can grow exponentially.
One potential application of quantum ML is the simulation of quantum systems, such as chemical reactions, to create new drugs. But the average performance of an ML algorithm depends on how much data it has. The amount of data ultimately limits the performance of machine learning. Therefore, to simulate a quantum system, the amount of training data that a quantum computer might need will grow exponentially as the system being modeled gets larger. This potentially eliminates the advantage of quantum computing over classical computing.
The scientists proposed to link additional qubits to the quantum system that the quantum computer should model. This additional set of "auxiliary" qubits can help the quantum ML circuit to simultaneously interact with many quantum states in the training data. So the quantum ML scheme can work even with a relatively small number of auxiliary devices. In practice, it is still quite difficult to implement this idea, but it can be tested within the framework of the experiments of CERN, the largest particle physics laboratory in the world.
https://spectrum.ieee.org/quantum-machine-learning

IEEE Spectrum

Spooky Action Could Help Boost Quantum Machine Learning

Mysterious quantum links could help lead to exponential scale-up

610 views03:31

About

Blog

Apps

Platform