Big Data Science – Telegram
Big Data Science
3.74K subscribers
65 photos
9 videos
12 files
637 links
Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼https://news.1rj.ru/str/bds_job — channel about Data Science jobs and career
💻https://news.1rj.ru/str/bdscience_ru — Big Data Science [RU]
Download Telegram
🚀Need a visual dashboard quickly? Try Panel!
Every data analyst knows that the dashboard should summarize the most important indicators for decision makers. On the one hand, the dashboard is a simple HTML page for quickly viewing graphs and text. On the other hand, making it visual and not overloaded is not so easy. Moreover, not every BI specialist has design taste and skills in working with HTML components and their interaction with Javanoscript. But data specialists actively use Python.
To create the dashboard using only Python you can with Panel - the Python open source library. It allows to create custom interactive web applications and dashboards, widgets that users connect to graphs, images, tables or text. With it, you do not need to know how to create HTML components and their interaction with Javanoscript, because it is written on Python.
Panel supports almost all plot library builds and works just as well in Jupyter notebooks as it does on a standalone secure web server. The Panel uses the same code, supports both Python and adapts HTML/JavaScript, exports applications and can export applications, and can run rich interactive applications without tying domain-specific code to any possible GUI or web interface tool.
Panel makes easy the following:
• use of Python tools for data analysis and processing:
• development in IDE or laptop environment with further deployment of applications;
• Rapid prototyping of applications and dashboards, several polished templates for final deployment;
• deep interconnection, interaction switching and events on the client side in Python;
• transfer of large and small data to an external interface;
• Authentication in the application using built-in OAuth browsers.
Usage example: https://medium.com/@jairotunior/advanced-interactive-dashboards-in-python-cc2927dcde07
👍4
🤔PandaSQL: Python Combo for Data Scientist
SQL and Pandas are the most popular data analytics tools for tabular data management, processing and analysis. They can be used independently or together in the PandaSQL, the Python-package which provides SQL syntax capabilities in the Python environment. PandaSQL allows you to query date frames Pandas uses SQL syntax.
PandaSQL is a great tool for those who know SQL and are not familiar with the Python syntax that requires Pandas. For example, in Pandas, filtering a dataframe or grouping by column column when aggregating multiple columns can be explored in a confusing way, unlike SQL. However, this simple use of SQL in Pandas has been tainted with redundant runtime. For example, to calculate the number of rows, PandaSQL requires almost 100 times more execution runtime than Pandas.
Also, Python already has a lot of names that are reserved as basic words like for, while, in, if, else, elif, import, as. SQL adds more keywords: create, like, where, having.
Thus, it is worth to try PandaSQL, but this interesting tool is not suitable for an production data analytics pipelines.
Usage examples and runtime comparison: https://towardsdatascience.com/the-downsides-of-pandasql-that-no-one-talks-about-9b63c664bef4
👍3
#test
What statistical term that refers to a systematic relationship between 2 random variables in which a change in the other reflects a change in one variable?
Anonymous Quiz
29%
Covariance
0%
Delta
71%
Correlation
0%
Interpretation
👍5
Visual ETL with VDP
VDP (Visual Data Preparation)
is an open source visual data ETL tool for optimizing the end-to-end visual data processing pipeline. It involves extracting unstructured visual data from pre-built data sources such as cloud/local storage or IoT devices, transforming it into parsable structured data using Vision AI models, and loading the processed data into repositories, applications, or other destinations.
VDP streamlines the end-to-end visual data processing pipeline by eliminating the need for developers to create their own connectors, model service platforms, and ELT automation tools. With VDP, visual data integration becomes easier and faster. The VDP is released under the Apache 2.0 license and is available for local and cloud deployment on Kubernetes. VDP is built from a data management perspective to optimize the end-to-end flow of visual data with a transformation component that can import Vision AI models from different sources in a flexible way. Building an ETL pipeline becomes like assembling from ready-made blocks, as in a children's constructor. And high performance is provided by a Go backend with Triton Inference Server with powerful NVIDIA GPU architectures supporting TensorRT, PyTorch, TensorFlow, ONNX and Python.
VDP is also in line with MLOps, allowing one-click import and deployment of ML/DL models from GitHub, Hugging Face, or cloud storage managed by version control tools such as DVC or ArtiVC. CV Task's standardized output formats simplify data warehousing, while pre-built ETL connectors provide advanced data access through integration with Airbyte.
VDP supports different usage scenarios: synchronous for real-time inference and asynchronous for on-demand workload. The scalable API-based microservice design is developer-friendly through seamless integration with the modern data stack. NoCode/Low Code interfaces lower the barrier to entry into the technology, giving the Data Scientist and analyst independence from data engineering.
https://github.com/instill-ai/vdp
👍2
🥒7 Reasons Not to Use Pickle to Save ML Models
A Data Scientist often writes code in notebooks like Jupyter Notebook, Google Colab, or specialized IDEs. To port this code to a production environment, it must be converted to a lightweight interchange format, compressed and serialized, that is independent of the development language. One of these formats is Pickle, a binary version of a Python object for serializing and deserializing its structure, converting a hierarchy of Python objects into a stream of bytes and vice versa. The Pickle format is quite popular due to its lightness. It does not require a data schema and is quite common, but it has a number of disadvantages:
• Unsafe. You can only unpack pickle files that you trust. An attacker can create malicious data that will execute arbitrary code during decompression. You can mitigate this risk by signing the data with hmac to make sure it hasn't been tampered with. The unreliability comes not from the fact that Pickles contain code, but from the fact that they create objects by calling the constructors mentioned in the file. Any callable object can be used instead of a class name to create objects. Malicious code will use other Python callables as constructors.
• Code mismatch. If the code changes between the time the ML model is packaged in the Pickle file and when it is used, the objects may not match the code. They will still have the structure created by the old code, but will try to work with the new version. For example, if an attribute was added after the Pickle was created, the objects in the Pickle file will not have that attribute. And if the new version of the code is supposed to process it, there will be problems.
• Implicit serialization. On the one hand, the Pickle format is convenient in that it serializes any structure of a Python object. But at the same time, there is no way to specify preferences for serialization of one or another type of data. Pickle serializes everything in objects, even data that doesn't need to be serialized. But there is no way to skip the serialization of this or that attribute in Pickle. If an object contains an attribute that cannot be boxed, such as an object with an open file, Pickle will not skip it, insisting on trying to box it, and then throw an exception.
• Lack of initialization. Pickle stores the entire structure of objects. When the Pickle module recreates the objects, it does not call the init method because the object has already been created, considering the initialization to have been called when the object was first created during the creation of the Pickle file. But the init method can do some important things, like opening file objects. In this case, raw objects will be in a state that is incompatible with the init method. Or initialization can register information about the object being created. Then unselected objects will not be displayed in the general log.
• Unreadable. Pickle is a stream of binary data, i.e. instructions for the abstract execution mechanism. Once a Pickle is opened as a normal file, its contents cannot be read. To find out what is in it, you will have to use the Pickle module to load. This can make debugging difficult, since it is difficult to find the desired data in binaries.
• Binding to Python. Being a Python library, Pickle is specific to this programming language. Although the format itself can be used for other programming languages, it is difficult to find packages that provide such capabilities. Also, they will be limited to cross-language common list/dict object structures. Pickle serializes objects containing callable functions and classes without problems. But the format does not store the code, but only the name of the function or class. When unpacking data, function names are used to look for existing code in the running process.
• Low speed. Finally, compared to other serialization methods, Pickle is much slower.
👍3
🌴🌳🌲Decision Trees: brief overview
In general, decision trees are constructed via an algorithmic approach that identifies ways to split a data set based on various conditions. It is one of the most widely used and practical methods of non-parametric supervised learning used for classification and regression tasks. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. It is one of the most powerful and popular algorithms. Decision-tree algorithm falls under the category of supervised learning algorithms. It works for both continuous as well as categorical output variables. It learns from simple decision rules using the various data features.
Entropy is the measure of uncertainty or randomness in a data set. Entropy handles how a decision tree splits the data. The information gain measures the decrease in entropy after the data set is split. The Gini Index is used to determine the correct variable for splitting nodes. It measures how often a randomly chosen variable would be incorrectly identified. The root node is always the top node of a decision tree. It represents the entire population or data sample, and it can be further divided into different sets. Decision nodes are subnodes that can be split into different subnodes; they contain at least two branches. A leaf node in a decision tree carries the final results. These nodes, which are also known as terminal nodes, cannot be split any further.
Decision Tree Applications:
• to determine whether an applicant is likely to default on a loan.
• to determine the odds of an individual developing a specific disease.
• to find customer churn rates
• to predict whether a consumer is likely to purchase a specific product.
Advantages of Using Decision Trees
• Decision trees are simple to understand, interpret, and visualize
• They can effectively handle both numerical and categorical data
• They can determine the worst, best, and expected values for several scenarios
• Decision trees require little data preparation and data normalization
• They perform well, even if the actual model violates the assumptions
Disadvantages
• Overfitting is one of the practical difficulties for decision tree models. It happens when the learning algorithm continues developing hypotheses that reduce the training set error but at the cost of increasing test set error. But this issue can be resolved by pruning and setting constraints on the model parameters.
• Decision trees cannot be used well with continuous numerical variables.
• A small change in the data tends to cause a big difference in the tree structure, which causes instability.
https://blog.devgenius.io/decision-tree-regression-in-machine-learning-3ea6c734eb51
👍3
👍ETL and data integration with Airbyte
The key component of any data pipeline is the extraction of data. Once data is extracted it needs to be loaded and transformed (ELT). Make it easier with Airbyte, an open source data integration platform that aims to standardize and simplify the extraction and loading process. Airbyte works like an ELT, extracting raw data and uploading it to destinations. Airbyte also allows you to perform data transformation, separating them from the EL phases. It simplifies the process by creating connectors between data sources and data destinations. Airbyte is a plugin-based system where you can quickly create your own custom connector using the CDK.
If there are not enough ready-made 170+ connectors and 25+ destinations, you can develop your own. Airbyte has a built-in scheduler that provides different clock rates, AirFlow and dbt integration is supported. It is available on the K8s platform, includes octavia-cli with a YAML template for deployment, and supports near real-time CDC.
However, the framework is still in alpha and does not support IAM role-based authentication in AWS services. No built-in support for Prometheus, although Open Telemetry has been recently added. Airbyte does not provide support for user access control and replaying a specific job execution instance. And also the work can slow down with 2000+ simultaneous tasks.
https://airbyte.com/
👍3
💫Need to automate your WhatsApp messages? Use PyWhatKit!
PyWhatKit is a Python library with various useful features. It is easy to use and does not require additional settings. It allows you to automate the following actions:
• Sending a message to a WhatsApp group or contact
• Send image to WhatsApp group or contact
• Convert image to ASCII Art
• Convert string to handwriting
• YouTube video playback
• Sending emails with HTML code
For example, the following code will send the message "Hello!" to the number +78912345678 at 12:15
import pywhatkit
pywhatkit.sendwhatmsg("+78912345678", “Hello!”, 12, 15)
The sending time must be specified in advance, at least 2-3 minutes from the current time when the noscript is launched, otherwise the module will generate an error. Also, before running the noscript, you should make sure that you are logged into WhatsApp Web in Google Chrome through the web interface or desktop application by scanning the QR code on your mobile phone.
https://pypi.org/project/pywhatkit/
👍5
🤔How to extract tables from PDF? Try Camelot!
The open-source Camelot library helps extract tables from PDF files. Before installing it, you need to install the Tkinter and Ghostnoscript libraries. You can install these libraries through the pip or conda package managers:
pip install camelot-py
conda install -c conda-forge camelot-py
Next, you need to, as usual, import the desired module from the library in order to use its methods:
import Camelot
tables = camelot.read_pdf('foo.pdf', pages='1', flavor='lattice')
The flavor parameter is set to lattice by default, but can be reconfigured to stream. The lattice is more deterministic and it is great for analyzing tables where there are demarcation lines between cells. This allows you to automatically parse multiple tables present on the page. The lattice converts the PDF page to an image using the ghostnoscript library and then processes it to produce horizontal and vertical line segments by applying a set of morphological transformations using OpenCV.
To extract a table from a PDF, use the export() method to print it as a dataframe or export it to a CSV file:
tables.export('foo.csv', f='csv', compress=True)
tables[0].to_csv('foo.csv') # to a csv file
print(tables[0].df) # to a df
https://camelot-py.readthedocs.io/en/master/user/install.html
👍6
#test
The power of the experiment does NOT depend on the following factor
Anonymous Quiz
28%
Significance level (alpha)
3%
Sample size
12%
Variability
26%
Effect size
31%
Standard deviation
👍4
🍁TOP-10 DS-events all over the world
• Oct 4-5
• Modern DataFest • Virtual https://www.datacated.com/moderndatafest
• Oct 4-6 • NLP Summit 2022 • Virtual https://www.nlpsummit.org/
• Oct 5-6 • Deep Learning World • Berlin, Germany https://deeplearningworld.de/
• Oct 5-6 • Marketing Analytics Summit • Berlin, Germany https://marketinganalyticssummit.de/
• Oct 6-7 • Big Data & AI Toronto • Toronto, ON, Canada https://www.bigdata-toronto.com/
• Oct 10-13 • Chief Data & Analytics Officers, Fall • Boston, MA, USA https://cdao-fall.coriniumintelligence.com/
Oct 13-14 • Linq Conference – use the promocode ChernobrovovxLinq to 5% discount of ticket price https://linqconf.com
• Oct 17-21 • DeepLearn 2022 Autumn: 7th International School on Deep Learning • Luleå, Sweden https://irdta.eu/deeplearn/2022au/
• Oct 25-26 • IMPACT 2022: The Data Observability Summit • London, UK, New York, NY, Bay Area, CA + Virtual https://impactdatasummit.com/2022
• Oct 27-28 • The Data Science Conference • Chicago, IL, USA https://www.thedatascienceconference.com/home
👍3
🌎 How to create a machine learning project?

You will find out it on October 4 at 19:00 at Mathshub's Open Day.

Aira Mongush (founder of Mathshub, teacher of AI programs at MIPT/HSE, ex-Mail.Ru ex-aitarget. com) will tell you in detail how easy it is to create an ML project from scratch:

Choosing an Idea
Blue ocean search
Building a minimal step-by-step plan
ML product release

📍 In addition, the guys from Mathshub will tell about the school and their cool teachers - David Dale from AI Research, Igor Slinko, ex-Samsung AI Research and other specialists with extensive practice in deep learning and teaching.

📍And they will also present the autumn program for creating ML projects.

Who will be interested:
— those who want to put ML into practice
— those who have a product idea with ML
— those who want to create a portfolio in ML
— engineers and product managers who create features with ML algorithms
Sign up at the link
Please open Telegram to view this post
VIEW IN TELEGRAM
👍2
🚀Pandas features: use at and iat in loops instead of iloc and loc
The Python-library Pandas has iloc and loc functions to access dateframe values by row index and index or column name. But running them inside loops is time consuming. If you replace loc with at, or iloc with iat, the execution time of a for loop can be reduced by a factor of 60!
Such a different difference in speed is due to the nature of these functions: at and iat include access to a scalar, that is, to one element of the dataframe. And loc and iloc are used for simultaneous access to elements (rows, dataframes), i.e. they are initially needed to perform vectorized operations.
Because of at/iat are used for accessing the scalar environment, they are faster than loc/iloc for serie/dataframe access and take more space and time. Therefore, using loc/iloc inside loops in Python is slow and should be replaced with at/iat, which exucute faster. However, loc and iloc work fine outside of Python loops for vectorized operations.
https://medium.com/codex/dont-use-loc-iloc-with-loops-in-python-instead-use-this-f9243289dde7
👍6
👍🏻PRegEx for regular expressions
Regular expressions for searching and replacing text in a row, one or more files are actively used by developers and testers. However, reading and writing them is quite difficult. To simplify working with regular expressions in Python, the open source PRegEx (Programmable Regular Expressions) package, which has a simple syntax reminiscent of the imperative way of programming, will help. With PRegEx, there is no need to group patterns or escape metacharacters, as they are perfectly handled within the package.
Due to the modular nature of pattern creation in regular expressions, PRegEx allows you to break a complex pattern into several simpler ones, which can then be combined. And a high-level API on top of the built-in Python re module, providing access to its main functions and more, eliminates the need to work with re.Match instances.
Usage examples: https://towardsdatascience.com/pregex-write-human-readable-regular-expressions-in-python-9c87d1b1335
👍4
👌Low Code/No Code for data with Superblocks
The idea of rapid application development without writing code is actively used in automating office business processes using BPMS (ELMA, Camunda, etc.). In the field of data analysis, similar solutions also appear, allowing you to assemble a working application from ready-made blocks ala Lego. For example, Superblocks is a Low Code platform for integrating data and processing pipelines with the capabilities of a BI system.
How to build and deploy a reporting analytics web application using Superblocks and MongoDB in a couple of minutes, read here: https://yaakovbressler.medium.com/next-generation-data-dashboards-reimagined-meet-superblocks-2d316cb3597c
👍1
🤔What is sequential testing and why is it useful
A common problem with online A/B tests is the review problem, the notion that making early delivery decisions as soon as statistically significant results are observed leads to inflated rates of false positives. This is due to the contradiction between two aspects of online experimentation:
• Live Metrics Updates - Modern online experimentation platforms use real-time data feeds and can display results immediately. These results can then be updated to reflect the most recent information as data collection continues.
• Limitations of the basic statistical test - Hypothesis testing typically uses a predetermined rate of false positives, the so-called alpha level, of 0.05 (5%). When the p-value is less than 0.05, it is common to reject the null hypothesis and attribute the observed effect to the experiment being tested. There is a 5% chance that a statistically significant result is actually just random noise.
However, constant monitoring in anticipation of significance tends to exacerbate the 5% false positive effect. Therefore, in online testing, sequential hypothesis testing is useful - a statistical analysis in which the sample size is not fixed in advance. Instead, data is evaluated as it is collected, and further sampling is terminated according to a predetermined stopping rule as soon as meaningful results are observed. Thus, sequential testing reduces the time and cost of the experiment due to the ability to draw conclusions at an early stage of the study.
In sequential testing, the p-value calculation is modified to reduce the higher risk of false positives associated with peeping. Therefore, it is important to ensure early decision making without increasing false positives by adjusting the significance threshold to effectively raise the bar on what constitutes statistically significant early results.
Part of designing experiments involves pre-setting a target duration. This is the number of days needed to determine the desired effect size, assuming there is an effect. There are usually several measures of interest with different variances and effect sizes that require different sample sizes and durations. It is better to choose a duration that provides sufficient statistical power for all key indicators.
When viewing an experiment up to the end date, the confidence intervals expand to reflect the higher uncertainty at that point in time. If the adjusted confidence interval crosses zero, this means that there is not enough data yet to make a decision based on this metric, even if the traditional p-value is statistical. The adjustment decreases as the experiment progresses and disappears when the target duration is reached.
https://blog.statsig.com/sequential-testing-on-statsig-a3b45dd8ab72
2👍1
#test
If there are many outliers in the training dataset, the following method is suitable for ML
Anonymous Quiz
9%
neural network
22%
random forest with deep trees
23%
support vector machine
45%
random forest with shallow trees
👍4
🤔Limitations of Sequential Testing
Sequential hypothesis testing is a statistical analysis where the sample size is not fixed in advance, and the data are evaluated as they are collected. The analysis is terminated according to a predefined stopping rule as soon as significant results are observed. This allows you to draw conclusions at an earlier stage of the study than with more classic A / B testing, reducing the financial and time costs of the experiment.
To better understand the progress of sequential testing over the course of an experiment, it is helpful to think about thresholds that determine whether an effect is significant or not. These are commonly referred to as efficiency frontiers. When the Z-score calculated for the metric delta is above the upper bound, the effect is statistically positive. Conversely, a Z-score below the lower bound means a negative statistical signal result. At the beginning of the experiment, the efficiency margins are high. Intuitively, this means that a much higher significance threshold must be crossed in order to make an early decision when the sample size is still small. Borders are being adjusted every day. At the end of the predetermined duration, they reach the standard Z-score for the chosen significance level, for example: 1.96 for two-tailed tests with 95% confidence intervals.
While "peeping" into A/B testing is frowned upon, early monitoring of tests is critical to getting the most out of your experimentation program. If an experiment results in a measurable regression, don't wait until the end to take action. With sequential testing, statistical noise can be distinguished from strong effects that are significant at an early stage. Sequential testing also comes in handy when there are opportunity costs to run the experiment throughout its duration. For example, there is a significant engineering or business cost to failing an improvement from a subset of users, or when the completion of an experiment paves the way for further testing.
However, before making an early decision, it's worth remembering that even if one metric crossed the efficiency frontier, other metrics that appear neutral so far may be statistically significant at the end of the experiment. The efficacy frontier is useful for early determination of statistical results, but does not distinguish between no true effect and insufficient power before the target duration is reached.
The weekly seasonality of the experiments should also be taken into account. In particular, even when all the metrics of interest look great early on, it is recommended to wait at least 7 full days before making a decision. This is because many metrics are affected by weekly seasonality, with product end users behaving differently depending on the day of the week.
Finally, if a good estimate of the effect size is important, it is better to follow through with the experiment, as the adjusted confidence intervals of sequential testing are wider, so the range of likely values is larger when making an early decision. This means low accuracy. Also, a larger measured effect is more likely to be statistically significant early on, even if the true effect is actually smaller. Regularly making early decisions based on positive statistical results can lead to a systematic overestimation of the impact of running experiments, which also reduces accuracy.
https://blog.statsig.com/sequential-testing-on-statsig-a3b45dd8ab72
👍1
🤔UDF in SQL: pros and cons
Data scientists and data analysts often write SQL queries that take a very long time. User Defined Functions (UDFs) can improve the speed of writing SQL queries. A UDF is at a basic level similar to a typical SQL function, but is used by the user. Technically, a UDF is a function that takes a set of typed parameters, applies change logic, and then returns a typed value. UDFs have a wide range of uses, including optimizing repetitive code and centralizing business logic to help you write SQL queries more efficiently. The use of UDF can be noted, but it has its repetitions. The main advantages of UDF:
• allows you to replace parts of complex or repetitive SQL code with one-line strings that make the code more readable. For example, a complex CASE statement spans many lines. With a UDF, it can be reduced to a single line.
• Encourages the use of process calculations, use them in parentheses to reuse them in new queries.
Thus, UDF speeds up code development, but it also has some disadvantages:
• Too many obscure UDFs can make code less readable, especially if they themselves expose obscurities like "func1" - it's not clear that the function does a division and the reader has to look up its definition, which takes more time than just reading it in the code .
• It will be necessary to keep a record of all emergency functions that arise, which is difficult when there are a large number of them. To do this, it is recommended to save the dictionary with the created UDFs and share it with the team. You can also create more generic functions to avoid rework.
https://towardsdatascience.com/save-time-writing-sql-with-udfs-24b002bf0192
👍1
#test
What is the difference between population and sample in statistics? Example "The top 100 search results for advertisements for IT jobs in India on October 1, 2022" it is
Anonymous Quiz
14%
the population
73%
the sample
13%
the dataset
0%
the data point
👍1