🕸Analysis of the American Data Science market 2021: a web scraping project on Selenium on open vacancies with visual results and conclusions. Also in the review, you will learn about the popularity of programming languages and ML-frameworks among US employers.
https://pub.towardsai.net/current-data-science-job-market-trend-analysis-future-4184f03a04ca
https://pub.towardsai.net/current-data-science-job-market-trend-analysis-future-4184f03a04ca
Medium
Data Science Job Market Trend Analysis for 2021
Know what employers are expecting for a data scientist role in 2021. Data analysis from over 3000+ data scientist job postings — extracted…
🤦🏼♀️Machine unlearning is a new challenge in ML
Sometimes ML algorithms have to forget what they have learned. For example, artificial intelligence can destroy privacy. Regulators around the world have the right to compel companies to remove inappropriate information. EU and California citizens may require the company to delete their data. Recently, regulators in the US and Europe have said that owners of artificial intelligence systems must sometimes remove systems trained on sensitive data. And in 2020, the UK data regulator warned companies that some ML programs may be subject to GDPR rights because they contain personal data. In early 2021, the FTC forced facial recognition startup Paravision to remove a collection of incorrectly captured photographs of faces and ML algorithms trained on them.
Thus, we come to a new area of DS called machine learning, which seeks ways to induce selective amnesia for AI in order to remove all traces of a particular person or data point from an ML system without affecting its performance. Some studies have shown that under certain conditions it is possible to make ML algorithms forget something, but this method is not yet ready for use in production. Specifically, in 2019, scientists from the Universities of Toronto and Wisconsin-Madison proposed splitting the raw data for machine learning into multiple parts, each of which is processed separately before the results are combined into the final ML model. If you later need to forget one data point, you only need to reprocess part of the original dataset. Testing has shown that the approach works with online shopping data and a collection of over a million photographs. However, the unlearning system will fail if sent deletion requests are received in a specific sequence. Researchers are now looking for how to solve this problem. However, machine learning techniques are more of a demonstration of technical acumen than a major shift in data protection. After all, even if machines learn to forget, users will have to remember who they are sharing their data with.
https://www.wired.com/story/machines-can-learn-can-they-unlearn/
Sometimes ML algorithms have to forget what they have learned. For example, artificial intelligence can destroy privacy. Regulators around the world have the right to compel companies to remove inappropriate information. EU and California citizens may require the company to delete their data. Recently, regulators in the US and Europe have said that owners of artificial intelligence systems must sometimes remove systems trained on sensitive data. And in 2020, the UK data regulator warned companies that some ML programs may be subject to GDPR rights because they contain personal data. In early 2021, the FTC forced facial recognition startup Paravision to remove a collection of incorrectly captured photographs of faces and ML algorithms trained on them.
Thus, we come to a new area of DS called machine learning, which seeks ways to induce selective amnesia for AI in order to remove all traces of a particular person or data point from an ML system without affecting its performance. Some studies have shown that under certain conditions it is possible to make ML algorithms forget something, but this method is not yet ready for use in production. Specifically, in 2019, scientists from the Universities of Toronto and Wisconsin-Madison proposed splitting the raw data for machine learning into multiple parts, each of which is processed separately before the results are combined into the final ML model. If you later need to forget one data point, you only need to reprocess part of the original dataset. Testing has shown that the approach works with online shopping data and a collection of over a million photographs. However, the unlearning system will fail if sent deletion requests are received in a specific sequence. Researchers are now looking for how to solve this problem. However, machine learning techniques are more of a demonstration of technical acumen than a major shift in data protection. After all, even if machines learn to forget, users will have to remember who they are sharing their data with.
https://www.wired.com/story/machines-can-learn-can-they-unlearn/
WIRED
Now That Machines Can Learn, Can They Unlearn?
Privacy concerns about AI systems are growing. So researchers are testing whether they can remove sensitive data without retraining the system from scratch.
Luxury EDA with Lux
Useful DS-tools that will come in handy in your daily work. For example, the Lux – Python-library, which simplifies and accelerates data exploration by automating the process of visualizing and analyzing data. For the dataframe in the Jupyter Notebook, Lux recommends a set of visualizations that highlight interesting trends and patterns in the dataset. The visualizations are displayed using an interactive widget that allows users to quickly browse through large collections of visualizations and understand the data. Deeply integrated with Pandas, Lux supports the various geographic and temporal data types in the library, as well as SQL queries against Postgres.
Lux consists of several modules, each of which performs its own duties:
• user interface level;
• level of verification and analysis of user input;
• intent processing level, data execution level, and finally, analytics level.
https://github.com/lux-org/lux
https://lux-api.readthedocs.io/en/latest/source/getting_started/overview.html
Useful DS-tools that will come in handy in your daily work. For example, the Lux – Python-library, which simplifies and accelerates data exploration by automating the process of visualizing and analyzing data. For the dataframe in the Jupyter Notebook, Lux recommends a set of visualizations that highlight interesting trends and patterns in the dataset. The visualizations are displayed using an interactive widget that allows users to quickly browse through large collections of visualizations and understand the data. Deeply integrated with Pandas, Lux supports the various geographic and temporal data types in the library, as well as SQL queries against Postgres.
Lux consists of several modules, each of which performs its own duties:
• user interface level;
• level of verification and analysis of user input;
• intent processing level, data execution level, and finally, analytics level.
https://github.com/lux-org/lux
https://lux-api.readthedocs.io/en/latest/source/getting_started/overview.html
☂️Reverse ETL: what it is and how to use it
Reverse ETL is the process of copying data from a data warehouse to operating systems, including SaaS for marketing, sales, and support. This allows any team of professionals, from salespeople to engineers, to access the data they need on the systems they use. There are 3 main use cases for reverse ETL:
• Operational analytics - providing insights to business teams in their normal workflows and tools to make data-driven decisions
• Data Automation — Automate ad hoc requests for data from other teams, for example, when financiers request product usage data for billing
• personalization of interaction with customers in different applications
The most popular reverse ETL tools today are:
• Hightouch is a data platform that allows you to synchronize data from repositories with CRM, marketing and customer support tools https://hightouch.io/docs/
• Census is an operational analytics platform that synchronizes the data warehouse with different applications https://www.getcensus.com/
• Octolis is a cloud service that allows marketing and sales teams to easily deploy use cases by activating their data in their operational tools such as CRM or marketing automation software https://octolis.com/
• Grouparoo is an open source reverse ETL that runs easily on a laptop or in the cloud, allowing you to develop locally, commit changes, and deploy https://www.grouparoo.com/docs/config
• Polytomic is an ETL solution that allows you to create in real time all the necessary customer data in Marketo, Salesforce, HubSpot and other business systems in a couple of minutes https://www.polytomic.com/
• RudderStack is a customer data platform for developers where reverse ETL tools make it easy to deploy pipelines that collect customer data from each application, website, and SaaS platform to activate on DWH and application systems https://rudderstack.com/
• Workato - a tool for automating business processes in cloud and local applications https://www.workato.com/
• Omnata - data integration tool for modern architectures https://omnata.com/
• Smart ETL Tool from Rivery - a platform for automating ETL processes using any cloud-based DBMS, including Redshift, Oracle, BigQuery, Azure and Snowflake https://rivery.io/
Reverse ETL is the process of copying data from a data warehouse to operating systems, including SaaS for marketing, sales, and support. This allows any team of professionals, from salespeople to engineers, to access the data they need on the systems they use. There are 3 main use cases for reverse ETL:
• Operational analytics - providing insights to business teams in their normal workflows and tools to make data-driven decisions
• Data Automation — Automate ad hoc requests for data from other teams, for example, when financiers request product usage data for billing
• personalization of interaction with customers in different applications
The most popular reverse ETL tools today are:
• Hightouch is a data platform that allows you to synchronize data from repositories with CRM, marketing and customer support tools https://hightouch.io/docs/
• Census is an operational analytics platform that synchronizes the data warehouse with different applications https://www.getcensus.com/
• Octolis is a cloud service that allows marketing and sales teams to easily deploy use cases by activating their data in their operational tools such as CRM or marketing automation software https://octolis.com/
• Grouparoo is an open source reverse ETL that runs easily on a laptop or in the cloud, allowing you to develop locally, commit changes, and deploy https://www.grouparoo.com/docs/config
• Polytomic is an ETL solution that allows you to create in real time all the necessary customer data in Marketo, Salesforce, HubSpot and other business systems in a couple of minutes https://www.polytomic.com/
• RudderStack is a customer data platform for developers where reverse ETL tools make it easy to deploy pipelines that collect customer data from each application, website, and SaaS platform to activate on DWH and application systems https://rudderstack.com/
• Workato - a tool for automating business processes in cloud and local applications https://www.workato.com/
• Omnata - data integration tool for modern architectures https://omnata.com/
• Smart ETL Tool from Rivery - a platform for automating ETL processes using any cloud-based DBMS, including Redshift, Oracle, BigQuery, Azure and Snowflake https://rivery.io/
Getcensus
Universal Data Platform | Census
Unify, de-duplicate, enhance, and activate your data. Census helps you deliver AI enhanced data from any data source to every tool—no silos, no guesswork.
How to raise the quality of data?
You can have perfect outcomes on all stages of product promotion, but if you lack quality data, they will not be reliable and won't bring any efficient results. What is important about data is consistency, especially for product analytics. Data quality depends on it heavily.
At Matemarketing Vlad Kharitonov and Oleg Khomyuk will elaborate on how to achieve consistency in all cases, including scaling. Their performance includes speeches on strict contract-based categorization, versioning, cross-platform cases, using legacy for scaling.
Matemarketing is the biggest conference on Marketing and Product Analytics, Monetization and Data-Driven Solutions in Russia and CIS.
- - - -
✅ Matemarketing-21 will take place on November 18-19 in Moscow and will be available online.
↪️ The full program and all details are available on our website.
- - - -
And now we want to share Jordi Roura's, (InfoTrust Barcelona), report from Matemarketing. You will find out how to provide quality data theoretically and see examples of implementation of this knowledge in certain cases.
You can have perfect outcomes on all stages of product promotion, but if you lack quality data, they will not be reliable and won't bring any efficient results. What is important about data is consistency, especially for product analytics. Data quality depends on it heavily.
At Matemarketing Vlad Kharitonov and Oleg Khomyuk will elaborate on how to achieve consistency in all cases, including scaling. Their performance includes speeches on strict contract-based categorization, versioning, cross-platform cases, using legacy for scaling.
Matemarketing is the biggest conference on Marketing and Product Analytics, Monetization and Data-Driven Solutions in Russia and CIS.
- - - -
✅ Matemarketing-21 will take place on November 18-19 in Moscow and will be available online.
↪️ The full program and all details are available on our website.
- - - -
And now we want to share Jordi Roura's, (InfoTrust Barcelona), report from Matemarketing. You will find out how to provide quality data theoretically and see examples of implementation of this knowledge in certain cases.
matemarketing.ru
MM’25 — Конференция для аналитиков, performance-маркетологов и product-менеджеров
Крупнейшая конференция по маркетинговой и продуктовой аналитике в России, СНГ и Восточной Европе. Даты: 20–21 ноября 2025, онлайн-день — 11 ноября.
✍🏻5 Scikit-learn tips from Data Scientist
1. Fill the gaps with Iterative Imputer - IterativeImputer, which iteratively searches for and fills in missing values, improving the dataset with each iteration. To use this function, import it enable_iterative_imputer from the sklearn.experimental package
2. Generate random dummy data to reserve the place where real or useful data should be present. The dummy data is needed for testing, so it must be reliable. To do this, you can use the functions make_classification () in a classification task or make_regression () in a regression task. You can also set the number of samples and features to control the behavior of the data in debugging and testing.
3. Save ML-models for reuse without retraining. To serialize your algorithms and save them, the pickle and joblib Python libraries come in handy.
4. Plot a confusion matrix using the plot_confusion_matrix function, which displays true positive, false positive, false negative and true negative values.
5. Visualize decision trees using the tree.plot_tree function in the matplotlib package without manually installing dependencies to create simple visualizations. You can also save the tree as a graphic png-file.
https://www.educative.io/blog/scikit-learn-tricks-tips
1. Fill the gaps with Iterative Imputer - IterativeImputer, which iteratively searches for and fills in missing values, improving the dataset with each iteration. To use this function, import it enable_iterative_imputer from the sklearn.experimental package
2. Generate random dummy data to reserve the place where real or useful data should be present. The dummy data is needed for testing, so it must be reliable. To do this, you can use the functions make_classification () in a classification task or make_regression () in a regression task. You can also set the number of samples and features to control the behavior of the data in debugging and testing.
3. Save ML-models for reuse without retraining. To serialize your algorithms and save them, the pickle and joblib Python libraries come in handy.
4. Plot a confusion matrix using the plot_confusion_matrix function, which displays true positive, false positive, false negative and true negative values.
5. Visualize decision trees using the tree.plot_tree function in the matplotlib package without manually installing dependencies to create simple visualizations. You can also save the tree as a graphic png-file.
https://www.educative.io/blog/scikit-learn-tricks-tips
Educative: Interactive Courses for Software Developers
Data Science Made Simple: 5 essential Scikit-learn tricks
Scikit-learn is the most popular Python machine learning library for data science. Learn the top tips and tricks to take your Scikit skills to the next level.
📝Math for Data Scientist: 3 Distance Measures, Part 1
• Euclidean Distance - measures the length of the line that connects two points. The most common measure, but not scalable. The calculated distances may be skewed depending on the units of the objects. Therefore, before using this measure, you need to normalize the data. As the dimension of the data increases, the usefulness of the Euclidean distance decreases. But this measure works great for low-dimensional data. For example, the kNN and HDBSCAN methods show good results with this measure. Finally, Euclidean distance is intuitive to use and easy to implement.
• Cosine Similarity - the cosine of the angle between two vectors. This method helps to eliminate the disadvantages of high-dimensional Euclidean distance. Two vectors with the same orientation have a cosine similarity of 1, and vectors that are diametrically opposed to each other have a similarity of -1. The magnitude of the vectors is irrelevant as this is a measure of orientation. Therefore, this measure is not very suitable for recommendation systems, because cosine similarity does not account for the difference in the rating scale between different users. Nevertheless, cosine similarity is useful when there is multidimensional data and the magnitude of the vectors does not matter, for example, for text analysis.
• Hamming distance - the number of values that differ in two vectors. Typically used to compare two binary strings of the same length, for example, to compare how similar they are to each other by calculating the number of characters that differ. Hamming distance is difficult to apply when two vectors have different lengths. For example, for correcting or detecting errors in data transmission over computer networks when determining the number of corrupted bits in a binary word as a way to estimate the error. You can also use Hamming distance to measure the distance between categorical variables.
• Euclidean Distance - measures the length of the line that connects two points. The most common measure, but not scalable. The calculated distances may be skewed depending on the units of the objects. Therefore, before using this measure, you need to normalize the data. As the dimension of the data increases, the usefulness of the Euclidean distance decreases. But this measure works great for low-dimensional data. For example, the kNN and HDBSCAN methods show good results with this measure. Finally, Euclidean distance is intuitive to use and easy to implement.
• Cosine Similarity - the cosine of the angle between two vectors. This method helps to eliminate the disadvantages of high-dimensional Euclidean distance. Two vectors with the same orientation have a cosine similarity of 1, and vectors that are diametrically opposed to each other have a similarity of -1. The magnitude of the vectors is irrelevant as this is a measure of orientation. Therefore, this measure is not very suitable for recommendation systems, because cosine similarity does not account for the difference in the rating scale between different users. Nevertheless, cosine similarity is useful when there is multidimensional data and the magnitude of the vectors does not matter, for example, for text analysis.
• Hamming distance - the number of values that differ in two vectors. Typically used to compare two binary strings of the same length, for example, to compare how similar they are to each other by calculating the number of characters that differ. Hamming distance is difficult to apply when two vectors have different lengths. For example, for correcting or detecting errors in data transmission over computer networks when determining the number of corrupted bits in a binary word as a way to estimate the error. You can also use Hamming distance to measure the distance between categorical variables.
🚀FLAN by Google AI: generalizable Language Models with Instruction Fine-Tuning
In order for an ML-model to generate meaningful text, it must have a large amount of knowledge about the world and the ability to abstract. While language models that are trained to do this are able to automatically acquire this knowledge as they scale, their ML models should better uncover this knowledge and apply it to specific real-world problems.
One recent popular technique for using language models to solve problems is called the zero-shot prompt or the multi-shot prompt. This method formulates a problem based on the text that the language model could see during training, in order to then generate a response, complementing the text. While this method has good performance for some tasks, it requires careful design to make the tasks look like the data the model saw during training. This approach works well for some tasks, but may not be intuitive for practical interaction with the model. For example, the creators of GPT-3 have found that such hinting methods do not lead to good performance in natural language inference (NLI) tasks.
Instead, FLAN tunes the model with a wide variety of instructions that use a simple and intuitive denoscription of the problem, such as “Classify this movie review as positive or negative” or “Translate this sentence into Danish”. Creating a dataset with instructions from scratch to fine-tune the model will be resource intensive. Therefore, templates can be used to convert existing datasets to training format. Experiments by Google AI researchers have shown the success of this approach, testing FLAN and GPT-3 on 25 tasks.
Notably, on a small scale, the FLAN method actually degrades performance, and only on a larger scale does the model become able to generalize instructions in the training data to invisible problems. This is due to the fact that small models do not have enough parameters to perform a large number of tasks.
https://ai.googleblog.com/2021/10/introducing-flan-more-generalizable.html
In order for an ML-model to generate meaningful text, it must have a large amount of knowledge about the world and the ability to abstract. While language models that are trained to do this are able to automatically acquire this knowledge as they scale, their ML models should better uncover this knowledge and apply it to specific real-world problems.
One recent popular technique for using language models to solve problems is called the zero-shot prompt or the multi-shot prompt. This method formulates a problem based on the text that the language model could see during training, in order to then generate a response, complementing the text. While this method has good performance for some tasks, it requires careful design to make the tasks look like the data the model saw during training. This approach works well for some tasks, but may not be intuitive for practical interaction with the model. For example, the creators of GPT-3 have found that such hinting methods do not lead to good performance in natural language inference (NLI) tasks.
Instead, FLAN tunes the model with a wide variety of instructions that use a simple and intuitive denoscription of the problem, such as “Classify this movie review as positive or negative” or “Translate this sentence into Danish”. Creating a dataset with instructions from scratch to fine-tune the model will be resource intensive. Therefore, templates can be used to convert existing datasets to training format. Experiments by Google AI researchers have shown the success of this approach, testing FLAN and GPT-3 on 25 tasks.
Notably, on a small scale, the FLAN method actually degrades performance, and only on a larger scale does the model become able to generalize instructions in the training data to invisible problems. This is due to the fact that small models do not have enough parameters to perform a large number of tasks.
https://ai.googleblog.com/2021/10/introducing-flan-more-generalizable.html
research.google
Introducing FLAN: More generalizable Language Models with Instruction Fine-Tunin
Posted by Maarten Bosma, Research Engineer and Jason Wei, AI Resident, Google Research For a machine learning model to generate meaningful text, it...
🚀News from DeepMind AI: Enformer Architecture for Genetic Research
The Enformer architecture, powered by Transformers, advances genetic research to accurately predict how DNA sequence affects gene expression. In early October 2021, Nature Methods published an article by DeepMind and Calico researchers about the new Enformer neural network architecture, which greatly improves the accuracy of predicting gene expression from a DNA sequence. The developers have made this model and its initial predictions of common genetic variants publicly available.
Enformer builds on transformers common in natural language processing to use self-attention mechanisms for greater coverage of the DNA context. By efficiently processing sequences to account for interactions at distances more than 5 times (i.e. 200,000 base pairs) longer than previous methods, the new architecture can simulate the influence of important regulatory elements on the expression of genes found in the DNA sequence.
AI can be used to explore new possibilities for finding patterns in the genome and to put forward mechanistic hypotheses about sequence changes. Like a spell checker, Enformer partially understands a DNA sequence dictionary and can highlight changes that could alter gene expression.
The main application of this new model is to predict which changes in DNA letters, also called genetic variants, will affect gene expression. Compared to previous models, Enformer is much more accurate in predicting the effect of variants on gene expression, both in the case of natural genetic variants and synthetic variants that alter important regulatory sequences. This property is useful for interpreting the growing number of disease-related variants derived from genome-wide associative studies. Variants associated with complex genetic diseases are predominantly located in the non-coding region of the genome, likely causing disease by altering gene expression. But because of the intrinsic correlation between options, many of these disease-related options are only falsely correlated and not causal. Computing tools help distinguish true associations from false positives.
https://deepmind.com/blog/article/enformer
https://www.nature.com/articles/s41592-021-01252-x
https://github.com/deepmind/deepmind-research/tree/master/enformer
The Enformer architecture, powered by Transformers, advances genetic research to accurately predict how DNA sequence affects gene expression. In early October 2021, Nature Methods published an article by DeepMind and Calico researchers about the new Enformer neural network architecture, which greatly improves the accuracy of predicting gene expression from a DNA sequence. The developers have made this model and its initial predictions of common genetic variants publicly available.
Enformer builds on transformers common in natural language processing to use self-attention mechanisms for greater coverage of the DNA context. By efficiently processing sequences to account for interactions at distances more than 5 times (i.e. 200,000 base pairs) longer than previous methods, the new architecture can simulate the influence of important regulatory elements on the expression of genes found in the DNA sequence.
AI can be used to explore new possibilities for finding patterns in the genome and to put forward mechanistic hypotheses about sequence changes. Like a spell checker, Enformer partially understands a DNA sequence dictionary and can highlight changes that could alter gene expression.
The main application of this new model is to predict which changes in DNA letters, also called genetic variants, will affect gene expression. Compared to previous models, Enformer is much more accurate in predicting the effect of variants on gene expression, both in the case of natural genetic variants and synthetic variants that alter important regulatory sequences. This property is useful for interpreting the growing number of disease-related variants derived from genome-wide associative studies. Variants associated with complex genetic diseases are predominantly located in the non-coding region of the genome, likely causing disease by altering gene expression. But because of the intrinsic correlation between options, many of these disease-related options are only falsely correlated and not causal. Computing tools help distinguish true associations from false positives.
https://deepmind.com/blog/article/enformer
https://www.nature.com/articles/s41592-021-01252-x
https://github.com/deepmind/deepmind-research/tree/master/enformer
Google DeepMind
Predicting gene expression with AI
When the Human Genome Project succeeded in mapping the DNA sequence of the human genome, the international research community were excited by the opportunity to better understand the genetic...
✍🏻Math for the Data Scientist: another 3 Distance Measures, Part 2
• Manhattan Distance, also called a taxi or city block measure, calculates the distance between vectors with real values. Then Manhattan distance refers to the distance between two vectors on a uniform grid if they can only move at right angles. No diagonal movement is used when calculating the distance. While Manhattan distance seems to be acceptable for multidimensional data, it is a less intuitive measure than Euclidean distance. A measure is more likely to give a higher distance value than Euclidean distance, since it is not the shortest possible distance. However, if the dataset has discrete and / or binary attributes, the Manhattan distance works well because it takes into account real paths within the possible values.
• Chebyshev distance is defined as the greatest difference between two vectors in any coordinate dimension, it is simply the maximum distance along one axis. This measure is also often called the distance of the chessboard, since the minimum number of moves required for the king to move from one square to another is equal to the distance of Chebyshev. This distance is usually used in very specific use cases, making it difficult to use it as a universal measure of distance, as opposed to Euclidean distance or cosine similarity. Therefore, the Chebyshev distance is only recommended in certain cases. For example, to determine the minimum number of moves in games that allow unlimited 8-sided movement. Also, the Chebyshev distance is often used in warehouse logistics, for example, to determine the time required for an overhead crane to move an object.
• Minkowski distance is a more complex measure used in normalized vector space (n-dimensional real space), where distances can be represented as a vector with length. When using this measure, there is a zero vector that has zero length and all others are positive, the vector can be multiplied by a number (scalar coefficient), and the shortest distance between two points is a straight line. You can also use the p parameter here to control distance metrics similar to other measures, for example, p = 1 for Manhattan distance, p = 2 for Euclidean, and p = ∞ for Chebyshev distance. Therefore, in order to work with the Minkowski distance, you need to understand the purpose, advantages and disadvantages of the Manhattan, Euclidean and Chebyshev measures. Finding the correct value for p can be computationally inefficient, gives flexibility in the distance metric, and can be a huge advantage if correctly selected.
• Manhattan Distance, also called a taxi or city block measure, calculates the distance between vectors with real values. Then Manhattan distance refers to the distance between two vectors on a uniform grid if they can only move at right angles. No diagonal movement is used when calculating the distance. While Manhattan distance seems to be acceptable for multidimensional data, it is a less intuitive measure than Euclidean distance. A measure is more likely to give a higher distance value than Euclidean distance, since it is not the shortest possible distance. However, if the dataset has discrete and / or binary attributes, the Manhattan distance works well because it takes into account real paths within the possible values.
• Chebyshev distance is defined as the greatest difference between two vectors in any coordinate dimension, it is simply the maximum distance along one axis. This measure is also often called the distance of the chessboard, since the minimum number of moves required for the king to move from one square to another is equal to the distance of Chebyshev. This distance is usually used in very specific use cases, making it difficult to use it as a universal measure of distance, as opposed to Euclidean distance or cosine similarity. Therefore, the Chebyshev distance is only recommended in certain cases. For example, to determine the minimum number of moves in games that allow unlimited 8-sided movement. Also, the Chebyshev distance is often used in warehouse logistics, for example, to determine the time required for an overhead crane to move an object.
• Minkowski distance is a more complex measure used in normalized vector space (n-dimensional real space), where distances can be represented as a vector with length. When using this measure, there is a zero vector that has zero length and all others are positive, the vector can be multiplied by a number (scalar coefficient), and the shortest distance between two points is a straight line. You can also use the p parameter here to control distance metrics similar to other measures, for example, p = 1 for Manhattan distance, p = 2 for Euclidean, and p = ∞ for Chebyshev distance. Therefore, in order to work with the Minkowski distance, you need to understand the purpose, advantages and disadvantages of the Manhattan, Euclidean and Chebyshev measures. Finding the correct value for p can be computationally inefficient, gives flexibility in the distance metric, and can be a huge advantage if correctly selected.
📝Analyzing Time Series Data: 5 Tips for a Data Scientist
One of the most common mistakes beginners make in analyzing time series data is the assumption that the data has regular points and does not contain gaps. In practice, this is usually not confirmed and leads to incorrect results. In real datasets, data points are often missing, and the available ones are located unevenly or inconsistently. Therefore, before analyzing time series data, a preliminary preparation stage should be carried out:
• Understand the time range and detail of the time series by data points using dataset visualization;
• Compare the actual number of ticks in each time series with the number of expected ticks depending on the interval between points and the total length of the time series. This ratio is sometimes referred to as the duty cycle, which is the difference between the maximum and minimum timestamp divided by the point spacing. If this value is much less than 1, then a lot of data is missing.
• Filter out batches with low duty cycle by setting a limit, for example, 40% or whatever is appropriate for a specific task.
• Standardize the spacing between time series cues by upsampling to finer resolution.
• Fill upsampled gaps using an appropriate interpolation method such as last known value or linear / quadratic interpolation. In Apache Spark, you can use the applyInPandas method in the PySpark grouped dataframe for this, under the hood of which is pandasUDF, the performance of which is much higher than simple UDF functions due to more efficient data transfer through Apache Arrow and calculations through Pandas vectorization.
https://towardsdatascience.com/a-common-mistake-to-avoid-when-working-with-time-series-data-eedf60a8b4c1
One of the most common mistakes beginners make in analyzing time series data is the assumption that the data has regular points and does not contain gaps. In practice, this is usually not confirmed and leads to incorrect results. In real datasets, data points are often missing, and the available ones are located unevenly or inconsistently. Therefore, before analyzing time series data, a preliminary preparation stage should be carried out:
• Understand the time range and detail of the time series by data points using dataset visualization;
• Compare the actual number of ticks in each time series with the number of expected ticks depending on the interval between points and the total length of the time series. This ratio is sometimes referred to as the duty cycle, which is the difference between the maximum and minimum timestamp divided by the point spacing. If this value is much less than 1, then a lot of data is missing.
• Filter out batches with low duty cycle by setting a limit, for example, 40% or whatever is appropriate for a specific task.
• Standardize the spacing between time series cues by upsampling to finer resolution.
• Fill upsampled gaps using an appropriate interpolation method such as last known value or linear / quadratic interpolation. In Apache Spark, you can use the applyInPandas method in the PySpark grouped dataframe for this, under the hood of which is pandasUDF, the performance of which is much higher than simple UDF functions due to more efficient data transfer through Apache Arrow and calculations through Pandas vectorization.
https://towardsdatascience.com/a-common-mistake-to-avoid-when-working-with-time-series-data-eedf60a8b4c1
Medium
How NOT to Analyze Time Series
One of the most common time series data mistakes I see junior data scientists and interview candidates make is to assume that the data has…
👻Google AI's SimVLM: Pre-Learning a Weakly Controlled Visual Language Model
Visual language modeling involves understanding the language on visual inputs that can be useful for developing products and tools. For example, the image caption model generates natural language denoscriptions based on understanding the essence of the image. Over the past few years, significant progress has been made in visual language modeling thanks to the introduction of VLP (Vision-Language Pre-training) technology.
This approach is aimed at studying a single functional space immediately from visual and language inputs. For this purpose, VLP often uses an object detector such as the Faster R-CNN, trained on datasets of tagged objects to highlight regions of interest, relying on task-specific approaches and collaboratively exploring the representation of images and texts. These approaches require annotated datasets or time to label them and are therefore less scalable.
To solve this problem, Google AI researchers propose a minimalistic and efficient VLP called SimVLM (Simple Visual Language Model). SimVLM is trained from start to finish with a single goal, similar to language modeling, on a huge number of poorly aligned image-text pairs, i.e. text paired with an image is not necessarily an accurate denoscription of the image.
The simplicity of SimVLM enables efficient training on such a scalable dataset, helping the model achieve the highest level of performance across six tests in the visualization language. In addition, SimVLM includes a unified multimodal presentation that provides reliable cross-modal transmission with no fine-tuning or fine-tuning for text-only data, incl. visualization of answers to questions, captions for images and multimodal translation.
Unlike BERT and other VLP methods that apply pre-training procedures, SimVLM takes a sequence-by-sequence structure and is trained with a single prefix language target model (PrefixLM), which receives the leading part of the sequence (prefix) as input, then predicts its continuation. For example, for a dog chasing a yellow ball sequence, the sequence is randomly truncated to the chasing dog prefix, and the model predicts its continuation. The concept of a prefix is similarly applied to images, where the image is divided into a series of "slices", a subset of which are sequentially fed into the model as input. In SimVLM, for multimodal input data (images and their signatures), a prefix is a concatenation of a sequence of image fragments and a sequence of prefix text received by the encoder. The decoder then predicts the continuation of the text sequence.
Through this idea, SimVLM maximizes the flexibility and versatility in adapting the ML model to different task settings. And successfully tested in BERT and ViT, the transformer architecture allows models to directly accept raw images as input. It also applies a convolution step from the first three ResNet blocks to extract contextualized patches, which is more beneficial than the naive linear projection of the original ViT model.
The model is pretrained on large-scale datasets with images and texts. ALIGN was used as a training dataset, containing about 1.8 billion noisy image-text pairs. For the text data, the Colossal Clean Crawled Corpus (C4) dataset of 800G web documents was used. SimVLM testing has shown this ML model to be successful even without supervised fine tuning. SimVLM was able to achieve subnoscript quality close to the results of controlled methods.
https://ai.googleblog.com/2021/10/simvlm-simple-visual-language-model-pre.html
Visual language modeling involves understanding the language on visual inputs that can be useful for developing products and tools. For example, the image caption model generates natural language denoscriptions based on understanding the essence of the image. Over the past few years, significant progress has been made in visual language modeling thanks to the introduction of VLP (Vision-Language Pre-training) technology.
This approach is aimed at studying a single functional space immediately from visual and language inputs. For this purpose, VLP often uses an object detector such as the Faster R-CNN, trained on datasets of tagged objects to highlight regions of interest, relying on task-specific approaches and collaboratively exploring the representation of images and texts. These approaches require annotated datasets or time to label them and are therefore less scalable.
To solve this problem, Google AI researchers propose a minimalistic and efficient VLP called SimVLM (Simple Visual Language Model). SimVLM is trained from start to finish with a single goal, similar to language modeling, on a huge number of poorly aligned image-text pairs, i.e. text paired with an image is not necessarily an accurate denoscription of the image.
The simplicity of SimVLM enables efficient training on such a scalable dataset, helping the model achieve the highest level of performance across six tests in the visualization language. In addition, SimVLM includes a unified multimodal presentation that provides reliable cross-modal transmission with no fine-tuning or fine-tuning for text-only data, incl. visualization of answers to questions, captions for images and multimodal translation.
Unlike BERT and other VLP methods that apply pre-training procedures, SimVLM takes a sequence-by-sequence structure and is trained with a single prefix language target model (PrefixLM), which receives the leading part of the sequence (prefix) as input, then predicts its continuation. For example, for a dog chasing a yellow ball sequence, the sequence is randomly truncated to the chasing dog prefix, and the model predicts its continuation. The concept of a prefix is similarly applied to images, where the image is divided into a series of "slices", a subset of which are sequentially fed into the model as input. In SimVLM, for multimodal input data (images and their signatures), a prefix is a concatenation of a sequence of image fragments and a sequence of prefix text received by the encoder. The decoder then predicts the continuation of the text sequence.
Through this idea, SimVLM maximizes the flexibility and versatility in adapting the ML model to different task settings. And successfully tested in BERT and ViT, the transformer architecture allows models to directly accept raw images as input. It also applies a convolution step from the first three ResNet blocks to extract contextualized patches, which is more beneficial than the naive linear projection of the original ViT model.
The model is pretrained on large-scale datasets with images and texts. ALIGN was used as a training dataset, containing about 1.8 billion noisy image-text pairs. For the text data, the Colossal Clean Crawled Corpus (C4) dataset of 800G web documents was used. SimVLM testing has shown this ML model to be successful even without supervised fine tuning. SimVLM was able to achieve subnoscript quality close to the results of controlled methods.
https://ai.googleblog.com/2021/10/simvlm-simple-visual-language-model-pre.html
blog.research.google
SimVLM: Simple Visual Language Model Pre-training with Weak Supervision
✈️Math for the Data Scientist: Measuring Distance, Part 3
• The Jaccard index or Intersection over Union is a measure for calculating the similarity and diversity of multiple samples — the size of the intersection divided by the size of their union. In practice, this is the total number of similar objects between sets divided by the total number of objects. For example, if two sets have 1 common entity and only 5 different entities, then the Jaccard index will be 1/5 = 0.2. The main disadvantage of this measure is its dependence on the amount of data: the larger the sample, the larger the index value. The Jaccard index is often used in applications that use binary data. For example, the DL model predicts image segments. In this case, the Jaccard index can be used to calculate how closely the forecast matches reality. This exact measure can be applied to text similarity analysis to measure how often words are selected between documents and to compare sets of multiple patterns.
• The Sørensen-Dice index is very similar to the Jaccard index — it also measures the similarity and diversity of multiple samples. While they are calculated in a similar way, the Sorensen-Deiss index is a little more intuitive because it can be thought of as the percentage of overlap between two sets, which is a value between 0 and 1. Like the Jaccard index, the Sorensen-Deiss index exaggerates the importance of sets in which there are practically no reliable positive results, weighing each element in inverse proportion to the size of its sample. This measure is often used in image segmentation problems and in text similarity analysis.
• Haversine distance is the distance between two points on the sphere, taking into account their longitude and latitude. This is similar to Euclidean distance in that it calculates the shortest line between two points that are on a sphere. This is the main drawback of this measure - ideal spheres do not exist in reality. For example, due to the unevenness of the planet's landscape, calculations may be distorted. Instead, you can use the Vincenty distance, which works with an ellipsoid instead of a sphere. Unsurprisingly, Haversine distance is often used in navigation. For example, to calculate the distance between two countries when flying between them. It makes no sense to apply this measure at short distances, because a small radius of curvature has little effect.
• The Jaccard index or Intersection over Union is a measure for calculating the similarity and diversity of multiple samples — the size of the intersection divided by the size of their union. In practice, this is the total number of similar objects between sets divided by the total number of objects. For example, if two sets have 1 common entity and only 5 different entities, then the Jaccard index will be 1/5 = 0.2. The main disadvantage of this measure is its dependence on the amount of data: the larger the sample, the larger the index value. The Jaccard index is often used in applications that use binary data. For example, the DL model predicts image segments. In this case, the Jaccard index can be used to calculate how closely the forecast matches reality. This exact measure can be applied to text similarity analysis to measure how often words are selected between documents and to compare sets of multiple patterns.
• The Sørensen-Dice index is very similar to the Jaccard index — it also measures the similarity and diversity of multiple samples. While they are calculated in a similar way, the Sorensen-Deiss index is a little more intuitive because it can be thought of as the percentage of overlap between two sets, which is a value between 0 and 1. Like the Jaccard index, the Sorensen-Deiss index exaggerates the importance of sets in which there are practically no reliable positive results, weighing each element in inverse proportion to the size of its sample. This measure is often used in image segmentation problems and in text similarity analysis.
• Haversine distance is the distance between two points on the sphere, taking into account their longitude and latitude. This is similar to Euclidean distance in that it calculates the shortest line between two points that are on a sphere. This is the main drawback of this measure - ideal spheres do not exist in reality. For example, due to the unevenness of the planet's landscape, calculations may be distorted. Instead, you can use the Vincenty distance, which works with an ellipsoid instead of a sphere. Unsurprisingly, Haversine distance is often used in navigation. For example, to calculate the distance between two countries when flying between them. It makes no sense to apply this measure at short distances, because a small radius of curvature has little effect.
📚Introduction to feature engineering for time series forecasting - extract from the book by Dr. Francesca “Machine Learning for Time Series Forecasting with Python” published by Wiley in December 2020
Developing ML models is often time-consuming and requires many factors to be considered: algorithm iteration, hyperparameter tuning, and feature engineering. These options are additionally multiplied by time series data, since DS specialists still need to take into account trends, seasonality, holidays, and external economic variables. Each ML algorithm expects data as input to be formatted. Therefore, time series datasets require precleaning and feature definitions before running the simulation.
The peculiarity of time series analysis is that data points are linked to time. For example, you need to build the output of an ML model by defining the variable you want to predict (future sales next Monday) and then use the historical data and feature set to create the input variables. This is necessary to achieve the following goals:
• creation of the correct set of input data for the ML-algorithm in order to create input features from historical data and form a dataset as a supervised learning problem;
• improving the performance of ML-models - creating a valid relationship between the input features and the target variable that needs to be predicted.
The main categories of features useful for time series analysis are as follows:
• Date time features based on the timestamp value of each observation, such as the hour, month, and day of the week for each observation. This also includes weekends and holidays, seasons, etc.
• Lag features and window features - Values at previous time steps that are considered useful because they are created on the assumption that what happened in the past may influence or contain some kind of internal information about the future. For example, it might be useful to create functions for sales that occurred on previous days at 4:00 pm if you want to predict similar sales at the same time the next day.
• Sliding (Rolling) Window Statistics, to compute statistics on values from a given sample of data by defining a range that includes the sample itself, as well as a specified number of values before and after the sample used.
• Expandable feature window statistics that include all previous data.
Illustrations and examples of Python are available here:
https://medium.com/data-science-at-microsoft/introduction-to-feature-engineering-for-time-series-forecasting-620aa55fcab0
Developing ML models is often time-consuming and requires many factors to be considered: algorithm iteration, hyperparameter tuning, and feature engineering. These options are additionally multiplied by time series data, since DS specialists still need to take into account trends, seasonality, holidays, and external economic variables. Each ML algorithm expects data as input to be formatted. Therefore, time series datasets require precleaning and feature definitions before running the simulation.
The peculiarity of time series analysis is that data points are linked to time. For example, you need to build the output of an ML model by defining the variable you want to predict (future sales next Monday) and then use the historical data and feature set to create the input variables. This is necessary to achieve the following goals:
• creation of the correct set of input data for the ML-algorithm in order to create input features from historical data and form a dataset as a supervised learning problem;
• improving the performance of ML-models - creating a valid relationship between the input features and the target variable that needs to be predicted.
The main categories of features useful for time series analysis are as follows:
• Date time features based on the timestamp value of each observation, such as the hour, month, and day of the week for each observation. This also includes weekends and holidays, seasons, etc.
• Lag features and window features - Values at previous time steps that are considered useful because they are created on the assumption that what happened in the past may influence or contain some kind of internal information about the future. For example, it might be useful to create functions for sales that occurred on previous days at 4:00 pm if you want to predict similar sales at the same time the next day.
• Sliding (Rolling) Window Statistics, to compute statistics on values from a given sample of data by defining a range that includes the sample itself, as well as a specified number of values before and after the sample used.
• Expandable feature window statistics that include all previous data.
Illustrations and examples of Python are available here:
https://medium.com/data-science-at-microsoft/introduction-to-feature-engineering-for-time-series-forecasting-620aa55fcab0
Medium
Introduction to feature engineering for time series forecasting
Transforming raw data to prepare it for use in ML algorithms to discover insights and, eventually, make predictions.
🙌🏻Simple combo for data analyst: 3 frameworks joining Python + spreadsheets
In practice, any data analyst works with datasets not only in Jupyter Notebook or Google Colab. Sometimes you have to open spreadsheet files Excel and Google Spreadsheets. Therefore, there is a need to combine Python noscripts with built-in spreadsheet tools. The following frameworks come in handy for this:
• XLWings is a Python package that is actually preinstalled on Anaconda and is most often used to automate Excel processes. It is similar to Openpyxl, but more reliable and user-friendly. For example, you can write your own UDF in Python to parse web pages, machine learning, or solve NLP problems on data in a spreadsheet. https://www.xlwings.org/tutorials/
• Mito is a spreadsheet interface for Python, a spreadsheet within Jupyter that generates code. Mito supports basic Python functions like: merge, join, pivot, filtering, sorting, visualization, adding columns, using spreadsheet formulas, etc. https://docs.trymito.io/
• Openpyxl is a set of Python packages for reading from and writing to Excel. For example, you can connect to a local Excel file and access a specific cell or group of cells by fetching data into a DataFrame. And after processing, you can send the data back to the Excel file. In practice, this package is most often used in the financial sector, since processing large datasets in Excel is too slow. https://foss.heptapod.net/openpyxl/openpyxl
In practice, any data analyst works with datasets not only in Jupyter Notebook or Google Colab. Sometimes you have to open spreadsheet files Excel and Google Spreadsheets. Therefore, there is a need to combine Python noscripts with built-in spreadsheet tools. The following frameworks come in handy for this:
• XLWings is a Python package that is actually preinstalled on Anaconda and is most often used to automate Excel processes. It is similar to Openpyxl, but more reliable and user-friendly. For example, you can write your own UDF in Python to parse web pages, machine learning, or solve NLP problems on data in a spreadsheet. https://www.xlwings.org/tutorials/
• Mito is a spreadsheet interface for Python, a spreadsheet within Jupyter that generates code. Mito supports basic Python functions like: merge, join, pivot, filtering, sorting, visualization, adding columns, using spreadsheet formulas, etc. https://docs.trymito.io/
• Openpyxl is a set of Python packages for reading from and writing to Excel. For example, you can connect to a local Excel file and access a specific cell or group of cells by fetching data into a DataFrame. And after processing, you can send the data back to the Excel file. In practice, this package is most often used in the financial sector, since processing large datasets in Excel is too slow. https://foss.heptapod.net/openpyxl/openpyxl
xlwings: Python in Excel done right
Tutorials
The xlwings tutorials bring light into the dark on how to automate Excel with Python.