Big Data Science – Telegram
Big Data Science
3.74K subscribers
65 photos
9 videos
12 files
637 links
Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼https://news.1rj.ru/str/bds_job — channel about Data Science jobs and career
💻https://news.1rj.ru/str/bdscience_ru — Big Data Science [RU]
Download Telegram
✍🏻Math for the Data Scientist: another 3 Distance Measures, Part 2
Manhattan Distance, also called a taxi or city block measure, calculates the distance between vectors with real values. Then Manhattan distance refers to the distance between two vectors on a uniform grid if they can only move at right angles. No diagonal movement is used when calculating the distance. While Manhattan distance seems to be acceptable for multidimensional data, it is a less intuitive measure than Euclidean distance. A measure is more likely to give a higher distance value than Euclidean distance, since it is not the shortest possible distance. However, if the dataset has discrete and / or binary attributes, the Manhattan distance works well because it takes into account real paths within the possible values.
Chebyshev distance is defined as the greatest difference between two vectors in any coordinate dimension, it is simply the maximum distance along one axis. This measure is also often called the distance of the chessboard, since the minimum number of moves required for the king to move from one square to another is equal to the distance of Chebyshev. This distance is usually used in very specific use cases, making it difficult to use it as a universal measure of distance, as opposed to Euclidean distance or cosine similarity. Therefore, the Chebyshev distance is only recommended in certain cases. For example, to determine the minimum number of moves in games that allow unlimited 8-sided movement. Also, the Chebyshev distance is often used in warehouse logistics, for example, to determine the time required for an overhead crane to move an object.
Minkowski distance is a more complex measure used in normalized vector space (n-dimensional real space), where distances can be represented as a vector with length. When using this measure, there is a zero vector that has zero length and all others are positive, the vector can be multiplied by a number (scalar coefficient), and the shortest distance between two points is a straight line. You can also use the p parameter here to control distance metrics similar to other measures, for example, p = 1 for Manhattan distance, p = 2 for Euclidean, and p = ∞ for Chebyshev distance. Therefore, in order to work with the Minkowski distance, you need to understand the purpose, advantages and disadvantages of the Manhattan, Euclidean and Chebyshev measures. Finding the correct value for p can be computationally inefficient, gives flexibility in the distance metric, and can be a huge advantage if correctly selected.
📝Analyzing Time Series Data: 5 Tips for a Data Scientist
One of the most common mistakes beginners make in analyzing time series data is the assumption that the data has regular points and does not contain gaps. In practice, this is usually not confirmed and leads to incorrect results. In real datasets, data points are often missing, and the available ones are located unevenly or inconsistently. Therefore, before analyzing time series data, a preliminary preparation stage should be carried out:
Understand the time range and detail of the time series by data points using dataset visualization;
Compare the actual number of ticks in each time series with the number of expected ticks depending on the interval between points and the total length of the time series. This ratio is sometimes referred to as the duty cycle, which is the difference between the maximum and minimum timestamp divided by the point spacing. If this value is much less than 1, then a lot of data is missing.
Filter out batches with low duty cycle by setting a limit, for example, 40% or whatever is appropriate for a specific task.
Standardize the spacing between time series cues by upsampling to finer resolution.
Fill upsampled gaps using an appropriate interpolation method such as last known value or linear / quadratic interpolation. In Apache Spark, you can use the applyInPandas method in the PySpark grouped dataframe for this, under the hood of which is pandasUDF, the performance of which is much higher than simple UDF functions due to more efficient data transfer through Apache Arrow and calculations through Pandas vectorization.
https://towardsdatascience.com/a-common-mistake-to-avoid-when-working-with-time-series-data-eedf60a8b4c1
👻Google AI's SimVLM: Pre-Learning a Weakly Controlled Visual Language Model
Visual language modeling involves understanding the language on visual inputs that can be useful for developing products and tools. For example, the image caption model generates natural language denoscriptions based on understanding the essence of the image. Over the past few years, significant progress has been made in visual language modeling thanks to the introduction of VLP (Vision-Language Pre-training) technology.
This approach is aimed at studying a single functional space immediately from visual and language inputs. For this purpose, VLP often uses an object detector such as the Faster R-CNN, trained on datasets of tagged objects to highlight regions of interest, relying on task-specific approaches and collaboratively exploring the representation of images and texts. These approaches require annotated datasets or time to label them and are therefore less scalable.
To solve this problem, Google AI researchers propose a minimalistic and efficient VLP called SimVLM (Simple Visual Language Model). SimVLM is trained from start to finish with a single goal, similar to language modeling, on a huge number of poorly aligned image-text pairs, i.e. text paired with an image is not necessarily an accurate denoscription of the image.
The simplicity of SimVLM enables efficient training on such a scalable dataset, helping the model achieve the highest level of performance across six tests in the visualization language. In addition, SimVLM includes a unified multimodal presentation that provides reliable cross-modal transmission with no fine-tuning or fine-tuning for text-only data, incl. visualization of answers to questions, captions for images and multimodal translation.
Unlike BERT and other VLP methods that apply pre-training procedures, SimVLM takes a sequence-by-sequence structure and is trained with a single prefix language target model (PrefixLM), which receives the leading part of the sequence (prefix) as input, then predicts its continuation. For example, for a dog chasing a yellow ball sequence, the sequence is randomly truncated to the chasing dog prefix, and the model predicts its continuation. The concept of a prefix is similarly applied to images, where the image is divided into a series of "slices", a subset of which are sequentially fed into the model as input. In SimVLM, for multimodal input data (images and their signatures), a prefix is a concatenation of a sequence of image fragments and a sequence of prefix text received by the encoder. The decoder then predicts the continuation of the text sequence.
Through this idea, SimVLM maximizes the flexibility and versatility in adapting the ML model to different task settings. And successfully tested in BERT and ViT, the transformer architecture allows models to directly accept raw images as input. It also applies a convolution step from the first three ResNet blocks to extract contextualized patches, which is more beneficial than the naive linear projection of the original ViT model.
The model is pretrained on large-scale datasets with images and texts. ALIGN was used as a training dataset, containing about 1.8 billion noisy image-text pairs. For the text data, the Colossal Clean Crawled Corpus (C4) dataset of 800G web documents was used. SimVLM testing has shown this ML model to be successful even without supervised fine tuning. SimVLM was able to achieve subnoscript quality close to the results of controlled methods.
https://ai.googleblog.com/2021/10/simvlm-simple-visual-language-model-pre.html
Forwarded from Big Data Science [RU]
SimVLM by Google AI
✈️Math for the Data Scientist: Measuring Distance, Part 3
• The Jaccard index or Intersection over Union is a measure for calculating the similarity and diversity of multiple samples — the size of the intersection divided by the size of their union. In practice, this is the total number of similar objects between sets divided by the total number of objects. For example, if two sets have 1 common entity and only 5 different entities, then the Jaccard index will be 1/5 = 0.2. The main disadvantage of this measure is its dependence on the amount of data: the larger the sample, the larger the index value. The Jaccard index is often used in applications that use binary data. For example, the DL model predicts image segments. In this case, the Jaccard index can be used to calculate how closely the forecast matches reality. This exact measure can be applied to text similarity analysis to measure how often words are selected between documents and to compare sets of multiple patterns.
• The Sørensen-Dice index is very similar to the Jaccard index — it also measures the similarity and diversity of multiple samples. While they are calculated in a similar way, the Sorensen-Deiss index is a little more intuitive because it can be thought of as the percentage of overlap between two sets, which is a value between 0 and 1. Like the Jaccard index, the Sorensen-Deiss index exaggerates the importance of sets in which there are practically no reliable positive results, weighing each element in inverse proportion to the size of its sample. This measure is often used in image segmentation problems and in text similarity analysis.
Haversine distance is the distance between two points on the sphere, taking into account their longitude and latitude. This is similar to Euclidean distance in that it calculates the shortest line between two points that are on a sphere. This is the main drawback of this measure - ideal spheres do not exist in reality. For example, due to the unevenness of the planet's landscape, calculations may be distorted. Instead, you can use the Vincenty distance, which works with an ellipsoid instead of a sphere. Unsurprisingly, Haversine distance is often used in navigation. For example, to calculate the distance between two countries when flying between them. It makes no sense to apply this measure at short distances, because a small radius of curvature has little effect.
📚Introduction to feature engineering for time series forecasting - extract from the book by Dr. Francesca “Machine Learning for Time Series Forecasting with Python” published by Wiley in December 2020

Developing ML models is often time-consuming and requires many factors to be considered: algorithm iteration, hyperparameter tuning, and feature engineering. These options are additionally multiplied by time series data, since DS specialists still need to take into account trends, seasonality, holidays, and external economic variables. Each ML algorithm expects data as input to be formatted. Therefore, time series datasets require precleaning and feature definitions before running the simulation.
The peculiarity of time series analysis is that data points are linked to time. For example, you need to build the output of an ML model by defining the variable you want to predict (future sales next Monday) and then use the historical data and feature set to create the input variables. This is necessary to achieve the following goals:
• creation of the correct set of input data for the ML-algorithm in order to create input features from historical data and form a dataset as a supervised learning problem;
• improving the performance of ML-models - creating a valid relationship between the input features and the target variable that needs to be predicted.
The main categories of features useful for time series analysis are as follows:
Date time features based on the timestamp value of each observation, such as the hour, month, and day of the week for each observation. This also includes weekends and holidays, seasons, etc.
Lag features and window features - Values at previous time steps that are considered useful because they are created on the assumption that what happened in the past may influence or contain some kind of internal information about the future. For example, it might be useful to create functions for sales that occurred on previous days at 4:00 pm if you want to predict similar sales at the same time the next day.
Sliding (Rolling) Window Statistics, to compute statistics on values from a given sample of data by defining a range that includes the sample itself, as well as a specified number of values before and after the sample used.
Expandable feature window statistics that include all previous data.
Illustrations and examples of Python are available here:
https://medium.com/data-science-at-microsoft/introduction-to-feature-engineering-for-time-series-forecasting-620aa55fcab0
🙌🏻Simple combo for data analyst: 3 frameworks joining Python + spreadsheets
In practice, any data analyst works with datasets not only in Jupyter Notebook or Google Colab. Sometimes you have to open spreadsheet files Excel and Google Spreadsheets. Therefore, there is a need to combine Python noscripts with built-in spreadsheet tools. The following frameworks come in handy for this:
• XLWings is a Python package that is actually preinstalled on Anaconda and is most often used to automate Excel processes. It is similar to Openpyxl, but more reliable and user-friendly. For example, you can write your own UDF in Python to parse web pages, machine learning, or solve NLP problems on data in a spreadsheet. https://www.xlwings.org/tutorials/
• Mito is a spreadsheet interface for Python, a spreadsheet within Jupyter that generates code. Mito supports basic Python functions like: merge, join, pivot, filtering, sorting, visualization, adding columns, using spreadsheet formulas, etc. https://docs.trymito.io/
• Openpyxl is a set of Python packages for reading from and writing to Excel. For example, you can connect to a local Excel file and access a specific cell or group of cells by fetching data into a DataFrame. And after processing, you can send the data back to the Excel file. In practice, this package is most often used in the financial sector, since processing large datasets in Excel is too slow. https://foss.heptapod.net/openpyxl/openpyxl
🌏🍁TOP-15 the most interesting DS-conferences all over the world in November 2021
1. 01-02 Nov – 15th International Conference on Big Data Analytics and Big Data Science https://waset.org/big-data-analytics-and-big-data-science-conference-in-november-2021-in-san-francisco
2. 2 Nov - OSA Con, Open Source Analytics Conference, a free virtual event https://altinity.com/osa-con-2021/
3. 2-3 Nov - Chief Data & Analytics Officer, Europe – annual meeting https://cdao-eu.coriniumintelligence.com/
4. 3-4 Nov – 3rd International Conference on Big Data Analytics and Data Science https://crgconferences.com/datascience/
5. 3-4 Nov - Ai4 2021 Enterprise Summit, Exploring AI Across Industry https://ai4.io/enterprise-ai/
6. 4-6 Nov - AAAI 2021 Fall Symposium on Science Guided Machine Learning https://sites.google.com/vt.edu/sgai-aaai-21
7. 8-11 Nov - NVIDIA GTC, the Conference for AI Innovators, Technologists, and Creators https://www.nvidia.com/gtc/
8. 8-19 Nov - PRODUCT LEADERSHIP FESTIVAL 2021 - Product, Design & Data https://www.productleadership.com/events/product-leadership-festival
9. 9 Nov - RE.WORK Women in AI https://www.re-work.co/events/women-in-ai-virtual-2021
10. 11-12 Nov – DACH AI, Data Analytics and Insights Summit https://berryprofessionals.com/ai-data-analytics-and-insights-summit-dach/
11. 15-18 Nov - Toronto Machine Learning Summit (TMLS) 2021 https://www.torontomachinelearning.com/
12. 15-17 Nov - Marketing Analytics and Data Science https://informaconnect.com/marketing-analytics-data-science
13. 16-18 Nov – Open Data Science Conference https://odsc.com/
14. 18 Nov - SAS Global Learning Conference https://www.sas.com/content/sascom/sas/events/global-learning-conference.html
15. 21-25 Nov – Data Science Conference Europe 2021 https://datasciconference.com/
👆🏻Not only CoPilot: Anomaly Detection in Software Development with ControlFlag by IntelLabs
Intel provided its response to OpenAI's AI CoPilot. Meet ControlFlag, a self-checking pattern detection system that learns typical patterns in the control structures of high-level programming languages like C / C ++ by pulling them from GitHub repositories, etc. These patterns are then used to detect anomalies in custom code. ControlFlag can be used to detect typos, missed NULL checks, and more.
Having first introduced the product at the end of 2020, in 2021 Intel opened up the ControlFlag source code based on the unsupervised learning method. After studying more than a billion lines of code, the system finds errors with high accuracy and is able to adapt to the style of individual developers in order to distinguish anomaly from the peculiarities of coding.
https://github.com/IntelLabs/control-flag
🚀Simple interpolation in Scipy instead of complex optimization
SciPy is a set of math algorithms and helper functions built on an extension of the NumPy Python library. It adds many high-level commands and classes for manipulating and visualizing data, allowing a DS specialist to get away with a regular Python code development environment without complex math systems like MATLAB, IDL, Octave, R-Lab, and SciLab.
For example, interpolation of experimental data in complex scientific or business research. Having obtained the interpolation function from Scipy, you can use it in further calculations. This is useful when additional collection and experimentation is expensive or time-consuming, such as semiconductor development, chemical process optimization, production planning, etc.
Interpolation will help to conduct simulations with datasets where data points are collected at a large interval. For example, you can create an interpolation function using linear, quadratic, or cubic splines and run the interpolation function to evaluate the results of an experiment or simulation on a dense mesh.
This method does not guarantee the best results in all situations, but it is suitable for most real-life situations. Despite the fact that interpolation requires smoothness (continuity) of the function, this assumption can be applied to most real ones, which are not too jumpy, but rather smooth for interpolation methods.
Scipy interpolation routines work in both 2D and 1D cases. For example, you can get a smooth interpolated 2D surface from sparse data using Scipy interpolation, creating a 4900-point matrix from 400 actual data points.
In Scipy, the scipy.interpolate package is responsible for interpolation, which contains spline functions and classes, one-dimensional and multidimensional interpolation classes, Lagrange and Taylor polynomial interpolators, and wrappers for the FITPACK and DFITPACK functions.
https://towardsdatascience.com/optimizing-complex-simulations-use-scipy-interpolation-dc782c27dcd2
https://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
✈️Optuna: How to Automate Hyperparameter Tuning
Tuning the hyperparameters of an ML model takes a lot of time and effort. To simplify this task, you can use special frameworks, one of which is Optuna. Launched in 2019, this platform has the following advantages:
• compatibility with PyTorch, Chainer, TensorFlow, Keras, MXNet, Scikit-Learn, XGBoost, LightGBM and other ML frameworks;
• work in an understandable DS-language using conditional expressions, loops and Python syntax;
• the ability to handle continuous hyperparameter values by tuning alpha and lambda regularization to any floating point values within a given range;
• use of Bayesian selection algorithms with the ability to remove the obviously losing space of the given hyperparameters from the analysis in order to speed up the optimization;
• parallelization of the search for hyperparameters over several threads or processes without changing the code;
• works faster than analogs (RandomSearch, GridSearch, hyperopt, scikit-optimize);
• detailed documentation.
https://optuna.org/
https://towardsdatascience.com/kagglers-guide-to-lightgbm-hyperparameter-tuning-with-optuna-in-2021-ed048d9838b5
🦋Self-Supervised Reversible Reinforcement Learning: A New Approach from Google AI
Reinforcement learning (RL) is great at solving problems from scratch, but it is not easy to train an agent to understand the reversibility of his actions. For example, robots should avoid activities that could damage them. To evaluate the reversibility of an action, one needs practical knowledge and understanding of the physics of the environment in which the RL agent exists. Therefore, Google AI researchers at the NeurIPS 2021 conference present a new way to approximate the reversibility of the actions of RL agents. This approach adds a separate reversibility assessment component to self-directed reinforcement learning from untagged data collected by agents. The model can be trained online (with the RL agent) or offline (from the interaction dataset) to guide RL policies towards reversible behavior. This can significantly improve the performance of RL agents when performing multiple tasks.
The reversibility component added to the RL procedure is extracted from interactions and is a model that can be trained separately from the agent itself. The model is trained on its own and does not require data markup indicating the reversibility of actions: the model itself learns which types of actions tend to be reversible from the context of the training data. This takes into account the probability of occurrence of events and priority as a proxy measure of the true reversibility, which can be learned from the dataset of interactions, even without rewarding the RL agent.
This method allows RL agents to predict reversibility of action by learning to simulate the temporal order of randomly selected trajectory events, resulting in better exploration and control. The method is self-checking, i.e. does not require prior knowledge of reversibility, which is suitable for different environments.
https://ai.googleblog.com/2021/11/self-supervised-reversibility-aware.html
🧐AI predicts how old children are. How safe is it?
Today, any content is usually limited by age, indicating the minimum number of years of its potential consumer. Everyone knows the labeling of films 18+. And in most social networks, users over the age of 13 can independently create accounts. Unsurprisingly, facial recognition technologies are being used to determine the age of content consumers or users. Looking further, AI can predict what a person will look like in old age: remember the recent boom in FaceApp.
Likewise, Yoti's age solutions have a margin of error of less than 3 years for a range of 6 to 60 years. For users under 25 years of age, the margin of error is less than 1.5 years. In the next few weeks, they will launch in major supermarket chains in the UK, for example, to prevent the sale of alcohol to minors. Yoti trained her neural networks using hundreds of thousands of images of people's faces from their official documents (passports and driver's licenses). Due to the high risk of leakage of such confidential data, other players in the face recognition market say that age verification should, if possible, be done without analyzing the face itself or other biometric data.
https://www.wired.com/story/ai-predicts-how-old-children-are/
😜ML for child protection
In the United States, one in seven children has been abused or neglected in the past year. US Child Protection Agencies receive several million reports of alleged neglect or abuse each year. Therefore, some agencies are implementing ML to help professionals review cases and determine which ones should be investigated next. But these ML models are useless if users don't understand and trust their results.
So a study from MIT and others has developed a visual analytics tool that uses bar charts to show how specific factors in a case affect the predicted risk of a child being homeless over the next two years. This risk assessment is based on over 100 demographic and historical factors, including parental age and criminal record. Having tested the created solution, the developers drew conclusions about the need to improve the visualization and interpretability of forecasts in order to avoid dangerous distortions in making important decisions.
https://news.mit.edu/2021/machine-learning-high-stakes-1028
📝👀Fastparquet: Reading Parqufet Files with Python
Apache Parquet is a binary column-oriented storage format originally created for the Hadoop ecosystem. Thanks to its concise and efficient column-wise representation of data, it is very popular in the Big Data world. However, reading the data in Parquet format is not an easy task. PySpark can handle this, of course, but not every Data Scientist works with data in Apache Spark. This is where fastparquet comes in, a Python implementation of the Parquet format used by Dask, Pandas, and others to deliver high performance with a small distribution size and small codebase. Fastparquet depends on a set of Python libraries (numpy, pandas, cramjam, fsspec), so they should be installed beforehand.
After installation via the PIP package manager (pip install fastparquet) or from Github (pip install git + https: //github.com/dask/fastparquet), the contents of the Parquet file can be easily transferred to the dataframe in your usual DS-IDE as Jupiter Notebook:
from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
df = pf.to_pandas()
df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])

Or write a dataframe to a Parquet file, specifying the number of logical segments, compression codec and data scheme:
from fastparquet import write
write('outfile.parq', df)
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
compression='GZIP', file_scheme='hive')

https://github.com/dask/fastparquet
https://www.anaconda.com/blog/whats-new-with-fastparquet
https://blog.datasyndrome.com/using-the-python-ml-stack-inside-pyspark-de1223942c32
https://fastparquet.readthedocs.io/en/latest/
🌏Digital Twin of Earth from NVIDIA
To prevent climate disasters, NVIDIA is set to build the world's most powerful AI supercomputer for predicting climate change. This system will create a digital twin of the Earth in the Universe and will become an analogue of Cambridge-1, the world's most powerful supercomputer with artificial intelligence for medical research. By combining three technologies: GPU-accelerated computing, deep learning on neural networks, and AI supercomputers with a lot of observable and model data, scientists and engineers are set to achieve accurate simulations of physical, biological and chemical processes on Earth. This will help shape early warnings for the adaptation and resilience of urban infrastructure so that people and countries can act quickly to prevent climate disasters.
https://blogs.nvidia.com/blog/2021/11/12/earth-2-supercomputer/
🕸🐅Over 50 New Graph Algorithms in TigerGraph: Fall 2021 Release
TigerGraph is a popular fast and scalable graph database with massively parallel processing and ACID transaction support, making it the fastest and most scalable graphics platform. Thanks to its efficient data compression and MPP architecture, it can analyze huge amounts of information in real time. And the internal query language GSQL is very similar to standard SQL, familiar to every analyst.
October 2021 release includes 50 new algorithms, for example, embedded graphs node2vec and FastRP, similarity algorithms ("Nearest Neighbor Approximation", "Euclidean Similarity", "Overlap Similarity" and "Pearson Similarity"), structural similarity algorithms for predicting topological relationships and random path algorithms. In the first half of 2022, the developers promise to add neural networks and other ML methods to build analytical pipelines on graphs. Despite the fact that TigerGraph is positioned as a powerful enterprise solution, the source code of this system is open and available for free download from Github.
https://www.tigergraph.com/blogs/about-tigergraph/graph-data-science-library/
https://github.com/tigergraph
🙌🏻Principal Component Analysis: 7 Methods for Dimension Reduction in Scikit-Learn
One of the main problems of machine learning on large datasets is the huge size of the computational vectors. Therefore, methods of dimensionality reduction to reduce the number of variables are very relevant. This method is Principal Component Analysis (PCA), the essence of which is to reduce the dimension of the dataset, while retaining as much "variability" as possible, i.e. statistical information.
PCA is a statistical method of converting high-dimensional data into low-dimensional data by choosing the most important features that collect as much information about the dataset as possible. Features are selected based on the variance they cause in the output. The trait that causes the most variance is the first major component. The trait responsible for the second largest variance is considered the second main component, etc. It is important that the main components are not related to each other in any way. In addition to speeding up ML algorithms, PCA allows you to visualize data by projecting it into a lower dimension in order to display it in 2D or 3D space.
The popular Python library Scikit-learn includes the sklearn.decomposition.PCA module, which is implemented as a transformer object for multiple components in the fit () method. It can also be used for new data to project onto these components. To use the PCA method in the Scikit-Learn library, there are 2 steps:
1. Initialize the PCA class by passing the required number of components to the constructor;
2. invoke the fitting methods, and then transform them, passing them a feature set. The transform method returns the specified number of base components.
Scikit-learn supports several variations of the PCA method:
• Kernel Principal Component Analysis (KPCA) is a nonlinear dimensionality reduction method using the kernel. The PCA kernel was developed to help classify data whose decision boundaries are described by a non-linear function. The idea is to move into a higher-dimensional space in which the decision-making boundary becomes linear. The sklearn.decomposition module has different kernels: linear, polynomial (poly), Gaussian radial basis function (rbf), sigmoid (sigmoid '), etc. The default is linear, which is suitable if the data is linear separable.
• Sparse PCA - A sparse version of PCA, the purpose of which is to extract a set of sparse components that best recovers data. Typically, PCA extracted components have extremely dense expressions, i.e. nonzero coefficients as linear combinations of the original variables. This makes it difficult to interpret the results. In practice, real principal components can be more naturally represented as sparse vectors, for example, in face recognition, they can display parts of faces.
• Incremental Principal Component Analysis (IPCA) - Incremental PCA method when the dataset to be decomposed is too large to fit in memory. IPCA constructs a low-rank approximation for the input data using an amount of memory that is independent of the size of the input sample. It still depends on the input features, but changing the packet size allows you to control the memory usage.
• Fast Independent Component Analysis (ICA) - Fast independent PCA is used to evaluate sources with noisy measurements and reconstruct sources, since classic PCA does not work with non-Gaussian processes.
• Linear Discriminant Analysis (LDA) - Linear Discriminant Analysis, like classical PCA, is a linear transformation method. But the PCA is not monitored, i.e. ignores class labels, and LDA is supervised machine learning that is used to distinguish between two classes or groups. LDA is suitable for controlled dimensionality reduction by projecting input data into linear subspace from directions that maximize separation between classes. The dimension of the output is necessarily less than the number of classes. In scikit-learn, LDA is implemented using LinearDiscriminantAnalysis, where the n_components parameter specifies the number of functions to return.
• Non-negative Matrix Factorization (NMF) - Non-negative matrix factorization, an alternative approach to decomposition that assumes that the data and components are non-negative. NMF is an uncontrolled linear dimensionality reduction technique. In NMF, the original data (feature matrix) is split into multiple matrices (i.e., factorized) representing the hidden relationship between observations and their characteristics. NMF can be connected instead of PCA when the data matrix does not contain negative values. NMF does not provide an explained variance like PCA and other methods, so the best way to find the optimal value for n_components is to try a range of values.
• Truncated Singular Value Decomposition (TSVD) - Truncated singular value decomposition is similar to PCA. This method performs linear dimensionality reduction using a truncated singular value decomposition. Unlike PCA, this estimator does not center the data before computing the singular value decomposition and can work efficiently with sparse matrices.
https://medium.com/@deepak.engg.phd/dimensionality-reduction-with-scikit-learn-ee5d2b69225b