Forwarded from Big Data Science [RU]
Reminder!
Сегодня! 12 августа в 18:00 будет проходить первый митап из серии Citymobil Data Meet-up!
Будем говорить про логистику, городские данные и технологии умных городов, обсудим роль геоданных и проблемы, которые с ними возникают.
Присоединяйтесь к нам, будем разбираться вместе))
Выступают:
- Артем Солоухин (Ситимобил)
- Андрей Критилин (ЦИАН)
- Фёдор Лаврентьев (Яндекс Go)
Не забудьте подготовить вопросы спикерам, после докладов будем общаться и у вас будет возможность их задать 🙂
Ссылка: https://tulu.la/chat/city-mobil-00002d/meetup-0002fv
Сегодня! 12 августа в 18:00 будет проходить первый митап из серии Citymobil Data Meet-up!
Будем говорить про логистику, городские данные и технологии умных городов, обсудим роль геоданных и проблемы, которые с ними возникают.
Присоединяйтесь к нам, будем разбираться вместе))
Выступают:
- Артем Солоухин (Ситимобил)
- Андрей Критилин (ЦИАН)
- Фёдор Лаврентьев (Яндекс Go)
Не забудьте подготовить вопросы спикерам, после докладов будем общаться и у вас будет возможность их задать 🙂
Ссылка: https://tulu.la/chat/city-mobil-00002d/meetup-0002fv
How to tune hyperparameters to reliably improve ML model accuracy: a detailed guide
The ML model and its preprocessing are individual for each project: the hyperparameters depend on the data. For example, in the logistic regression algorithm there are different hyperparameters (solver, C, penalty), different combinations of which give different results. Similarly, there are tunable support vector machine parameters: gamma, C. These algorithm hyperparameters are available on the Sklearn free Python library site. However, often a developer has to create their own solutions without relying on ready-made recommendations in order to develop an ML-model with high accuracy, which depends on the best combination of hyperparameters. Read the article about testing various combinations of Grid search with and without the Sklearn library, checking the results with cross-validation and conclusions about the efficiency of utilizing CPU. https://towardsdatascience.com/evaluating-all-possible-combinations-of-hyperparameter
The ML model and its preprocessing are individual for each project: the hyperparameters depend on the data. For example, in the logistic regression algorithm there are different hyperparameters (solver, C, penalty), different combinations of which give different results. Similarly, there are tunable support vector machine parameters: gamma, C. These algorithm hyperparameters are available on the Sklearn free Python library site. However, often a developer has to create their own solutions without relying on ready-made recommendations in order to develop an ML-model with high accuracy, which depends on the best combination of hyperparameters. Read the article about testing various combinations of Grid search with and without the Sklearn library, checking the results with cross-validation and conclusions about the efficiency of utilizing CPU. https://towardsdatascience.com/evaluating-all-possible-combinations-of-hyperparameter
✍🏻SoundStream: An End-to-End Neural Audio Codec by Google AI
SoundStream is the first neural network codec to work on speech and music, while being able to run in real-time on a smartphone CPU. It is able to deliver state-of-the-art quality over a broad range of bitrates with a single trained model, which represents a significant advance in learnable codecs.
The main technical ingredient of SoundStream is a neural network, consisting of an encoder, decoder and quantizer, all of which are trained end-to-end. The encoder converts the input audio stream into a coded signal, which is compressed using the quantizer and then converted back to audio using the decoder. SoundStream leverages state-of-the-art solutions in the field of neural audio synthesis to deliver audio at high perceptual quality, by training a discriminator that computes a combination of adversarial and reconstruction loss functions that induce the reconstructed audio to sound like the uncompressed original input. Once trained, the encoder and decoder can be run on separate clients to efficiently transmit high-quality audio over a network. Evaluate SoundStream and learn more about it here
https://ai.googleblog.com/2021/08/soundstream-end-to-end-neural-audio.html
SoundStream is the first neural network codec to work on speech and music, while being able to run in real-time on a smartphone CPU. It is able to deliver state-of-the-art quality over a broad range of bitrates with a single trained model, which represents a significant advance in learnable codecs.
The main technical ingredient of SoundStream is a neural network, consisting of an encoder, decoder and quantizer, all of which are trained end-to-end. The encoder converts the input audio stream into a coded signal, which is compressed using the quantizer and then converted back to audio using the decoder. SoundStream leverages state-of-the-art solutions in the field of neural audio synthesis to deliver audio at high perceptual quality, by training a discriminator that computes a combination of adversarial and reconstruction loss functions that induce the reconstructed audio to sound like the uncompressed original input. Once trained, the encoder and decoder can be run on separate clients to efficiently transmit high-quality audio over a network. Evaluate SoundStream and learn more about it here
https://ai.googleblog.com/2021/08/soundstream-end-to-end-neural-audio.html
research.google
SoundStream: An End-to-End Neural Audio Codec
Posted by Neil Zeghidour, Research Scientist and Marco Tagliasacchi, Staff Research Scientist, Google Research Audio codecs are used to efficiently...
✈️New algorithm to manage drones by MIT
Aerospace engineers at MIT have devised an algorithm that helps drones find the fastest route around obstacles without crashing. The new algorithm combines simulations of a drone flying through a virtual obstacle course with data from experiments of a real drone flying through the same course in a physical space.
The researchers found that a drone trained with their algorithm flew through a simple obstacle course up to 20 percent faster than a drone trained on conventional planning algorithms. Interestingly, the new algorithm didn’t always keep a drone ahead of its competitor throughout the course. In some cases, it chose to slow a drone down to handle a tricky curve, or save its energy in order to speed up and ultimately overtake its rival.
https://news.mit.edu/2021/drones-speed-route-system-0810
Aerospace engineers at MIT have devised an algorithm that helps drones find the fastest route around obstacles without crashing. The new algorithm combines simulations of a drone flying through a virtual obstacle course with data from experiments of a real drone flying through the same course in a physical space.
The researchers found that a drone trained with their algorithm flew through a simple obstacle course up to 20 percent faster than a drone trained on conventional planning algorithms. Interestingly, the new algorithm didn’t always keep a drone ahead of its competitor throughout the course. In some cases, it chose to slow a drone down to handle a tricky curve, or save its energy in order to speed up and ultimately overtake its rival.
https://news.mit.edu/2021/drones-speed-route-system-0810
MIT News
System trains drones to fly around obstacles at high speeds
A new algorithm helps drones find the fastest route around obstacles without crashing. The MIT system could enable fast, nimble drones for time-critical operations such as search and rescue.
🏸FastMoE: A Fast Mixture-of-Expert Training System
Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of language model to trillions of parameters. However, training trillion-scale MoE requires algorithm and system co-design for a well-tuned high performance distributed training system. Unfortunately, the only existing platform that meets the requirements strongly depends on Google's hardware (TPU) and software (Mesh Tensorflow) stack, and is not open and available to the public, especially GPU and PyTorch communities.
The FastMoE – the distributed open-source MoE training system based on PyTorch with common accelerators. The system provides a hierarchical interface for both flexible model design and easy adaption to different applications, such as Transformer-XL and Megatron-LM. Different from direct implementation of MoE models using PyTorch, the training speed is highly optimized in FastMoE by sophisticated high-performance acceleration skills. The system supports placing different experts on multiple GPUs across multiple nodes, enabling enlarging the number of experts linearly against the number of GPUs.
https://github.com/laekov/fastmoe
https://arxiv.org/abs/2103.13262
Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of language model to trillions of parameters. However, training trillion-scale MoE requires algorithm and system co-design for a well-tuned high performance distributed training system. Unfortunately, the only existing platform that meets the requirements strongly depends on Google's hardware (TPU) and software (Mesh Tensorflow) stack, and is not open and available to the public, especially GPU and PyTorch communities.
The FastMoE – the distributed open-source MoE training system based on PyTorch with common accelerators. The system provides a hierarchical interface for both flexible model design and easy adaption to different applications, such as Transformer-XL and Megatron-LM. Different from direct implementation of MoE models using PyTorch, the training speed is highly optimized in FastMoE by sophisticated high-performance acceleration skills. The system supports placing different experts on multiple GPUs across multiple nodes, enabling enlarging the number of experts linearly against the number of GPUs.
https://github.com/laekov/fastmoe
https://arxiv.org/abs/2103.13262
GitHub
GitHub - laekov/fastmoe: A fast MoE impl for PyTorch
A fast MoE impl for PyTorch. Contribute to laekov/fastmoe development by creating an account on GitHub.
🔥Not only GPT-3: what is GPT-J-6B
OpenAI's powerful NLP GPT-3 algorithm is not an open source project. Therefore, other companies offer their alternative solutions. The most interesting of them is now considered GPT-J from EleutherAI with 6 billion parameters. The developers promise that GPT-J will provide more flexible and faster output than Tensorflow + TPU counterparts when performing various downstream streaming tasks.
https://6b.eleuther.ai/
https://colab.research.google.com/github/kingoflolz/mesh-transformer-jax/blob/master/colab_demo.ipynb
https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b
https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/
https://minimaxir.com/2021/06/gpt-j-6b/
OpenAI's powerful NLP GPT-3 algorithm is not an open source project. Therefore, other companies offer their alternative solutions. The most interesting of them is now considered GPT-J from EleutherAI with 6 billion parameters. The developers promise that GPT-J will provide more flexible and faster output than Tensorflow + TPU counterparts when performing various downstream streaming tasks.
https://6b.eleuther.ai/
https://colab.research.google.com/github/kingoflolz/mesh-transformer-jax/blob/master/colab_demo.ipynb
https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b
https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/
https://minimaxir.com/2021/06/gpt-j-6b/
6b.eleuther.ai
EleutherAI - text generation testing UI
EleutherAI web app testing for language models
🌸News from MIT: A New AI-Powered Probabilistic Programming Language
It can impartially assess the "fairness" of AI algorithms more accurately and faster than existing alternatives. This Sum-Product Probabilistic Language (SPPL) is a probabilistic programming system - a new area at the intersection of programming languages and AI that simplifies the development of AI solutions using probabilistic models and explanations of observable data.
SPPL offers improved flexibility and robustness through the expressiveness of the language, its precise and simple semantics, and the speed and reliability of its exact character output engine. This avoids pitfalls by limiting it to a carefully designed class of AI models, including decision tree classifiers. SPPL works by compiling probabilistic programs into a specialized data structure called a sum-product expression. However, this approach cannot analyze neural networks, although it works faster than other similar solutions. SPPL is Python-based open source project.
https://news.mit.edu/2021/exact-symbolic-artificial-intelligence-faster-better-assessment-ai-fairness-0809
https://github.com/probcomp/sppl
It can impartially assess the "fairness" of AI algorithms more accurately and faster than existing alternatives. This Sum-Product Probabilistic Language (SPPL) is a probabilistic programming system - a new area at the intersection of programming languages and AI that simplifies the development of AI solutions using probabilistic models and explanations of observable data.
SPPL offers improved flexibility and robustness through the expressiveness of the language, its precise and simple semantics, and the speed and reliability of its exact character output engine. This avoids pitfalls by limiting it to a carefully designed class of AI models, including decision tree classifiers. SPPL works by compiling probabilistic programs into a specialized data structure called a sum-product expression. However, this approach cannot analyze neural networks, although it works faster than other similar solutions. SPPL is Python-based open source project.
https://news.mit.edu/2021/exact-symbolic-artificial-intelligence-faster-better-assessment-ai-fairness-0809
https://github.com/probcomp/sppl
MIT News
Exact symbolic artificial intelligence for faster, better assessment of AI fairness
A new domain-specific artificial intelligence programming language developed at MIT allows for error-free, exact, automatic solutions to hard AI problems — and it’s thousands of times faster than alternatives. The researchers' Sum-Product Probabilistic Language…
👻What is AIOps and how it differs from MLOps
MLOps is an interdisciplinary approach to managing machine learning methods as standalone products with their own life cycle, with a focus on developing, scaling, and applying ML algorithms on an ongoing basis.
MLOps aims to bridge the gap between creating ML models and maintaining them, while AIOps focuses on automating incident management and intelligent root cause analysis.
AIOps solutions use all tracking and reporting data and logs to detect events and apply machine learning and deep learning to notify IT operations of any issues or disruptions.
The goal of AIOps is to improve the efficiency of IT operations by automating the diagnosis of events and using machine learning to pinpoint root causes. These protections provide technical teams with high quality data that is easy to understand by analyzing the distortions generated by monitoring technologies and reducing false positives by allowing them to function in decision making. AIOps goes beyond preventing downtime to include cost containment, security, and AI-powered policy compliance to improve IT operations.
MLOps helps teams choose which tools, methodologies, and documentation will help their ML models go into production, and AIOps helps teams automate their technology lifecycles.
The greatest effect is provided by the combined use of MLOps and AIOps.
https://ai.plainenglish.io/whats-the-difference-between-aiops-and-mlops-15316cfa803d
MLOps is an interdisciplinary approach to managing machine learning methods as standalone products with their own life cycle, with a focus on developing, scaling, and applying ML algorithms on an ongoing basis.
MLOps aims to bridge the gap between creating ML models and maintaining them, while AIOps focuses on automating incident management and intelligent root cause analysis.
AIOps solutions use all tracking and reporting data and logs to detect events and apply machine learning and deep learning to notify IT operations of any issues or disruptions.
The goal of AIOps is to improve the efficiency of IT operations by automating the diagnosis of events and using machine learning to pinpoint root causes. These protections provide technical teams with high quality data that is easy to understand by analyzing the distortions generated by monitoring technologies and reducing false positives by allowing them to function in decision making. AIOps goes beyond preventing downtime to include cost containment, security, and AI-powered policy compliance to improve IT operations.
MLOps helps teams choose which tools, methodologies, and documentation will help their ML models go into production, and AIOps helps teams automate their technology lifecycles.
The greatest effect is provided by the combined use of MLOps and AIOps.
https://ai.plainenglish.io/whats-the-difference-between-aiops-and-mlops-15316cfa803d
Medium
What’s the Difference Between AIOps and MLOps?
MLOps bridges the gap between data scientists and operations. AIOps focuses on incident management automation and smart root cause…
👆🏻BYOL - Bootstrap Your Own Latent
BYOL is a new approach to self-teaching image representation with 2 neural networks that interact and learn from each other. The online network learns from the representation made by the target network on the same image with various additions. The underlying BYOL architecture is existing ResNet50 or other similar architectures. Input x is padded to t and t ', which are transmitted via the online and target network separately.
The difference between online and target networks is that the former has an MLP architecture with two fully connected layers, and Relu and batchnorm in between. The online network view learns from the view generated by the target network. The online network is updated with a regression loss function whose targets are set by the target network. And the parameters of the target model are updated by the exponential moving average of the online network, allowing you to process more information and avoid decision collapse.
The performance of BYOL is in line with the comparison with the supervised learning architecture of SOTA. There is a slight performance degradation when using only random cropping as image enlargement, but BYOL performs better than SimCLR by iteratively learning from previous versions of its output without using negative pairs with the linear classifier protocol. However, the BYOL approach is not yet applicable to the tasks of processing text, video, and audio.
https://www.youtube.com/watch?v=YPfUiOMYOEE
https://ai.plainenglish.io/byol-bootstrap-your-own-latent-dacee62a3dc8
https://arxiv.org/abs/2006.07733
https://arxiv.org/abs/2010.10241
https://github.com/lucidrains/byol-pytorch
BYOL is a new approach to self-teaching image representation with 2 neural networks that interact and learn from each other. The online network learns from the representation made by the target network on the same image with various additions. The underlying BYOL architecture is existing ResNet50 or other similar architectures. Input x is padded to t and t ', which are transmitted via the online and target network separately.
The difference between online and target networks is that the former has an MLP architecture with two fully connected layers, and Relu and batchnorm in between. The online network view learns from the view generated by the target network. The online network is updated with a regression loss function whose targets are set by the target network. And the parameters of the target model are updated by the exponential moving average of the online network, allowing you to process more information and avoid decision collapse.
The performance of BYOL is in line with the comparison with the supervised learning architecture of SOTA. There is a slight performance degradation when using only random cropping as image enlargement, but BYOL performs better than SimCLR by iteratively learning from previous versions of its output without using negative pairs with the linear classifier protocol. However, the BYOL approach is not yet applicable to the tasks of processing text, video, and audio.
https://www.youtube.com/watch?v=YPfUiOMYOEE
https://ai.plainenglish.io/byol-bootstrap-your-own-latent-dacee62a3dc8
https://arxiv.org/abs/2006.07733
https://arxiv.org/abs/2010.10241
https://github.com/lucidrains/byol-pytorch
YouTube
BYOL: Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (Paper Explained)
Self-supervised representation learning relies on negative samples to keep the encoder from collapsing to trivial solutions. However, this paper shows that negative samples, which are a nuisance to implement, are not necessary for learning good representation…
💐TOP-15 the most interesting DS-conferences all over the world in September 2021
6-7.09 (offline) and 13-15.09 (online) - AI & Big Data Expo Global, the leading Artificial Intelligence & Big Data Conference & Exhibition, at the Business Design Centre, London https://www.ai-expo.net/global/
9-10.09 – R Conference, New York, Online https://rstats.ai/nyr/
13-17.09 – Data Science Salon Miami Machine Learning & AI Meetup Week. Miami, FL, USA https://www.datascience.salon/miami-ml-meetup-week
14-16.09 - Insurance AI and Innovative Tech USA 2021 – Online Conference by Reuters https://reutersevents.com/events/analyticsusa/
- 15-16.09 - DATA festival #online https://datafestival.de/
- 15-16.09 - Open Data Science Conference, Online https://odsc.com/apac
- 20.09 – 1st Citizen Data Science Summit, Boston https://www.citizen-data-science.org/
- 20-21.09 - International Conference on Advances in Big Data and Data Sciences, Toronto, Canada https://waset.org/advances-in-big-data-and-data-sciences-conference-in-september-2021-in-toronto
- 21.09 – Data Champions Online, Canada https://dco-canada.coriniumintelligence.com/
- 22-23.09 - Big Data LDN, UK largest data & analytics event, Olympia London, UK https://bigdataldn.com/
- 22-23.09 - RE.WORK Deep Learning Summit https://www.re-work.co/events/deep-learning-summit-research and https://www.re-work.co/events/deep-learning-summit-applications
- 28-29.09 – Chief Data & Analytics Officer, Financial Services, Online https://cdao-fs-eu.coriniumintelligence.com/
- 28-30.09 – DataOps Summit Online https://www.dataopssummit-sf.com/about/
- 30.09 - Web Data Extraction Summit 2021 by Zyte https://www.extractsummit.io/
6-7.09 (offline) and 13-15.09 (online) - AI & Big Data Expo Global, the leading Artificial Intelligence & Big Data Conference & Exhibition, at the Business Design Centre, London https://www.ai-expo.net/global/
9-10.09 – R Conference, New York, Online https://rstats.ai/nyr/
13-17.09 – Data Science Salon Miami Machine Learning & AI Meetup Week. Miami, FL, USA https://www.datascience.salon/miami-ml-meetup-week
14-16.09 - Insurance AI and Innovative Tech USA 2021 – Online Conference by Reuters https://reutersevents.com/events/analyticsusa/
- 15-16.09 - DATA festival #online https://datafestival.de/
- 15-16.09 - Open Data Science Conference, Online https://odsc.com/apac
- 20.09 – 1st Citizen Data Science Summit, Boston https://www.citizen-data-science.org/
- 20-21.09 - International Conference on Advances in Big Data and Data Sciences, Toronto, Canada https://waset.org/advances-in-big-data-and-data-sciences-conference-in-september-2021-in-toronto
- 21.09 – Data Champions Online, Canada https://dco-canada.coriniumintelligence.com/
- 22-23.09 - Big Data LDN, UK largest data & analytics event, Olympia London, UK https://bigdataldn.com/
- 22-23.09 - RE.WORK Deep Learning Summit https://www.re-work.co/events/deep-learning-summit-research and https://www.re-work.co/events/deep-learning-summit-applications
- 28-29.09 – Chief Data & Analytics Officer, Financial Services, Online https://cdao-fs-eu.coriniumintelligence.com/
- 28-30.09 – DataOps Summit Online https://www.dataopssummit-sf.com/about/
- 30.09 - Web Data Extraction Summit 2021 by Zyte https://www.extractsummit.io/
AI & Big Data Expo Global - Conference & Exhibition
AI & Big Data Expo
AI & Big Data Expo, part of TechEx Global, London is the premier the leading conference & exhibition event showcasing Generative AI, Machine Learning & Data. Register your pass.
🏸What is AIOps
While we got used to MLOps, a new Ops phenomenon happened in IT, the need for which actually arose a long time ago. Meet AIOps - using AI to simplify IT operations management and accelerate and automate problem solving in today's complex IT environments. AIOps leverages the power of big data, analytics and machine learning for the following purposes:
• Collecting and aggregating huge and ever-growing volumes of operational data generated by many IT infrastructure components, applications and performance monitoring tools;
• Filtering useful signals from noise to reveal really important events and patterns related to the performance and availability of systems;
• identifying root causes and responding quickly to problems, sometimes automatically without human intervention.
By replacing many separate tools for manual IT operations with a single intelligent and automated platform, AIOps enables you to respond quickly and even proactively to slowdowns and system failures with much less effort. AIOps bridges the gap between all diverse, dynamic and complex IT landscapes without sacrificing application performance and availability. With more companies moving from traditional IT infrastructure to a dynamic mix of on-premises clusters, private clouds, and public clouds today, AIOps is relevant for many enterprises.
https://medium.com/geekculture/aiops-6e463cbe617a
While we got used to MLOps, a new Ops phenomenon happened in IT, the need for which actually arose a long time ago. Meet AIOps - using AI to simplify IT operations management and accelerate and automate problem solving in today's complex IT environments. AIOps leverages the power of big data, analytics and machine learning for the following purposes:
• Collecting and aggregating huge and ever-growing volumes of operational data generated by many IT infrastructure components, applications and performance monitoring tools;
• Filtering useful signals from noise to reveal really important events and patterns related to the performance and availability of systems;
• identifying root causes and responding quickly to problems, sometimes automatically without human intervention.
By replacing many separate tools for manual IT operations with a single intelligent and automated platform, AIOps enables you to respond quickly and even proactively to slowdowns and system failures with much less effort. AIOps bridges the gap between all diverse, dynamic and complex IT landscapes without sacrificing application performance and availability. With more companies moving from traditional IT infrastructure to a dynamic mix of on-premises clusters, private clouds, and public clouds today, AIOps is relevant for many enterprises.
https://medium.com/geekculture/aiops-6e463cbe617a
Medium
AIOps
AIOps uses artificial intelligence to simplify IT operations management and accelerate and automate problem resolution in complex modern IT…
👻What is anomaly detection and how does it work
Anomaly detection is a mathematical search for deviations in controlled and uncontrolled numerical data, depending on how much a particular value differs from others or from the standard deviation in a given sample. There are many different methods for detecting anomalies, called outlier detection algorithms, each with different criteria for detecting them and therefore used in different scenarios. The most common methods used to detect anomalies are:
• General density-based methods: K-Nearest Neighbor (KNN), Local Outlier Factor (LOF), Isolation Forests, and other algorithms that can be applied to regression or classification scenarios. Each of these generates the expected behavior by following the line of the highest density of data points. Points that fall by a statistically significant amount outside these dense zones are flagged as anomaly. Most of these methods are based on distance between points, so it is important to normalize the units and scale in the dataset to get accurate results. For example, in KNN, data points are weighted by 1 / k, where k is the distance to the nearest neighbor. Therefore, the points that are closer to each other have a lot of weight, and affect what is the standard, there are more distant points. The algorithm marks points with a low 1 / k value as outliers. This is suitable for normalized data without labels, when there is no desire and ability to use algorithms with more complex calculations.
• One-class support vector machine is a supervised learning algorithm that creates a robust prediction model. Often used for classification. There is a training set of examples, each labeled as part of one of two categories. The system creates criteria for sorting new examples for each category, matches the examples with points in space in order to distinguish both categories as much as possible. The system will flag an outlier if it goes beyond any category. In the absence of labeled data, you can use unsupervised learning, which looks for clustering among the examples to define categories. This is suitable for working with 2 categories of data, when you need to find which data points lie outside each of them.
• Algorithm for clustering K-means, combining KNN-approaches based on the proximity of each data point to other nearby points and SVM, since it focuses on classification into various categories. Here, each data point is categorized based on its characteristics. The category has a center point that serves as the prototype for all other data points in the cluster. They are all compared to these prototypes to determine their k-mean, which acts as a measure of the difference between the prototype and the current data point. Data points with higher k-means are closer to the prototype, forming a cluster. K-Means Clustering can detect anomalies by marking points that do not fit any of the established categories. This is suitable for scenarios where there is untagged data from many different types that need to be organized similar to the prototypes learned.
There are other more sophisticated algorithms for unsupervised anomaly detection and multidimensional datasets. For example, Gaussian as an alternative version of the K-Means algorithm with Gaussian distribution instead of standard deviation. And Bayesian uses Bayesian probability to detect anomalies. Also, to detect anomalies, autoencoders can be used - neural networks that create coded rules for the expected output depending on the input value. Anything beyond these repetitive values is considered an anomaly and is well suited for dimensional detection tasks.
Anomaly detection is a mathematical search for deviations in controlled and uncontrolled numerical data, depending on how much a particular value differs from others or from the standard deviation in a given sample. There are many different methods for detecting anomalies, called outlier detection algorithms, each with different criteria for detecting them and therefore used in different scenarios. The most common methods used to detect anomalies are:
• General density-based methods: K-Nearest Neighbor (KNN), Local Outlier Factor (LOF), Isolation Forests, and other algorithms that can be applied to regression or classification scenarios. Each of these generates the expected behavior by following the line of the highest density of data points. Points that fall by a statistically significant amount outside these dense zones are flagged as anomaly. Most of these methods are based on distance between points, so it is important to normalize the units and scale in the dataset to get accurate results. For example, in KNN, data points are weighted by 1 / k, where k is the distance to the nearest neighbor. Therefore, the points that are closer to each other have a lot of weight, and affect what is the standard, there are more distant points. The algorithm marks points with a low 1 / k value as outliers. This is suitable for normalized data without labels, when there is no desire and ability to use algorithms with more complex calculations.
• One-class support vector machine is a supervised learning algorithm that creates a robust prediction model. Often used for classification. There is a training set of examples, each labeled as part of one of two categories. The system creates criteria for sorting new examples for each category, matches the examples with points in space in order to distinguish both categories as much as possible. The system will flag an outlier if it goes beyond any category. In the absence of labeled data, you can use unsupervised learning, which looks for clustering among the examples to define categories. This is suitable for working with 2 categories of data, when you need to find which data points lie outside each of them.
• Algorithm for clustering K-means, combining KNN-approaches based on the proximity of each data point to other nearby points and SVM, since it focuses on classification into various categories. Here, each data point is categorized based on its characteristics. The category has a center point that serves as the prototype for all other data points in the cluster. They are all compared to these prototypes to determine their k-mean, which acts as a measure of the difference between the prototype and the current data point. Data points with higher k-means are closer to the prototype, forming a cluster. K-Means Clustering can detect anomalies by marking points that do not fit any of the established categories. This is suitable for scenarios where there is untagged data from many different types that need to be organized similar to the prototypes learned.
There are other more sophisticated algorithms for unsupervised anomaly detection and multidimensional datasets. For example, Gaussian as an alternative version of the K-Means algorithm with Gaussian distribution instead of standard deviation. And Bayesian uses Bayesian probability to detect anomalies. Also, to detect anomalies, autoencoders can be used - neural networks that create coded rules for the expected output depending on the input value. Anything beyond these repetitive values is considered an anomaly and is well suited for dimensional detection tasks.
💥TOP 5 useful Python tools for data engineers and web developers
• Requests is an easy-to-use HTTP library for Python that allows you to make requests and interact with the API https://docs.python-requests.org/en/master/
• Advanced Python Scheduler (APScheduler) - a library for deferred execution of Python code once or with periodic repetition. When the tasks are saved in the database, their states and the restart of the scheduler will also be saved. APScheduler can also be used as a cross-platform application-specific replacement for platform-specific schedulers such as the cron daemon or Windows task scheduler. However, APScheduler is not a daemon or service, and therefore does not come with command line tools, but is intended to run inside existing applications. This library provides some ready-made building blocks for creating a scheduler service or for running it in a separate process. https://apscheduler.readthedocs.io/en/stable/userguide.html
• Watchdog - a module for tracking filesystem events through the Python API and shell utilities https://pypi.org/project/watchdog/
• Twilio - a library for automating the sending of text messages and phone calls. It is very convenient for automatic monitoring of events on third-party sites, for example, prompt tracking of discounts on the right products or the appearance of new products https://pypi.org/project/twilio/
• Random User Agent - a library for adding random user agents to requests, which is useful when web parsing data or sending a large number of requests https://pypi.org/project/random-user-agent/
• Requests is an easy-to-use HTTP library for Python that allows you to make requests and interact with the API https://docs.python-requests.org/en/master/
• Advanced Python Scheduler (APScheduler) - a library for deferred execution of Python code once or with periodic repetition. When the tasks are saved in the database, their states and the restart of the scheduler will also be saved. APScheduler can also be used as a cross-platform application-specific replacement for platform-specific schedulers such as the cron daemon or Windows task scheduler. However, APScheduler is not a daemon or service, and therefore does not come with command line tools, but is intended to run inside existing applications. This library provides some ready-made building blocks for creating a scheduler service or for running it in a separate process. https://apscheduler.readthedocs.io/en/stable/userguide.html
• Watchdog - a module for tracking filesystem events through the Python API and shell utilities https://pypi.org/project/watchdog/
• Twilio - a library for automating the sending of text messages and phone calls. It is very convenient for automatic monitoring of events on third-party sites, for example, prompt tracking of discounts on the right products or the appearance of new products https://pypi.org/project/twilio/
• Random User Agent - a library for adding random user agents to requests, which is useful when web parsing data or sending a large number of requests https://pypi.org/project/random-user-agent/
PyPI
watchdog
Filesystem events monitoring
✈️Real-Time ML Predictions with Google's Vertex AI
One of the biggest challenges in serving ML-models is providing near real-time predictions. Some business scenarios are especially sensitive to time latency. For example, recommendation systems for online store users, estimating the delivery time of products for food tech companies, etc. On August 25, 2021, Google announced the possibility of direct interaction with Vertex AI - its unified ML platform through private endpoints. Vertex AI allows you to quickly connect a trained and tested ML model to a working application, upload it to a specially prepared server in the Google Cloud, or export it to the desired format.
Vertex Predictions is a serverless way of serving ML models that can be linked in the cloud and made predictions via a REST API. With online forecasts, it is necessary to obtain a model at the endpoint, which will link it to physical computing resources and allow it to be done in almost real time. With VPC Peering, you can configure a private connection to reach an endpoint. By doing this, user data will not pass through the public Internet, which reduces the latency of online predictions and improves security.
https://cloud.google.com/blog/products/ai-machine-learning/creating-a-private-endpoint-on-vertex-ai
One of the biggest challenges in serving ML-models is providing near real-time predictions. Some business scenarios are especially sensitive to time latency. For example, recommendation systems for online store users, estimating the delivery time of products for food tech companies, etc. On August 25, 2021, Google announced the possibility of direct interaction with Vertex AI - its unified ML platform through private endpoints. Vertex AI allows you to quickly connect a trained and tested ML model to a working application, upload it to a specially prepared server in the Google Cloud, or export it to the desired format.
Vertex Predictions is a serverless way of serving ML models that can be linked in the cloud and made predictions via a REST API. With online forecasts, it is necessary to obtain a model at the endpoint, which will link it to physical computing resources and allow it to be done in almost real time. With VPC Peering, you can configure a private connection to reach an endpoint. By doing this, user data will not pass through the public Internet, which reduces the latency of online predictions and improves security.
https://cloud.google.com/blog/products/ai-machine-learning/creating-a-private-endpoint-on-vertex-ai
Google Cloud Blog
Creating a private endpoint on Vertex AI | Google Cloud Blog
Learn the basics of VPC peering and how to use Private Endpoints on Vertex AI.
🏂🏸Adversarial attacks to refine molecular energy predictions
Researchers at MIT have found a new quantitative estimate of the uncertainty of molecular energies using neural networks. Neural networks are often used to predict new resources, speeds, and capabilities orders of magnitude faster than traditional methods such as quo-mechanical simulation. The results obtained can be unreliable, since ML-models are interpolated, it is possible that they fail when applied to the operational data of an external dataset. This is especially for predicting the "potential energy" (PES) or energy map of a molecule in all its configurations. To solve these problems, scientists have proposed safe zones of a neural network using adversarial attacks. The actual simulation is performed only for small parts of the molecule, and the data is fed into the neural network, which learns to predict the same properties for the rest of the molecules. These methods have been successfully tested on new materials, including catalysts for the production of hydrogen from water, cheaper polymer electrolytes for electric vehicles, magnets, etc. However, the accuracy of neural networks depends on the correctness of training data, and incorrect predictions can have disastrous consequences.
One way to find out the uncertainty of a model is to run the same data through several versions of it. To do this, the researchers had several neural networks predicting a potential surface based on the same data. If the network is confident in the prediction, the difference between the outputs of different networks is minimal and the surfaces converge more. Otherwise, the predictions of the various models vary greatly, producing a series of outputs, any of which may be the correct surface.
Forecast scatter represents the uncertainty at a particular point. The ML-model should indicate not only the best forecast, but also the uncertainty of each of them. However, each simulation can take tens to thousands of CPU hours. And to get meaningful results, you need to run multiple models at a sufficient number of points.
Therefore, the new approach only selects data points with low forecast confidence. These molecules are then modified slightly to increase the uncertainty. Additional data is computed for these molecules through simulation, and then the original training pool is added. The neural networks are trained again, and a new set of uncertainties is calculated. This process is repeated until the uncertainty associated with various points on the surface becomes well defined and cannot be further reduced.
The proposed approach has been tested on zeolites - cavernous crystals, selective forms and use in catalysis, gas separation and ion exchange. Modeling large zeolite structures is very expensive, and the researchers show how their method can provide significant savings in computer simulations. But an adversarial approach to retraining neural networks increases performance without significant computational costs.
https://news.mit.edu/2021/using-adversarial-attacks-refine-molecular-energy-predictions-0901
Researchers at MIT have found a new quantitative estimate of the uncertainty of molecular energies using neural networks. Neural networks are often used to predict new resources, speeds, and capabilities orders of magnitude faster than traditional methods such as quo-mechanical simulation. The results obtained can be unreliable, since ML-models are interpolated, it is possible that they fail when applied to the operational data of an external dataset. This is especially for predicting the "potential energy" (PES) or energy map of a molecule in all its configurations. To solve these problems, scientists have proposed safe zones of a neural network using adversarial attacks. The actual simulation is performed only for small parts of the molecule, and the data is fed into the neural network, which learns to predict the same properties for the rest of the molecules. These methods have been successfully tested on new materials, including catalysts for the production of hydrogen from water, cheaper polymer electrolytes for electric vehicles, magnets, etc. However, the accuracy of neural networks depends on the correctness of training data, and incorrect predictions can have disastrous consequences.
One way to find out the uncertainty of a model is to run the same data through several versions of it. To do this, the researchers had several neural networks predicting a potential surface based on the same data. If the network is confident in the prediction, the difference between the outputs of different networks is minimal and the surfaces converge more. Otherwise, the predictions of the various models vary greatly, producing a series of outputs, any of which may be the correct surface.
Forecast scatter represents the uncertainty at a particular point. The ML-model should indicate not only the best forecast, but also the uncertainty of each of them. However, each simulation can take tens to thousands of CPU hours. And to get meaningful results, you need to run multiple models at a sufficient number of points.
Therefore, the new approach only selects data points with low forecast confidence. These molecules are then modified slightly to increase the uncertainty. Additional data is computed for these molecules through simulation, and then the original training pool is added. The neural networks are trained again, and a new set of uncertainties is calculated. This process is repeated until the uncertainty associated with various points on the surface becomes well defined and cannot be further reduced.
The proposed approach has been tested on zeolites - cavernous crystals, selective forms and use in catalysis, gas separation and ion exchange. Modeling large zeolite structures is very expensive, and the researchers show how their method can provide significant savings in computer simulations. But an adversarial approach to retraining neural networks increases performance without significant computational costs.
https://news.mit.edu/2021/using-adversarial-attacks-refine-molecular-energy-predictions-0901
MIT News
Using adversarial attacks to refine molecular energy predictions
MIT engineers create machine learning models that improve themselves by automatically finding new training data to lower their uncertainty. The new algorithm allows them to build models that replace expensive physics-based simulations.
🕸Web scraping automation: 3 popular tools
Do you want to track prices in an online store or automate ordering food in a restaurant? Try the following remedies:
• Selenium is a well-known test automation framework that can be used to simulate user behavior and perform actions on websites such as filling out forms, clicking buttons, etc. https://selenium-python.readthedocs.io/
• Beautiful Soup is a Python-package for parsing HTML and XML documents. Creates a parse tree that can be used to extract data when parsing web pages. Very good for simple projects. https://pypi.org/project/beautifulsoup4/
• Scrapy is a fast, high-level website crawling and crawling framework used to extract structured data for mining, monitoring, and automated testing. It is great for complex projects and is much faster than the aforementioned counterparts. https://docs.scrapy.org/en/latest/
Do you want to track prices in an online store or automate ordering food in a restaurant? Try the following remedies:
• Selenium is a well-known test automation framework that can be used to simulate user behavior and perform actions on websites such as filling out forms, clicking buttons, etc. https://selenium-python.readthedocs.io/
• Beautiful Soup is a Python-package for parsing HTML and XML documents. Creates a parse tree that can be used to extract data when parsing web pages. Very good for simple projects. https://pypi.org/project/beautifulsoup4/
• Scrapy is a fast, high-level website crawling and crawling framework used to extract structured data for mining, monitoring, and automated testing. It is great for complex projects and is much faster than the aforementioned counterparts. https://docs.scrapy.org/en/latest/
PyPI
beautifulsoup4
Screen-scraping library
😎Need to develop an app for real-time emotion recognition on video?
Use Face Recognition API! Open-source project for face recognition and control from Python or command line. The ML model was created using the DL face recognition algorithm and has an accuracy of 99.38% in the Labeled Faces in the Wild test.
With Face Recognition API, application development consists of 5 steps:
• receiving video in real time
• applying Python-functions from a ready-to-use API for detecting faces and emotions on objects in a video stream;
• classification of emotions into categories;
• developing a recommendation system;
• building the application and deploying to Heroku, Dash or a web server.
https://github.com/ageitgey/face_recognition
Use Face Recognition API! Open-source project for face recognition and control from Python or command line. The ML model was created using the DL face recognition algorithm and has an accuracy of 99.38% in the Labeled Faces in the Wild test.
With Face Recognition API, application development consists of 5 steps:
• receiving video in real time
• applying Python-functions from a ready-to-use API for detecting faces and emotions on objects in a video stream;
• classification of emotions into categories;
• developing a recommendation system;
• building the application and deploying to Heroku, Dash or a web server.
https://github.com/ageitgey/face_recognition
GitHub
GitHub - ageitgey/face_recognition: The world's simplest facial recognition api for Python and the command line
The world's simplest facial recognition api for Python and the command line - ageitgey/face_recognition
🚀Data Science в городе: продолжаем серию митапов Ситимобила про Data Science в геосервисах, логистике, приложениях Smart City и т.д. Приглашаем на 2-ю онлайн-встречу 23 сентября в 18:00 МСК. Вас ждут интересные доклады DS-практиков из Ситимобила, Optimate AI и Яндекс.Маршрутизации:
🚕Максим Шаланкин (Data Scientist в гео-сервисе Ситимобил) расскажет о жизненном цикле ML-модели прогнозирования времени в пути с учетом большой нагрузки
🚚Сергей Свиридов (CTO из Optimate AI) объяснит, что не так с классическими эвристиками и методами комбинаторной оптимизации для построения оптимальных маршрутов, и как их можно заменить динамическим программированием
🚛Даниил Тарарухин (Руководитель группы аналитики в Яндекс.Маршрутизации) поделится, как автомобильные пробки влияют на поиск оптимального маршрута и имитационное моделирование этой задачи.
После докладов спикеры ответят на вопросы слушателей.
Ведущий мероприятия – Алексей Чернобровов🛸
Регистрация для бесплатного участия: https://citymobil.timepad.ru/event/1773649/
🚕Максим Шаланкин (Data Scientist в гео-сервисе Ситимобил) расскажет о жизненном цикле ML-модели прогнозирования времени в пути с учетом большой нагрузки
🚚Сергей Свиридов (CTO из Optimate AI) объяснит, что не так с классическими эвристиками и методами комбинаторной оптимизации для построения оптимальных маршрутов, и как их можно заменить динамическим программированием
🚛Даниил Тарарухин (Руководитель группы аналитики в Яндекс.Маршрутизации) поделится, как автомобильные пробки влияют на поиск оптимального маршрута и имитационное моделирование этой задачи.
После докладов спикеры ответят на вопросы слушателей.
Ведущий мероприятия – Алексей Чернобровов🛸
Регистрация для бесплатного участия: https://citymobil.timepad.ru/event/1773649/
citymobil.timepad.ru
Citymobil Data Meetup №2 / События на TimePad.ru
Ситимобил запускает митапы о применении Data science в городских и геосервисах, логистике и технологиях умных городов.
🗣4 best practices to improve efficiency from using the Google Cloud Translation API
A web service that dynamically translates between languages using Google ML models supports over 100 languages and is actively used in practice. And if you know useful life hacks, you can reduce costs, increase productivity, and improve the security of this translation API on websites.
1. Caching translated content not only reduces the number of calls to the Google Cloud Translation API, but also reduces the load and computation usage on internal web servers and databases. This optimizes application performance and reduces shipping costs. You can configure caching in an application architecture at different levels of the application. For example, at the proxy level (NGINX or HAProxy), the application itself in memory on web servers, or an external memory caching service, as well as through a CDN.
2. Secure access based on the principle of least privilege. When accessing the Google Cloud Translation API, it is recommended that you use a Google Cloud Service account rather than api keys. A service account is a special type of authentication that represents a non-human user and can be authorized to access data in the Google API. Service accounts are not assigned passwords and cannot be used to log in through a browser, minimizing this threat vector. By following the principle of least privilege, you can grant a least privileged role with a set of permissions to access the translation API.
3. Setting up translations. If your content includes domain and context terms, Google Cloud Translation API Advanced supports custom terminology through a glossary. You can create and use your own translation models using Google AutoML Translation. Customers understand the potential risks of errors and inaccuracies by alerting users that content has been automatically translated by Google.
4. Budget control. The costs associated with the Google Cloud Translation API mainly depend on the number of characters sent to the API. For example, at $ 10 per million characters, if a web page contains 20 million characters and needs to be translated into 10 languages, the cost would be $ 10 * 20 = $ 200. Setting up alerts in your work environment will help you keep track of your budget.
https://cloud.google.com/blog/products/ai-machine-learning/four-best-practices-for-translating-your-website
A web service that dynamically translates between languages using Google ML models supports over 100 languages and is actively used in practice. And if you know useful life hacks, you can reduce costs, increase productivity, and improve the security of this translation API on websites.
1. Caching translated content not only reduces the number of calls to the Google Cloud Translation API, but also reduces the load and computation usage on internal web servers and databases. This optimizes application performance and reduces shipping costs. You can configure caching in an application architecture at different levels of the application. For example, at the proxy level (NGINX or HAProxy), the application itself in memory on web servers, or an external memory caching service, as well as through a CDN.
2. Secure access based on the principle of least privilege. When accessing the Google Cloud Translation API, it is recommended that you use a Google Cloud Service account rather than api keys. A service account is a special type of authentication that represents a non-human user and can be authorized to access data in the Google API. Service accounts are not assigned passwords and cannot be used to log in through a browser, minimizing this threat vector. By following the principle of least privilege, you can grant a least privileged role with a set of permissions to access the translation API.
3. Setting up translations. If your content includes domain and context terms, Google Cloud Translation API Advanced supports custom terminology through a glossary. You can create and use your own translation models using Google AutoML Translation. Customers understand the potential risks of errors and inaccuracies by alerting users that content has been automatically translated by Google.
4. Budget control. The costs associated with the Google Cloud Translation API mainly depend on the number of characters sent to the API. For example, at $ 10 per million characters, if a web page contains 20 million characters and needs to be translated into 10 languages, the cost would be $ 10 * 20 = $ 200. Setting up alerts in your work environment will help you keep track of your budget.
https://cloud.google.com/blog/products/ai-machine-learning/four-best-practices-for-translating-your-website
Google Cloud Blog
Four best practices for translating your website | Google Cloud Blog
Translate your website with Google’s industry leading Machine Learning. Learn best practices for optimizing cost, performance, and security.
🍏3 useful Python libraries for Data Scientist
• JMESPath - a library that helps you query for JSON. Useful when working with a large multi-level JSON document or dictionary. JMESPath exposes the object to JavaScript-style access, making it easier to develop and test your code. It's also safe - if any of the paths don't exist, the JMESPath lookup function will return None. https://github.com/jmespath/jmespath.py
• Inflection is a Ruby-derived library that helps you handle complex string processing logic. It translates English words to singular and plural, and also converts strings from CamelCase to underscore. Useful when there are variable or data point names generated in another language or on another system that need to be converted to pythonic style in accordance with the PEP standards. https://github.com/jpvanhal/inflection
• more-itertools - a library that includes a set of useful functions that can be used in various development tasks. For example, write code quickly and gracefully to split one dictionary into multiple lists based on a common repeating key, or to loop through multiple lists. This library will automatically organize your regex implementation and set up recursive constraints. https://github.com/more-itertools/more-itertools
• JMESPath - a library that helps you query for JSON. Useful when working with a large multi-level JSON document or dictionary. JMESPath exposes the object to JavaScript-style access, making it easier to develop and test your code. It's also safe - if any of the paths don't exist, the JMESPath lookup function will return None. https://github.com/jmespath/jmespath.py
• Inflection is a Ruby-derived library that helps you handle complex string processing logic. It translates English words to singular and plural, and also converts strings from CamelCase to underscore. Useful when there are variable or data point names generated in another language or on another system that need to be converted to pythonic style in accordance with the PEP standards. https://github.com/jpvanhal/inflection
• more-itertools - a library that includes a set of useful functions that can be used in various development tasks. For example, write code quickly and gracefully to split one dictionary into multiple lists based on a common repeating key, or to loop through multiple lists. This library will automatically organize your regex implementation and set up recursive constraints. https://github.com/more-itertools/more-itertools
GitHub
GitHub - jmespath/jmespath.py: JMESPath is a query language for JSON.
JMESPath is a query language for JSON. Contribute to jmespath/jmespath.py development by creating an account on GitHub.