Big Data Science – Telegram
Big Data Science
3.75K subscribers
65 photos
9 videos
12 files
637 links
Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼https://news.1rj.ru/str/bds_job — channel about Data Science jobs and career
💻https://news.1rj.ru/str/bdscience_ru — Big Data Science [RU]
Download Telegram
🤦🏼‍♀️Too many datasets? Label their metadata and divide them onto 3 categories: Technical (logical and physical), Operational (Lineage and Data Profiling stats) and Team metadata from data scientists and analysts.
To understand the each dataset better ask next questions:
1. What does the data represent logically? What is the meaning of the attributes? Is it the source of truth, or derived from another dataset?
2. What is the schema of data? Who manages it? How was it transformed?
3. When was it last updated? Is the data tiered? Where are the previous versions? Can I trust this data? How reliable is the data quality?
4. Who and/or which team is the owner? Who are the common users?
5. What query engines are used to access the data? Are the datasets versioned?
6. Where is the data located? Where is it replicated, and what is the format?
7. How is the data physically represented, and can it be accessed?
8. Are there similar datasets with common similar or identical content, both overall as well as for individual columns?
When you have answers about all datasets you can build a metadata catalog service is a critical building block of Data Lake/Data Mesh/Data Lakehouse platforms. This service typically post-hoc collects metadata after the datasets have been created or updated by various pipelines without interfering with dataset owners or users.
https://medium.com/wrong-ml/why-is-understanding-datasets-hard-in-the-real-world-6eec47cafaa1
👻Soft-bodied robots with DL-neural networks from MIT researchers
Traditional robots, hard and metal, are not suitable for all tasks. So scientists try to create flexible and soft robots that can safely interact with people and easily enter confined spaces. But these robots need to know the location of all parts of its body, which can change in any configuration.
Due to the limited range of motion due to a given set of joints and limbs, rigid robots are perfectly controllable using algorithms that map and plan their movement. The problem with soft robots is that the space of their deformations and movements is practically infinite. Of course, you can determine the position of the robot using a video camera and simply transmit this information to the control program. But it is dependence on an external device (camera) appears. Therefore, in order to determine the optimal number of sensors and their most efficient placement on the robot itself, MIT researchers have developed a new neural network architecture. The ML-algorithm of deep learning optimizes the placement of sensors by data about the deformation of different parts of the robot's body during its movement when performing applied tasks, for example, grabbing objects.
This ML algorithm performed better in test simulation compared to expert predictions of robototechicians on robots with touch screens and touch controls.
https://news.mit.edu/2021/sensor-soft-robots-placement-0322
🏂Apache Spark for Data Scientist: a short overview of ML packages
Data Scientists like the Apache Spark not only for ability to process really large datasets very quickly, but also for the presence of popular machine learning algorithms (classification, regression, clustering, filtering) and tools for preparing data for modeling (cleaning, feature extraction, transformation etc.), as well as algebraic and statistical functions. All this is packaged in special packages: MLLib (spark.mllib) and ML (spark.ml).
However "Spark ML" is not the official library name in the spark.ml package, but is often used to refer to the DataFrame-based MLlib API, unlike spark.mllib, which works with lower-level data structures - RDD (Resilient Distributed Dataset, a reliable distributed table-type collection). That official Apache Spark documentation emphasizes that both APIs are supported and neither is deprecated. In practice, most modern Spark apps developers, data analysts and Data Scientists work with the spark.ml package because of the flexible and convenient DataFrame API.
👀YOLO is the first neural network recognized objects in real time on mobile devices. Due to the absence of “for”cycles in layers architecture, it provides high speed and accuracy of recognition in one pass.
The first version of YOLO was offered in 2016, and today, in May 2021, the 5th version has already been released. At the moment of the YOLO family provide the best results of real-time object detection.
YOLO works faster than R-CNN because it splits the image into a constant number of cells, instead of highlighting regions and calculating a solution for each of them. Now YOLO is not so good in recognition of objects with complex shapes or a group of small objects due to the insufficient number of candidates for the margins.
Nevertheless, in December 2020, Scaled YOLO v4 showed the best results (55.8% AP) on the Microsoft COCO dataset among peers, overtaking the Google EfficientDet D7x / DetectoRS neural network or SpineNet-190 (self-learning on additional data), Amazon Cascade in accuracy -RCNN, ResNest200 Microsoft RepPoints v2 and Facebook RetinaNet SpineNet-190. These results were achieved in the conditions of an optimal ratio of speed and accuracy from 15 FPS to 1774 FPS.
Since May 18, 2021, Google integrates all its ML cloud services into a single interface and API as part of the public accessible Vertex AI, a managed MLOps-platform for deploying and serving AI models. Vertex AI integrates AutoML and the AI platform into a single API, client library, and user interface. Users can independently manage data and prototypes, deploy and interpret their ML-models using Vertex tools: Vizier, Feature Store, Experiments, Continuous Monitoring and Pipelines.
Vertex AI integrates with many open-source frameworks (TensorFlow, PyTorch, and scikit-learn) and also supports all ML-frameworks with custom training and prediction containers. Also you can connect BigQuery, use standard SQL-queries in existing business intelligence and spreadsheet tools, and export datasets from BigQuery to Vertex AI. Vertex Data Labeling allows your mark your data accurately. Learn more and try the new DS-platform from the corporation of good here: https://cloud.google.com/vertex-ai
🙌🏻Data cleansing is too hard and too long? Try PClean, a new AI-system form MIT researchers written in a domain-specific probabilistic programming language for automatic data cleansing. It removes typos, duplicates, missing values, spelling errors and inconsistencies, making it easier to prepare a dataset for analysis and ML-modeling. Notably, PClean does not just mechanically cleanse data, but takes into account its semantics using generalized common sense models for judgments that can be customized for specific underlying data and error types.
The idea of probabilistic data cleansing based on declarative generalized knowledge of the research context is not new. It was published in a 2003 article by researches of the Berkley University of California. PClean develops this idea according to the trend of "explainable AI" with the realistic models of human knowledge to interpret data. Corrections in PClean based on Bayesian reasoning, whereby each alternative explanation for ambiguous data is assigned some weight to the existing probability data based on prior knowledge. An additional advantage of PClean is the ability to clean really large amounts of data, and relatively quickly. For example, in a recent 2021 study on table with 2.2 million rows of medical data, PClean found over 8,000 errors in just 7.5 hours. Finally, thanks to the principle of Bayesian probability, PClean give calibrated estimates of its uncertainty, which can be manually corrected and train the AI system.
https://news.mit.edu/2021/system-cleans-messy-data-tables-automatically-0511
🎂Reinforcement learning (RL) is great for tasks with a well-defined reward function, as evidenced by the successful experiences of AlphaZero for Go, OpenAI Five for Dota, and AlphaStar for StarCraft. But in practice, it is not always possible to clearly define the reward function. For example, in a simple room cleaning case, an old business card found under the bed or a used concert ticket may be evaluated as trash, but if they are valuable for host they should not be thrown away. However, even if you set clear criteria for evaluating the analyzed object, converting them into rewards is not easy. If you give the agent a reward that reinforces his behavior every time when it collects the garbage, it can throw it back to collect again and receive reinforcement.
This behavior of the AI system can be prevented by forming a reward function based on feedback on the agent's behavior. But this approach requires a lot of resources: in particular, training the Deep RL Cheetah model from OpenAI Gym and MujoCo requires about 700+ human comparisons.
Therefore, researchers at the Berkeley University of California David Lindner and Rohin Shah proposed an RL-algorithm without human supervision or an explicitly assigned function to form a reward policy based on implicit information. They named it RLSP (Reward Learning by Simulating the Past) because RL is formed by modeling the past, based on judgments that allow the agent to draw inferences about human preferences without explicit feedback. The main difficulty with scaling RLSP is how to reason about previous experience in the case of a big amount of data. The authors propose to choose the most probable past trajectories of the development of events instead of their full enumeration, alternating the prediction of past actions with the prediction of the past states from which these actions were taken.
The RLSP algorithm uses gradient lifting to continuously update a linear reward function to explain the observed state. Scaling this idea is possible through a functional representation of each state and modeling a linear reward function for these characteristics, followed by an approximation of the RLSP gradient by sampling more likely past trajectories. The gradient encourages the reward function so that backward trajectories (which should have been done in the past) and forward trajectories (which the agent would have done using the current reward) are consistent with each other. Once the trajectories are consisted, the gradient becomes zero, and the reward function that is most likely to cause the observed state is known. The essence of the RLSP algorithm is to perform a gradient lift using this gradient. The algorithm was tested in the MujoCo simulator, an environment for testing RL algorithms on the problem of training simulated robots to move along the optimal trajectory or in the best possible way. The results showed that RLSP-generated reinforcement policies perform as well as those directly trained in the true reward function.
https://bair.berkeley.edu/blog/2021/05/03/rlsp/
Easy Python: how to create your own class in a couple of lines with the dataclass module
Data analysts and Data Scientists like Python - an easy to-learn and to-use object-oriented language that contains many built-in functions and allows you to create your own. According to the DRY (Don’t Repeat Yourself) principle, the current development standard is to package code into classes that provide APIs for using their methods. In practice, you have to create your own classes in Python quite often, with the subsequent addition and expansion. You can do it faster with the standard module dataclasses, which should be imported first. Then you need to decorate the custom class with a special function dataclass, and list the necessary attributes inside this class, annotating them beforehand, i.e. by specifying the types. For example, for a custom class Person with an integer ID and string name, it would look like this:
from dataclass import dataclass
@dataclass
class Person:
id: int
name: str
The dataclasses module provides a decorator and functions to automatically add generated custom methods such as init () and repr () to user-defined classes. However, you need to use the dataclasses module wisely and be aware of the following features:
• by default, the module does not check the type specified in the variable annotation;
• the dataclass() decorator adds some methods to the custom class that may be redundant and duplicate the behavior of those already existing in the class;
• when using the dataclass() decorator, all base classes are viewed in reverse order, starting from the object, and for each class found, the attributes of its parent are added to the ordered display of the fields. All generated methods will use this combined computed ordered field mapping. And because the fields are in the order they are inserted, derived classes override base ones.
🗣A new deep learning engine from NVIDIA Research to create 3D object models from standard 2D images based on GAN neural networks and the NVIDIA Omniverse platform.
Developed by the NVIDIA AI Research Lab in Toronto, GANverse3D transforms flat images into realistic 3D models. Their rendered results can be used in virtual environments, allowing game developers and designers to add new objects to their layouts easy without 3D modeling expertise and large rendering budgets. For example, one photo of a car can be turned into a 3D model that can drive around a virtual scene with realistic headlights, taillights and turn signals. The training dataset is created using GAN neural networks that synthesize images of the same object from different points of the angles.
Previous inverse graphics models relied on 3D shapes as training data. A new approach without 3D resources turns the GAN model into an efficient data generator for creating a 3D-object from 2D-images. Trained on real images rather than typical synthetic data, this AI model generalizes better to real-world applications, saving time and budget for modeling complex virtual objects. In particular, with the trained GANverse3D app, real photos of cars, buildings, or even people and animals can be transformed into 3D shapes to be customized and animated in Omniverse.
To visualize the same object from different viewpoints, the neural network has the following structure: the first 4 layers are open, and the remaining 12 are frozen. Conversely, if you freeze the first 4 layers and variable the remaining 12, the neural network generated different images from the same viewpoint. By manually assigning standard viewpoints (height and distance from the camera), the researchers were able to quickly create a multi-angle dataset from separate 2D images.
These multi-view images are included in the inverse graphics rendering framework to produce 3D mesh models from 2D images. After training on multi-view 2D images, GANverse3D only needs one 2D image to form the mesh of the 3D model. This 3D model can be used with a 3D neural renderer, allowing developers to customize objects and change backgrounds. And importing as an extension to the NVIDIA Omniverse platform and running on NVIDIA RTX GPUs, GANverse3D comes in handy for recreating any 2D image in 3D.
The results of testing the latest GAN model from NVIDIA, trained on 55,000 car images, outperformed an inverse graphics neural network trained on the popular Pascal3D dataset.
https://blogs.nvidia.com/blog/2021/04/16/gan-research-knight-rider-ai-omniverse/
🏂Looking for robot’s simulation? Try MuJoCo - Multi-Joint dynamics with Contact. Initially it was used at the Movement Control Laboratory, University of Washington, and has now been adopted by a wide community of researchers and developers. MuJoCo is a physics engine to facilitate research and development in robotics, biomechanics, graphics and animation, and other AI/ML/DS-areas. It offers a unique combination of speed, accuracy and power and model-based optimization through contacts. MuJoCo makes it possible to scale up computationally-intensive techniques and apply them to complex dynamical systems in contact-rich behaviors. It also provides testing and validation of control schemes before deployment on physical robots, interactive scientific visualization, virtual environments, animation and gaming. Core module (MuJoCo 2.0 - dynamic library with C/C++ API, includes an XML parser, model compiler, simulator, and interactive OpenGL visualizer, compatible with 64-bit Windows, Linux and macOS) distributes under Trial License 30 days is free, then you can buy Individual or Institutional License. Other modules you can download free https://www.roboti.us/index.html
🚗How to get rid of lidar sensors and improve the quality of self-driving cars - research by ML specialists from MIT
Modern self-driving cars are driven with a giant rotating cylinder on the roof. It is a lidar sensor that sends pulses of infrared light and measures the time it bounces off objects to create a 3D map of points around the vehicle. However, this 3D data is huge and computationally intensive. For example, a typical 64-channel sensor delivers over 2 million points per second. Due to the extra spatial dimension, modern 3D models require 14 times more computation during output than their 2D counterparts. Therefore, for efficient navigation, engineers have to convert data to 2D, as a result of which some information is lost.
A team of DS specialists from MIT is creating a new ML automatic driving system that will do autonomous navigation using only raw 3D point cloud data and low-resolution GPS maps like smartphones. They even had to develop new deep learning components to use the GPU more efficiently and drive cars in real time. During testing, the system reduced the frequency of transmission of control of the machine to the human driver and could even withstand severe sensor failures. This hybrid evidence-based approach, which combines various control predictions together to arrive at the optimal choice of motion planning, performed better than traditional 3D lidar. And by combining control predictions according to model uncertainty, the system can adapt to unexpected events. The main components of the system are a driving platform without high definition 3D maps, an ML system and a deep 3D learning solution that optimizes neural architecture and inference library. Further, the team plans to develop the project, working out unfavorable weather conditions and dynamic interaction with other vehicles.
https://www.csail.mit.edu/news/more-efficient-lidar-sensing-self-driving-cars
😎What is your salary? Anonymous report and open datasets about AI/ML and Big Data salaries
You can share your data and download CSV/JSON-datasets with answers of your colleagues from all over the world https://salaries.ai-jobs.net/download/
🥁AI in retail: 7 examples
eBay
- Pricing and stockpiling, optimizing the appearance of product cards to increase appeal and increase sales
Sephora uses a color matching recommendation system for color cosmetics (lipstick, eyeshadow and powder): the camera scans the skin color, analyzes the data and generates a unique color number and selects the product from the product line that suits this client best.
Tesco - inventory management: forecasting and replenishing with weather and regional characteristics, as well as data from CCTV cameras directed to store shelves. And routing ML-algorithms help buy faster in Tesco Online.
OTTO - Predicting future purchases based on the analysis of 3 billion historical transactions and 200 additional variables (weather, website searches, etc.). The accuracy of the forecast of which product will be sold within a month reaches 90%. This helps to optimize warehouse stocks and increase product turnover.
Simbe Robotics creates robots that detect violation of the plan for the placement of goods, their lack of goods and non-compliance with price tags using computer vision systems. She not only recognizes products, but also recommends how to fill them.
Vekia has developed a supply chain management solution for Leroy Merlen, Etam, Okaidi and Jacadi: control of goods in each store with daily assortment assessments. The system calculates the optimal stock level for each location several times a day and can automatically generate an order for the required items.
Diwo - determination of the factors of decrease in sales for individual products. The ML system also offers a set of recommended strategies for improving the situation, suggesting the ideal time to start promotions and other attributes of advertising campaigns.
🙌🏻Mathematics for the Data Scientist, Part 1: Benford's Law
Benford's (Newcomb-Benford) law of the first digit describes the probability of occurrence of a certain first significant digit in distributions of quantities taken from real life. This mathematical law is true for many distributions, it allows you to predict the frequency of occurrence of the second and third digits in the dataset.
For the first time this law was discovered by the American astronomer Simon Newcome in 1881, analyzing the degree of wear and tear of book pages. And in 1938, physicist Frank Benford made similar conclusions based on the results of the analysis of tables on the characteristics of rivers, chemical compounds and house numbers in the city directory. An analysis of numbers showed that one is the first significant digit with a probability of not 1/9, as it seems at first glance, but about 1/3.
Benford's Law applies to sets of numbers that can grow exponentially, i.e. the rate of growth of a value is proportional to its current value, for example, stock balances in warehouses, stock prices, population size, length of rivers, area of countries.A set of numbers satisfies Benford's law if the first digit d (𝑑∈1,…, 9) occurs in the equation. Using this distribution, you can predict which digit occurs most frequently in the dataset. The law usually does not apply for distributions with given minimum or maximum values, as well as those that cover only one or two orders of magnitude. Also Benford's law does not apply to texts. The sample size for the law of the first digit should be sufficient to apply statistical methods. In practice, the first digit law is applied in applications for detecting fraud in tax forms, election results, economic performance and accounting data.
💃🏼🕺🏼Smart clothes is a new fashion in the coming years: digital ML fabric from MIT researchers with memory, temperature sensors and a trained neural network
MIT has created the first digitally capable fiber that can detect, store, analyze, and measure physical activity after being sewn into a shirt. Digital fibers enhance the ability of tissues to detect hidden structures in the human body, which can be used to monitor physical performance, medical reports and early detection of diseases, as well as retain impressions. For example, memorize wedding music in the dress you were wearing that day.
Until now, electronic fibers have been analog, carrying a continuous electrical signal, not digital. This is the first implementation of a structure with the ability to digitally store and process data, allowing a new dimension of content to be added to textiles to program fabrics.
The new fiber was created by placing hundreds of square silicon digital microchips in a preform, which was then used to create a polymer fiber. By precisely controlling the flow of the polymer, the researchers were able to create a fiber with a continuous electrical connection between the chips for tens of meters.
The fiber itself is thin and flexible, it can be passed through a needle, sewn into fabric and washed at least 10 times without breaking, and it is also not felt at all. Thanks to the digital addressing method, it is possible to include the functionality of one element without affecting the rest of the elements. Digital fiber can also store a large amount of information in memory. The researchers were able to record, store and read information about the fiber, including a 767-kilobyte full-color short video file and a 0.48-megabyte music file. Files can be stored for two months without power.
The fiber includes a neural network of 1,650 connections in tissue memory, which can be trained on data in real time directly on a person, analyzing information about body temperature taking into account physical activity. Thanks to this, in the future, clothing will be able to detect and warn people in real time about changes in health indicators (respiratory and heart rate) and transmit data about muscle activation to athletes during training. Now the smart fabric is controlled by a small external device, and in the future it is planned to develop a new chip as a microcontroller connected to the fiber itself.
https://news.mit.edu/2021/programmable-fiber-0603
👀How to create your own Deep Fake without the long training of neural networks? It's easy!
Try 4 free services:
- https://reface.app/
- https://avatarify.ai/
- https://www.wombo.ai/
- https://www.myheritage.com/deep-nostalgia?lang=RU
You need only upload a photo or make a selfie from your mobile phone to get a believable video of another person's face. For example, MyHeritage, a genealogical website, allows reanimate died people by generating mini-videos of them watching and smiling. However, remember that generating fakes to compromise someone can be considered libel and punishable by law!👆🏻
🥁ML-leader from Sber in IT World Awards 2021
The cloud platform ML Space from SberCloud (Sber) was recognized as the best Data Science and AI product in the world according to the organizers of the IT World Awards 2021, the Globee Awards. ML Space is a powerful MLOps data tool supporting all processes in the ML-model development cycle, including testing and deployment. The platform integrates all the necessary frameworks and libraries that allow you to speed up, optimize and simplify the processes of creating ML products.
https://globeeawards.com/it-world-awards/winners/
🤣Communication with photo and video bots: a study of how people reflect the emotions of virtual interlocutors, trusting their appearance and emotions. The conclusions of the scientists will surprise you: mirroring is not at all an indicator of a pleasant conversation, but an indicator of the difficulty in understanding an opponent.
https://techxplore.com/news/2021-06-features-virtual-agents-affect-humans.html
💥Mathematics for the Data Scientist, Part 2: Zipf's Law
This empirical pattern of natural language word frequency distribution is often used in quantitative linguistics and NLP problems. Zipf's law says: if all words in a large text are ordered in descending order of frequency of their use, then the frequency of the n-th word in this list will be inversely proportional to its ordinal number n (rank). For example, the second most commonly used word occurs about two times less often than the first, the third - three times less often than the first, etc.
The pattern was first discovered by French stenographer Jean-Baptiste Estoux in 1908. In practice, the law was applied to describe the distribution of city sizes by the German physicist Felix Auerbach in 1913. And the American linguist George Zipf actively popularized this pattern in 1949, proposing to use it to describe the distribution of economic forces and social status: the richest person has twice as much money as the next rich man, etc. An explanation of Zipf's law based on the correlation properties of additive Markov chains (with a step memory function) was given in 2005. Mathematically, Zipf's law is described by the Pareto distribution (the well-known 80 to 20 principle).
The different areas of application of the law (not only linguistics) are explained by the American bioinformatics specialist Wentian Li, who proved that a random sequence of characters also obeys this Zipf's law. Scientist argues that Zipf's law is a statistical phenomenon that has nothing to do with the semantics of a text, and the probability of a random occurrence of any word of length n in a chain of random characters decreases with increasing n in the same proportion as the rank of this word in the frequency list (ordinal scale). Therefore, the multiplication of the rank of a word by its frequency is a constant.