NEW BOT Телеграм, страница

Big Data Science

🤓How to assess the potential effectiveness of medical drugs: new method DeepBAR form MIT researchers to calculate the binding affinities between drug candidates and their targets. It is based on GAN-models for analyzing molecular structures as images
https://news.mit.edu/2021/drug-discovery-binding-affinity-0315

MIT News

Faster drug discovery through machine learning

MIT researchers have developed DeepBAR, a machine learning technique that quickly calculates drug molecules’ binding affinity with target proteins. The advance could accelerate drug discovery and protein engineering.

850 views04:45

Big Data Science

🥺Looking for interesting reading? Take the TOP-21 books about Data Science, Engineering and Statistics – must to be read in 2021 https://towardsdatascience.com/21-data-science-books-you-should-read-in-2021-db625e97feb6
And short-list of 5 items https://medium.com/curious/5-books-every-data-scientist-should-read-in-2021-206609d8593b

Medium

21 Data Science Books You Should Read in 2021

An Updated Collection of the Best Data Science Books to Read Right Now

895 views03:46

Big Data Science

😂If you do not want to study grammar and history, use ML to pass the exams! GPT-3 has done it with U.S. History, Research Methods, Creative Writing, and Law. In 3-20 minutes, NN was able to mimic human writing in areas of grammar, syntax, and word frequency and get the same feedback as the human writers
https://www.zdnet.com/article/ai-can-write-a-passing-college-paper-in-20-minutes/

ZDNet

AI can write a passing college paper in 20 minutes

Natural language processing is on the cusp of changing our relationship with machines forever.

803 views03:28

Big Data Science

👍🏻Deep Learning helps you to be healthy!
For years, physicians have relied on visual inspection to identify suspicious pigmented lesions (SPLs), which can be an indication of skin cancer. Early-stage identification of SPLs can improve melanoma prognosis and significantly reduce treatment cost. But it is not easy quickly find and prioritize SPLs due to the high volume of pigmented lesions. Researchers from MIT have devised a new AI pipeline, using deep convolutional neural networks (DCNNs) and applying them to analyzing SPLs through the wide-field photography common in smartphones.
A wide-field image, acquired with a smartphone camera, shows large skin sections from a patient. An automated system detects, extracts, and analyzes all pigmented skin lesions observable in the wide-field image. A pre-trained DCNN ML-models determines the suspiciousness of individual pigmented lesions and marks them: further inspection as yellow, referral to dermatologist as red. Extracted features are used to further assess pigmented lesions and to display results in a heatmap format.
DCNNs are deep learning algorithms are used to classify images to then cluster them for performing a photo search.
https://news.mit.edu/2021/artificial-intelligence-tool-can-help-detect-melanoma-0402

MIT News

An artificial intelligence tool that can help detect melanoma

An artificial intelligence system can efficiently detect melanoma, a type of skin cancer. MIT researchers used deep convolutional neural networks (DCNNs) to quickly analyze wide-field photos of patients’ bodies.

785 views11:14

Big Data Science

🏂How AI helps to increase production quality +25% up: Fujitsu experience
Japanese tech giant has developed an AI system highlights abnormalities in the appearance of products to help detect manufacturing issues earlier before materials are wasted.
ML-model trained on simulated images of products with abnormalities is able to detect different issues. For example frayed threads or defective wiring patterns on multicolor carpets or electronics parts with different wiring shapes. The ML-algorithm gained high quality: AUROC score of defects finding is more than 98%. The technology was tested at Fujitsu Plant in Nagano, which manufactures electronic equipment. The results showed a 25% reduction in the man-hours for inspecting printed circuit boards.
https://artificialintelligence-news.com/2021/03/29/fujitsu-develops-ai-product-abnormalities-manufacturing/

768 viewsedited 05:01

Big Data Science

🤠In January 2021 Open.AI presented the new ML-model, neural network called DALL·E that creates images from text captions for concepts on natural language. It has 12-billion parameters and based on GPT-3. DALL·E was trained to generate images from text denoscriptions, using a dataset of text–image pairs. It can create anthropomorphized versions of animals and objects, combine unrelated concepts in plausible ways, render texts, and apply transformations to existing images.
Like GPT-3, DALL·E is a transformer language model. It receives both the text and the image as a single stream of data containing up to 1280 tokens, and is trained using maximum likelihood to generate all of the tokens, one after another. A token is any symbol from a discrete vocabulary, e.g. each English letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image is represented using 1024 tokens with a vocabulary size of 8192.
The images are preprocessed to 256x256 resolution during training. Similar to VQVAE, each image is compressed to a 32x32 grid of discrete latent codes using a discrete VAE pretrained using a continuous relaxation to obviate the need for an explicit codebook, EMA loss, or dead code revival. Also this trick can scale up to large vocabulary sizes and allows DALL·E to generate an image from scratch and to regenerate any rectangular region of an existing image that extends to the bottom-right corner consistent with the text prompt.
The attention mask at each of its 64 self-attention layers allows each image token to attend to all text tokens. DALL·E uses the standard causal mask for the text tokens, and sparse attention for the image tokens with either a row, column, or convolutional attention pattern, depending on the layer. The embeddings are produced by an encoder pretrained using a contrastive loss, not unlike CLIP.
https://openai.com/blog/dall-e/

Openai

DALL·E: Creating images from text

We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language.

758 views03:26

Big Data Science

Forwarded from Big Data Science [RU]

Оффер от Яндекса за выходные!
С 24 по 25 апреля пройдёт Weekend Offer для аналитиков — онлайн-встреча Яндекса, на которой можно пройти собеседования и получить оффер за выходные!
Чтобы попасть на Weekend Offer, необходимо решить 2-5 задач на платформе Яндекс.Контест. 24 апреля пройдут две часовые секции с кодом, а 25 апреля — часовые финалы с командами, на которых ваш потенциальный руководитель расскажет о сервисе, вашей роли и, возможно, даст ещё одну задачу. При взаимном интересе вы тем же вечером получите оффер.
Подробности и регистрация: https://clck.ru/UAzfb

804 views10:59

Big Data Science

🙈Pandas is a great Python-library for data analysis library that uses every Data Scientist. However, this tool does not support multiprocessing and is rather slow on large datasets. Therefore, for fast Big Data processing, you should choose Vaex and Dask.
👍🏻Dask https://dask.org/ is a Pandas-based data analysis library with parallel computing and scalable performance. In addition to Pandas, it also integrates with the Numpy and Scikit-learn libraries, providing easy switching between them through Python APIs and data structures.
👍🏻Vaex https://vaex.io/docs/index.html is a high-performance Python library for lazy evaluations with dataframes similar to Pandas, as well as Big Data visualization and aggregation. It allows you to calculate basic statistics at a billion lines per second, but unlike Dask, it is not fully integrated with other libraries.

www.dask.org

Dask | Scale the Python tools you love

Dask is a flexible open-source Python library for parallel computing maintained by OSS contributors across dozens of companies including Anaconda, Coiled, SaturnCloud, and nvidia.

799 views08:46

Big Data Science

🤜🏻How to eliminate missing values in datasets for ML: 3 Python-functions easy to use
• fillna() - function is available in the pandas package. It is used to fill null (NA/NaN) values. It returns an object as output in which null/missing values are filled. Series.fillna(value=None, method=None, axis=None, inplace=False, **kwargs)
• dropna() - function to remove or drop null values from the data in different ways. This function analyzes and drops the rows/columns from the data that contain missing/NaN values. The parameter axis=0 indicates to drop rows that contain missing values and axis=1 is used to drop the columns. DataFrame.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
• interpolate() - function to fill missing/NaN values using different interpolation techniques to fill the missing data. DataFrame.interpolate(method=’linear’, axis=0, limit=None, inplace=False, limit_direction=’forward’, limit_area=None, downcast=None, **kwargs)

791 views07:00

Big Data Science

🙌🏻6 types of RNN to model sequential data
Why modeling sequential data is not so easy and how to solve this task with recurrent neural networks: a lot of math and visual illustrates. Also detailed example of RNN Implementation in Keras/Tensorflow and Python.
https://neptune.ai/blog/recurrent-neural-network-guide

neptune.ai

Recurrent Neural Network Guide: a Deep Dive in RNN

Sequence modeling is a task of modeling sequential data. Modeling sequence data is when you create a mathematical notion to understand and study sequential data, and use those understandings to generate, predict or classify the same for a specific application. …

712 views15:27

Big Data Science

🥁How to understand and develop AI by searching and highlighting representative scenarios: Bayes-TrEx tool from MIT researches
Accuracy of ML-results is not enough to use it with high assurance in every field. Focus only on simple accuracy can lead to dangerous oversights. Model can make mistakes with very high confidence encountered something previously unseen, such as a self-driving car seeing a new type of traffic sign. To gain better human-AI interaction, a team of researchers from MIT’s Computer Science and Artificial Intelligence Laboratory have created a new tool called Bayes-TrEx. It allows developers and users increase transparency into their AI model. Specifically, it does so by finding concrete examples that lead to a particular behavior. The method makes use of “Bayesian posterior inference,” a widely-used mathematical framework to reason about model uncertainty.
In experiments, the researchers applied Bayes-TrEx to several image-based datasets, and found new insights that were previously overlooked by standard evaluations focusing solely on prediction accuracy. It can be used in medical diagnosis, autonomous driving systems, robotics and so on. Bayes-TrEx could help address these novel situations ahead of time, and enable developers to correct any undesirable outcomes before potential tragedies occur or resources waste.
https://news.mit.edu/2021/more-transparency-understanding-machine-behaviors-bayes-trex-0322

MIT News

More transparency and understanding into machine behaviors

A tool called Bayes-TrEx allows artificial intelligence developers and users to gain transparency into their AI model. The work was led by researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL).

745 views13:27

Big Data Science

🎯How to remove duplicates in a dataset with Apache Spark?
Use the following framework API methods:
• distinct() - the simplest and most frequently used way to remove identical duplicate strings from the dataframe
• dropDuplicates() - Unlike distinct(), which takes no arguments at all, you can specify a subset of columns in the arguments to dropDuplicates () to remove duplicate records. Therefore, dropDuplicates (Seq <String> colNames) is more suitable when only some of the columns from the original dataset need to be processed.
• reduceByKey() - returns a new RDD - a distributed dataset of key-value pairs (K, V), in which all values for one key are combined into a tuple - the key and the result of the reduce function for all values associated with this key. This method of removing duplicates is limited to the size of a Scala tuple, which contains between 2 and 22 elements. That's why you should not reduceByKey() if Spark RDD keys or values have more than 22 columns.
• collect_set() - a function from the Spark SQL API. It collects and returns a set of unique items. It is not deterministic, since the order of the results depends on the order of the rows, which might change after shuffling, and is not "real" deduplication. Basically, collect_set () is about rolling up records by executing groupBy () and collecting unique values for the column associated with each group.
• write your own window function to get around the size limitation of Scala tuples. For example, split the RDD by columns, sort them, and filter the values you want.

762 views14:40

Big Data Science

🕸What is Graph Transformer and how does this NN-model works
Transformers based neural networks in NLP-tasks overcome the bottlenecks of Recurrent Neural Networks (RNNs) caused by the sequential processing. Mapping the words in a sentence and combines the received information they can generate its abstract feature representations.
For learning on graphs, graph neural networks (GNNs) with several parameterized layers have emerged as the most powerful tool in deep learning. Each GNN-layer takes a graph with node (and edge) features and builds abstract feature representations of nodes (and edges) based the available explicit connectivity structure (graph structure). The so-generated features are then passed to downstream classification layers and the target property is predicted. Generalization of transformer neural networks to graphs can learn on graphs and datasets with arbitrary structure rather than just the sequential as NLP-transformers.
https://www.topbots.com/graph-transformer/

TOPBOTS

Graph Transformer: A Generalization of Transformers to Graphs

In this article, I'll present Graph Transformer, a transformer neural network that can operate on arbitrary graphs.

765 views13:29

Big Data Science

🎯GPT-3 Powers the Next Generation of Apps with Open API for advanced AI features: from semantic search, summarization and sentiment analysis to content generation and translation.
Since the first commercial release of OpenAI API from GPT-3, more than 300 applications use it across varying categories and industries, from productivity and education to creativity and games.
Meet 3 success stories https://openai.com/blog/gpt-3-apps/ and try the product for your DS-needs https://beta.openai.com/

Openai

GPT-3 powers the next generation of apps

Over 300 applications are delivering GPT-3–powered search, conversation, text completion, and other advanced AI features through our API.

705 views17:23

Big Data Science

🤦🏼‍♀️Too many datasets? Label their metadata and divide them onto 3 categories: Technical (logical and physical), Operational (Lineage and Data Profiling stats) and Team metadata from data scientists and analysts.
To understand the each dataset better ask next questions:
1. What does the data represent logically? What is the meaning of the attributes? Is it the source of truth, or derived from another dataset?
2. What is the schema of data? Who manages it? How was it transformed?
3. When was it last updated? Is the data tiered? Where are the previous versions? Can I trust this data? How reliable is the data quality?
4. Who and/or which team is the owner? Who are the common users?
5. What query engines are used to access the data? Are the datasets versioned?
6. Where is the data located? Where is it replicated, and what is the format?
7. How is the data physically represented, and can it be accessed?
8. Are there similar datasets with common similar or identical content, both overall as well as for individual columns?
When you have answers about all datasets you can build a metadata catalog service is a critical building block of Data Lake/Data Mesh/Data Lakehouse platforms. This service typically post-hoc collects metadata after the datasets have been created or updated by various pipelines without interfering with dataset owners or users.
https://medium.com/wrong-ml/why-is-understanding-datasets-hard-in-the-real-world-6eec47cafaa1

Medium

Why is understanding datasets hard in the real-world?

There is no dearth of data within the enterprise, but consuming the data to solve business problems is a major challenge today. To extract…

680 views09:30

Big Data Science

👻Soft-bodied robots with DL-neural networks from MIT researchers
Traditional robots, hard and metal, are not suitable for all tasks. So scientists try to create flexible and soft robots that can safely interact with people and easily enter confined spaces. But these robots need to know the location of all parts of its body, which can change in any configuration.
Due to the limited range of motion due to a given set of joints and limbs, rigid robots are perfectly controllable using algorithms that map and plan their movement. The problem with soft robots is that the space of their deformations and movements is practically infinite. Of course, you can determine the position of the robot using a video camera and simply transmit this information to the control program. But it is dependence on an external device (camera) appears. Therefore, in order to determine the optimal number of sensors and their most efficient placement on the robot itself, MIT researchers have developed a new neural network architecture. The ML-algorithm of deep learning optimizes the placement of sensors by data about the deformation of different parts of the robot's body during its movement when performing applied tasks, for example, grabbing objects.
This ML algorithm performed better in test simulation compared to expert predictions of robototechicians on robots with touch screens and touch controls.
https://news.mit.edu/2021/sensor-soft-robots-placement-0322

MIT News

Researchers’ algorithm designs soft robots that sense

MIT researchers developed a deep learning neural network to aid the design of soft-bodied robots. The algorithm optimizes the arrangement of sensors on the robot, enabling it to complete tasks as efficiently as possible.

653 views16:13

Big Data Science

🏂Apache Spark for Data Scientist: a short overview of ML packages
Data Scientists like the Apache Spark not only for ability to process really large datasets very quickly, but also for the presence of popular machine learning algorithms (classification, regression, clustering, filtering) and tools for preparing data for modeling (cleaning, feature extraction, transformation etc.), as well as algebraic and statistical functions. All this is packaged in special packages: MLLib (spark.mllib) and ML (spark.ml).
However "Spark ML" is not the official library name in the spark.ml package, but is often used to refer to the DataFrame-based MLlib API, unlike spark.mllib, which works with lower-level data structures - RDD (Resilient Distributed Dataset, a reliable distributed table-type collection). That official Apache Spark documentation emphasizes that both APIs are supported and neither is deprecated. In practice, most modern Spark apps developers, data analysts and Data Scientists work with the spark.ml package because of the flexible and convenient DataFrame API.

1.14K views05:20

Big Data Science

👀YOLO is the first neural network recognized objects in real time on mobile devices. Due to the absence of “for”cycles in layers architecture, it provides high speed and accuracy of recognition in one pass.
The first version of YOLO was offered in 2016, and today, in May 2021, the 5th version has already been released. At the moment of the YOLO family provide the best results of real-time object detection.
YOLO works faster than R-CNN because it splits the image into a constant number of cells, instead of highlighting regions and calculating a solution for each of them. Now YOLO is not so good in recognition of objects with complex shapes or a group of small objects due to the insufficient number of candidates for the margins.
Nevertheless, in December 2020, Scaled YOLO v4 showed the best results (55.8% AP) on the Microsoft COCO dataset among peers, overtaking the Google EfficientDet D7x / DetectoRS neural network or SpineNet-190 (self-learning on additional data), Amazon Cascade in accuracy -RCNN, ResNest200 Microsoft RepPoints v2 and Facebook RetinaNet SpineNet-190. These results were achieved in the conditions of an optimal ratio of speed and accuracy from 15 FPS to 1774 FPS.

1.17K viewsedited 16:14

Big Data Science

Since May 18, 2021, Google integrates all its ML cloud services into a single interface and API as part of the public accessible Vertex AI, a managed MLOps-platform for deploying and serving AI models. Vertex AI integrates AutoML and the AI platform into a single API, client library, and user interface. Users can independently manage data and prototypes, deploy and interpret their ML-models using Vertex tools: Vizier, Feature Store, Experiments, Continuous Monitoring and Pipelines.
Vertex AI integrates with many open-source frameworks (TensorFlow, PyTorch, and scikit-learn) and also supports all ML-frameworks with custom training and prediction containers. Also you can connect BigQuery, use standard SQL-queries in existing business intelligence and spreadsheet tools, and export datasets from BigQuery to Vertex AI. Vertex Data Labeling allows your mark your data accurately. Learn more and try the new DS-platform from the corporation of good here: https://cloud.google.com/vertex-ai

Google Cloud

Vertex AI Platform

Enterprise ready, fully-managed, unified AI development platform. Access and utilize Vertex AI Studio, Agent Builder, and 200+ foundation models.

1.11K views15:37

Big Data Science

🙌🏻Data cleansing is too hard and too long? Try PClean, a new AI-system form MIT researchers written in a domain-specific probabilistic programming language for automatic data cleansing. It removes typos, duplicates, missing values, spelling errors and inconsistencies, making it easier to prepare a dataset for analysis and ML-modeling. Notably, PClean does not just mechanically cleanse data, but takes into account its semantics using generalized common sense models for judgments that can be customized for specific underlying data and error types.
The idea of probabilistic data cleansing based on declarative generalized knowledge of the research context is not new. It was published in a 2003 article by researches of the Berkley University of California. PClean develops this idea according to the trend of "explainable AI" with the realistic models of human knowledge to interpret data. Corrections in PClean based on Bayesian reasoning, whereby each alternative explanation for ambiguous data is assigned some weight to the existing probability data based on prior knowledge. An additional advantage of PClean is the ability to clean really large amounts of data, and relatively quickly. For example, in a recent 2021 study on table with 2.2 million rows of medical data, PClean found over 8,000 errors in just 7.5 hours. Finally, thanks to the principle of Bayesian probability, PClean give calibrated estimates of its uncertainty, which can be manually corrected and train the AI system.
https://news.mit.edu/2021/system-cleans-messy-data-tables-automatically-0511

MIT News

New system cleans messy data tables automatically

A new machine learning system from MIT uses probabilistic programming to clean dirty datasets, filling in blank cells accurately and quickly. Because it’s Bayesian, the artificial intelligence system can also tell you how confident it is in its answers.

1.05K views11:01

Big Data Science

🎂Reinforcement learning (RL) is great for tasks with a well-defined reward function, as evidenced by the successful experiences of AlphaZero for Go, OpenAI Five for Dota, and AlphaStar for StarCraft. But in practice, it is not always possible to clearly define the reward function. For example, in a simple room cleaning case, an old business card found under the bed or a used concert ticket may be evaluated as trash, but if they are valuable for host they should not be thrown away. However, even if you set clear criteria for evaluating the analyzed object, converting them into rewards is not easy. If you give the agent a reward that reinforces his behavior every time when it collects the garbage, it can throw it back to collect again and receive reinforcement.
This behavior of the AI system can be prevented by forming a reward function based on feedback on the agent's behavior. But this approach requires a lot of resources: in particular, training the Deep RL Cheetah model from OpenAI Gym and MujoCo requires about 700+ human comparisons.
Therefore, researchers at the Berkeley University of California David Lindner and Rohin Shah proposed an RL-algorithm without human supervision or an explicitly assigned function to form a reward policy based on implicit information. They named it RLSP (Reward Learning by Simulating the Past) because RL is formed by modeling the past, based on judgments that allow the agent to draw inferences about human preferences without explicit feedback. The main difficulty with scaling RLSP is how to reason about previous experience in the case of a big amount of data. The authors propose to choose the most probable past trajectories of the development of events instead of their full enumeration, alternating the prediction of past actions with the prediction of the past states from which these actions were taken.
The RLSP algorithm uses gradient lifting to continuously update a linear reward function to explain the observed state. Scaling this idea is possible through a functional representation of each state and modeling a linear reward function for these characteristics, followed by an approximation of the RLSP gradient by sampling more likely past trajectories. The gradient encourages the reward function so that backward trajectories (which should have been done in the past) and forward trajectories (which the agent would have done using the current reward) are consistent with each other. Once the trajectories are consisted, the gradient becomes zero, and the reward function that is most likely to cause the observed state is known. The essence of the RLSP algorithm is to perform a gradient lift using this gradient. The algorithm was tested in the MujoCo simulator, an environment for testing RL algorithms on the problem of training simulated robots to move along the optimal trajectory or in the best possible way. The results showed that RLSP-generated reinforcement policies perform as well as those directly trained in the true reward function.
https://bair.berkeley.edu/blog/2021/05/03/rlsp/

The Berkeley Artificial Intelligence Research Blog

Learning What To Do by Simulating the Past

The BAIR Blog

1.07K views04:38