NEW BOT Телеграм, страница

Big Data Science

🌷Not only LightGBM and XGBoost: meet new probabilistic prediction algorithm - Natural Gradient Boosting (NGBoost). Released in 2019, NGBoost uses the Natural Gradient to address technical challenges that makes generic probabilistic prediction hard with existing gradient boosting methods. This algorithm consists of three abstract modular components: base learner, parametric probability distribution, and scoring rule. All three components are treated as hyperparameters chosen in advance before training. NGBoost makes it easier to do probabilistic regression with flexible tree-based models. Further, it has been possible to do probabilistic classification for quite some time since most classifiers are actually probabilistic classifiers in that they return probabilities over each class. For instance, logistic regression returns class probabilities as output. In this light, NGBoost doesn’t add much new but experiments on several regression datasets proved that this ML-algorithm provides competitive predictive performance of both uncertainty estimates and traditional metrics. On other hand its computing time is quite longer than other two algorithms and there’s no some useful options, e.g. early stopping, showing the intermediate results, the flexibility of choosing the base learner, setting a random state seed, dealing only with decision tree and Ridge regression,and so on. But this modular ML-algorithm for probabilistic prediction is quite competitive against other popular boosting methods. See more
http://www.51anomaly.org/pdf/NGBOOST.pdf
https://medium.com/@ODSC/using-the-ngboost-algorithm-8d337b753c58
https://towardsdatascience.com/ngboost-explained-comparison-to-lightgbm-and-xgboost-fda510903e53
https://www.groundai.com/project/ngboost-natural-gradient-boosting-for-probabilistic-prediction/1

1.07K views16:05

Big Data Science

😂For those who skipped everything in 2020: Top 15 Machine Learning & AI Research Papers – from YOLO 4 to TensorFlow Quantum https://rubikscode.net/2020/12/21/2020s-top-15-machine-learning-ai-research-papers/

Rubix Code

2020’s Top 15 Machine Learning & AI Research Papers

Machine Learning, Deep Learning & AI Research Papers from 2020, that we thing you should read.

1.11K views05:30

Big Data Science

ML to optimize microchip's architecture: Apollo project from Google to search right parameters of chip for certain Neural Net and meet the high speed of computations
https://www.zdnet.com/article/googles-deep-learning-finds-a-critical-path-in-ai-chips

ZDNET

Google’s deep learning finds a critical path in AI chips

The work marks a beginning in using machine learning techniques to optimize the architecture of chips.

1.1K views03:53

Big Data Science

Deep into NGBoost and probabilistic regression: what is probabilistic supervised learning and how to deal with prediction intervals. About correct interpretation of this ML-algorithm
https://towardsdatascience.com/interpreting-the-probabilistic-predictions-from-ngboost-868d6f3770b2

Medium

NGBoost and Prediction Intervals

What is probabilistic regression and how should you interpret probabilistic predictions?

1.08K views01:50

Big Data Science

About tensor holography to create real time 3D-holograms for virtual reality, 3D printing and medical visualization that could be run on your smartphone. Meet new AI-method from MIT researchers https://news.mit.edu/2021/3d-holograms-vr-0310

MIT News

Using artificial intelligence to generate 3D holograms in real-time

MIT researchers developed a way to produce holograms almost instantly. The deep learning-based method is so efficient, it could run on a smartphone, they say.

903 views04:01

Big Data Science

🤓Deep fake is not too simple: interview with Belgium VFX specialist Chris Ume, creator of viral video about fake Tom Cruise. Why only ML-algorithm is not enough to get high quality result and you need thorough tune video effects manually
https://www.theverge.com/2021/3/5/22314980/tom-cruise-deepfake-tiktok-videos-ai-impersonator-chris-ume-miles-fisher

The Verge

Tom Cruise deepfake creator says public shouldn’t be worried about ‘one-click fakes’

‘You can’t do it by just pressing a button.’

837 views04:04

Big Data Science

😜Not only Deep Learning: new approach to build AI systems working as human brain - sparse coding principle to supply series of local functions in synaptic learning rules and reduce number of adjusting data in NN-model. The startup Nara Logics from MIT alumnus is trying to increase effectiveness of AI by mimicking the brain structure and function at the circuit level.
https://news.mit.edu/2021/nara-logics-ai-0312

MIT News

Artificial intelligence that more closely mimics the mind

Nara Logics, co-founded by MIT alumnus Nathan Wilson PhD ’05, is attempting to mimic the brain with an AI platform powered by an engine it calls Nara Logics Synaptic Intelligence.

864 viewsedited 04:43

Big Data Science

💥Meet the CLIP (Contrastive Language – Image Pre-training) - new Neural Net from OpenAI: it can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance, similar to the “zero-shot” capabilities of GPT-2 and GPT-3. CLIP is based on zero-shot transfer, natural language supervision, and multimodal learning to recognize a wide variety of visual concepts in images and associate them with their names. Read more where you can use this unique ML-model https://openai.com/blog/clip/

Openai

CLIP: Connecting text and images

We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized,…

914 views02:55

Big Data Science

👀 Why modern AI for Computer Vision should have Multimodal Neurons and how this Faceted Feature Visualization rises the accuracy of predictions and classifications. New paper from
OpenAI researchers https://distill.pub/2021/multimodal-neurons/

Distill

Multimodal Neurons in Artificial Neural Networks

We report the existence of multimodal neurons in artificial neural networks, similar to those found in the human brain.

824 views03:07

Big Data Science

🤓How to assess the potential effectiveness of medical drugs: new method DeepBAR form MIT researchers to calculate the binding affinities between drug candidates and their targets. It is based on GAN-models for analyzing molecular structures as images
https://news.mit.edu/2021/drug-discovery-binding-affinity-0315

MIT News

Faster drug discovery through machine learning

MIT researchers have developed DeepBAR, a machine learning technique that quickly calculates drug molecules’ binding affinity with target proteins. The advance could accelerate drug discovery and protein engineering.

850 views04:45

Big Data Science

🥺Looking for interesting reading? Take the TOP-21 books about Data Science, Engineering and Statistics – must to be read in 2021 https://towardsdatascience.com/21-data-science-books-you-should-read-in-2021-db625e97feb6
And short-list of 5 items https://medium.com/curious/5-books-every-data-scientist-should-read-in-2021-206609d8593b

Medium

21 Data Science Books You Should Read in 2021

An Updated Collection of the Best Data Science Books to Read Right Now

895 views03:46

Big Data Science

😂If you do not want to study grammar and history, use ML to pass the exams! GPT-3 has done it with U.S. History, Research Methods, Creative Writing, and Law. In 3-20 minutes, NN was able to mimic human writing in areas of grammar, syntax, and word frequency and get the same feedback as the human writers
https://www.zdnet.com/article/ai-can-write-a-passing-college-paper-in-20-minutes/

ZDNet

AI can write a passing college paper in 20 minutes

Natural language processing is on the cusp of changing our relationship with machines forever.

803 views03:28

Big Data Science

👍🏻Deep Learning helps you to be healthy!
For years, physicians have relied on visual inspection to identify suspicious pigmented lesions (SPLs), which can be an indication of skin cancer. Early-stage identification of SPLs can improve melanoma prognosis and significantly reduce treatment cost. But it is not easy quickly find and prioritize SPLs due to the high volume of pigmented lesions. Researchers from MIT have devised a new AI pipeline, using deep convolutional neural networks (DCNNs) and applying them to analyzing SPLs through the wide-field photography common in smartphones.
A wide-field image, acquired with a smartphone camera, shows large skin sections from a patient. An automated system detects, extracts, and analyzes all pigmented skin lesions observable in the wide-field image. A pre-trained DCNN ML-models determines the suspiciousness of individual pigmented lesions and marks them: further inspection as yellow, referral to dermatologist as red. Extracted features are used to further assess pigmented lesions and to display results in a heatmap format.
DCNNs are deep learning algorithms are used to classify images to then cluster them for performing a photo search.
https://news.mit.edu/2021/artificial-intelligence-tool-can-help-detect-melanoma-0402

MIT News

An artificial intelligence tool that can help detect melanoma

An artificial intelligence system can efficiently detect melanoma, a type of skin cancer. MIT researchers used deep convolutional neural networks (DCNNs) to quickly analyze wide-field photos of patients’ bodies.

785 views11:14

Big Data Science

🏂How AI helps to increase production quality +25% up: Fujitsu experience
Japanese tech giant has developed an AI system highlights abnormalities in the appearance of products to help detect manufacturing issues earlier before materials are wasted.
ML-model trained on simulated images of products with abnormalities is able to detect different issues. For example frayed threads or defective wiring patterns on multicolor carpets or electronics parts with different wiring shapes. The ML-algorithm gained high quality: AUROC score of defects finding is more than 98%. The technology was tested at Fujitsu Plant in Nagano, which manufactures electronic equipment. The results showed a 25% reduction in the man-hours for inspecting printed circuit boards.
https://artificialintelligence-news.com/2021/03/29/fujitsu-develops-ai-product-abnormalities-manufacturing/

768 viewsedited 05:01

Big Data Science

🤠In January 2021 Open.AI presented the new ML-model, neural network called DALL·E that creates images from text captions for concepts on natural language. It has 12-billion parameters and based on GPT-3. DALL·E was trained to generate images from text denoscriptions, using a dataset of text–image pairs. It can create anthropomorphized versions of animals and objects, combine unrelated concepts in plausible ways, render texts, and apply transformations to existing images.
Like GPT-3, DALL·E is a transformer language model. It receives both the text and the image as a single stream of data containing up to 1280 tokens, and is trained using maximum likelihood to generate all of the tokens, one after another. A token is any symbol from a discrete vocabulary, e.g. each English letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image is represented using 1024 tokens with a vocabulary size of 8192.
The images are preprocessed to 256x256 resolution during training. Similar to VQVAE, each image is compressed to a 32x32 grid of discrete latent codes using a discrete VAE pretrained using a continuous relaxation to obviate the need for an explicit codebook, EMA loss, or dead code revival. Also this trick can scale up to large vocabulary sizes and allows DALL·E to generate an image from scratch and to regenerate any rectangular region of an existing image that extends to the bottom-right corner consistent with the text prompt.
The attention mask at each of its 64 self-attention layers allows each image token to attend to all text tokens. DALL·E uses the standard causal mask for the text tokens, and sparse attention for the image tokens with either a row, column, or convolutional attention pattern, depending on the layer. The embeddings are produced by an encoder pretrained using a contrastive loss, not unlike CLIP.
https://openai.com/blog/dall-e/

Openai

DALL·E: Creating images from text

We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language.

758 views03:26

Big Data Science

Forwarded from Big Data Science [RU]

Оффер от Яндекса за выходные!
С 24 по 25 апреля пройдёт Weekend Offer для аналитиков — онлайн-встреча Яндекса, на которой можно пройти собеседования и получить оффер за выходные!
Чтобы попасть на Weekend Offer, необходимо решить 2-5 задач на платформе Яндекс.Контест. 24 апреля пройдут две часовые секции с кодом, а 25 апреля — часовые финалы с командами, на которых ваш потенциальный руководитель расскажет о сервисе, вашей роли и, возможно, даст ещё одну задачу. При взаимном интересе вы тем же вечером получите оффер.
Подробности и регистрация: https://clck.ru/UAzfb

804 views10:59

Big Data Science

🙈Pandas is a great Python-library for data analysis library that uses every Data Scientist. However, this tool does not support multiprocessing and is rather slow on large datasets. Therefore, for fast Big Data processing, you should choose Vaex and Dask.
👍🏻Dask https://dask.org/ is a Pandas-based data analysis library with parallel computing and scalable performance. In addition to Pandas, it also integrates with the Numpy and Scikit-learn libraries, providing easy switching between them through Python APIs and data structures.
👍🏻Vaex https://vaex.io/docs/index.html is a high-performance Python library for lazy evaluations with dataframes similar to Pandas, as well as Big Data visualization and aggregation. It allows you to calculate basic statistics at a billion lines per second, but unlike Dask, it is not fully integrated with other libraries.

www.dask.org

Dask | Scale the Python tools you love

Dask is a flexible open-source Python library for parallel computing maintained by OSS contributors across dozens of companies including Anaconda, Coiled, SaturnCloud, and nvidia.

799 views08:46

Big Data Science

🤜🏻How to eliminate missing values in datasets for ML: 3 Python-functions easy to use
• fillna() - function is available in the pandas package. It is used to fill null (NA/NaN) values. It returns an object as output in which null/missing values are filled. Series.fillna(value=None, method=None, axis=None, inplace=False, **kwargs)
• dropna() - function to remove or drop null values from the data in different ways. This function analyzes and drops the rows/columns from the data that contain missing/NaN values. The parameter axis=0 indicates to drop rows that contain missing values and axis=1 is used to drop the columns. DataFrame.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
• interpolate() - function to fill missing/NaN values using different interpolation techniques to fill the missing data. DataFrame.interpolate(method=’linear’, axis=0, limit=None, inplace=False, limit_direction=’forward’, limit_area=None, downcast=None, **kwargs)

791 views07:00

Big Data Science

🙌🏻6 types of RNN to model sequential data
Why modeling sequential data is not so easy and how to solve this task with recurrent neural networks: a lot of math and visual illustrates. Also detailed example of RNN Implementation in Keras/Tensorflow and Python.
https://neptune.ai/blog/recurrent-neural-network-guide

neptune.ai

Recurrent Neural Network Guide: a Deep Dive in RNN

Sequence modeling is a task of modeling sequential data. Modeling sequence data is when you create a mathematical notion to understand and study sequential data, and use those understandings to generate, predict or classify the same for a specific application. …

712 views15:27

Big Data Science

🥁How to understand and develop AI by searching and highlighting representative scenarios: Bayes-TrEx tool from MIT researches
Accuracy of ML-results is not enough to use it with high assurance in every field. Focus only on simple accuracy can lead to dangerous oversights. Model can make mistakes with very high confidence encountered something previously unseen, such as a self-driving car seeing a new type of traffic sign. To gain better human-AI interaction, a team of researchers from MIT’s Computer Science and Artificial Intelligence Laboratory have created a new tool called Bayes-TrEx. It allows developers and users increase transparency into their AI model. Specifically, it does so by finding concrete examples that lead to a particular behavior. The method makes use of “Bayesian posterior inference,” a widely-used mathematical framework to reason about model uncertainty.
In experiments, the researchers applied Bayes-TrEx to several image-based datasets, and found new insights that were previously overlooked by standard evaluations focusing solely on prediction accuracy. It can be used in medical diagnosis, autonomous driving systems, robotics and so on. Bayes-TrEx could help address these novel situations ahead of time, and enable developers to correct any undesirable outcomes before potential tragedies occur or resources waste.
https://news.mit.edu/2021/more-transparency-understanding-machine-behaviors-bayes-trex-0322

MIT News

More transparency and understanding into machine behaviors

A tool called Bayes-TrEx allows artificial intelligence developers and users to gain transparency into their AI model. The work was led by researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL).

745 views13:27

Big Data Science

🎯How to remove duplicates in a dataset with Apache Spark?
Use the following framework API methods:
• distinct() - the simplest and most frequently used way to remove identical duplicate strings from the dataframe
• dropDuplicates() - Unlike distinct(), which takes no arguments at all, you can specify a subset of columns in the arguments to dropDuplicates () to remove duplicate records. Therefore, dropDuplicates (Seq <String> colNames) is more suitable when only some of the columns from the original dataset need to be processed.
• reduceByKey() - returns a new RDD - a distributed dataset of key-value pairs (K, V), in which all values for one key are combined into a tuple - the key and the result of the reduce function for all values associated with this key. This method of removing duplicates is limited to the size of a Scala tuple, which contains between 2 and 22 elements. That's why you should not reduceByKey() if Spark RDD keys or values have more than 22 columns.
• collect_set() - a function from the Spark SQL API. It collects and returns a set of unique items. It is not deterministic, since the order of the results depends on the order of the rows, which might change after shuffling, and is not "real" deduplication. Basically, collect_set () is about rolling up records by executing groupBy () and collecting unique values for the column associated with each group.
• write your own window function to get around the size limitation of Scala tuples. For example, split the RDD by columns, sort them, and filter the values you want.

762 views14:40

About

Blog

Apps

Platform