Big Data Science – Telegram
Big Data Science
3.74K subscribers
65 photos
9 videos
12 files
637 links
Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼https://news.1rj.ru/str/bds_job — channel about Data Science jobs and career
💻https://news.1rj.ru/str/bdscience_ru — Big Data Science [RU]
Download Telegram
🌷Not only LightGBM and XGBoost: meet new probabilistic prediction algorithm - Natural Gradient Boosting (NGBoost). Released in 2019, NGBoost uses the Natural Gradient to address technical challenges that makes generic probabilistic prediction hard with existing gradient boosting methods. This algorithm consists of three abstract modular components: base learner, parametric probability distribution, and scoring rule. All three components are treated as hyperparameters chosen in advance before training. NGBoost makes it easier to do probabilistic regression with flexible tree-based models. Further, it has been possible to do probabilistic classification for quite some time since most classifiers are actually probabilistic classifiers in that they return probabilities over each class. For instance, logistic regression returns class probabilities as output. In this light, NGBoost doesn’t add much new but experiments on several regression datasets proved that this ML-algorithm provides competitive predictive performance of both uncertainty estimates and traditional metrics. On other hand its computing time is quite longer than other two algorithms and there’s no some useful options, e.g. early stopping, showing the intermediate results, the flexibility of choosing the base learner, setting a random state seed, dealing only with decision tree and Ridge regression,and so on. But this modular ML-algorithm for probabilistic prediction is quite competitive against other popular boosting methods. See more
http://www.51anomaly.org/pdf/NGBOOST.pdf
https://medium.com/@ODSC/using-the-ngboost-algorithm-8d337b753c58
https://towardsdatascience.com/ngboost-explained-comparison-to-lightgbm-and-xgboost-fda510903e53
https://www.groundai.com/project/ngboost-natural-gradient-boosting-for-probabilistic-prediction/1
Deep into NGBoost and probabilistic regression: what is probabilistic supervised learning and how to deal with prediction intervals. About correct interpretation of this ML-algorithm
https://towardsdatascience.com/interpreting-the-probabilistic-predictions-from-ngboost-868d6f3770b2
About tensor holography to create real time 3D-holograms for virtual reality, 3D printing and medical visualization that could be run on your smartphone. Meet new AI-method from MIT researchers https://news.mit.edu/2021/3d-holograms-vr-0310
🤓Deep fake is not too simple: interview with Belgium VFX specialist Chris Ume, creator of viral video about fake Tom Cruise. Why only ML-algorithm is not enough to get high quality result and you need thorough tune video effects manually
https://www.theverge.com/2021/3/5/22314980/tom-cruise-deepfake-tiktok-videos-ai-impersonator-chris-ume-miles-fisher
😜Not only Deep Learning: new approach to build AI systems working as human brain - sparse coding principle to supply series of local functions in synaptic learning rules and reduce number of adjusting data in NN-model. The startup Nara Logics from MIT alumnus is trying to increase effectiveness of AI by mimicking the brain structure and function at the circuit level.
https://news.mit.edu/2021/nara-logics-ai-0312
💥Meet the CLIP (Contrastive Language – Image Pre-training) - new Neural Net from OpenAI: it can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance, similar to the “zero-shot” capabilities of GPT-2 and GPT-3. CLIP is based on zero-shot transfer, natural language supervision, and multimodal learning to recognize a wide variety of visual concepts in images and associate them with their names. Read more where you can use this unique ML-model https://openai.com/blog/clip/
👀 Why modern AI for Computer Vision should have Multimodal Neurons and how this Faceted Feature Visualization rises the accuracy of predictions and classifications. New paper from
OpenAI researchers https://distill.pub/2021/multimodal-neurons/
🤓How to assess the potential effectiveness of medical drugs: new method DeepBAR form MIT researchers to calculate the binding affinities between drug candidates and their targets. It is based on GAN-models for analyzing molecular structures as images
https://news.mit.edu/2021/drug-discovery-binding-affinity-0315
😂If you do not want to study grammar and history, use ML to pass the exams! GPT-3 has done it with U.S. History, Research Methods, Creative Writing, and Law. In 3-20 minutes, NN was able to mimic human writing in areas of grammar, syntax, and word frequency and get the same feedback as the human writers
https://www.zdnet.com/article/ai-can-write-a-passing-college-paper-in-20-minutes/
👍🏻Deep Learning helps you to be healthy!
For years, physicians have relied on visual inspection to identify suspicious pigmented lesions (SPLs), which can be an indication of skin cancer. Early-stage identification of SPLs can improve melanoma prognosis and significantly reduce treatment cost. But it is not easy quickly find and prioritize SPLs due to the high volume of pigmented lesions. Researchers from MIT have devised a new AI pipeline, using deep convolutional neural networks (DCNNs) and applying them to analyzing SPLs through the wide-field photography common in smartphones.
A wide-field image, acquired with a smartphone camera, shows large skin sections from a patient. An automated system detects, extracts, and analyzes all pigmented skin lesions observable in the wide-field image. A pre-trained DCNN ML-models determines the suspiciousness of individual pigmented lesions and marks them: further inspection as yellow, referral to dermatologist as red. Extracted features are used to further assess pigmented lesions and to display results in a heatmap format.
DCNNs are deep learning algorithms are used to classify images to then cluster them for performing a photo search.
https://news.mit.edu/2021/artificial-intelligence-tool-can-help-detect-melanoma-0402
🏂How AI helps to increase production quality +25% up: Fujitsu experience
Japanese tech giant has developed an AI system highlights abnormalities in the appearance of products to help detect manufacturing issues earlier before materials are wasted.
ML-model trained on simulated images of products with abnormalities is able to detect different issues. For example frayed threads or defective wiring patterns on multicolor carpets or electronics parts with different wiring shapes. The ML-algorithm gained high quality: AUROC score of defects finding is more than 98%. The technology was tested at Fujitsu Plant in Nagano, which manufactures electronic equipment. The results showed a 25% reduction in the man-hours for inspecting printed circuit boards.
https://artificialintelligence-news.com/2021/03/29/fujitsu-develops-ai-product-abnormalities-manufacturing/
🤠In January 2021 Open.AI presented the new ML-model, neural network called DALL·E that creates images from text captions for concepts on natural language. It has 12-billion parameters and based on GPT-3. DALL·E was trained to generate images from text denoscriptions, using a dataset of text–image pairs. It can create anthropomorphized versions of animals and objects, combine unrelated concepts in plausible ways, render texts, and apply transformations to existing images.
Like GPT-3, DALL·E is a transformer language model. It receives both the text and the image as a single stream of data containing up to 1280 tokens, and is trained using maximum likelihood to generate all of the tokens, one after another. A token is any symbol from a discrete vocabulary, e.g. each English letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image is represented using 1024 tokens with a vocabulary size of 8192.
The images are preprocessed to 256x256 resolution during training. Similar to VQVAE, each image is compressed to a 32x32 grid of discrete latent codes using a discrete VAE pretrained using a continuous relaxation to obviate the need for an explicit codebook, EMA loss, or dead code revival. Also this trick can scale up to large vocabulary sizes and allows DALL·E to generate an image from scratch and to regenerate any rectangular region of an existing image that extends to the bottom-right corner consistent with the text prompt.
The attention mask at each of its 64 self-attention layers allows each image token to attend to all text tokens. DALL·E uses the standard causal mask for the text tokens, and sparse attention for the image tokens with either a row, column, or convolutional attention pattern, depending on the layer. The embeddings are produced by an encoder pretrained using a contrastive loss, not unlike CLIP.
https://openai.com/blog/dall-e/
Forwarded from Big Data Science [RU]
Оффер от Яндекса за выходные!
С 24 по 25 апреля пройдёт Weekend Offer для аналитиков — онлайн-встреча Яндекса, на которой можно пройти собеседования и получить оффер за выходные!
Чтобы попасть на Weekend Offer, необходимо решить 2-5 задач на платформе Яндекс.Контест. 24 апреля пройдут две часовые секции с кодом, а 25 апреля — часовые финалы с командами, на которых ваш потенциальный руководитель расскажет о сервисе, вашей роли и, возможно, даст ещё одну задачу. При взаимном интересе вы тем же вечером получите оффер.
Подробности и регистрация: https://clck.ru/UAzfb 
🙈Pandas is a great Python-library for data analysis library that uses every Data Scientist. However, this tool does not support multiprocessing and is rather slow on large datasets. Therefore, for fast Big Data processing, you should choose Vaex and Dask.
👍🏻Dask https://dask.org/ is a Pandas-based data analysis library with parallel computing and scalable performance. In addition to Pandas, it also integrates with the Numpy and Scikit-learn libraries, providing easy switching between them through Python APIs and data structures.
👍🏻Vaex https://vaex.io/docs/index.html is a high-performance Python library for lazy evaluations with dataframes similar to Pandas, as well as Big Data visualization and aggregation. It allows you to calculate basic statistics at a billion lines per second, but unlike Dask, it is not fully integrated with other libraries.
🤜🏻How to eliminate missing values in datasets for ML: 3 Python-functions easy to use
fillna() - function is available in the pandas package. It is used to fill null (NA/NaN) values. It returns an object as output in which null/missing values are filled. Series.fillna(value=None, method=None, axis=None, inplace=False, **kwargs)
dropna() - function to remove or drop null values from the data in different ways. This function analyzes and drops the rows/columns from the data that contain missing/NaN values. The parameter axis=0 indicates to drop rows that contain missing values and axis=1 is used to drop the columns. DataFrame.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
interpolate() - function to fill missing/NaN values using different interpolation techniques to fill the missing data. DataFrame.interpolate(method=’linear’, axis=0, limit=None, inplace=False, limit_direction=’forward’, limit_area=None, downcast=None, **kwargs)
🙌🏻6 types of RNN to model sequential data
Why modeling sequential data is not so easy and how to solve this task with recurrent neural networks: a lot of math and visual illustrates. Also detailed example of RNN Implementation in Keras/Tensorflow and Python.
https://neptune.ai/blog/recurrent-neural-network-guide
🥁How to understand and develop AI by searching and highlighting representative scenarios: Bayes-TrEx tool from MIT researches
Accuracy of ML-results is not enough to use it with high assurance in every field. Focus only on simple accuracy can lead to dangerous oversights. Model can make mistakes with very high confidence encountered something previously unseen, such as a self-driving car seeing a new type of traffic sign. To gain better human-AI interaction, a team of researchers from MIT’s Computer Science and Artificial Intelligence Laboratory have created a new tool called Bayes-TrEx. It allows developers and users increase transparency into their AI model. Specifically, it does so by finding concrete examples that lead to a particular behavior. The method makes use of “Bayesian posterior inference,” a widely-used mathematical framework to reason about model uncertainty.
In experiments, the researchers applied Bayes-TrEx to several image-based datasets, and found new insights that were previously overlooked by standard evaluations focusing solely on prediction accuracy. It can be used in medical diagnosis, autonomous driving systems, robotics and so on. Bayes-TrEx could help address these novel situations ahead of time, and enable developers to correct any undesirable outcomes before potential tragedies occur or resources waste.
https://news.mit.edu/2021/more-transparency-understanding-machine-behaviors-bayes-trex-0322
🎯How to remove duplicates in a dataset with Apache Spark?
Use the following framework API methods:
distinct() - the simplest and most frequently used way to remove identical duplicate strings from the dataframe
dropDuplicates() - Unlike distinct(), which takes no arguments at all, you can specify a subset of columns in the arguments to dropDuplicates () to remove duplicate records. Therefore, dropDuplicates (Seq <String> colNames) is more suitable when only some of the columns from the original dataset need to be processed.
reduceByKey() - returns a new RDD - a distributed dataset of key-value pairs (K, V), in which all values ​​for one key are combined into a tuple - the key and the result of the reduce function for all values ​​associated with this key. This method of removing duplicates is limited to the size of a Scala tuple, which contains between 2 and 22 elements. That's why you should not reduceByKey() if Spark RDD keys or values ​​have more than 22 columns.
collect_set() - a function from the Spark SQL API. It collects and returns a set of unique items. It is not deterministic, since the order of the results depends on the order of the rows, which might change after shuffling, and is not "real" deduplication. Basically, collect_set () is about rolling up records by executing groupBy () and collecting unique values ​​for the column associated with each group.
• write your own window function to get around the size limitation of Scala tuples. For example, split the RDD by columns, sort them, and filter the values ​​you want.