Big Data Science – Telegram
Big Data Science
3.74K subscribers
65 photos
9 videos
12 files
637 links
Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼https://news.1rj.ru/str/bds_job — channel about Data Science jobs and career
💻https://news.1rj.ru/str/bdscience_ru — Big Data Science [RU]
Download Telegram
💥5 YOUTUBE channels for a data engineer from popular DS bloggers
• Ken Jee
https://www.youtube.com/c/KenJee1/videos - 183 thousand subscribers and about 200 videos about Data Science, big data engineering, ML and sports analytics
• Karolina Sowinska https://www.youtube.com/c/KarolinaSowinska/videos 30+ thousand subscribers and almost 60 great videos about AirFlow, AI, ETL and the career of a data engineer;
Shashank Mishra https://www.youtube.com/c/LearningBridge/video 40+ thousand subscribers and more than 150 videos about everyday life data engineers, DS course reviews, interview recommendations and personal experience of the author who worked at Amazon , McKinsey&Company, PayTm and other large corporations, as well as startups.
Seattle Data Guy https://www.youtube.com/c/SeattleDataGuy/videos almost 20 thousand subscribers and more than 100 videos about the soft and hard skills of a data engineer, life hacks for solving daily tasks of collecting and aggregating data using Python and not only, SQL best practices, introduction to R and much more
Andreas Kretz https://www.youtube.com/c/andreaskayy/videos about 27 thousand subscribers and more than 500 videos vanilla and proprietary Hadoop, Spark, Kafka, AWS services and other cloud platforms, ETL basics, installation details and practical use different Big Data technologies and features of the data engineer profession.
🏸Zingg + TigerGraph combo for deduplication and big data graph analytics
Graph databases with built-in relationship patterns are great for record disambiguation and entity resolution. For example, TigerGraph is a powerful graph analytics system. And if you supplement it with the open ML tool Zingg (https://github.com/zinggAI/zingg), you can find duplicate and ambiguous records even faster.
Imagine, the same person in different systems is written differently. Therefore, it is very difficult to analyze its user behavior, for example, to generate a personal marketing offer or inclusion in loyalty programs. Zingg have built-in locking mechanisms that only calculate pairwise similarity for selected records. This reduces computation time and helps scale to large datasets. You don't have to worry about manually linking/grouping records: the internal entity resolution framework takes care of that. So with Zingg and TigerGraph you can combine the best simple and scalable entity resolution and further graph analysis.
https://towardsdatascience.com/entity-resolution-with-tigergraph-add-zingg-to-the-mix-95009471ca02
LaMDA: Safe, Grounded, and High-Quality Dialog Model from Google AI
LaMDA
is created by fine-tuning a family of dialogue-specific Transformer-based neural language models with model parameters up to 137B and training the models to use external knowledge sources. LaMDA has three key goals:
Quality, which is measured in terms of Sensibleness, Specificity, and Interestingness. These indicators are evaluated by people. Reasonableness indicates the presence of meaning in the context of the dialogue, for example, the absence of absurd answers from the ML-model and contradictions with earlier answers. Specificity indicates whether the system's response is specific to the context of the previous dialog. Interestingness measures the emotional reaction of the interlocutor to the answers of the ML model.
Safety so that the model's responses do not contain offensive and dangerous statements.
Groundedness - modern language models often generate statements that seem plausible, but in fact contradict the true facts in external sources. Groundedness is defined as the percentage of responses with statements about the outside world that can be verified by reputable external sources. A related metric, Informativeness, is defined as the percentage of responses with information about the outside world that can be confirmed by known sources.
LaMDA models undergo two-stage training: pre-training and fine-tuning. The first stage was performed on a data set of 1.56 thousand words from publicly available dialogue data and public web documents. After tokenizing the data set of 2.81T tokens, the model was trained to predict each next token in the sentence, given the previous ones. The pretrained LaMDA model has also been widely used for NLP research at Google, including program synthesis, zero-shot learning, and more.
In the fine-tuning phase, LaMDA is trained to combine generative tasks to generate natural language responses in given contexts and classification tasks to determine the safety and quality of the model. This results in a single multitasking model: the LaMDA generator is trained to predict the next token in the dialogue dataset, and the classifiers are trained to predict the security and response quality scores in context using annotated data.
The test results showed that LaMDA significantly outperforms the pre-trained model in every dimension and at every scale. Quality metrics improve as the number of model parameters increases, with or without fine-tuning. Safety is not improved by scaling the model alone, but compensated for by fine-tuning. Groundedness improves as the size of the model grows, due to the ability to remember unusual knowledge. And fine-tuning allows the model to access external sources and effectively transfer part of the burden of remembering knowledge to them. By fine-tuning, the human-level quality gap can be reduced, although the performance of the model remains below human-level in terms of safety and Groundedness.
https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html
Forwarded from Big Data Science [RU]
сравнение метрик LAMDA с человеческими оценками
👀Upscaling video games with NVIDIA's DLDSR
DLDSR (Deep Learning Dynamic Super Resolution)
is a video game image enhancement technology that uses a multilayer neural network that requires fewer pixels. The 2.25X DLDSR is comparable in quality to the 4X resolution of previous generation DSR technology. At the same time, DLDSR performance is much higher thanks to the tensor cores of RTX video cards, which accelerate neural networks several times. You can try DLDSR on your gaming computer by updating your video card driver and setting the desired settings.
https://www.rockpapershotgun.com/nvidias-deep-learning-dynamic-super-resolution-tech-is-out-now-heres-how-to-enable-it
🌦TOP-10 Data Science conferences in February 2022:
1. 02 Feb
- Virtual conference DataOps Unleashed https://dataopsunleashed.com/
2. 03 Feb - Beyond Big Data: AI/Machine Learning Summit 2022, Pittsburgh, USA https://www.pghtech.org/events/BeyondBigData2022
3. 10 Feb - Online-summit AICamp ML Data Engineering https://www.aicamp.ai/event/eventdetails/W2022021009
4. 12-13 Feb - IAET International Conference on Machine Learning, Smart & Nanomaterials, Design Engineering, Information Technology & Signal Processing. Budapest, Hungary https://institute-aet.com/mns-22/
5. 16 Feb - DSS Hybrid Miami: AI & ML in the Enterprise. Miami, FL, USA & Virtual https://www.datascience.salon/miami/
6. 17-18 Feb - RE.WORK San Francisco, CA, USA and Online
Reinforcement Learning Summit: https://www.re-work.co/events/reinforcement-learning-summit-2022
Deep Learning Summit: https://www.re-work.co/events/deep-learning-summit-2022 Enterprise AI Summit: https://www.re-work.co/events/enterprise-ai-summit-2022
7. 18-20 Feb - International Conference on Compute and Data Analysis (ICCDA 2022). Sanya, China http://iccda.org/
8. 21-25 Feb - WSDM'22, The 15th ACM International WSDM Conference. Online. http://www.wsdm-conference.org/2022/
9. 22-23 Feb - AI & ML Developers Conference. Virtual. https://cnvrg.io/mlcon
10. 26-27 Feb - 9th International Conference on Data Mining and Database (DMDB 2022). Vancouver, Canada https://ccseit2022.org/dmdb/
🚗Yandex Courier Robots in Seoul
As early as last year, Yandex's autonomous courier robots began delivering orders in Russia, food from restaurants in the US city of Ann Arbor, Michigan, and other US student campuses. And in January 2022, Yandex entered into an agreement of intent with a large South Korean telecommunications company, KT Corporation, for delivery by autonomous robots in Seoul. So already this year, South Korea will become the first country in East Asia where Yandex rovers operate. The company is also preparing to launch this technology in Dubai.
http://www.koreaherald.com/view.php?ud=20220118000709
🎂Terality - super fast serverless engine instead of slow Pandas
Terality
is a serverless data processing engine that runs on giant clusters to work with datasets of any size. Thanks to the serverless paradigm, you don’t have to worry about scaling resources in clusters or other infrastructure: there are practically no limits on memory, and therefore on the size of a data set. It only needs a good internet connection to handle hundreds of GB, even on a simple office laptop with 4GB of RAM. Terality allows you to run Pandas code 10 times faster: Terality syntax is similar to Pandas. It only takes one line of code to switch from Pandas to Terality: import teratiyu as te. The Python package sends HTTPS requests to the Terality engine when you call Pandas functions. The engine processes the data and the command and returns the result. However, Terality is not just a Python package, but freemium software with a free 1TB plan. This counts every API call, not just data reads.
https://docs.terality.com/
https://towardsdatascience.com/good-bye-pandas-meet-terality-its-evil-twin-with-identical-syntax-455b42f33a6d
🥁🥁Undouble - Python library for detecting duplicate images using hash functions
Finding identical or similar photos manually is a long and tedious task. ITS can not be solved simply by comparing the size and file name, because. photos are taken from different sources (mobile devices, social networking applications, etc.), which results in differences in these attributes and creates differences in resolution, scaling, compression, and brightness. Hash functions are ideal for detecting identical and similar photos due to their resistance to minor changes. This idea is the basis of Undouble - the Python library that works using a multi-stage image preprocessing process (grayscale, normalization and scaling), image hash calculation, and image grouping. Threshold 0 will group images with identical image hash. The results can be easily examined using the plotting function, and the images can be moved using the move function. When moving images, the image from the group with the highest resolution is copied, and all other images are moved to the "undouble" subdirectory.
To try this open source library (https://github.com/erdogant/undouble), you first need to install it: pip install undouble, then import the package into your project: from undouble import Undouble. Then, by setting the hash method and hash size, duplicates can be detected using undouble. In this case, the following steps are performed: recursively reading all images from the directory with the specified extensions, computing the hash, and grouping similar images.
See an example with explanations here:
https://towardsdatascience.com/detection-of-duplicate-images-using-image-hash-functions-4d9c53f04a75
😜How to create an exe-file from a py-noscript
Although you can run a Python noscript in a terminal or text editor, sometimes you need to hide all the code in a py file by wrapping it inside an executable (.exe) file. For example, to schedule a job that runs an executable at a specific time. This can be done in 2 ways:
• via the GUI of the auto-py-to-exe package (https://pypi.org/project/auto-py-to-exe/), which must first be installed via pip install auto-py-to-exe, then run and follow 4 steps in sequence.
• in the terminal with the PyInstaller library, which should also be installed first: pip install pyinstaller, and then go to the directory with the desired py file and create an executable based on it using the pyinstaller --onefile name_of_noscript.py command
In fact, the first method is a visualization of the 2nd in a visual GUI. And those who work with the CLI interface can immediately use PyInstaller without any additional wrappers.
https://towardsdatascience.com/how-to-easily-convert-a-python-noscript-to-an-executable-file-exe-4966e253c7e9
😜Neural networks for selfies on Google Pixel 6: accurate alpha matting in portrait mode
Matting an image is the process of extracting a precise alpha mask that separates the foreground and background objects of an image. This is not only necessary for professional designers when designing advertising photos, but has also become a popular entertainment for smartphone users. Send friends a selfie with the Eiffel Tower in the background while in a room with a grandmother's carpet? Easy with Google Pixel 6: a convolutional neural network from a sequence of encoder-decoder blocks to gradually evaluate high-quality alpha matting will preserve all the details, including fine hairs.
The input RGB image is combined with a coarse alpha matte (generated with a low resolution people segmenter) which is passed as input to the network. The new Portrait Matting model uses the MobileNetV3 backbone and a shallow decoder with few layers to first predict the low resolution advanced alpha mask. Then a shallow codec and a series of residual blocks are applied to process the high resolution image and the refined alpha mask from the previous step. The shallow codec relies more on lower level functions than the previous MobileNetV3 backbone, focusing on high resolution structural functions to predict the final transparency values for each pixel. This way the model can refine the original foreground alpha mask and accurately extract very fine details. This neural network architecture works effectively on Pixel 6 using Tensorflow Lite. The ML model also uses a variety of training datasets that cover a wide range of skin tones and hairstyles.
https://ai.googleblog.com/2022/01/accurate-alpha-matting-for-portrait.html
🍏Clustimage - Python library for image clustering
Unsupervised clustering in image recognition is a multi-step process. It includes preprocessing, feature extraction, similarity clustering, and estimation of the optimal number of clusters using a quality measure. All of these steps are implemented in the Clustimage Python package, which takes only paths or raw pixel values as input.
The goal of clustimage is to detect natural groups or clusters of images using the ilhouette, dbindex and their derivatives methods, in combination with clustering methods (agglomerative, kmeans, dbscan and hdbscan). Clustimage helps you determine the most robust clustering by efficiently searching by parameter and evaluating clusters. In addition to image clustering, the model can also be used to find the most similar images for a new invisible sample.
To try this open source library (https://github.com/erdogant/clustimage), you first need to install it: pip install clustimage, then import the package into your project: from clustimage import Clustimage.
See an example with explanations here: https://towardsdatascience.com/a-step-by-step-guide-for-clustering-images-4b45f9906128
👍🏻Don't like documenting code? Give it to the AI!
AI Doc Writer
is a VS Code extension that documents code using AI. Simply select the necessary lines of code in the development environment and press Cmd / Ctrl +. AI Doc Writer will create a short denoscription of each feature and options. The tool from Microsoft (https://marketplace.visualstudio.com/items?itemName=mintlify.document) supports Python, JavaScript, TypeScript, PHP and Java languages, as well as JSX and TSX files. Of course, this cannot be called software documentation in the form in which the Customer understands it, however, the presence of understandable comments makes the code maintainable and readable. More examples:
https://betterprogramming.pub/too-lazy-to-write-documentation-let-the-ai-write-it-for-you-8574f7cd11b2
🔥Hide a spicy photo from strangers? Easy with nudity detection API
Python program by DeepAI
(https://deepai.org/machine-learning-model/nsfw-detector)
evaluates the image and estimates the likelihood that it covers areas of the human body that are usually found under clothing. The nudity check is a dynamic label, a certainty algorithm, or a false percentage. The user can set a different threshold in their app for what it considers to be nudity, and the image detection algorithm will return the percentage chance that the image contains natural content. This ML system is applicable not only to photos, but also to videos: neural networks analyze the video stream and give probabilistic feedback on the “adultness” of consumption.
See an example of using the API here: https://medium.com/mlearning-ai/i-tried-a-python-nude-detector-with-my-photo-446dba1bbfc8
😎3 Face Recognition ML Services APIs: Choose What You Need
IBM Watson Visual Recognition API for identifying scenes, objects, and faces in images uploaded to the service. It can process unstructured data in a large volume and is suitable as a decision support system. But it is expensive to maintain and does not process structured data directly. The facial recognition method does not support general biometric recognition, and the maximum image size is 10 MB with a minimum recommended density of 32x32 ppi. Suitable for image classification using built-in classifiers, allows you to create your own classifiers and train ML models. https://www.ibm.com/watson
Kairos Face Recognition API allows developers of ML applications to add face recognition capabilities to their applications by writing just a few lines of code. The Kairos Face Recognition API shows high accuracy in real-life scenarios and performs well in low light conditions as well as partial face hiding. Applies an ethical approach to identifying individuals, taking into account diversity. It is an extensible tool: users can apply additional intelligence to work with video and photos in the real world. Suitable for working with large volumes of images and ensures confidentiality through the secure storage of collected data and regular audits. However, it only supports BMP, JPG, and PNG file types, GIF files are not supported. Slightly slower in operation than the AWS API. https://www.kairos.com/docs/getting-started-with-kairos-face-recognition
Microsoft Computer Vision API in Azure gives developers access to advanced image processing algorithms. Once an image is loaded or its URL is specified, Microsoft Computer Vision algorithms analyze its visual content in various ways based on the user's choice. An added benefit of this fast API is visual guides, tutorials, and examples. A high SLA guarantees at least 99.9% availability. Through tight integration with other Microsoft Azure cloud services, APIs can be packaged into a complete solution. But if the transaction per second limit is exceeded, the response time will be reduced to the agreed limit. The pricing model is demand-driven, so the service can become expensive if the number of requests spikes. The Microsoft Computer Vision API is great for classifying images with objects, creatures, scenery, and activities, including their identification, categorization, and image tagging. Supports face, mood, age and scene recognition, optical character recognition to detect text content in images. Also provides intelligent photo management and moderated content display restriction. https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/
👍🏻Sentiment analysis in social networks in Python with VADER without developing an ML model
Not every classification problem needs machine learning models: sometimes even simple approaches can give excellent results. For example, VADER (Valence Aware Dictionary and sEntiment Reasoner) is a vocabulary and rule based sentiment analysis model. The project source code is available on Github under the MIT license: https://github.com/cjhutto/vaderSentiment
VADER can efficiently handle dictionaries, abbreviations, capital letters, repetitive punctuation marks, emoticons (😢 , 😃 , 😭, etc.), etc., which are commonly used in social networks to express sentiment, making it an excellent text sentiment tool. The advantage of VADER is that it evaluates the mood of any text without prior training of ML models. The result generated by VADER is a dictionary of 4 keys neg, neu, pos and components (compound):
• neg, neu and pos mean negative, neutral and positive respectively. Their sum must be equal to 1 or close to it in a floating point operation.
• Compound corresponds to the sum of the valency scores of each word in the lexicon and determines the degree of mood, and not the actual value, unlike the previous ones. Its value ranges from -1 (the strongest negative mood) to +1 (the strongest positive mood). The use of a composite score may be sufficient to determine the main tone of the text. Compound ≥ 0.05 for positive mood, compound ≤ -0.05 for negative mood, compound ranges from -0.05 to 0.05 for neutral mood
Try Google Colab: https://colab.research.google.com/drive/1_Y7LhR6t0Czsk3UOS3BC7quKDFnULlZG?usp=sharing
Example: https://towardsdatascience.com/social-media-sentiment-analysis-in-python-with-vader-no-training-required-4bc6a21e87b8
#test
To form a real estate rental package by applying the most demanded period by days there, from the dataset of the demand for rental, we take the following statistics by rental days:
Anonymous Quiz
38%
median
52%
mode
8%
max
0%
min
2%
average
🦋Yandex DataLens: Lightweight BI from Yandex.Cloud
Yandex DataLens is a free data visualization and analysis service. Main features:
• many data sources: ClickHouse, PostgreSQL, Greenplum, MySQL, CSV files, Google spreadsheets, Metrica and AppMetrica in direct access mode;
• diagrams, tables and data access UI elements for building dashboards;
• support for geodata and integration with maps;
• easy creation of the necessary dashboards without deep knowledge in DS;
• all documentation in Russian and a lot of understandable demos.
Service: https://cloud.yandex.ru/services/datalens
Documentation: https://cloud.yandex.ru/docs/datalens/quickstart
💥Fusion plasma control with DL
To solve the global crisis, find sources of clean, limitless energy. For example, nuclear fusion, which powers the stars in the universe. On earth, atomic batteries can be used for this, breaking and fusing them under extreme conditions in a tokamak device - a vacuum surrounded by magnetic coils. In it the plasma radiation is hotter than the core of the Sun. The norm of the device in the operating mode is very difficult: the control system must coordinate many magnetic current coils and the voltage on them is several times less in order to achieve that the plasma never touches the walls of the vessel, which can lead to heat loss and, possibly, loss. Deep reinforcement learning has been successfully applied to this problem to create controllers that maintain plasma stability and stable control of various shapes.
Existing control systems for plasma complications and requiring rare control for each of the subsequent magnetic coils. Each controller uses algorithms to evaluate plasma properties in real time and measure magnet voltages. The architecture from the renowned Deep Mind AI Center and the Swiss Center for Plasma Research uses a single neural network to control all coils simultaneously, automatically judging which voltages are best for building plasma, especially with sensors.
https://deepmind.com/blog/article/Accelerating-fusion-science-through-learned-plasma-control
🤦🏼‍♀️Biopass - REST API of a SaaS product for face recognition
Biopass is a platform for processing biometric data and artificial intelligence for creating ID products. The Biopass ID RESTfull API allows developers to enroll, manage, verify people, match and extract biometric data, manage fingerprint image compression and decompression, face detection, analyze face fakes, anonymize faces, and perform quality checks.
BioPass ID is an online cloud service that provides powerful multi-biometric and artificial intelligence technology for the development of any Internet-enabled service, software or platform. As a SaaS (Biometrics as a Service) product, BioPass ID supports any programming language, sensor model, camera or platform, enabling fast and easy implementation and system integration.
Images are common options in BioPass ID requests. To send them in API requests, you need to encode them into base64 strings. If the string is not in base64 string format, the call will return a bad request response with the message "Invalid JSON format".
https://www.biopassid.com/