Big Data Science – Telegram
Big Data Science
3.74K subscribers
65 photos
9 videos
12 files
637 links
Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼https://news.1rj.ru/str/bds_job — channel about Data Science jobs and career
💻https://news.1rj.ru/str/bdscience_ru — Big Data Science [RU]
Download Telegram
🚀Data Science в городе: продолжаем серию митапов Ситимобила про Data Science в геосервисах, логистике, приложениях Smart City и т.д. Приглашаем на 2-ю онлайн-встречу 23 сентября в 18:00 МСК. Вас ждут интересные доклады DS-практиков из Ситимобила, Optimate AI и Яндекс.Маршрутизации:
🚕Максим Шаланкин (Data Scientist в гео-сервисе Ситимобил) расскажет о жизненном цикле ML-модели прогнозирования времени в пути с учетом большой нагрузки
🚚Сергей Свиридов (CTO из Optimate AI) объяснит, что не так с классическими эвристиками и методами комбинаторной оптимизации для построения оптимальных маршрутов, и как их можно заменить динамическим программированием
🚛Даниил Тарарухин (Руководитель группы аналитики в Яндекс.Маршрутизации) поделится, как автомобильные пробки влияют на поиск оптимального маршрута и имитационное моделирование этой задачи.
После докладов спикеры ответят на вопросы слушателей.
Ведущий мероприятия – Алексей Чернобровов🛸
Регистрация для бесплатного участия: https://citymobil.timepad.ru/event/1773649/
🗣4 best practices to improve efficiency from using the Google Cloud Translation API
A web service that dynamically translates between languages using Google ML models supports over 100 languages and is actively used in practice. And if you know useful life hacks, you can reduce costs, increase productivity, and improve the security of this translation API on websites.
1. Caching translated content not only reduces the number of calls to the Google Cloud Translation API, but also reduces the load and computation usage on internal web servers and databases. This optimizes application performance and reduces shipping costs. You can configure caching in an application architecture at different levels of the application. For example, at the proxy level (NGINX or HAProxy), the application itself in memory on web servers, or an external memory caching service, as well as through a CDN.
2. Secure access based on the principle of least privilege. When accessing the Google Cloud Translation API, it is recommended that you use a Google Cloud Service account rather than api keys. A service account is a special type of authentication that represents a non-human user and can be authorized to access data in the Google API. Service accounts are not assigned passwords and cannot be used to log in through a browser, minimizing this threat vector. By following the principle of least privilege, you can grant a least privileged role with a set of permissions to access the translation API.
3. Setting up translations. If your content includes domain and context terms, Google Cloud Translation API Advanced supports custom terminology through a glossary. You can create and use your own translation models using Google AutoML Translation. Customers understand the potential risks of errors and inaccuracies by alerting users that content has been automatically translated by Google.
4. Budget control. The costs associated with the Google Cloud Translation API mainly depend on the number of characters sent to the API. For example, at $ 10 per million characters, if a web page contains 20 million characters and needs to be translated into 10 languages, the cost would be $ 10 * 20 = $ 200. Setting up alerts in your work environment will help you keep track of your budget.
https://cloud.google.com/blog/products/ai-machine-learning/four-best-practices-for-translating-your-website
🍏3 useful Python libraries for Data Scientist
JMESPath - a library that helps you query for JSON. Useful when working with a large multi-level JSON document or dictionary. JMESPath exposes the object to JavaScript-style access, making it easier to develop and test your code. It's also safe - if any of the paths don't exist, the JMESPath lookup function will return None. https://github.com/jmespath/jmespath.py
Inflection is a Ruby-derived library that helps you handle complex string processing logic. It translates English words to singular and plural, and also converts strings from CamelCase to underscore. Useful when there are variable or data point names generated in another language or on another system that need to be converted to pythonic style in accordance with the PEP standards. https://github.com/jpvanhal/inflection
more-itertools - a library that includes a set of useful functions that can be used in various development tasks. For example, write code quickly and gracefully to split one dictionary into multiple lists based on a common repeating key, or to loop through multiple lists. This library will automatically organize your regex implementation and set up recursive constraints. https://github.com/more-itertools/more-itertools
👀How to evaluate the quality of a multi-object ML model of computer vision?
Tracking multiple objects in a real-world environment is challenging, incl. due to the metrics for evaluating the quality of the ML-model, the purpose of which is to evaluate the tracking accuracy and check the trajectory of a moving object. Suppose, for each frame in the video stream, the tracking system infers the hypothesis 'n', and there are 'm' main true objects in the frame. Then the process of evaluating indicators is as follows:
• Find and match the best match between hypothesis and underlying truth based on their coordinates and using various matching algorithms.
• For each matched pair, find the error in the position of the object.
• Calculate the sum of several errors, such as misses (the tracker was unable to hypothesize for an object), false positives (when the tracker generated a hypothesis, but the object was absent) and mismatch errors (when the hypothesis of the watcher of valid information changed the current frame).
So the performance of the ML-model can be expressed in two metrics:
MOTP (Multi-Object Tracking Precision) shows how accurately the precise positions of an object are estimated. This is the total error in estimating the location for the overlapping ground truth-hypothesis pairs across all frames, averaged over the total number of matches made. This metric is not responsible for recognizing object configurations and evaluating object trajectories. The metric ranges from 0 to 1. If the MOTP value is 1, then the system's accuracy is poor. And if it is close to zero, then the accuracy of the system is good.
MOTA (Multi-Object Tracking Accuracy) shows how many errors the tracking system made (misses, false positives, mismatch errors). The metric ranges from –inf to 1. If the MOTA is 1, then the accuracy of the system is good. If the MOTA is near zero or less than zero, then the accuracy of the system is poor.
https://pub.towardsai.net/multi-object-tracking-metrics-1e602f364c0c
😜Need sentiment analytics in YouTube comments?
Over 2 billion users watch YouTube videos at least once a month. Popular YouTube bloggers have billions of views. But you can't please all subscribers and public opinion is constantly changing. Build your user sentiment analysis model with Youtube-Comment-Scraper, a Python library for getting comments on YouTube videos using browser automation (only works on Windows for now). This open-source project will help create a dashboard that analyzes the attitude of subscribers to videos of popular youtubers. The work will be reduced to the following steps:
• collecting the necessary comments to the video from YouTube users;
• using a pretrained ML model to make predictions for each comment;
• visualization of model forecasts on a dashboard, incl. using Dash in Python or Shiny in R.
Add interactivity with filters to sentiment analysis results by release time, video author, and genre.
https://pypi.org/project/youtube-comment-scraper-python/
Register for the free international online conference DataArt IT NonStop 2021!

IT NonStop will be held on November 18-20, 2021.
This year, we will be focusing on Cloud, Data, and Machine Learning & Artificial Intelligence. Market leaders will take the stage and share their own knowledge, case studies and best solutions. The main working language of the conference is English, however there will be a special Junior track on November 20 that will be delivered mostly in Russian. November 20 will be also dedicated to workshops.

More than 30 speakers from Microsoft, AWS, Ocado, Codete, Ciklum, Eleks, SoftServe, Toloka, Yandex, DataArt, and other market leaders will take stage at IT NonStop 2021. We can't list all of them in one post, so here are the selected few workshops:
— "Creating Real-Time Data Streaming powered by SQL on Kubernetes", Albert Lewandowski, Big Data DevOps Engineer, GetInData.
— "Create your own cognitive portrait in 60 minutes", Dmitry Soshnikov, Cloud Developer Advocate, Microsoft.
— "Training unbiased and accurate AI models", Robert Yenokyan, AI Lead, Pinsight.
The whole list of speakers and topics is available on our webpage and it's constantly growing.
You can still sign up for our conference. Registration is open and it's free for everyone!

Briefly about the IT NonStop Conference:
When: November 18-20
Venue: online and free of charge
Registration: https://it-nonstop.net/register-to-the-conference/?utm_source=bdscience&utm_medium=referral
🍁TOP-10 the most interesting DS-conferences all over the world in October 2021
1. 5-8 Oct
- NLP Summit, Applied Natural Language Processing, online. https://www.nlpsummit.org/nlp-2021/
2. 6-7 Oct - TransformX AI Conference, with 100+ speakers including Andrew Ng, Fei-fei Li, free and open to the public. Online. https://www.aicamp.ai/event/eventdetails/W2021100608
3. 6-9 Oct -The 8th IEEE International Conference on Data Science and Advanced Analytics, Porto, Portugal https://dsaa2021.dcc.fc.up.pt/
4. 12-14 Oct - Google Cloud Next '21, a global digital experience. Online. https://cloud.withgoogle.com/next
5. 12-14 Oct - Chief Data & Analytics Officers (CDAO). Online. https://cdao-fall.coriniumintelligence.com/virtual-home
6. 13-14 Oct - Big Data and AI Toronto. Online. https://www.bigdata-toronto.com/register
7. 15 – 17 Oct - DataScienceGO, UCLA Campus, Los Angeles, USA https://www.datasciencego.com/united-states
8. 19 Oct - Graph + AI Summit Fall 2021 - open conference for accelerating analytics and AI with Graph. New York, NY, USA and virtual https://info.tigergraph.com/graphai-fall
9. 20 – 21 Oct - RE.WORK Conversational AI for Enterprise Summit, Online. https://www.re-work.co/summits/conversational-ai-enterprise-summit
10. 21 Oct - DSS Mini Salon: The Future of Retail with Behavioral Data, Online. https://www.datascience.salon/snowplow-analytics-mini-virtual-salon
🍂
🗣3 Things You Didn't Know About Python Memory Usage
Since Python is the main programming language for Data Science tasks, every DS specialist will benefit from the following features of this tool:
Obtaining the address of an object in memory - for this, Python uses the id () function, which returns the memory address for a variable as an integer.
Garbage collection - Python uses reference counting to decide when an object should be removed from memory. Python calculates the number of references to each object, and when the object is unreferenced, the Garbage Collection runs. For complex cases, you can manually change the garbage collection behavior using the gc module in Python.
Interning or integer caching - To save time and memory costs, Python always preloads all integers in the range [-5, 256]. When a new integer variable is declared in this range, Python simply references the cached integer and does not create any new objects. Therefore, no matter how many variables were created, if they refer to an integer 256 in the range [-5, 256], they will all point to the same memory address of the cached integer. Likewise, Python has interning mechanisms for short strings.
https://medium.com/techtofreedom/memory-management-in-python-3-popular-interview-questions-bce4bc69b69a
🕸Analysis of the American Data Science market 2021: a web scraping project on Selenium on open vacancies with visual results and conclusions. Also in the review, you will learn about the popularity of programming languages and ML-frameworks among US employers.
https://pub.towardsai.net/current-data-science-job-market-trend-analysis-future-4184f03a04ca
🤦🏼‍♀️Machine unlearning is a new challenge in ML
Sometimes ML algorithms have to forget what they have learned. For example, artificial intelligence can destroy privacy. Regulators around the world have the right to compel companies to remove inappropriate information. EU and California citizens may require the company to delete their data. Recently, regulators in the US and Europe have said that owners of artificial intelligence systems must sometimes remove systems trained on sensitive data. And in 2020, the UK data regulator warned companies that some ML programs may be subject to GDPR rights because they contain personal data. In early 2021, the FTC forced facial recognition startup Paravision to remove a collection of incorrectly captured photographs of faces and ML algorithms trained on them.
Thus, we come to a new area of DS called machine learning, which seeks ways to induce selective amnesia for AI in order to remove all traces of a particular person or data point from an ML system without affecting its performance. Some studies have shown that under certain conditions it is possible to make ML algorithms forget something, but this method is not yet ready for use in production. Specifically, in 2019, scientists from the Universities of Toronto and Wisconsin-Madison proposed splitting the raw data for machine learning into multiple parts, each of which is processed separately before the results are combined into the final ML model. If you later need to forget one data point, you only need to reprocess part of the original dataset. Testing has shown that the approach works with online shopping data and a collection of over a million photographs. However, the unlearning system will fail if sent deletion requests are received in a specific sequence. Researchers are now looking for how to solve this problem. However, machine learning techniques are more of a demonstration of technical acumen than a major shift in data protection. After all, even if machines learn to forget, users will have to remember who they are sharing their data with.
https://www.wired.com/story/machines-can-learn-can-they-unlearn/
Luxury EDA with Lux
Useful DS-tools that will come in handy in your daily work. For example, the Lux – Python-library, which simplifies and accelerates data exploration by automating the process of visualizing and analyzing data. For the dataframe in the Jupyter Notebook, Lux recommends a set of visualizations that highlight interesting trends and patterns in the dataset. The visualizations are displayed using an interactive widget that allows users to quickly browse through large collections of visualizations and understand the data. Deeply integrated with Pandas, Lux supports the various geographic and temporal data types in the library, as well as SQL queries against Postgres.
Lux consists of several modules, each of which performs its own duties:
• user interface level;
• level of verification and analysis of user input;
• intent processing level, data execution level, and finally, analytics level.
https://github.com/lux-org/lux
https://lux-api.readthedocs.io/en/latest/source/getting_started/overview.html
☂️Reverse ETL: what it is and how to use it
Reverse ETL is the process of copying data from a data warehouse to operating systems, including SaaS for marketing, sales, and support. This allows any team of professionals, from salespeople to engineers, to access the data they need on the systems they use. There are 3 main use cases for reverse ETL:
• Operational analytics - providing insights to business teams in their normal workflows and tools to make data-driven decisions
• Data Automation — Automate ad hoc requests for data from other teams, for example, when financiers request product usage data for billing
• personalization of interaction with customers in different applications

The most popular reverse ETL tools today are:
Hightouch is a data platform that allows you to synchronize data from repositories with CRM, marketing and customer support tools https://hightouch.io/docs/
Census is an operational analytics platform that synchronizes the data warehouse with different applications https://www.getcensus.com/
Octolis is a cloud service that allows marketing and sales teams to easily deploy use cases by activating their data in their operational tools such as CRM or marketing automation software https://octolis.com/
Grouparoo is an open source reverse ETL that runs easily on a laptop or in the cloud, allowing you to develop locally, commit changes, and deploy https://www.grouparoo.com/docs/config
Polytomic is an ETL solution that allows you to create in real time all the necessary customer data in Marketo, Salesforce, HubSpot and other business systems in a couple of minutes https://www.polytomic.com/
RudderStack is a customer data platform for developers where reverse ETL tools make it easy to deploy pipelines that collect customer data from each application, website, and SaaS platform to activate on DWH and application systems https://rudderstack.com/
Workato - a tool for automating business processes in cloud and local applications https://www.workato.com/
Omnata - data integration tool for modern architectures https://omnata.com/
Smart ETL Tool from Rivery - a platform for automating ETL processes using any cloud-based DBMS, including Redshift, Oracle, BigQuery, Azure and Snowflake https://rivery.io/
Forwarded from Big Data Science [RU]
Reverse ETL
How to raise the quality of data?

You can have perfect outcomes on all stages of product promotion, but if you lack quality data, they will not be reliable and won't bring any efficient results. What is important about data is consistency, especially for product analytics. Data quality depends on it heavily.

At Matemarketing Vlad Kharitonov and Oleg Khomyuk will elaborate on how to achieve consistency in all cases, including scaling. Their performance includes speeches on strict contract-based categorization, versioning, cross-platform cases, using legacy for scaling.

Matemarketing is the biggest conference on Marketing and Product Analytics, Monetization and Data-Driven Solutions in Russia and CIS.

- - - -
Matemarketing-21 will take place on November 18-19 in Moscow and will be available online.
↪️ The full program and all details are available on our website.
- - - -

And now we want to share Jordi Roura's, (InfoTrust Barcelona), report from Matemarketing. You will find out how to provide quality data theoretically and see examples of implementation of this knowledge in certain cases.
✍🏻5 Scikit-learn tips from Data Scientist
1. Fill the gaps with Iterative Imputer
- IterativeImputer, which iteratively searches for and fills in missing values, improving the dataset with each iteration. To use this function, import it enable_iterative_imputer from the sklearn.experimental package
2. Generate random dummy data to reserve the place where real or useful data should be present. The dummy data is needed for testing, so it must be reliable. To do this, you can use the functions make_classification () in a classification task or make_regression () in a regression task. You can also set the number of samples and features to control the behavior of the data in debugging and testing.
3. Save ML-models for reuse without retraining. To serialize your algorithms and save them, the pickle and joblib Python libraries come in handy.
4. Plot a confusion matrix using the plot_confusion_matrix function, which displays true positive, false positive, false negative and true negative values.
5. Visualize decision trees using the tree.plot_tree function in the matplotlib package without manually installing dependencies to create simple visualizations. You can also save the tree as a graphic png-file.
https://www.educative.io/blog/scikit-learn-tricks-tips
📝Math for Data Scientist: 3 Distance Measures, Part 1
• Euclidean Distance
- measures the length of the line that connects two points. The most common measure, but not scalable. The calculated distances may be skewed depending on the units of the objects. Therefore, before using this measure, you need to normalize the data. As the dimension of the data increases, the usefulness of the Euclidean distance decreases. But this measure works great for low-dimensional data. For example, the kNN and HDBSCAN methods show good results with this measure. Finally, Euclidean distance is intuitive to use and easy to implement.
• Cosine Similarity - the cosine of the angle between two vectors. This method helps to eliminate the disadvantages of high-dimensional Euclidean distance. Two vectors with the same orientation have a cosine similarity of 1, and vectors that are diametrically opposed to each other have a similarity of -1. The magnitude of the vectors is irrelevant as this is a measure of orientation. Therefore, this measure is not very suitable for recommendation systems, because cosine similarity does not account for the difference in the rating scale between different users. Nevertheless, cosine similarity is useful when there is multidimensional data and the magnitude of the vectors does not matter, for example, for text analysis.
• Hamming distance - the number of values that differ in two vectors. Typically used to compare two binary strings of the same length, for example, to compare how similar they are to each other by calculating the number of characters that differ. Hamming distance is difficult to apply when two vectors have different lengths. For example, for correcting or detecting errors in data transmission over computer networks when determining the number of corrupted bits in a binary word as a way to estimate the error. You can also use Hamming distance to measure the distance between categorical variables.
🚀FLAN by Google AI: generalizable Language Models with Instruction Fine-Tuning
In order for an ML-model to generate meaningful text, it must have a large amount of knowledge about the world and the ability to abstract. While language models that are trained to do this are able to automatically acquire this knowledge as they scale, their ML models should better uncover this knowledge and apply it to specific real-world problems.
One recent popular technique for using language models to solve problems is called the zero-shot prompt or the multi-shot prompt. This method formulates a problem based on the text that the language model could see during training, in order to then generate a response, complementing the text. While this method has good performance for some tasks, it requires careful design to make the tasks look like the data the model saw during training. This approach works well for some tasks, but may not be intuitive for practical interaction with the model. For example, the creators of GPT-3 have found that such hinting methods do not lead to good performance in natural language inference (NLI) tasks.
Instead, FLAN tunes the model with a wide variety of instructions that use a simple and intuitive denoscription of the problem, such as “Classify this movie review as positive or negative” or “Translate this sentence into Danish”. Creating a dataset with instructions from scratch to fine-tune the model will be resource intensive. Therefore, templates can be used to convert existing datasets to training format. Experiments by Google AI researchers have shown the success of this approach, testing FLAN and GPT-3 on 25 tasks.
Notably, on a small scale, the FLAN method actually degrades performance, and only on a larger scale does the model become able to generalize instructions in the training data to invisible problems. This is due to the fact that small models do not have enough parameters to perform a large number of tasks.
https://ai.googleblog.com/2021/10/introducing-flan-more-generalizable.html
Forwarded from Big Data Science [RU]
Сравнение FLAN с GPT-3
🚀News from DeepMind AI: Enformer Architecture for Genetic Research
The Enformer architecture, powered by Transformers, advances genetic research to accurately predict how DNA sequence affects gene expression. In early October 2021, Nature Methods published an article by DeepMind and Calico researchers about the new Enformer neural network architecture, which greatly improves the accuracy of predicting gene expression from a DNA sequence. The developers have made this model and its initial predictions of common genetic variants publicly available.
Enformer builds on transformers common in natural language processing to use self-attention mechanisms for greater coverage of the DNA context. By efficiently processing sequences to account for interactions at distances more than 5 times (i.e. 200,000 base pairs) longer than previous methods, the new architecture can simulate the influence of important regulatory elements on the expression of genes found in the DNA sequence.
AI can be used to explore new possibilities for finding patterns in the genome and to put forward mechanistic hypotheses about sequence changes. Like a spell checker, Enformer partially understands a DNA sequence dictionary and can highlight changes that could alter gene expression.
The main application of this new model is to predict which changes in DNA letters, also called genetic variants, will affect gene expression. Compared to previous models, Enformer is much more accurate in predicting the effect of variants on gene expression, both in the case of natural genetic variants and synthetic variants that alter important regulatory sequences. This property is useful for interpreting the growing number of disease-related variants derived from genome-wide associative studies. Variants associated with complex genetic diseases are predominantly located in the non-coding region of the genome, likely causing disease by altering gene expression. But because of the intrinsic correlation between options, many of these disease-related options are only falsely correlated and not causal. Computing tools help distinguish true associations from false positives.
https://deepmind.com/blog/article/enformer
https://www.nature.com/articles/s41592-021-01252-x
https://github.com/deepmind/deepmind-research/tree/master/enformer
✍🏻Math for the Data Scientist: another 3 Distance Measures, Part 2
Manhattan Distance, also called a taxi or city block measure, calculates the distance between vectors with real values. Then Manhattan distance refers to the distance between two vectors on a uniform grid if they can only move at right angles. No diagonal movement is used when calculating the distance. While Manhattan distance seems to be acceptable for multidimensional data, it is a less intuitive measure than Euclidean distance. A measure is more likely to give a higher distance value than Euclidean distance, since it is not the shortest possible distance. However, if the dataset has discrete and / or binary attributes, the Manhattan distance works well because it takes into account real paths within the possible values.
Chebyshev distance is defined as the greatest difference between two vectors in any coordinate dimension, it is simply the maximum distance along one axis. This measure is also often called the distance of the chessboard, since the minimum number of moves required for the king to move from one square to another is equal to the distance of Chebyshev. This distance is usually used in very specific use cases, making it difficult to use it as a universal measure of distance, as opposed to Euclidean distance or cosine similarity. Therefore, the Chebyshev distance is only recommended in certain cases. For example, to determine the minimum number of moves in games that allow unlimited 8-sided movement. Also, the Chebyshev distance is often used in warehouse logistics, for example, to determine the time required for an overhead crane to move an object.
Minkowski distance is a more complex measure used in normalized vector space (n-dimensional real space), where distances can be represented as a vector with length. When using this measure, there is a zero vector that has zero length and all others are positive, the vector can be multiplied by a number (scalar coefficient), and the shortest distance between two points is a straight line. You can also use the p parameter here to control distance metrics similar to other measures, for example, p = 1 for Manhattan distance, p = 2 for Euclidean, and p = ∞ for Chebyshev distance. Therefore, in order to work with the Minkowski distance, you need to understand the purpose, advantages and disadvantages of the Manhattan, Euclidean and Chebyshev measures. Finding the correct value for p can be computationally inefficient, gives flexibility in the distance metric, and can be a huge advantage if correctly selected.