NEW BOT Телеграм, страница

Big Data Science

👻What is anomaly detection and how does it work
Anomaly detection is a mathematical search for deviations in controlled and uncontrolled numerical data, depending on how much a particular value differs from others or from the standard deviation in a given sample. There are many different methods for detecting anomalies, called outlier detection algorithms, each with different criteria for detecting them and therefore used in different scenarios. The most common methods used to detect anomalies are:
• General density-based methods: K-Nearest Neighbor (KNN), Local Outlier Factor (LOF), Isolation Forests, and other algorithms that can be applied to regression or classification scenarios. Each of these generates the expected behavior by following the line of the highest density of data points. Points that fall by a statistically significant amount outside these dense zones are flagged as anomaly. Most of these methods are based on distance between points, so it is important to normalize the units and scale in the dataset to get accurate results. For example, in KNN, data points are weighted by 1 / k, where k is the distance to the nearest neighbor. Therefore, the points that are closer to each other have a lot of weight, and affect what is the standard, there are more distant points. The algorithm marks points with a low 1 / k value as outliers. This is suitable for normalized data without labels, when there is no desire and ability to use algorithms with more complex calculations.
• One-class support vector machine is a supervised learning algorithm that creates a robust prediction model. Often used for classification. There is a training set of examples, each labeled as part of one of two categories. The system creates criteria for sorting new examples for each category, matches the examples with points in space in order to distinguish both categories as much as possible. The system will flag an outlier if it goes beyond any category. In the absence of labeled data, you can use unsupervised learning, which looks for clustering among the examples to define categories. This is suitable for working with 2 categories of data, when you need to find which data points lie outside each of them.
• Algorithm for clustering K-means, combining KNN-approaches based on the proximity of each data point to other nearby points and SVM, since it focuses on classification into various categories. Here, each data point is categorized based on its characteristics. The category has a center point that serves as the prototype for all other data points in the cluster. They are all compared to these prototypes to determine their k-mean, which acts as a measure of the difference between the prototype and the current data point. Data points with higher k-means are closer to the prototype, forming a cluster. K-Means Clustering can detect anomalies by marking points that do not fit any of the established categories. This is suitable for scenarios where there is untagged data from many different types that need to be organized similar to the prototypes learned.
There are other more sophisticated algorithms for unsupervised anomaly detection and multidimensional datasets. For example, Gaussian as an alternative version of the K-Means algorithm with Gaussian distribution instead of standard deviation. And Bayesian uses Bayesian probability to detect anomalies. Also, to detect anomalies, autoencoders can be used - neural networks that create coded rules for the expected output depending on the input value. Anything beyond these repetitive values is considered an anomaly and is well suited for dimensional detection tasks.

485 views04:02

Big Data Science

💥TOP 5 useful Python tools for data engineers and web developers
• Requests is an easy-to-use HTTP library for Python that allows you to make requests and interact with the API https://docs.python-requests.org/en/master/
• Advanced Python Scheduler (APScheduler) - a library for deferred execution of Python code once or with periodic repetition. When the tasks are saved in the database, their states and the restart of the scheduler will also be saved. APScheduler can also be used as a cross-platform application-specific replacement for platform-specific schedulers such as the cron daemon or Windows task scheduler. However, APScheduler is not a daemon or service, and therefore does not come with command line tools, but is intended to run inside existing applications. This library provides some ready-made building blocks for creating a scheduler service or for running it in a separate process. https://apscheduler.readthedocs.io/en/stable/userguide.html
• Watchdog - a module for tracking filesystem events through the Python API and shell utilities https://pypi.org/project/watchdog/
• Twilio - a library for automating the sending of text messages and phone calls. It is very convenient for automatic monitoring of events on third-party sites, for example, prompt tracking of discounts on the right products or the appearance of new products https://pypi.org/project/twilio/
• Random User Agent - a library for adding random user agents to requests, which is useful when web parsing data or sending a large number of requests https://pypi.org/project/random-user-agent/

PyPI

watchdog

Filesystem events monitoring

499 views06:14

Big Data Science

✈️Real-Time ML Predictions with Google's Vertex AI
One of the biggest challenges in serving ML-models is providing near real-time predictions. Some business scenarios are especially sensitive to time latency. For example, recommendation systems for online store users, estimating the delivery time of products for food tech companies, etc. On August 25, 2021, Google announced the possibility of direct interaction with Vertex AI - its unified ML platform through private endpoints. Vertex AI allows you to quickly connect a trained and tested ML model to a working application, upload it to a specially prepared server in the Google Cloud, or export it to the desired format.
Vertex Predictions is a serverless way of serving ML models that can be linked in the cloud and made predictions via a REST API. With online forecasts, it is necessary to obtain a model at the endpoint, which will link it to physical computing resources and allow it to be done in almost real time. With VPC Peering, you can configure a private connection to reach an endpoint. By doing this, user data will not pass through the public Internet, which reduces the latency of online predictions and improves security.
https://cloud.google.com/blog/products/ai-machine-learning/creating-a-private-endpoint-on-vertex-ai

Google Cloud Blog

Creating a private endpoint on Vertex AI | Google Cloud Blog

Learn the basics of VPC peering and how to use Private Endpoints on Vertex AI.

463 views06:57

Big Data Science

🏂🏸Adversarial attacks to refine molecular energy predictions
Researchers at MIT have found a new quantitative estimate of the uncertainty of molecular energies using neural networks. Neural networks are often used to predict new resources, speeds, and capabilities orders of magnitude faster than traditional methods such as quo-mechanical simulation. The results obtained can be unreliable, since ML-models are interpolated, it is possible that they fail when applied to the operational data of an external dataset. This is especially for predicting the "potential energy" (PES) or energy map of a molecule in all its configurations. To solve these problems, scientists have proposed safe zones of a neural network using adversarial attacks. The actual simulation is performed only for small parts of the molecule, and the data is fed into the neural network, which learns to predict the same properties for the rest of the molecules. These methods have been successfully tested on new materials, including catalysts for the production of hydrogen from water, cheaper polymer electrolytes for electric vehicles, magnets, etc. However, the accuracy of neural networks depends on the correctness of training data, and incorrect predictions can have disastrous consequences.
One way to find out the uncertainty of a model is to run the same data through several versions of it. To do this, the researchers had several neural networks predicting a potential surface based on the same data. If the network is confident in the prediction, the difference between the outputs of different networks is minimal and the surfaces converge more. Otherwise, the predictions of the various models vary greatly, producing a series of outputs, any of which may be the correct surface.
Forecast scatter represents the uncertainty at a particular point. The ML-model should indicate not only the best forecast, but also the uncertainty of each of them. However, each simulation can take tens to thousands of CPU hours. And to get meaningful results, you need to run multiple models at a sufficient number of points.
Therefore, the new approach only selects data points with low forecast confidence. These molecules are then modified slightly to increase the uncertainty. Additional data is computed for these molecules through simulation, and then the original training pool is added. The neural networks are trained again, and a new set of uncertainties is calculated. This process is repeated until the uncertainty associated with various points on the surface becomes well defined and cannot be further reduced.
The proposed approach has been tested on zeolites - cavernous crystals, selective forms and use in catalysis, gas separation and ion exchange. Modeling large zeolite structures is very expensive, and the researchers show how their method can provide significant savings in computer simulations. But an adversarial approach to retraining neural networks increases performance without significant computational costs.
https://news.mit.edu/2021/using-adversarial-attacks-refine-molecular-energy-predictions-0901

MIT News

Using adversarial attacks to refine molecular energy predictions

MIT engineers create machine learning models that improve themselves by automatically finding new training data to lower their uncertainty. The new algorithm allows them to build models that replace expensive physics-based simulations.

471 views03:26

Big Data Science

🕸Web scraping automation: 3 popular tools
Do you want to track prices in an online store or automate ordering food in a restaurant? Try the following remedies:
• Selenium is a well-known test automation framework that can be used to simulate user behavior and perform actions on websites such as filling out forms, clicking buttons, etc. https://selenium-python.readthedocs.io/
• Beautiful Soup is a Python-package for parsing HTML and XML documents. Creates a parse tree that can be used to extract data when parsing web pages. Very good for simple projects. https://pypi.org/project/beautifulsoup4/
• Scrapy is a fast, high-level website crawling and crawling framework used to extract structured data for mining, monitoring, and automated testing. It is great for complex projects and is much faster than the aforementioned counterparts. https://docs.scrapy.org/en/latest/

PyPI

beautifulsoup4

Screen-scraping library

445 views02:32

Big Data Science

😎Need to develop an app for real-time emotion recognition on video?
Use Face Recognition API! Open-source project for face recognition and control from Python or command line. The ML model was created using the DL face recognition algorithm and has an accuracy of 99.38% in the Labeled Faces in the Wild test.
With Face Recognition API, application development consists of 5 steps:
• receiving video in real time
• applying Python-functions from a ready-to-use API for detecting faces and emotions on objects in a video stream;
• classification of emotions into categories;
• developing a recommendation system;
• building the application and deploying to Heroku, Dash or a web server.
https://github.com/ageitgey/face_recognition

GitHub

GitHub - ageitgey/face_recognition: The world's simplest facial recognition api for Python and the command line

The world's simplest facial recognition api for Python and the command line - ageitgey/face_recognition

435 views07:35

Big Data Science

🚀Data Science в городе: продолжаем серию митапов Ситимобила про Data Science в геосервисах, логистике, приложениях Smart City и т.д. Приглашаем на 2-ю онлайн-встречу 23 сентября в 18:00 МСК. Вас ждут интересные доклады DS-практиков из Ситимобила, Optimate AI и Яндекс.Маршрутизации:
🚕Максим Шаланкин (Data Scientist в гео-сервисе Ситимобил) расскажет о жизненном цикле ML-модели прогнозирования времени в пути с учетом большой нагрузки
🚚Сергей Свиридов (CTO из Optimate AI) объяснит, что не так с классическими эвристиками и методами комбинаторной оптимизации для построения оптимальных маршрутов, и как их можно заменить динамическим программированием
🚛Даниил Тарарухин (Руководитель группы аналитики в Яндекс.Маршрутизации) поделится, как автомобильные пробки влияют на поиск оптимального маршрута и имитационное моделирование этой задачи.
После докладов спикеры ответят на вопросы слушателей.
Ведущий мероприятия – Алексей Чернобровов🛸
Регистрация для бесплатного участия: https://citymobil.timepad.ru/event/1773649/

citymobil.timepad.ru

Citymobil Data Meetup №2 / События на TimePad.ru

Ситимобил запускает митапы о применении Data science в городских и геосервисах, логистике и технологиях умных городов.

1.49K viewsedited 10:12

Big Data Science

🗣4 best practices to improve efficiency from using the Google Cloud Translation API
A web service that dynamically translates between languages using Google ML models supports over 100 languages and is actively used in practice. And if you know useful life hacks, you can reduce costs, increase productivity, and improve the security of this translation API on websites.
1. Caching translated content not only reduces the number of calls to the Google Cloud Translation API, but also reduces the load and computation usage on internal web servers and databases. This optimizes application performance and reduces shipping costs. You can configure caching in an application architecture at different levels of the application. For example, at the proxy level (NGINX or HAProxy), the application itself in memory on web servers, or an external memory caching service, as well as through a CDN.
2. Secure access based on the principle of least privilege. When accessing the Google Cloud Translation API, it is recommended that you use a Google Cloud Service account rather than api keys. A service account is a special type of authentication that represents a non-human user and can be authorized to access data in the Google API. Service accounts are not assigned passwords and cannot be used to log in through a browser, minimizing this threat vector. By following the principle of least privilege, you can grant a least privileged role with a set of permissions to access the translation API.
3. Setting up translations. If your content includes domain and context terms, Google Cloud Translation API Advanced supports custom terminology through a glossary. You can create and use your own translation models using Google AutoML Translation. Customers understand the potential risks of errors and inaccuracies by alerting users that content has been automatically translated by Google.
4. Budget control. The costs associated with the Google Cloud Translation API mainly depend on the number of characters sent to the API. For example, at $ 10 per million characters, if a web page contains 20 million characters and needs to be translated into 10 languages, the cost would be $ 10 * 20 = $ 200. Setting up alerts in your work environment will help you keep track of your budget.
https://cloud.google.com/blog/products/ai-machine-learning/four-best-practices-for-translating-your-website

Google Cloud Blog

Four best practices for translating your website | Google Cloud Blog

Translate your website with Google’s industry leading Machine Learning. Learn best practices for optimizing cost, performance, and security.

477 views03:07

Big Data Science

🍏3 useful Python libraries for Data Scientist
• JMESPath - a library that helps you query for JSON. Useful when working with a large multi-level JSON document or dictionary. JMESPath exposes the object to JavaScript-style access, making it easier to develop and test your code. It's also safe - if any of the paths don't exist, the JMESPath lookup function will return None. https://github.com/jmespath/jmespath.py
• Inflection is a Ruby-derived library that helps you handle complex string processing logic. It translates English words to singular and plural, and also converts strings from CamelCase to underscore. Useful when there are variable or data point names generated in another language or on another system that need to be converted to pythonic style in accordance with the PEP standards. https://github.com/jpvanhal/inflection
• more-itertools - a library that includes a set of useful functions that can be used in various development tasks. For example, write code quickly and gracefully to split one dictionary into multiple lists based on a common repeating key, or to loop through multiple lists. This library will automatically organize your regex implementation and set up recursive constraints. https://github.com/more-itertools/more-itertools

GitHub

GitHub - jmespath/jmespath.py: JMESPath is a query language for JSON.

JMESPath is a query language for JSON. Contribute to jmespath/jmespath.py development by creating an account on GitHub.

497 views03:27

Big Data Science

👀How to evaluate the quality of a multi-object ML model of computer vision?
Tracking multiple objects in a real-world environment is challenging, incl. due to the metrics for evaluating the quality of the ML-model, the purpose of which is to evaluate the tracking accuracy and check the trajectory of a moving object. Suppose, for each frame in the video stream, the tracking system infers the hypothesis 'n', and there are 'm' main true objects in the frame. Then the process of evaluating indicators is as follows:
• Find and match the best match between hypothesis and underlying truth based on their coordinates and using various matching algorithms.
• For each matched pair, find the error in the position of the object.
• Calculate the sum of several errors, such as misses (the tracker was unable to hypothesize for an object), false positives (when the tracker generated a hypothesis, but the object was absent) and mismatch errors (when the hypothesis of the watcher of valid information changed the current frame).
So the performance of the ML-model can be expressed in two metrics:
• MOTP (Multi-Object Tracking Precision) shows how accurately the precise positions of an object are estimated. This is the total error in estimating the location for the overlapping ground truth-hypothesis pairs across all frames, averaged over the total number of matches made. This metric is not responsible for recognizing object configurations and evaluating object trajectories. The metric ranges from 0 to 1. If the MOTP value is 1, then the system's accuracy is poor. And if it is close to zero, then the accuracy of the system is good.
• MOTA (Multi-Object Tracking Accuracy) shows how many errors the tracking system made (misses, false positives, mismatch errors). The metric ranges from –inf to 1. If the MOTA is 1, then the accuracy of the system is good. If the MOTA is near zero or less than zero, then the accuracy of the system is poor.
https://pub.towardsai.net/multi-object-tracking-metrics-1e602f364c0c

Medium

Multi-Object Tracking Metrics

The Evaluation process is one of the most important steps in build a Machine Learning Model. Especially when it comes to real-time…

451 views03:55

Big Data Science

😜Need sentiment analytics in YouTube comments?
Over 2 billion users watch YouTube videos at least once a month. Popular YouTube bloggers have billions of views. But you can't please all subscribers and public opinion is constantly changing. Build your user sentiment analysis model with Youtube-Comment-Scraper, a Python library for getting comments on YouTube videos using browser automation (only works on Windows for now). This open-source project will help create a dashboard that analyzes the attitude of subscribers to videos of popular youtubers. The work will be reduced to the following steps:
• collecting the necessary comments to the video from YouTube users;
• using a pretrained ML model to make predictions for each comment;
• visualization of model forecasts on a dashboard, incl. using Dash in Python or Shiny in R.
Add interactivity with filters to sentiment analysis results by release time, video author, and genre.
https://pypi.org/project/youtube-comment-scraper-python/

PyPI

youtube-comment-scraper-python

A python library to scrape video's comments data from youtube automatically.

458 views05:02

Big Data Science

Register for the free international online conference DataArt IT NonStop 2021!

IT NonStop will be held on November 18-20, 2021.
This year, we will be focusing on Cloud, Data, and Machine Learning & Artificial Intelligence. Market leaders will take the stage and share their own knowledge, case studies and best solutions. The main working language of the conference is English, however there will be a special Junior track on November 20 that will be delivered mostly in Russian. November 20 will be also dedicated to workshops.

More than 30 speakers from Microsoft, AWS, Ocado, Codete, Ciklum, Eleks, SoftServe, Toloka, Yandex, DataArt, and other market leaders will take stage at IT NonStop 2021. We can't list all of them in one post, so here are the selected few workshops:
— "Creating Real-Time Data Streaming powered by SQL on Kubernetes", Albert Lewandowski, Big Data DevOps Engineer, GetInData.
— "Create your own cognitive portrait in 60 minutes", Dmitry Soshnikov, Cloud Developer Advocate, Microsoft.
— "Training unbiased and accurate AI models", Robert Yenokyan, AI Lead, Pinsight.
The whole list of speakers and topics is available on our webpage and it's constantly growing.
You can still sign up for our conference. Registration is open and it's free for everyone!

Briefly about the IT NonStop Conference:
When: November 18-20
Venue: online and free of charge
Registration: https://it-nonstop.net/register-to-the-conference/?utm_source=bdscience&utm_medium=referral

410 viewsedited 15:18

Big Data Science

🍁TOP-10 the most interesting DS-conferences all over the world in October 2021
1. 5-8 Oct - NLP Summit, Applied Natural Language Processing, online. https://www.nlpsummit.org/nlp-2021/
2. 6-7 Oct - TransformX AI Conference, with 100+ speakers including Andrew Ng, Fei-fei Li, free and open to the public. Online. https://www.aicamp.ai/event/eventdetails/W2021100608
3. 6-9 Oct -The 8th IEEE International Conference on Data Science and Advanced Analytics, Porto, Portugal https://dsaa2021.dcc.fc.up.pt/
4. 12-14 Oct - Google Cloud Next '21, a global digital experience. Online. https://cloud.withgoogle.com/next
5. 12-14 Oct - Chief Data & Analytics Officers (CDAO). Online. https://cdao-fall.coriniumintelligence.com/virtual-home
6. 13-14 Oct - Big Data and AI Toronto. Online. https://www.bigdata-toronto.com/register
7. 15 – 17 Oct - DataScienceGO, UCLA Campus, Los Angeles, USA https://www.datasciencego.com/united-states
8. 19 Oct - Graph + AI Summit Fall 2021 - open conference for accelerating analytics and AI with Graph. New York, NY, USA and virtual https://info.tigergraph.com/graphai-fall
9. 20 – 21 Oct - RE.WORK Conversational AI for Enterprise Summit, Online. https://www.re-work.co/summits/conversational-ai-enterprise-summit
10. 21 Oct - DSS Mini Salon: The Future of Retail with Behavioral Data, Online. https://www.datascience.salon/snowplow-analytics-mini-virtual-salon
🍂

NLP Summit

NLP Summit 2021 - NLP Summit

528 views11:01

Big Data Science

🗣3 Things You Didn't Know About Python Memory Usage
Since Python is the main programming language for Data Science tasks, every DS specialist will benefit from the following features of this tool:
• Obtaining the address of an object in memory - for this, Python uses the id () function, which returns the memory address for a variable as an integer.
• Garbage collection - Python uses reference counting to decide when an object should be removed from memory. Python calculates the number of references to each object, and when the object is unreferenced, the Garbage Collection runs. For complex cases, you can manually change the garbage collection behavior using the gc module in Python.
• Interning or integer caching - To save time and memory costs, Python always preloads all integers in the range [-5, 256]. When a new integer variable is declared in this range, Python simply references the cached integer and does not create any new objects. Therefore, no matter how many variables were created, if they refer to an integer 256 in the range [-5, 256], they will all point to the same memory address of the cached integer. Likewise, Python has interning mechanisms for short strings.
https://medium.com/techtofreedom/memory-management-in-python-3-popular-interview-questions-bce4bc69b69a

Medium

Memory Management in Python: 3 Popular Interview Questions

Dive into the internal mechanisms

492 views05:22

Big Data Science

🕸Analysis of the American Data Science market 2021: a web scraping project on Selenium on open vacancies with visual results and conclusions. Also in the review, you will learn about the popularity of programming languages and ML-frameworks among US employers.
https://pub.towardsai.net/current-data-science-job-market-trend-analysis-future-4184f03a04ca

Medium

Data Science Job Market Trend Analysis for 2021

Know what employers are expecting for a data scientist role in 2021. Data analysis from over 3000+ data scientist job postings — extracted…

502 views07:39

Big Data Science

🤦🏼‍♀️Machine unlearning is a new challenge in ML
Sometimes ML algorithms have to forget what they have learned. For example, artificial intelligence can destroy privacy. Regulators around the world have the right to compel companies to remove inappropriate information. EU and California citizens may require the company to delete their data. Recently, regulators in the US and Europe have said that owners of artificial intelligence systems must sometimes remove systems trained on sensitive data. And in 2020, the UK data regulator warned companies that some ML programs may be subject to GDPR rights because they contain personal data. In early 2021, the FTC forced facial recognition startup Paravision to remove a collection of incorrectly captured photographs of faces and ML algorithms trained on them.
Thus, we come to a new area of DS called machine learning, which seeks ways to induce selective amnesia for AI in order to remove all traces of a particular person or data point from an ML system without affecting its performance. Some studies have shown that under certain conditions it is possible to make ML algorithms forget something, but this method is not yet ready for use in production. Specifically, in 2019, scientists from the Universities of Toronto and Wisconsin-Madison proposed splitting the raw data for machine learning into multiple parts, each of which is processed separately before the results are combined into the final ML model. If you later need to forget one data point, you only need to reprocess part of the original dataset. Testing has shown that the approach works with online shopping data and a collection of over a million photographs. However, the unlearning system will fail if sent deletion requests are received in a specific sequence. Researchers are now looking for how to solve this problem. However, machine learning techniques are more of a demonstration of technical acumen than a major shift in data protection. After all, even if machines learn to forget, users will have to remember who they are sharing their data with.
https://www.wired.com/story/machines-can-learn-can-they-unlearn/

WIRED

Now That Machines Can Learn, Can They Unlearn?

Privacy concerns about AI systems are growing. So researchers are testing whether they can remove sensitive data without retraining the system from scratch.

527 views06:44

Big Data Science

Luxury EDA with Lux
Useful DS-tools that will come in handy in your daily work. For example, the Lux – Python-library, which simplifies and accelerates data exploration by automating the process of visualizing and analyzing data. For the dataframe in the Jupyter Notebook, Lux recommends a set of visualizations that highlight interesting trends and patterns in the dataset. The visualizations are displayed using an interactive widget that allows users to quickly browse through large collections of visualizations and understand the data. Deeply integrated with Pandas, Lux supports the various geographic and temporal data types in the library, as well as SQL queries against Postgres.
Lux consists of several modules, each of which performs its own duties:
• user interface level;
• level of verification and analysis of user input;
• intent processing level, data execution level, and finally, analytics level.
https://github.com/lux-org/lux
https://lux-api.readthedocs.io/en/latest/source/getting_started/overview.html

436 viewsedited 08:33

Big Data Science

☂️Reverse ETL: what it is and how to use it
Reverse ETL is the process of copying data from a data warehouse to operating systems, including SaaS for marketing, sales, and support. This allows any team of professionals, from salespeople to engineers, to access the data they need on the systems they use. There are 3 main use cases for reverse ETL:
• Operational analytics - providing insights to business teams in their normal workflows and tools to make data-driven decisions
• Data Automation — Automate ad hoc requests for data from other teams, for example, when financiers request product usage data for billing
• personalization of interaction with customers in different applications

The most popular reverse ETL tools today are:
• Hightouch is a data platform that allows you to synchronize data from repositories with CRM, marketing and customer support tools https://hightouch.io/docs/
• Census is an operational analytics platform that synchronizes the data warehouse with different applications https://www.getcensus.com/
• Octolis is a cloud service that allows marketing and sales teams to easily deploy use cases by activating their data in their operational tools such as CRM or marketing automation software https://octolis.com/
• Grouparoo is an open source reverse ETL that runs easily on a laptop or in the cloud, allowing you to develop locally, commit changes, and deploy https://www.grouparoo.com/docs/config
• Polytomic is an ETL solution that allows you to create in real time all the necessary customer data in Marketo, Salesforce, HubSpot and other business systems in a couple of minutes https://www.polytomic.com/
• RudderStack is a customer data platform for developers where reverse ETL tools make it easy to deploy pipelines that collect customer data from each application, website, and SaaS platform to activate on DWH and application systems https://rudderstack.com/
• Workato - a tool for automating business processes in cloud and local applications https://www.workato.com/
• Omnata - data integration tool for modern architectures https://omnata.com/
• Smart ETL Tool from Rivery - a platform for automating ETL processes using any cloud-based DBMS, including Redshift, Oracle, BigQuery, Azure and Snowflake https://rivery.io/

Getcensus

Universal Data Platform | Census

Unify, de-duplicate, enhance, and activate your data. Census helps you deliver AI enhanced data from any data source to every tool—no silos, no guesswork.

443 views05:44

Big Data Science

Forwarded from Big Data Science [RU]

Reverse ETL

388 views05:45

Big Data Science

How to raise the quality of data?

You can have perfect outcomes on all stages of product promotion, but if you lack quality data, they will not be reliable and won't bring any efficient results. What is important about data is consistency, especially for product analytics. Data quality depends on it heavily.

At Matemarketing Vlad Kharitonov and Oleg Khomyuk will elaborate on how to achieve consistency in all cases, including scaling. Their performance includes speeches on strict contract-based categorization, versioning, cross-platform cases, using legacy for scaling.

Matemarketing is the biggest conference on Marketing and Product Analytics, Monetization and Data-Driven Solutions in Russia and CIS.

- - - -
✅ Matemarketing-21 will take place on November 18-19 in Moscow and will be available online.
↪️ The full program and all details are available on our website.
- - - -

And now we want to share Jordi Roura's, (InfoTrust Barcelona), report from Matemarketing. You will find out how to provide quality data theoretically and see examples of implementation of this knowledge in certain cases.

matemarketing.ru

MM’25 — Конференция для аналитиков, performance-маркетологов и product-менеджеров

Крупнейшая конференция по маркетинговой и продуктовой аналитике в России, СНГ и Восточной Европе. Даты: 20–21 ноября 2025, онлайн-день — 11 ноября.

459 views11:21

Big Data Science

✍🏻5 Scikit-learn tips from Data Scientist
1. Fill the gaps with Iterative Imputer - IterativeImputer, which iteratively searches for and fills in missing values, improving the dataset with each iteration. To use this function, import it enable_iterative_imputer from the sklearn.experimental package
2. Generate random dummy data to reserve the place where real or useful data should be present. The dummy data is needed for testing, so it must be reliable. To do this, you can use the functions make_classification () in a classification task or make_regression () in a regression task. You can also set the number of samples and features to control the behavior of the data in debugging and testing.
3. Save ML-models for reuse without retraining. To serialize your algorithms and save them, the pickle and joblib Python libraries come in handy.
4. Plot a confusion matrix using the plot_confusion_matrix function, which displays true positive, false positive, false negative and true negative values.
5. Visualize decision trees using the tree.plot_tree function in the matplotlib package without manually installing dependencies to create simple visualizations. You can also save the tree as a graphic png-file.
https://www.educative.io/blog/scikit-learn-tricks-tips

Educative: Interactive Courses for Software Developers

Data Science Made Simple: 5 essential Scikit-learn tricks

Scikit-learn is the most popular Python machine learning library for data science. Learn the top tips and tricks to take your Scikit skills to the next level.

535 views04:26