👻4 simple tips for effective data engineering
To prevent data engineering projects with hundreds of artifacts, including dependency files, jobs, unit tests, shell files, and Jupyter notebooks from becoming chaos, follow these guidelines:
• manage dependencies, for example through a dependency manager like Poetry
• remember about unit tests - introducing unit tests into the project will save you from trouble and improve the quality of your code
• divide and conquer - store all data transformations in a separate module
• document to remember the code and the business problem it solves yourself and share knowledge with colleagues
https://blog.devgenius.io/keeping-your-data-pipelines-organized-fa387247d59e
To prevent data engineering projects with hundreds of artifacts, including dependency files, jobs, unit tests, shell files, and Jupyter notebooks from becoming chaos, follow these guidelines:
• manage dependencies, for example through a dependency manager like Poetry
• remember about unit tests - introducing unit tests into the project will save you from trouble and improve the quality of your code
• divide and conquer - store all data transformations in a separate module
• document to remember the code and the business problem it solves yourself and share knowledge with colleagues
https://blog.devgenius.io/keeping-your-data-pipelines-organized-fa387247d59e
Medium
Keeping Your Data Pipelines Organized
Presenting an easy to go Data Engineer project structure
👣AutoML and more with PyCaret
PyCaret is an open source AutoML library in Python with a low-level approach to automating most MLOps tasks. PyCaret has special features for parsing, deploying, and combining models that many other ML frameworks do not have. It allows you to go from preparing data to deploying an ML model in minutes in a user-selected development environment.
In fact, PyCaret is a Python wrapper for several libraries and ML frameworks: scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, etc. The simplicity of PyCaret allows it to be used not only by experienced DS specialists, but and ordinary users who are able to perform simple complex analytical tasks. The library is available for free download and use under the MIT license. The package contains several modules, functions in which are grouped according to the main use cases: from simple classification to NLP and anomaly detection.
https://pycaret.org/
https://github.com/pycaret/pycaret
PyCaret is an open source AutoML library in Python with a low-level approach to automating most MLOps tasks. PyCaret has special features for parsing, deploying, and combining models that many other ML frameworks do not have. It allows you to go from preparing data to deploying an ML model in minutes in a user-selected development environment.
In fact, PyCaret is a Python wrapper for several libraries and ML frameworks: scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, etc. The simplicity of PyCaret allows it to be used not only by experienced DS specialists, but and ordinary users who are able to perform simple complex analytical tasks. The library is available for free download and use under the MIT license. The package contains several modules, functions in which are grouped according to the main use cases: from simple classification to NLP and anomaly detection.
https://pycaret.org/
https://github.com/pycaret/pycaret
🐻❄️On the eve of the New Year, speeding up DS: meet the Polars
Polars is a fast ML modeling data preparation library for Python and Rust. It is 15 times faster than Pandas, parallelizing the processing of dataframes and queries in memory. Written in Rust, Polars uses all the cores of the computer. Also, the library is optimized for the specifics of data processing and supports Python. The rich API allows not only to work with huge amounts of data at the stage of their pre-preparation, but also to build working pipelines. The benchmarking comparison showed that Polars is ahead of not only Pandas, but also other tools, including computing engines popular in Big Data such as Apache Spark, Dask, etc.
Installing and trying Polars is very easy with the pip package manager:
pip install polars
import polars as pl
https://www.pola.rs/
https://betterprogramming.pub/this-library-is-15-times-faster-than-pandas-7e49c0a17adc
Polars is a fast ML modeling data preparation library for Python and Rust. It is 15 times faster than Pandas, parallelizing the processing of dataframes and queries in memory. Written in Rust, Polars uses all the cores of the computer. Also, the library is optimized for the specifics of data processing and supports Python. The rich API allows not only to work with huge amounts of data at the stage of their pre-preparation, but also to build working pipelines. The benchmarking comparison showed that Polars is ahead of not only Pandas, but also other tools, including computing engines popular in Big Data such as Apache Spark, Dask, etc.
Installing and trying Polars is very easy with the pip package manager:
pip install polars
import polars as pl
https://www.pola.rs/
https://betterprogramming.pub/this-library-is-15-times-faster-than-pandas-7e49c0a17adc
pola.rs
Polars
DataFrames for the new era
Forwarded from Алексей Чернобровов
🔝TOP-25 International Data Science events 2022:
1. WAICF - World Artificial Intelligence Cannes Festival https://worldaicannes.com/ February 10-12, Cannes, France
2. Deep and Reinforcement Learning Summit https://www.re-work.co/events/deep-learning-summit-2022 February 17-18, San Francisco, USA
3. Reinforce https://reinforceconf.com/ March 8-10, Budapest, Hungary
4. MLconf https://mlconf.com/event/mlconf-nyc/ March 31, New York City, USA
5. Open Data Science Conference EAST https://odsc.com/boston/ April 19-21, Boston, USA
6. ICLR - International Conference on Learning Representations https://iclr.cc/ April 25–29, online
7. SDM - SIAM International Conference on Data Mining https://www.siam.org/conferences/cm/conference/sdm22 April 28–30, Westin Alexandria Old Town, Virginia, USA
8. World Summit AI Americas https://americas.worldsummit.ai/ May 4-5, Montreal, Canada
9. The Data Science Conference https://www.thedatascienceconference.com/ May 12-13, Chicago, USA
10. World Data Summit https://worlddatasummit.com/ May 18-22, Amsterdam, The Netherlands
11. Machine Learning Prague https://mlprague.com/ May 27-29, Prague, Czech Republic
12. The AI Summit London https://london.theaisummit.com/ June 15-16, London, UK
13. Machine Learning Week https://www.predictiveanalyticsworld.com/machinelearningweek/ June 19-24, Las Vegas, USA
14. Enterprise AI Summit https://www.re-work.co/events/enterprise-ai-summit-berlin-2022 June 29–30, Berlin, Germany
15. DELTA - International Conference on Deep Learning Theory and Applications https://delta.scitevents.org/ July 12-14, Lisbon, Portugal
16. ICML - International Conference on Machine Learning https://icml.cc/ July 17-23, online
17. KDD - Knowledge Discovery and Data Mining https://kdd.org/kdd2022/ August 14-18, Washington, DC, USA
18. Open Data Science Conference APAC https://odsc.com/apac/ September 7-8, online
19. RecSys – ACM Conference on Recommender Systems https://recsys.acm.org/recsys22/ September 18-23, Seattle, USA
20. INTERSPEECH https://interspeech2022.org/ September 18-22, Incheon, Korea
21. BIG DATA CONFERENCE EUROPE https://bigdataconference.eu/ November 21-24, Vilnius, Lithuania
22. EMNLP - Conference on Empirical Methods in Natural Language Processing https://2021.emnlp.org/ November, TBA
23. Data Science Conference https://datasciconference.com/ November, Belgrade, Serbia
24. Data Science Summit http://dssconf.pl/ December, Warsaw, Poland
25. NeurIPS https://nips.cc/ December, TBA
1. WAICF - World Artificial Intelligence Cannes Festival https://worldaicannes.com/ February 10-12, Cannes, France
2. Deep and Reinforcement Learning Summit https://www.re-work.co/events/deep-learning-summit-2022 February 17-18, San Francisco, USA
3. Reinforce https://reinforceconf.com/ March 8-10, Budapest, Hungary
4. MLconf https://mlconf.com/event/mlconf-nyc/ March 31, New York City, USA
5. Open Data Science Conference EAST https://odsc.com/boston/ April 19-21, Boston, USA
6. ICLR - International Conference on Learning Representations https://iclr.cc/ April 25–29, online
7. SDM - SIAM International Conference on Data Mining https://www.siam.org/conferences/cm/conference/sdm22 April 28–30, Westin Alexandria Old Town, Virginia, USA
8. World Summit AI Americas https://americas.worldsummit.ai/ May 4-5, Montreal, Canada
9. The Data Science Conference https://www.thedatascienceconference.com/ May 12-13, Chicago, USA
10. World Data Summit https://worlddatasummit.com/ May 18-22, Amsterdam, The Netherlands
11. Machine Learning Prague https://mlprague.com/ May 27-29, Prague, Czech Republic
12. The AI Summit London https://london.theaisummit.com/ June 15-16, London, UK
13. Machine Learning Week https://www.predictiveanalyticsworld.com/machinelearningweek/ June 19-24, Las Vegas, USA
14. Enterprise AI Summit https://www.re-work.co/events/enterprise-ai-summit-berlin-2022 June 29–30, Berlin, Germany
15. DELTA - International Conference on Deep Learning Theory and Applications https://delta.scitevents.org/ July 12-14, Lisbon, Portugal
16. ICML - International Conference on Machine Learning https://icml.cc/ July 17-23, online
17. KDD - Knowledge Discovery and Data Mining https://kdd.org/kdd2022/ August 14-18, Washington, DC, USA
18. Open Data Science Conference APAC https://odsc.com/apac/ September 7-8, online
19. RecSys – ACM Conference on Recommender Systems https://recsys.acm.org/recsys22/ September 18-23, Seattle, USA
20. INTERSPEECH https://interspeech2022.org/ September 18-22, Incheon, Korea
21. BIG DATA CONFERENCE EUROPE https://bigdataconference.eu/ November 21-24, Vilnius, Lithuania
22. EMNLP - Conference on Empirical Methods in Natural Language Processing https://2021.emnlp.org/ November, TBA
23. Data Science Conference https://datasciconference.com/ November, Belgrade, Serbia
24. Data Science Summit http://dssconf.pl/ December, Warsaw, Poland
25. NeurIPS https://nips.cc/ December, TBA
👍1
🚀Speed up scikit learn: a new extension of the good old Python library for DS
The popular Sci-Kit Learn Python library is familiar to every Data Scientist. It has many advantages, but unlike the powerful ML frameworks PyTorch and TensorFlow, Scikit-learn does not allow for fast model training on GPUs. Sklearnex (Extension for Scikit-learn), a Sci-Kit Learn extension from Intel® Corporation, addresses this issue. Sklearnex is a free AI software module that provides 10x to 100x acceleration for a variety of applications. It fully supports all Scikit-Learn APIs and algorithms, speeding up code by replacing standard algorithms with optimized versions. The extension supports Python 3.6 and newer, and you can install it using the typical pip or conda package managers:
pip install scikit-learn-intelex
conda install scikit-learn-intelex -c conda-forge
https://intel.github.io/scikit-learn-intelex/
https://medium.com/@vamsik23/boost-sklearn-using-intels-sklearnex-cf2669f425bd
The popular Sci-Kit Learn Python library is familiar to every Data Scientist. It has many advantages, but unlike the powerful ML frameworks PyTorch and TensorFlow, Scikit-learn does not allow for fast model training on GPUs. Sklearnex (Extension for Scikit-learn), a Sci-Kit Learn extension from Intel® Corporation, addresses this issue. Sklearnex is a free AI software module that provides 10x to 100x acceleration for a variety of applications. It fully supports all Scikit-Learn APIs and algorithms, speeding up code by replacing standard algorithms with optimized versions. The extension supports Python 3.6 and newer, and you can install it using the typical pip or conda package managers:
pip install scikit-learn-intelex
conda install scikit-learn-intelex -c conda-forge
https://intel.github.io/scikit-learn-intelex/
https://medium.com/@vamsik23/boost-sklearn-using-intels-sklearnex-cf2669f425bd
Medium
Speed up sklearn model training
Sci-Kit Learn is a machine learning library for Python. This library contains ML tools which are required for the daily usage of data…
🏂How to choose a validation measure for ML models: Yandex approach
Every practical machine learning problem has a problem with measuring results. Different measures can lead to different assessment results and, therefore, to different chosen algorithms. Therefore, it is very important to find a suitable measure of quality. Researchers from Yandex compare various approaches to solving typical ML problems, from classification to clustering, in order to formulate a universal method for choosing the most optimal quality measure. Key messages and main results are presented in articles at conferences In recent articles published on ICML 2021 and NeurIPS 2021, and a short retelling is available directly on the Yandex website https://research.yandex.com/news/how-to-validate-validation-measures.
http://proceedings.mlr.press/v139/gosgens21a/gosgens21a.pdf
https://papers.nips.cc/paper/2021/file/8e489b4966fe8f703b5be647f1cbae63-Paper.pdf
Every practical machine learning problem has a problem with measuring results. Different measures can lead to different assessment results and, therefore, to different chosen algorithms. Therefore, it is very important to find a suitable measure of quality. Researchers from Yandex compare various approaches to solving typical ML problems, from classification to clustering, in order to formulate a universal method for choosing the most optimal quality measure. Key messages and main results are presented in articles at conferences In recent articles published on ICML 2021 and NeurIPS 2021, and a short retelling is available directly on the Yandex website https://research.yandex.com/news/how-to-validate-validation-measures.
http://proceedings.mlr.press/v139/gosgens21a/gosgens21a.pdf
https://papers.nips.cc/paper/2021/file/8e489b4966fe8f703b5be647f1cbae63-Paper.pdf
😎How to read tables from PDF: tabula-py
Sometimes the raw data for analysis is stored in pdf documents. To automatically extract data from this format straight into a dataframe, try tabula-py. It is a simple Python wrapper for tabula-java that can read PDF tables and convert to pandas dataframe as well as CSV / TSV / JSON files.
Just first install it through your pip package manager: pip install tabula-py
And then import into your Python noscript:
import tabula as tb
And you can use:
file = 'DataFile.pdf'
data = tb.read_pdf (file, pages = '12')
df = pd.DataFrame (data)
Examples: https://medium.com/codestorm/how-to-read-and-scrape-data-from-pdf-file-using-python-2f2a2fe73ae7
Documentation: https://tabula-py.readthedocs.io/en/latest/
Sometimes the raw data for analysis is stored in pdf documents. To automatically extract data from this format straight into a dataframe, try tabula-py. It is a simple Python wrapper for tabula-java that can read PDF tables and convert to pandas dataframe as well as CSV / TSV / JSON files.
Just first install it through your pip package manager: pip install tabula-py
And then import into your Python noscript:
import tabula as tb
And you can use:
file = 'DataFile.pdf'
data = tb.read_pdf (file, pages = '12')
df = pd.DataFrame (data)
Examples: https://medium.com/codestorm/how-to-read-and-scrape-data-from-pdf-file-using-python-2f2a2fe73ae7
Documentation: https://tabula-py.readthedocs.io/en/latest/
Medium
How to Read and Scrape Data From PDF File Using Python
In this post, I will show you how to read and scrape data from PDF File using Python.
💥Top 5 Data Engineering Trends in 2022: Astronomer Research
Astronomer, which commercializes and promotes the popular batch automation tool for working with data, Apache AirFlow, conducted a series of interviews with experts in the field of data engineering to identify the most pressing trends in the IT field.
• Data lineage, Data provenance and Data Quality
• Decentralization of data across different contexts and teams, but within a single consistent infrastructure with centralization of resources
• Consolidation of data tools, including orchestration of processing pipelines
• Data Mesh, eliminating silos between processing teams through the connection of used platforms
• mutual integration of DataOps, MLOps, AIOps for more efficient and faster use of consistent data and tools for seamless work with them.
https://www.astronomer.io/blog/top-data-management-trends-2022
Astronomer, which commercializes and promotes the popular batch automation tool for working with data, Apache AirFlow, conducted a series of interviews with experts in the field of data engineering to identify the most pressing trends in the IT field.
• Data lineage, Data provenance and Data Quality
• Decentralization of data across different contexts and teams, but within a single consistent infrastructure with centralization of resources
• Consolidation of data tools, including orchestration of processing pipelines
• Data Mesh, eliminating silos between processing teams through the connection of used platforms
• mutual integration of DataOps, MLOps, AIOps for more efficient and faster use of consistent data and tools for seamless work with them.
https://www.astronomer.io/blog/top-data-management-trends-2022
www.astronomer.io
What Are the Top Data Management Trends for 2022?
Learn about emerging trends that are revolutionizing the world of data from the leading Apache Airflow experts. See how to efficiently manage data in 2022.
🗣SQL queries against csv file with csvkit
csvkit is a command line toolkit for converting and working with CSV files. This utility allows you to perform the following operations in plain Python:
• Convert Excel and JSON files to CSV
• Display only column names
• Slice data
• change the order of columns
• find rows with matching cells
• convert CSV to JSON
• generate summary statistics
• refer to CSV using SQL queries
• import data into databases and extract from them
• parse CSV data
• work with column delimiters
The pip package manager will help you install csvkit: pip install csvkit
And the syntax for accessing a CSV file via an SQL query on the command line will look like this:
csvsql --query "SQL Query Here - source file name as table name (without .CSV)" source_filename> target_filename
To use this in your Python noscript you should
1) first import CSVSQL from csvkit utility
from csvkit.utilities.csvsql import CSVSQL
2) further define the arguments as a list of values, for example:
args = ['--query', 'select distinct manufacturer from playground', 'payground.csv']
3) then call CSVSQL with arguments
result = CSVSQL (args)
3) finally, the results can be shown
print (result.main ())
https://csvkit.readthedocs.io/en/latest/index.html
https://medium.com/data-engineering-ramstkp/sql-queries-on-csv-using-python-24a472fe53b1
csvkit is a command line toolkit for converting and working with CSV files. This utility allows you to perform the following operations in plain Python:
• Convert Excel and JSON files to CSV
• Display only column names
• Slice data
• change the order of columns
• find rows with matching cells
• convert CSV to JSON
• generate summary statistics
• refer to CSV using SQL queries
• import data into databases and extract from them
• parse CSV data
• work with column delimiters
The pip package manager will help you install csvkit: pip install csvkit
And the syntax for accessing a CSV file via an SQL query on the command line will look like this:
csvsql --query "SQL Query Here - source file name as table name (without .CSV)" source_filename> target_filename
To use this in your Python noscript you should
1) first import CSVSQL from csvkit utility
from csvkit.utilities.csvsql import CSVSQL
2) further define the arguments as a list of values, for example:
args = ['--query', 'select distinct manufacturer from playground', 'payground.csv']
3) then call CSVSQL with arguments
result = CSVSQL (args)
3) finally, the results can be shown
print (result.main ())
https://csvkit.readthedocs.io/en/latest/index.html
https://medium.com/data-engineering-ramstkp/sql-queries-on-csv-using-python-24a472fe53b1
Medium
SQL Queries on CSV Using Python
Python’s smart way of firing SQL queries on CSV files directly (In memory)
🚀Accelerating Big Data Analytics: Expedia Group Case Study with Apache Druid and DataSketches
When analyzing big data, problematic queries often arise that do not scale, since they require enormous computational resources and time to obtain accurate results. For example, counting individual items, quantiles, most frequent items, table joins in SQL queries, matrix calculations and graph analysis. If the approximate results for such calculations are acceptable, there are special streaming algorithms or sketches that run several orders of magnitude faster with acceptable errors. The sketches helped Yahoo successfully reduce processing time from days or hours to minutes or seconds. One such tool is the open-source library Apache DataSketches.
It is used by the large travel company Expedia Group to speed up time series analysis in Apache Druid, where table joins are limited, requiring a single dataset to be put into memory. DataSketches supports set operations, including join, intersection, and difference, with little loss in precision. This is useful when looking for and booking tickets. With DataSketches, each dataset can be queried independently of Druid to get the desired object for each dataset for preliminary and then final calculation. Since Druid did not initially support merging DataSketches objects, Expedia Group engineers had to write their own Java code. Moreover, the DataSketches object takes up very little memory space, despite the large size of the set. As a result, Apache Druid, a column-based DBMS for quickly receiving huge amounts of event data and submitting queries with low latency, became even faster.
https://datasketches.apache.org/
https://medium.com/expedia-group-tech/fast-approximate-counting-using-druid-and-datasketch-f5f163131acd
When analyzing big data, problematic queries often arise that do not scale, since they require enormous computational resources and time to obtain accurate results. For example, counting individual items, quantiles, most frequent items, table joins in SQL queries, matrix calculations and graph analysis. If the approximate results for such calculations are acceptable, there are special streaming algorithms or sketches that run several orders of magnitude faster with acceptable errors. The sketches helped Yahoo successfully reduce processing time from days or hours to minutes or seconds. One such tool is the open-source library Apache DataSketches.
It is used by the large travel company Expedia Group to speed up time series analysis in Apache Druid, where table joins are limited, requiring a single dataset to be put into memory. DataSketches supports set operations, including join, intersection, and difference, with little loss in precision. This is useful when looking for and booking tickets. With DataSketches, each dataset can be queried independently of Druid to get the desired object for each dataset for preliminary and then final calculation. Since Druid did not initially support merging DataSketches objects, Expedia Group engineers had to write their own Java code. Moreover, the DataSketches object takes up very little memory space, despite the large size of the set. As a result, Apache Druid, a column-based DBMS for quickly receiving huge amounts of event data and submitting queries with low latency, became even faster.
https://datasketches.apache.org/
https://medium.com/expedia-group-tech/fast-approximate-counting-using-druid-and-datasketch-f5f163131acd
datasketches.apache.org
DataSketches |
🌏5 Essential Components of Gartner's Digital Government Technology Platform
The Digital Government Technology Platform (DGTP) makes digital transformation a reality, but requires dedicated leadership. According to a Gartner study, by 2023, more than 80% of government digital implementations that are not based on a technology platform will fail.
DGTP is a set of end-to-end, integrated, horizontal capabilities that coordinate government services across multiple domains by integrating five platforms:
• Citizen Experience platform provides interfaces and technologies, implements policies and procedures for citizen-business interaction, and measures the experience of its users;
• Ecosystem platform – a set of digital interfaces that implement policies and procedures for governments and ecosystem partners to share data and services.
• Internet of Things (IoT) platform provides interfaces, data management and context, and implements policies and procedures for collecting and processing data from IoT sensors
• Information System Platform - Corporate information systems are at the heart of government IT efforts today. The information system platform provides the technologies, policies and procedures for integrating these back office systems into the DGTP
• Intelligence Platform provides advanced analytics, geospatial and location analytics, robotic process automation (RPA) and AI capabilities to process data collected or stored in any area of the platform.
The key reusable components in DGTP are applications and services that can provide a seamless mix of data, services, and capabilities that work together within DGTP and are accessible across networks and devices. DGTP is not a turnkey solution, but it gives government agencies the ability to innovate, reduce costs, and deliver new capabilities quickly and flexibly.
https://www.gartner.com/en/articles/government-cios-here-s-an-essential-piece-of-the-digital-transformation-puzzle
The Digital Government Technology Platform (DGTP) makes digital transformation a reality, but requires dedicated leadership. According to a Gartner study, by 2023, more than 80% of government digital implementations that are not based on a technology platform will fail.
DGTP is a set of end-to-end, integrated, horizontal capabilities that coordinate government services across multiple domains by integrating five platforms:
• Citizen Experience platform provides interfaces and technologies, implements policies and procedures for citizen-business interaction, and measures the experience of its users;
• Ecosystem platform – a set of digital interfaces that implement policies and procedures for governments and ecosystem partners to share data and services.
• Internet of Things (IoT) platform provides interfaces, data management and context, and implements policies and procedures for collecting and processing data from IoT sensors
• Information System Platform - Corporate information systems are at the heart of government IT efforts today. The information system platform provides the technologies, policies and procedures for integrating these back office systems into the DGTP
• Intelligence Platform provides advanced analytics, geospatial and location analytics, robotic process automation (RPA) and AI capabilities to process data collected or stored in any area of the platform.
The key reusable components in DGTP are applications and services that can provide a seamless mix of data, services, and capabilities that work together within DGTP and are accessible across networks and devices. DGTP is not a turnkey solution, but it gives government agencies the ability to innovate, reduce costs, and deliver new capabilities quickly and flexibly.
https://www.gartner.com/en/articles/government-cios-here-s-an-essential-piece-of-the-digital-transformation-puzzle
Gartner
Government CIOs: Here’s an Essential Piece of the Digital Transformation Puzzle
A digital government technology platform (DGTP) allows for true digital transformation, resulting in simplified processes, improved citizen interactions and ultimately a more resilient future 💡 Learn more. #GartnerSYM #DigitalTransformation
Forwarded from Big Data Science [RU]
Компоненты технологической платформы цифрового правительства от Gartner
🍏Bayesian statistics with PyMC3: brief overview
Frequency statistics rely on long-term event rates (data points) to calculate the desired variable. The Bayesian method can also work without a lot of events, even with a single data point. Frequency analysis gives a point estimate, while Bayesian analysis gives a distribution that can be interpreted as the confidence that the mean of the distribution is a good estimate for the variable. However, there is an uncertainty in the form of the standard deviation.
The Bayesian approach is useful in ML problems where estimates and validity are important. For example, today it could rain with a 60% chance.” The main formula underlying the Bayesian approach is Bayes' theorem, which allows you to calculate the posterior probability P(A|B ) of event A depending on event B.
• P(B|A) is called the probability that if event A happened, how likely is event B to happen?
• P(A) – probability of event A, a prior (initial) assumption about the variable of interest.
• P(B) is the probability of event B (evidence), which is usually difficult to calculate when estimating the posterior probability.
You can quickly calculate the Bayesian probability using the PyMC3 Python library https://docs.pymc.io/en/v3/. It allows you to write models using an intuitive syntax to describe the data generation process. PyMC3 allows you to tune an ML model with gradient-based MCMC algorithms like NUTS, with ADVI for fast approximate inference, including a mini-batch ADVI for scaling to large datasets, or with Gaussian processes to build Bayesian non-parametric models. PyMC3 includes a complete set of predefined statistical distributions that can be used as the building blocks of a Bayesian model.
This probabilistic programming package for Python allows users to fit Bayesian models using various numerical methods, most notably Markov Chain Monte Carlo (MCMC) and Variational Inference (VI). Instead of providing a basic model specification and fitting functions, PyMC3 includes functions for summarizing output and diagnosing the model.
PyMC3 aims to make Bayesian modeling as simple and painless as possible by allowing users to focus on their scientific problem rather than the methods used to solve it. The package uses Theano as a computational backend to quickly evaluate an expression, compute the gradient automatically, and perform computations on the GPU.
PyMC3 also has built-in support for modeling Gaussian processes, allowing you to generalize models and build graphs. There's model validation and convergence detection, custom stepwise methods, and unusual probability distributions. Bayesian models obtained using PyMC3 can be embedded in larger programs, and the results can be analyzed using any Python tools.
https://medium.com/@akashkadel94/bayesian-statistics-overview-and-your-first-bayesian-linear-regression-model-ba566676c5a7
Frequency statistics rely on long-term event rates (data points) to calculate the desired variable. The Bayesian method can also work without a lot of events, even with a single data point. Frequency analysis gives a point estimate, while Bayesian analysis gives a distribution that can be interpreted as the confidence that the mean of the distribution is a good estimate for the variable. However, there is an uncertainty in the form of the standard deviation.
The Bayesian approach is useful in ML problems where estimates and validity are important. For example, today it could rain with a 60% chance.” The main formula underlying the Bayesian approach is Bayes' theorem, which allows you to calculate the posterior probability P(A|B ) of event A depending on event B.
• P(B|A) is called the probability that if event A happened, how likely is event B to happen?
• P(A) – probability of event A, a prior (initial) assumption about the variable of interest.
• P(B) is the probability of event B (evidence), which is usually difficult to calculate when estimating the posterior probability.
You can quickly calculate the Bayesian probability using the PyMC3 Python library https://docs.pymc.io/en/v3/. It allows you to write models using an intuitive syntax to describe the data generation process. PyMC3 allows you to tune an ML model with gradient-based MCMC algorithms like NUTS, with ADVI for fast approximate inference, including a mini-batch ADVI for scaling to large datasets, or with Gaussian processes to build Bayesian non-parametric models. PyMC3 includes a complete set of predefined statistical distributions that can be used as the building blocks of a Bayesian model.
This probabilistic programming package for Python allows users to fit Bayesian models using various numerical methods, most notably Markov Chain Monte Carlo (MCMC) and Variational Inference (VI). Instead of providing a basic model specification and fitting functions, PyMC3 includes functions for summarizing output and diagnosing the model.
PyMC3 aims to make Bayesian modeling as simple and painless as possible by allowing users to focus on their scientific problem rather than the methods used to solve it. The package uses Theano as a computational backend to quickly evaluate an expression, compute the gradient automatically, and perform computations on the GPU.
PyMC3 also has built-in support for modeling Gaussian processes, allowing you to generalize models and build graphs. There's model validation and convergence detection, custom stepwise methods, and unusual probability distributions. Bayesian models obtained using PyMC3 can be embedded in larger programs, and the results can be analyzed using any Python tools.
https://medium.com/@akashkadel94/bayesian-statistics-overview-and-your-first-bayesian-linear-regression-model-ba566676c5a7
Medium
Bayesian Statistics Overview and your first Bayesian Linear Regression Model
A brief recap of Bayesian Learning followed by implementation of a Bayesian Linear Regression Model on NYC Airbnb open dataset
👍1
💥5 YOUTUBE channels for a data engineer from popular DS bloggers
• Ken Jee https://www.youtube.com/c/KenJee1/videos - 183 thousand subscribers and about 200 videos about Data Science, big data engineering, ML and sports analytics
• Karolina Sowinska https://www.youtube.com/c/KarolinaSowinska/videos 30+ thousand subscribers and almost 60 great videos about AirFlow, AI, ETL and the career of a data engineer;
• Shashank Mishra https://www.youtube.com/c/LearningBridge/video 40+ thousand subscribers and more than 150 videos about everyday life data engineers, DS course reviews, interview recommendations and personal experience of the author who worked at Amazon , McKinsey&Company, PayTm and other large corporations, as well as startups.
• Seattle Data Guy https://www.youtube.com/c/SeattleDataGuy/videos almost 20 thousand subscribers and more than 100 videos about the soft and hard skills of a data engineer, life hacks for solving daily tasks of collecting and aggregating data using Python and not only, SQL best practices, introduction to R and much more
• Andreas Kretz https://www.youtube.com/c/andreaskayy/videos about 27 thousand subscribers and more than 500 videos vanilla and proprietary Hadoop, Spark, Kafka, AWS services and other cloud platforms, ETL basics, installation details and practical use different Big Data technologies and features of the data engineer profession.
• Ken Jee https://www.youtube.com/c/KenJee1/videos - 183 thousand subscribers and about 200 videos about Data Science, big data engineering, ML and sports analytics
• Karolina Sowinska https://www.youtube.com/c/KarolinaSowinska/videos 30+ thousand subscribers and almost 60 great videos about AirFlow, AI, ETL and the career of a data engineer;
• Shashank Mishra https://www.youtube.com/c/LearningBridge/video 40+ thousand subscribers and more than 150 videos about everyday life data engineers, DS course reviews, interview recommendations and personal experience of the author who worked at Amazon , McKinsey&Company, PayTm and other large corporations, as well as startups.
• Seattle Data Guy https://www.youtube.com/c/SeattleDataGuy/videos almost 20 thousand subscribers and more than 100 videos about the soft and hard skills of a data engineer, life hacks for solving daily tasks of collecting and aggregating data using Python and not only, SQL best practices, introduction to R and much more
• Andreas Kretz https://www.youtube.com/c/andreaskayy/videos about 27 thousand subscribers and more than 500 videos vanilla and proprietary Hadoop, Spark, Kafka, AWS services and other cloud platforms, ETL basics, installation details and practical use different Big Data technologies and features of the data engineer profession.
🏸Zingg + TigerGraph combo for deduplication and big data graph analytics
Graph databases with built-in relationship patterns are great for record disambiguation and entity resolution. For example, TigerGraph is a powerful graph analytics system. And if you supplement it with the open ML tool Zingg (https://github.com/zinggAI/zingg), you can find duplicate and ambiguous records even faster.
Imagine, the same person in different systems is written differently. Therefore, it is very difficult to analyze its user behavior, for example, to generate a personal marketing offer or inclusion in loyalty programs. Zingg have built-in locking mechanisms that only calculate pairwise similarity for selected records. This reduces computation time and helps scale to large datasets. You don't have to worry about manually linking/grouping records: the internal entity resolution framework takes care of that. So with Zingg and TigerGraph you can combine the best simple and scalable entity resolution and further graph analysis.
https://towardsdatascience.com/entity-resolution-with-tigergraph-add-zingg-to-the-mix-95009471ca02
Graph databases with built-in relationship patterns are great for record disambiguation and entity resolution. For example, TigerGraph is a powerful graph analytics system. And if you supplement it with the open ML tool Zingg (https://github.com/zinggAI/zingg), you can find duplicate and ambiguous records even faster.
Imagine, the same person in different systems is written differently. Therefore, it is very difficult to analyze its user behavior, for example, to generate a personal marketing offer or inclusion in loyalty programs. Zingg have built-in locking mechanisms that only calculate pairwise similarity for selected records. This reduces computation time and helps scale to large datasets. You don't have to worry about manually linking/grouping records: the internal entity resolution framework takes care of that. So with Zingg and TigerGraph you can combine the best simple and scalable entity resolution and further graph analysis.
https://towardsdatascience.com/entity-resolution-with-tigergraph-add-zingg-to-the-mix-95009471ca02
GitHub
GitHub - zinggAI/zingg: Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Scalable identity resolution, entity resolution, data mastering and deduplication using ML - zinggAI/zingg
LaMDA: Safe, Grounded, and High-Quality Dialog Model from Google AI
LaMDA is created by fine-tuning a family of dialogue-specific Transformer-based neural language models with model parameters up to 137B and training the models to use external knowledge sources. LaMDA has three key goals:
• Quality, which is measured in terms of Sensibleness, Specificity, and Interestingness. These indicators are evaluated by people. Reasonableness indicates the presence of meaning in the context of the dialogue, for example, the absence of absurd answers from the ML-model and contradictions with earlier answers. Specificity indicates whether the system's response is specific to the context of the previous dialog. Interestingness measures the emotional reaction of the interlocutor to the answers of the ML model.
• Safety so that the model's responses do not contain offensive and dangerous statements.
• Groundedness - modern language models often generate statements that seem plausible, but in fact contradict the true facts in external sources. Groundedness is defined as the percentage of responses with statements about the outside world that can be verified by reputable external sources. A related metric, Informativeness, is defined as the percentage of responses with information about the outside world that can be confirmed by known sources.
LaMDA models undergo two-stage training: pre-training and fine-tuning. The first stage was performed on a data set of 1.56 thousand words from publicly available dialogue data and public web documents. After tokenizing the data set of 2.81T tokens, the model was trained to predict each next token in the sentence, given the previous ones. The pretrained LaMDA model has also been widely used for NLP research at Google, including program synthesis, zero-shot learning, and more.
In the fine-tuning phase, LaMDA is trained to combine generative tasks to generate natural language responses in given contexts and classification tasks to determine the safety and quality of the model. This results in a single multitasking model: the LaMDA generator is trained to predict the next token in the dialogue dataset, and the classifiers are trained to predict the security and response quality scores in context using annotated data.
The test results showed that LaMDA significantly outperforms the pre-trained model in every dimension and at every scale. Quality metrics improve as the number of model parameters increases, with or without fine-tuning. Safety is not improved by scaling the model alone, but compensated for by fine-tuning. Groundedness improves as the size of the model grows, due to the ability to remember unusual knowledge. And fine-tuning allows the model to access external sources and effectively transfer part of the burden of remembering knowledge to them. By fine-tuning, the human-level quality gap can be reduced, although the performance of the model remains below human-level in terms of safety and Groundedness.
https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html
LaMDA is created by fine-tuning a family of dialogue-specific Transformer-based neural language models with model parameters up to 137B and training the models to use external knowledge sources. LaMDA has three key goals:
• Quality, which is measured in terms of Sensibleness, Specificity, and Interestingness. These indicators are evaluated by people. Reasonableness indicates the presence of meaning in the context of the dialogue, for example, the absence of absurd answers from the ML-model and contradictions with earlier answers. Specificity indicates whether the system's response is specific to the context of the previous dialog. Interestingness measures the emotional reaction of the interlocutor to the answers of the ML model.
• Safety so that the model's responses do not contain offensive and dangerous statements.
• Groundedness - modern language models often generate statements that seem plausible, but in fact contradict the true facts in external sources. Groundedness is defined as the percentage of responses with statements about the outside world that can be verified by reputable external sources. A related metric, Informativeness, is defined as the percentage of responses with information about the outside world that can be confirmed by known sources.
LaMDA models undergo two-stage training: pre-training and fine-tuning. The first stage was performed on a data set of 1.56 thousand words from publicly available dialogue data and public web documents. After tokenizing the data set of 2.81T tokens, the model was trained to predict each next token in the sentence, given the previous ones. The pretrained LaMDA model has also been widely used for NLP research at Google, including program synthesis, zero-shot learning, and more.
In the fine-tuning phase, LaMDA is trained to combine generative tasks to generate natural language responses in given contexts and classification tasks to determine the safety and quality of the model. This results in a single multitasking model: the LaMDA generator is trained to predict the next token in the dialogue dataset, and the classifiers are trained to predict the security and response quality scores in context using annotated data.
The test results showed that LaMDA significantly outperforms the pre-trained model in every dimension and at every scale. Quality metrics improve as the number of model parameters increases, with or without fine-tuning. Safety is not improved by scaling the model alone, but compensated for by fine-tuning. Groundedness improves as the size of the model grows, due to the ability to remember unusual knowledge. And fine-tuning allows the model to access external sources and effectively transfer part of the burden of remembering knowledge to them. By fine-tuning, the human-level quality gap can be reduced, although the performance of the model remains below human-level in terms of safety and Groundedness.
https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html
research.google
LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything
Posted by Heng-Tze Cheng, Senior Staff Software Engineer and Romal Thoppilan, Senior Software Engineer, Google Research, Brain Team Language models...
👀Upscaling video games with NVIDIA's DLDSR
DLDSR (Deep Learning Dynamic Super Resolution) is a video game image enhancement technology that uses a multilayer neural network that requires fewer pixels. The 2.25X DLDSR is comparable in quality to the 4X resolution of previous generation DSR technology. At the same time, DLDSR performance is much higher thanks to the tensor cores of RTX video cards, which accelerate neural networks several times. You can try DLDSR on your gaming computer by updating your video card driver and setting the desired settings.
https://www.rockpapershotgun.com/nvidias-deep-learning-dynamic-super-resolution-tech-is-out-now-heres-how-to-enable-it
DLDSR (Deep Learning Dynamic Super Resolution) is a video game image enhancement technology that uses a multilayer neural network that requires fewer pixels. The 2.25X DLDSR is comparable in quality to the 4X resolution of previous generation DSR technology. At the same time, DLDSR performance is much higher thanks to the tensor cores of RTX video cards, which accelerate neural networks several times. You can try DLDSR on your gaming computer by updating your video card driver and setting the desired settings.
https://www.rockpapershotgun.com/nvidias-deep-learning-dynamic-super-resolution-tech-is-out-now-heres-how-to-enable-it
Rock Paper Shotgun
Nvidia’s Deep Learning Dynamic Super Resolution tech is out now – here’s how to enable it
Nvidia's DLDSR, which downsamples games for higher detail at a lower performance cost, is now available on GeForce RTX GPUs via driver update.
🌦TOP-10 Data Science conferences in February 2022:
1. 02 Feb - Virtual conference DataOps Unleashed https://dataopsunleashed.com/
2. 03 Feb - Beyond Big Data: AI/Machine Learning Summit 2022, Pittsburgh, USA https://www.pghtech.org/events/BeyondBigData2022
3. 10 Feb - Online-summit AICamp ML Data Engineering https://www.aicamp.ai/event/eventdetails/W2022021009
4. 12-13 Feb - IAET International Conference on Machine Learning, Smart & Nanomaterials, Design Engineering, Information Technology & Signal Processing. Budapest, Hungary https://institute-aet.com/mns-22/
5. 16 Feb - DSS Hybrid Miami: AI & ML in the Enterprise. Miami, FL, USA & Virtual https://www.datascience.salon/miami/
6. 17-18 Feb - RE.WORK San Francisco, CA, USA and Online
Reinforcement Learning Summit: https://www.re-work.co/events/reinforcement-learning-summit-2022
Deep Learning Summit: https://www.re-work.co/events/deep-learning-summit-2022 Enterprise AI Summit: https://www.re-work.co/events/enterprise-ai-summit-2022
7. 18-20 Feb - International Conference on Compute and Data Analysis (ICCDA 2022). Sanya, China http://iccda.org/
8. 21-25 Feb - WSDM'22, The 15th ACM International WSDM Conference. Online. http://www.wsdm-conference.org/2022/
9. 22-23 Feb - AI & ML Developers Conference. Virtual. https://cnvrg.io/mlcon
10. 26-27 Feb - 9th International Conference on Data Mining and Database (DMDB 2022). Vancouver, Canada https://ccseit2022.org/dmdb/
1. 02 Feb - Virtual conference DataOps Unleashed https://dataopsunleashed.com/
2. 03 Feb - Beyond Big Data: AI/Machine Learning Summit 2022, Pittsburgh, USA https://www.pghtech.org/events/BeyondBigData2022
3. 10 Feb - Online-summit AICamp ML Data Engineering https://www.aicamp.ai/event/eventdetails/W2022021009
4. 12-13 Feb - IAET International Conference on Machine Learning, Smart & Nanomaterials, Design Engineering, Information Technology & Signal Processing. Budapest, Hungary https://institute-aet.com/mns-22/
5. 16 Feb - DSS Hybrid Miami: AI & ML in the Enterprise. Miami, FL, USA & Virtual https://www.datascience.salon/miami/
6. 17-18 Feb - RE.WORK San Francisco, CA, USA and Online
Reinforcement Learning Summit: https://www.re-work.co/events/reinforcement-learning-summit-2022
Deep Learning Summit: https://www.re-work.co/events/deep-learning-summit-2022 Enterprise AI Summit: https://www.re-work.co/events/enterprise-ai-summit-2022
7. 18-20 Feb - International Conference on Compute and Data Analysis (ICCDA 2022). Sanya, China http://iccda.org/
8. 21-25 Feb - WSDM'22, The 15th ACM International WSDM Conference. Online. http://www.wsdm-conference.org/2022/
9. 22-23 Feb - AI & ML Developers Conference. Virtual. https://cnvrg.io/mlcon
10. 26-27 Feb - 9th International Conference on Data Mining and Database (DMDB 2022). Vancouver, Canada https://ccseit2022.org/dmdb/
Data Teams Summit
Data Teams Summit | Peer-to-Peer Virtual Data Conference | Jan 24, 2024
Data Team Summit is an annual virtual community event aimed at helping DataOps professionals build, manage, and monitor data pipelines.
🚗Yandex Courier Robots in Seoul
As early as last year, Yandex's autonomous courier robots began delivering orders in Russia, food from restaurants in the US city of Ann Arbor, Michigan, and other US student campuses. And in January 2022, Yandex entered into an agreement of intent with a large South Korean telecommunications company, KT Corporation, for delivery by autonomous robots in Seoul. So already this year, South Korea will become the first country in East Asia where Yandex rovers operate. The company is also preparing to launch this technology in Dubai.
http://www.koreaherald.com/view.php?ud=20220118000709
As early as last year, Yandex's autonomous courier robots began delivering orders in Russia, food from restaurants in the US city of Ann Arbor, Michigan, and other US student campuses. And in January 2022, Yandex entered into an agreement of intent with a large South Korean telecommunications company, KT Corporation, for delivery by autonomous robots in Seoul. So already this year, South Korea will become the first country in East Asia where Yandex rovers operate. The company is also preparing to launch this technology in Dubai.
http://www.koreaherald.com/view.php?ud=20220118000709
The Korea Herald
KT teams up with Russia’s Yandex to debut delivery robots
South Korean telecom carrier KT looks to launch autonomous delivery vehicles in Korea before the end of 2022 by partnering with Russian autonomous rover maker Yandex Self-Driving Group, KT said Tuesday. The two companies have signed a memorandum of understanding…