NEW BOT Телеграм, страница

Big Data Science

🙌🏻XLS-R is a new set of ML models from Facebook AI
The PyTorch team from Facebook have published XLS-R, a set of large-scale models for self-study of cross-language speech representation based on wav2vec 2.0. The models were trained in 128 languages for over 400 thousand hours of unlabeled speech. Thanks to fine tuning, the models show a high level of speech recognition and are excellent for translation, understanding and language identification tasks. The training data was taken from a variety of sources, including audiobooks and court records. XLS-R neural network models contain more than 2 billion parameters and are multilingual. Moreover, testing has shown that teaching several languages at once increases the efficiency of neural networks. You can download XLS-R from Github right now.
https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr

GitHub

fairseq/examples/wav2vec/xlsr at main · facebookresearch/fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python. - facebookresearch/fairseq

533 views03:02

Big Data Science

Visual Genome: the most labeled dataset
👁Scientists at Stanford University have collected the most annotated dataset with over 100,000 images. In total, the dataset contains almost 5.5 million object denoscriptions, attributes and relationships. You don't even have to download the dataset, but get the data you need by accessing the RESTful API endpoint using the GET-method. Despite the fact that the latest updates to the dataset are dated 2017, this is an excellent data set for training models in typical ML problems, from recognizing generated data using graph algorithms.
https://visualgenome.org/api/v0/api_home.html

15.8K views04:47

Big Data Science

How to read Parquet files
The Apache Parquet format is widely used in Big Data due to its column-wise storage and efficient compression. It allows you to quickly read the data you want from a specific column instead of reading a full row and saves time. However, not every application system can read binary Parquet files. It is often necessary to convert Parquet files to CSV or TXT format before opening them, for example, in MS Excel or Power BI, which are often used in the work of a DS specialist. In this case, the Python pandas library will help with the built-in read_parquet() function for reading data from a Parquet file into a dataframe. The dataframe can then be saved to a CSV file using the to_csv() method and opened in almost any office spreadsheet editor.
https://medium.com/@i.m.mak/workaround-for-reading-parquet-files-in-power-bi-e2d060abcb80

Medium

Workaround for reading Parquet files in Power BI

If you’re working with Apache Spark, an ML solution or a reporting project where you need quick reads, you must have definitely come…

592 views07:45

Big Data Science

💦What are CDC system and why you need it
To reduce the amount of data read from a corporate warehouse or lake, but always keep abreast of the latest changes, a CDC approach is used: Capture Changed Data or Change Data Capture. There are ready-made CDC tools: Oracle Golden Gate, Qlik Replicate and HVR, which are best suited for receiving data from frequently updated relational DBMSs. Also, data engineers create their own solutions: • CDC calculations using timestamps marking creation, update, and expiration in the source tables. Any process that inserts, updates, or deletes a row must also update the corresponding timestamp column. Hard deletion is not allowed. The drawbacks to this method are that you need to redesign the database structure to add a timestamp column, and the need for tight coupling between the original table and the ETL process code. • CDC computation using a negative query, when a link is created between the source and the target sink and a minus SQL query is executed to calculate the change log. This method is more of an anti-pattern, since only works if the source and destination databases are of the same type, and it also increases the amount of data moved. Both methods of handwritten CDC impose significant overhead on the original database. And special CDC tools reduce the load on the network and the source by analyzing the logs to calculate changes. However, the main disadvantage of off-the-shelf CDC solutions is their high cost. In addition, the source DBMS administrator must grant the CDC tool privileged access to the database log, which is perceived as not very loyal for security reasons.
https://towardsdatascience.com/change-data-capture-cdc-for-data-ingestion-ca81ff5934d2

Medium

Change Data Capture(CDC) for Data Lake Data Ingestion

Change Data Capture(CDC) tools can accelerate Data Lake adoption by enabling scalable and network efficient near-real-time data replication

480 views04:14

Big Data Science

💥Python and SQL combo with FugueSQL: a single SQL interface for Pandas, Spark and Dask dateframes
FugueSQL is an open source Python library that allows you to combine Python code with an SQL command by switching between them in a Jupyter Notebook or Python noscript. FugueSQL supports distributed computing and provides a unified API to run the same SQL code in Pandas, Dask, and Apache Spark.
Unlike PandaSQL, which has a single SQLite server, which introduces a lot of overhead when transferring data between Pandas and the database, FugueSQL supports multiple local backends: pandas, DuckDB, and SQLite.
When using the pandas backend, Fugue translates SQL directly into pandas operations, excluding data transfers. DuckDB has excellent panda support so the data transfer overhead is negligible. Both Pandas and DuckDB are the preferred FugueSQL server-side modules for local data processing. Fugue also supports Spark, Dask, and cuDF (via blazingSQL) as backends.
The Fugue SQL code is parsed using ANTLR and mapped to equivalent functions in the Fugue API. FugueSQL has many features built in and the code is extensible with Python-code. By default, it supports the most common features of the function: filling in nulls, deleting nulls, renaming columns, changing the schema, and more. Fugue also improves on some improvements in standard SQL to handle end-to-end data workflows gracefully. For example, creating intermediate tables by assigning a number.
As Pandas %% fsql accepts NativeExecutionEngine as a default parameter. In Dask, FugueSQL is slightly slower than the native engine, but more complete in terms of implemented SQL keywords. FugueSQL also runs on Spark, mapping %% fsql operations to Spark and Spark SQL operations. This allows you to quickly develop distributed applications. All you have to do is create a local prototype using the NativeExecutionEngine, test it, and deploy it to the Spark cluster just by changing the execution engine.
https://towardsdatascience.com/introducing-fuguesql-sql-for-pandas-spark-and-dask-dataframes-63d461a16b27
https://fugue-tutorials.readthedocs.io/tutorials/fugue_sql/index.html

Medium

Introducing FugueSQL — SQL for Pandas, Spark, and Dask DataFrames

An End-To-End SQL Interface for Data Science and Analytics

432 views05:35

Big Data Science

👻What is UMAP and why is it useful for Data Scientist?
UMAP (Uniform Manifold Approximation and Projection) is a universal manifold learning and dimensionality reduction algorithm. It is designed to be compatible with scikit-learn, uses the same API, and can be added to sklearn pipelines. As a stochastic algorithm, UMAP uses randomization to speed up the approximation and optimization steps. This means that different UMAP runs may produce different results. Although the UMAP is relatively stable, ideally the difference between runs should be relatively small, but it is. To ensure accurate reproduction of results, UMAP allows the user to set a random initial state.
Since version 0.4, UMAP also supports multithreading to improve performance, and when optimized, race conditions between threads are allowed at certain stages. The randomness in the UMAP output for the multithreaded case depends not only on the input random seed, but also on race conditions between threads during optimization, which is impossible to control. Therefore, multithreaded UMAP results cannot be explicitly reproduced.
UMAP can be used as an efficient preprocessing step to improve the performance of density-based clustering. But UMAP, like t-SNE, does not completely preserve density and can create false discontinuities in clusters. Compared to t-SNE, UMAP maintains a more global structure, creating more meaningful clusters. And thanks to the support for arbitrary embed sizes, UMAP allows you to work with large dimensional spaces.
Due to the active use of the nearest neighbors method, for some datasets, UMAP can consume excessive memory. Setting low_memory to True will help to switch to a slower, but less intensive approach to calculating nearest neighbors. It's also important to know that when run without a random seed, UMAP will use a parallel implementation of NUMBA to multithread and consume CPU cores. By default, it will use as many cores as available. You can limit the number of threads Numba uses by using the NUMBA_NUM_THREADS environment variable. Also due to the nature of Numba, UMAP does not support 32-bit Windows.
Despite some disadvantages, UMAP can be used in the following cases:
• exploratory data analysis (EDA);
• interactive visualization of the analysis results;
• processing of sparse matrices;
• detection of malicious programs based on behavioral data;
• preprocessing vectors of phrases for clustering;
• preprocessing of image embeddings (Inception) for clustering.
https://github.com/lmcinnes/umap
https://umap-learn.readthedocs.io/en/latest/index.html

GitHub

GitHub - lmcinnes/umap: Uniform Manifold Approximation and Projection

Uniform Manifold Approximation and Projection. Contribute to lmcinnes/umap development by creating an account on GitHub.

597 views03:37

Big Data Science

🚀Speed up DS with big data: Pandas API right in Apache Spark
The popular computing framework Apache Spark allows you to write programs in Python, which is familiar to every DS-specialist. PySpark now includes a pandas library that can be imported with just one line: import pyspark.pandas as ps.
This provides the following benefits:
• lowers the threshold for entering Spark;
• unifies the codebase for small and big data, local machines and distributed clusters;
• speeds up Pandas code.
By the way, Pandas on Spark is even faster than the other popular Python engine, Dask!
https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html
https://towardsdatascience.com/run-pandas-as-fast-as-spark-f5eefe780c45
https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

Medium

Run Pandas as Fast as Spark

Why the Pandas API on Spark is a total game changer

447 views07:08

Big Data Science

👑Introduce useful DS-tools: meet the Streamlit
Streamlit is an open source Python library that makes it easy to create and publish beautiful custom web applications for ML and DS. Build and deploy powerful applications in just a couple of minutes. Comparing Streamlit to Dash is similar to comparing Python to C #. Streamlit makes it easy to build web data applications in pure Python code, often in a few lines of code. For example, one-line commands for displaying interactive visuals Plotly, Bokeh and Altair, Pandas DataFrames, etc. Streamlit is supported by a huge open-source developer community: add your own components to the library using JavaScript. And the cloud use of Streamlit is open to everyone: you can create and host up to three applications for free.
https://streamlit.io/

streamlit.io

Streamlit • A faster way to build and share data apps

Streamlit is an open-source Python framework for data scientists and AI/ML engineers to deliver interactive data apps – in only a few lines of code.

436 viewsedited 07:46

Big Data Science

☀️Meet the Gallia: a new library for data transformation
This schema-enabled Scala library comes in handy for practical data transformation, including ETL processes, function development, HTTP responses, and more. Highly scalable, it is designed to bridge the gap between Pandas and Spark SQL. Gallia is useful for those who appreciate the powerful type system in Scala, and those who find it difficult to understand too fancy SQL queries. Essentially, Gallia implements a one stop shop paradigm for most or all of your data transformation needs in a single application. The library supports all kinds of data manipulation, from aggregations to pivoting tables, including processing individual and nested objects, not just collections. For scaling, Gallia integrates perfectly with the Spark RDD API.
https://cros-anthony.medium.com/gallia-a-library-for-data-transformation-3fafaaa2d8b9
https://github.com/galliaproject/gallia-core/blob/master/README.md

Medium

Gallia: A Library for Data Transformation

A schema-aware Scala library for practical data transformation: ETL, feature engineering, HTTP responses, etc

471 views02:59

Big Data Science

👻4 simple tips for effective data engineering
To prevent data engineering projects with hundreds of artifacts, including dependency files, jobs, unit tests, shell files, and Jupyter notebooks from becoming chaos, follow these guidelines:
• manage dependencies, for example through a dependency manager like Poetry
• remember about unit tests - introducing unit tests into the project will save you from trouble and improve the quality of your code
• divide and conquer - store all data transformations in a separate module
• document to remember the code and the business problem it solves yourself and share knowledge with colleagues
https://blog.devgenius.io/keeping-your-data-pipelines-organized-fa387247d59e

Medium

Keeping Your Data Pipelines Organized

Presenting an easy to go Data Engineer project structure

704 views06:56

Big Data Science

👣AutoML and more with PyCaret
PyCaret is an open source AutoML library in Python with a low-level approach to automating most MLOps tasks. PyCaret has special features for parsing, deploying, and combining models that many other ML frameworks do not have. It allows you to go from preparing data to deploying an ML model in minutes in a user-selected development environment.
In fact, PyCaret is a Python wrapper for several libraries and ML frameworks: scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, etc. The simplicity of PyCaret allows it to be used not only by experienced DS specialists, but and ordinary users who are able to perform simple complex analytical tasks. The library is available for free download and use under the MIT license. The package contains several modules, functions in which are grouped according to the main use cases: from simple classification to NLP and anomaly detection.
https://pycaret.org/
https://github.com/pycaret/pycaret

558 views06:48

Big Data Science

🐻‍❄️On the eve of the New Year, speeding up DS: meet the Polars
Polars is a fast ML modeling data preparation library for Python and Rust. It is 15 times faster than Pandas, parallelizing the processing of dataframes and queries in memory. Written in Rust, Polars uses all the cores of the computer. Also, the library is optimized for the specifics of data processing and supports Python. The rich API allows not only to work with huge amounts of data at the stage of their pre-preparation, but also to build working pipelines. The benchmarking comparison showed that Polars is ahead of not only Pandas, but also other tools, including computing engines popular in Big Data such as Apache Spark, Dask, etc.
Installing and trying Polars is very easy with the pip package manager:
pip install polars
import polars as pl
https://www.pola.rs/
https://betterprogramming.pub/this-library-is-15-times-faster-than-pandas-7e49c0a17adc

pola.rs

Polars

DataFrames for the new era

483 views06:19

Big Data Science

464 views06:19

Big Data Science

Forwarded from Алексей Чернобровов

🔝TOP-25 International Data Science events 2022:

1. WAICF - World Artificial Intelligence Cannes Festival https://worldaicannes.com/ February 10-12, Cannes, France
2. Deep and Reinforcement Learning Summit https://www.re-work.co/events/deep-learning-summit-2022 February 17-18, San Francisco, USA
3. Reinforce https://reinforceconf.com/ March 8-10, Budapest, Hungary
4. MLconf https://mlconf.com/event/mlconf-nyc/ March 31, New York City, USA
5. Open Data Science Conference EAST https://odsc.com/boston/ April 19-21, Boston, USA
6. ICLR - International Conference on Learning Representations https://iclr.cc/ April 25–29, online
7. SDM - SIAM International Conference on Data Mining https://www.siam.org/conferences/cm/conference/sdm22 April 28–30, Westin Alexandria Old Town, Virginia, USA
8. World Summit AI Americas https://americas.worldsummit.ai/ May 4-5, Montreal, Canada
9. The Data Science Conference https://www.thedatascienceconference.com/ May 12-13, Chicago, USA
10. World Data Summit https://worlddatasummit.com/ May 18-22, Amsterdam, The Netherlands
11. Machine Learning Prague https://mlprague.com/ May 27-29, Prague, Czech Republic
12. The AI Summit London https://london.theaisummit.com/ June 15-16, London, UK
13. Machine Learning Week https://www.predictiveanalyticsworld.com/machinelearningweek/ June 19-24, Las Vegas, USA
14. Enterprise AI Summit https://www.re-work.co/events/enterprise-ai-summit-berlin-2022 June 29–30, Berlin, Germany
15. DELTA - International Conference on Deep Learning Theory and Applications https://delta.scitevents.org/ July 12-14, Lisbon, Portugal
16. ICML - International Conference on Machine Learning https://icml.cc/ July 17-23, online
17. KDD - Knowledge Discovery and Data Mining https://kdd.org/kdd2022/ August 14-18, Washington, DC, USA
18. Open Data Science Conference APAC https://odsc.com/apac/ September 7-8, online
19. RecSys – ACM Conference on Recommender Systems https://recsys.acm.org/recsys22/ September 18-23, Seattle, USA
20. INTERSPEECH https://interspeech2022.org/ September 18-22, Incheon, Korea
21. BIG DATA CONFERENCE EUROPE https://bigdataconference.eu/ November 21-24, Vilnius, Lithuania
22. EMNLP - Conference on Empirical Methods in Natural Language Processing https://2021.emnlp.org/ November, TBA
23. Data Science Conference https://datasciconference.com/ November, Belgrade, Serbia
24. Data Science Summit http://dssconf.pl/ December, Warsaw, Poland
25. NeurIPS https://nips.cc/ December, TBA

👍1

414 views11:10

Big Data Science

🚀Speed up scikit learn: a new extension of the good old Python library for DS
The popular Sci-Kit Learn Python library is familiar to every Data Scientist. It has many advantages, but unlike the powerful ML frameworks PyTorch and TensorFlow, Scikit-learn does not allow for fast model training on GPUs. Sklearnex (Extension for Scikit-learn), a Sci-Kit Learn extension from Intel® Corporation, addresses this issue. Sklearnex is a free AI software module that provides 10x to 100x acceleration for a variety of applications. It fully supports all Scikit-Learn APIs and algorithms, speeding up code by replacing standard algorithms with optimized versions. The extension supports Python 3.6 and newer, and you can install it using the typical pip or conda package managers:
pip install scikit-learn-intelex
conda install scikit-learn-intelex -c conda-forge
https://intel.github.io/scikit-learn-intelex/
https://medium.com/@vamsik23/boost-sklearn-using-intels-sklearnex-cf2669f425bd

Medium

Speed up sklearn model training

Sci-Kit Learn is a machine learning library for Python. This library contains ML tools which are required for the daily usage of data…

644 views05:27

Big Data Science

🏂How to choose a validation measure for ML models: Yandex approach
Every practical machine learning problem has a problem with measuring results. Different measures can lead to different assessment results and, therefore, to different chosen algorithms. Therefore, it is very important to find a suitable measure of quality. Researchers from Yandex compare various approaches to solving typical ML problems, from classification to clustering, in order to formulate a universal method for choosing the most optimal quality measure. Key messages and main results are presented in articles at conferences In recent articles published on ICML 2021 and NeurIPS 2021, and a short retelling is available directly on the Yandex website https://research.yandex.com/news/how-to-validate-validation-measures.
http://proceedings.mlr.press/v139/gosgens21a/gosgens21a.pdf
https://papers.nips.cc/paper/2021/file/8e489b4966fe8f703b5be647f1cbae63-Paper.pdf

608 views06:47

Big Data Science

😎How to read tables from PDF: tabula-py
Sometimes the raw data for analysis is stored in pdf documents. To automatically extract data from this format straight into a dataframe, try tabula-py. It is a simple Python wrapper for tabula-java that can read PDF tables and convert to pandas dataframe as well as CSV / TSV / JSON files.
Just first install it through your pip package manager: pip install tabula-py
And then import into your Python noscript:
import tabula as tb
And you can use:
file = 'DataFile.pdf'
data = tb.read_pdf (file, pages = '12')
df = pd.DataFrame (data)
Examples: https://medium.com/codestorm/how-to-read-and-scrape-data-from-pdf-file-using-python-2f2a2fe73ae7
Documentation: https://tabula-py.readthedocs.io/en/latest/

Medium

How to Read and Scrape Data From PDF File Using Python

In this post, I will show you how to read and scrape data from PDF File using Python.

504 views08:18

Big Data Science

💥Top 5 Data Engineering Trends in 2022: Astronomer Research
Astronomer, which commercializes and promotes the popular batch automation tool for working with data, Apache AirFlow, conducted a series of interviews with experts in the field of data engineering to identify the most pressing trends in the IT field.
• Data lineage, Data provenance and Data Quality
• Decentralization of data across different contexts and teams, but within a single consistent infrastructure with centralization of resources
• Consolidation of data tools, including orchestration of processing pipelines
• Data Mesh, eliminating silos between processing teams through the connection of used platforms
• mutual integration of DataOps, MLOps, AIOps for more efficient and faster use of consistent data and tools for seamless work with them.
https://www.astronomer.io/blog/top-data-management-trends-2022

www.astronomer.io

What Are the Top Data Management Trends for 2022?

Learn about emerging trends that are revolutionizing the world of data from the leading Apache Airflow experts. See how to efficiently manage data in 2022.

493 views04:28

Big Data Science

🗣SQL queries against csv file with csvkit
csvkit is a command line toolkit for converting and working with CSV files. This utility allows you to perform the following operations in plain Python:
• Convert Excel and JSON files to CSV
• Display only column names
• Slice data
• change the order of columns
• find rows with matching cells
• convert CSV to JSON
• generate summary statistics
• refer to CSV using SQL queries
• import data into databases and extract from them
• parse CSV data
• work with column delimiters
The pip package manager will help you install csvkit: pip install csvkit
And the syntax for accessing a CSV file via an SQL query on the command line will look like this:
csvsql --query "SQL Query Here - source file name as table name (without .CSV)" source_filename> target_filename
To use this in your Python noscript you should
1) first import CSVSQL from csvkit utility
from csvkit.utilities.csvsql import CSVSQL
2) further define the arguments as a list of values, for example:
args = ['--query', 'select distinct manufacturer from playground', 'payground.csv']
3) then call CSVSQL with arguments
result = CSVSQL (args)
3) finally, the results can be shown
print (result.main ())
https://csvkit.readthedocs.io/en/latest/index.html
https://medium.com/data-engineering-ramstkp/sql-queries-on-csv-using-python-24a472fe53b1

Medium

SQL Queries on CSV Using Python

Python’s smart way of firing SQL queries on CSV files directly (In memory)

449 views02:35

Big Data Science

🚀Accelerating Big Data Analytics: Expedia Group Case Study with Apache Druid and DataSketches
When analyzing big data, problematic queries often arise that do not scale, since they require enormous computational resources and time to obtain accurate results. For example, counting individual items, quantiles, most frequent items, table joins in SQL queries, matrix calculations and graph analysis. If the approximate results for such calculations are acceptable, there are special streaming algorithms or sketches that run several orders of magnitude faster with acceptable errors. The sketches helped Yahoo successfully reduce processing time from days or hours to minutes or seconds. One such tool is the open-source library Apache DataSketches.
It is used by the large travel company Expedia Group to speed up time series analysis in Apache Druid, where table joins are limited, requiring a single dataset to be put into memory. DataSketches supports set operations, including join, intersection, and difference, with little loss in precision. This is useful when looking for and booking tickets. With DataSketches, each dataset can be queried independently of Druid to get the desired object for each dataset for preliminary and then final calculation. Since Druid did not initially support merging DataSketches objects, Expedia Group engineers had to write their own Java code. Moreover, the DataSketches object takes up very little memory space, despite the large size of the set. As a result, Apache Druid, a column-based DBMS for quickly receiving huge amounts of event data and submitting queries with low latency, became even faster.
https://datasketches.apache.org/
https://medium.com/expedia-group-tech/fast-approximate-counting-using-druid-and-datasketch-f5f163131acd

datasketches.apache.org

DataSketches |

486 views03:02

Big Data Science

🌏5 Essential Components of Gartner's Digital Government Technology Platform
The Digital Government Technology Platform (DGTP) makes digital transformation a reality, but requires dedicated leadership. According to a Gartner study, by 2023, more than 80% of government digital implementations that are not based on a technology platform will fail.
DGTP is a set of end-to-end, integrated, horizontal capabilities that coordinate government services across multiple domains by integrating five platforms:
• Citizen Experience platform provides interfaces and technologies, implements policies and procedures for citizen-business interaction, and measures the experience of its users;
• Ecosystem platform – a set of digital interfaces that implement policies and procedures for governments and ecosystem partners to share data and services.
• Internet of Things (IoT) platform provides interfaces, data management and context, and implements policies and procedures for collecting and processing data from IoT sensors
• Information System Platform - Corporate information systems are at the heart of government IT efforts today. The information system platform provides the technologies, policies and procedures for integrating these back office systems into the DGTP
• Intelligence Platform provides advanced analytics, geospatial and location analytics, robotic process automation (RPA) and AI capabilities to process data collected or stored in any area of the platform.
The key reusable components in DGTP are applications and services that can provide a seamless mix of data, services, and capabilities that work together within DGTP and are accessible across networks and devices. DGTP is not a turnkey solution, but it gives government agencies the ability to innovate, reduce costs, and deliver new capabilities quickly and flexibly.
https://www.gartner.com/en/articles/government-cios-here-s-an-essential-piece-of-the-digital-transformation-puzzle

Gartner

Government CIOs: Here’s an Essential Piece of the Digital Transformation Puzzle

A digital government technology platform (DGTP) allows for true digital transformation, resulting in simplified processes, improved citizen interactions and ultimately a more resilient future 💡 Learn more. #GartnerSYM #DigitalTransformation

437 views02:44

About

Blog

Apps

Platform