NEW BOT Телеграм, страница

Big Data Science

📝👀Fastparquet: Reading Parqufet Files with Python
Apache Parquet is a binary column-oriented storage format originally created for the Hadoop ecosystem. Thanks to its concise and efficient column-wise representation of data, it is very popular in the Big Data world. However, reading the data in Parquet format is not an easy task. PySpark can handle this, of course, but not every Data Scientist works with data in Apache Spark. This is where fastparquet comes in, a Python implementation of the Parquet format used by Dask, Pandas, and others to deliver high performance with a small distribution size and small codebase. Fastparquet depends on a set of Python libraries (numpy, pandas, cramjam, fsspec), so they should be installed beforehand.
After installation via the PIP package manager (pip install fastparquet) or from Github (pip install git + https: //github.com/dask/fastparquet), the contents of the Parquet file can be easily transferred to the dataframe in your usual DS-IDE as Jupiter Notebook:
from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
df = pf.to_pandas()
df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])

Or write a dataframe to a Parquet file, specifying the number of logical segments, compression codec and data scheme:
from fastparquet import write
write('outfile.parq', df)
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
compression='GZIP', file_scheme='hive')

https://github.com/dask/fastparquet
https://www.anaconda.com/blog/whats-new-with-fastparquet
https://blog.datasyndrome.com/using-the-python-ml-stack-inside-pyspark-de1223942c32
https://fastparquet.readthedocs.io/en/latest/

GitHub

GitHub - dask/fastparquet: python implementation of the parquet columnar file format.

python implementation of the parquet columnar file format. - dask/fastparquet

509 views04:22

Big Data Science

🌏Digital Twin of Earth from NVIDIA
To prevent climate disasters, NVIDIA is set to build the world's most powerful AI supercomputer for predicting climate change. This system will create a digital twin of the Earth in the Universe and will become an analogue of Cambridge-1, the world's most powerful supercomputer with artificial intelligence for medical research. By combining three technologies: GPU-accelerated computing, deep learning on neural networks, and AI supercomputers with a lot of observable and model data, scientists and engineers are set to achieve accurate simulations of physical, biological and chemical processes on Earth. This will help shape early warnings for the adaptation and resilience of urban infrastructure so that people and countries can act quickly to prevent climate disasters.
https://blogs.nvidia.com/blog/2021/11/12/earth-2-supercomputer/

NVIDIA Blog

NVIDIA to Build Earth-2 Supercomputer to See Our Future

NVIDIA plans to build the world’s most powerful AI supercomputer dedicated to predicting climate change, named Earth-2.

450 views03:44

Big Data Science

🕸🐅Over 50 New Graph Algorithms in TigerGraph: Fall 2021 Release
TigerGraph is a popular fast and scalable graph database with massively parallel processing and ACID transaction support, making it the fastest and most scalable graphics platform. Thanks to its efficient data compression and MPP architecture, it can analyze huge amounts of information in real time. And the internal query language GSQL is very similar to standard SQL, familiar to every analyst.
October 2021 release includes 50 new algorithms, for example, embedded graphs node2vec and FastRP, similarity algorithms ("Nearest Neighbor Approximation", "Euclidean Similarity", "Overlap Similarity" and "Pearson Similarity"), structural similarity algorithms for predicting topological relationships and random path algorithms. In the first half of 2022, the developers promise to add neural networks and other ML methods to build analytical pipelines on graphs. Despite the fact that TigerGraph is positioned as a powerful enterprise solution, the source code of this system is open and available for free download from Github.
https://www.tigergraph.com/blogs/about-tigergraph/graph-data-science-library/
https://github.com/tigergraph

TigerGraph

TigerGraph Graph Algorithms | Obtain Insights at Scale

Graph algorithms are essential building blocks for analyzing your connected data and for AI methods which gain deeper insights from that data

458 views03:05

Big Data Science

🙌🏻Principal Component Analysis: 7 Methods for Dimension Reduction in Scikit-Learn
One of the main problems of machine learning on large datasets is the huge size of the computational vectors. Therefore, methods of dimensionality reduction to reduce the number of variables are very relevant. This method is Principal Component Analysis (PCA), the essence of which is to reduce the dimension of the dataset, while retaining as much "variability" as possible, i.e. statistical information.
PCA is a statistical method of converting high-dimensional data into low-dimensional data by choosing the most important features that collect as much information about the dataset as possible. Features are selected based on the variance they cause in the output. The trait that causes the most variance is the first major component. The trait responsible for the second largest variance is considered the second main component, etc. It is important that the main components are not related to each other in any way. In addition to speeding up ML algorithms, PCA allows you to visualize data by projecting it into a lower dimension in order to display it in 2D or 3D space.
The popular Python library Scikit-learn includes the sklearn.decomposition.PCA module, which is implemented as a transformer object for multiple components in the fit () method. It can also be used for new data to project onto these components. To use the PCA method in the Scikit-Learn library, there are 2 steps:
1. Initialize the PCA class by passing the required number of components to the constructor;
2. invoke the fitting methods, and then transform them, passing them a feature set. The transform method returns the specified number of base components.
Scikit-learn supports several variations of the PCA method:
• Kernel Principal Component Analysis (KPCA) is a nonlinear dimensionality reduction method using the kernel. The PCA kernel was developed to help classify data whose decision boundaries are described by a non-linear function. The idea is to move into a higher-dimensional space in which the decision-making boundary becomes linear. The sklearn.decomposition module has different kernels: linear, polynomial (poly), Gaussian radial basis function (rbf), sigmoid (sigmoid '), etc. The default is linear, which is suitable if the data is linear separable.
• Sparse PCA - A sparse version of PCA, the purpose of which is to extract a set of sparse components that best recovers data. Typically, PCA extracted components have extremely dense expressions, i.e. nonzero coefficients as linear combinations of the original variables. This makes it difficult to interpret the results. In practice, real principal components can be more naturally represented as sparse vectors, for example, in face recognition, they can display parts of faces.
• Incremental Principal Component Analysis (IPCA) - Incremental PCA method when the dataset to be decomposed is too large to fit in memory. IPCA constructs a low-rank approximation for the input data using an amount of memory that is independent of the size of the input sample. It still depends on the input features, but changing the packet size allows you to control the memory usage.
• Fast Independent Component Analysis (ICA) - Fast independent PCA is used to evaluate sources with noisy measurements and reconstruct sources, since classic PCA does not work with non-Gaussian processes.
• Linear Discriminant Analysis (LDA) - Linear Discriminant Analysis, like classical PCA, is a linear transformation method. But the PCA is not monitored, i.e. ignores class labels, and LDA is supervised machine learning that is used to distinguish between two classes or groups. LDA is suitable for controlled dimensionality reduction by projecting input data into linear subspace from directions that maximize separation between classes. The dimension of the output is necessarily less than the number of classes. In scikit-learn, LDA is implemented using LinearDiscriminantAnalysis, where the n_components parameter specifies the number of functions to return.

385 views05:52

Big Data Science

• Non-negative Matrix Factorization (NMF) - Non-negative matrix factorization, an alternative approach to decomposition that assumes that the data and components are non-negative. NMF is an uncontrolled linear dimensionality reduction technique. In NMF, the original data (feature matrix) is split into multiple matrices (i.e., factorized) representing the hidden relationship between observations and their characteristics. NMF can be connected instead of PCA when the data matrix does not contain negative values. NMF does not provide an explained variance like PCA and other methods, so the best way to find the optimal value for n_components is to try a range of values.
• Truncated Singular Value Decomposition (TSVD) - Truncated singular value decomposition is similar to PCA. This method performs linear dimensionality reduction using a truncated singular value decomposition. Unlike PCA, this estimator does not center the data before computing the singular value decomposition and can work efficiently with sparse matrices.
https://medium.com/@deepak.engg.phd/dimensionality-reduction-with-scikit-learn-ee5d2b69225b

Medium

Dimensionality Reduction with Scikit-Learn

As a matter of fact, when compression technology came along, we thought the future in 1996 was about voice. We got it wrong. It is about…

396 views05:52

Big Data Science

🐱Visualizing Temporal Changes of Categorical Data: PyCatFlow vs RankFlow
Sometimes a Data Scientist needs to visualize ranked lists over time, such as changes in search results for queries on Google or YouTube. To do this, you can use RankFlow - a useful tool with a minimalistic UI and a rather cumbersome data preparation process. RankFlow allows you to compare ranked listings over time. It requires that the input tabular data be organized so that each column represents a ranked list. Each ranked list can be supplemented with weights, adding another level of information to the data. For example, for YouTube search results, you can take views, upvotes, or upvotes ratio. Each column in a data table is represented as a stack of nodes, ordered according to rank in a given dataset. In addition, identical nodes are connected between columns. This highlights data continuity and changes, allowing pattern analysis.
Building a RankFlow visualization from this data requires modifying the dataset. For each version of the API, there must be a column containing a ranked list of permissions that are not ordered by any relevance metric. Therefore, ordering a RankFlow chart is a design decision, meaning items can be sorted alphabetically, by frequency in the dataset, or based on additional data.
In practice, adapting the data to the required RankFlow data structure is quite tedious. To speed up the pre- and post-processing of charts, you can write your own Python noscript that processes the XML data in the SVG file generated by RankFlow. An alternative is PyCatFlow, a visualization tool similar to RankFlow that works well for temporal data without explicit ranking information, but with potential additional categorical data. PyCatFlow is an open-source Python package that can be downloaded freely from Github.
https://medium.com/@bumatic/pycatflow-visualizing-categorical-data-over-time-b344102bcce2
https://github.com/bumatic/PyCatFlow

Medium

PyCatFlow: Visualizing Categorical Data Over Time

PyCatFlow is a Python package for visualizing temporal changes to categorical data. It is inspired by Bernhard Rieder’s visualization tool…

465 views04:26

Big Data Science

💥Apache Spark on Google Colab? Installing PySpark on DS Cloud
Apache Spark is one of the most demanded computational frameworks in the Big Data field. With its clustered architecture and in-memory MapReduce jobs, it quickly processes huge amounts of data. Spark has an API for the popular DS Python language, PySpark, and automatically parallelizes code generated on the local machine to all nodes in the cluster. But what if you don't have a cluster and need to process a lot of data?
Cloud solutions will help - Google Colab, where you can install Spark. First you need to download Java, because the framework is written in Scala and runs on JVM:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
Then you should download the Spark framework itself from the official website of the Apache Software Foundation, along with Hadoop
!wget -q https://www-us.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz
Unzip the downloaded tgz-file
!tar xf spark-3.1.2-bin-hadoop2.7.tgz
The next step is to install the findspark library to find Spark on the system and import it.
!pip install -q findspark
Then you need to set the path to the Colab environment to run PySpark there:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop2.7"
To find Spark on the system, import findspark and use the findspark.init() method. The findspark.find() method will help you find out where Spark is installed:
import findspark
findspark.init()
findspark.find()
Finally, you can import the Spark - SparkSession from PySpark.sql and create an entry point to the framework, optionally specifying the application name in the appName (“”) configuration parameter
from pyspark.sql import SparkSessionspark= SparkSession \
.builder \
.appName("Spark Application Name") \
.getOrCreate()
https://medium.com/geekculture/how-to-get-your-spark-installation-right-every-time-on-colab-218d57b6091d

Medium

How to (Rightly) Install Pyspark on Google Colab

Let’s get it right regardless of version updates

510 views04:45

Big Data Science

🥁Data engineering for Data Science: SynapseML by Microsoft
Microsoft has adapted Apache Spark for the tasks of data engineers and DS specialists, released SynapseML - a framework for creating scalable ML pipelines. This open source library was previously called MMLSpark. SynapseML powered by SparkML to the Spark ecosystem of deep learning and data analysis, seamless ML pipeline transformation tools with Open Neural Network Exchange (ONNX), LightGBM, Cognitive Services, Vowpal Wabbit, and OpenCV. This allows you to create powerful and highly scalable predictive and analytical models for a variety of data sources.
It is noteworthy that SynapseML is able to work with untagged datasets thanks to API methods of ready-made AI services for quickly solving typical ML tasks. SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+. The framework allows you to write code in any Spark-compatible language: Python, Scala, R, Java, .NET, and C #. Over the HTTP protocol, users can embed any web service into their SparkML models, and the clustered nature of Spark allows ML projects to scale.
https://microsoft.github.io/SynapseML/
https://github.com/microsoft/SynapseML

microsoft.github.io

SynapseML | SynapseML

Simple and Distributed Machine Learning

460 views04:48

Big Data Science

☃️🎄❄️TOP-10 the most interesting DS-conferences all over the world in December 2021
1. 2 Dec – TechCrunch and iMerit ML DataOps Summit https://techcrunch.com/events/imerit-ml-dataops-summit/
2. 6-7 Dec – Scientific Data Analysis at Scale (SciDAS) Cloud Computing Workshop. Chapel Hill, NC, USA & Virtual https://renci.github.io/sbdh-scidas-workshop/
3. 6-10 Dec – The Analytics Engineering Conference https://coalesce.getdbt.com/
4. 7-10 Dec - IEEE ICDM 2021: 21st IEEE Int. Conference on Data Mining, Auckland, New Zealand https://icdm2021.auckland.ac.nz/
5. 7 Dec - Chief Data & Analytics Officer, Nordics - Think Tanks, by Corinium. Join The Nordic Region's Most Innovative Data & Analytics Leaders. Online https://cdao-nordics.coriniumintelligence.com/
6. 7-8 Dec - MENA Conversational AI Summit 2021, Virtual https://menaconversationalai.com/
7. 8 Dec - Data Points Summit | Manufacturing, Retail & CPG by Grid Dynamics. Online https://datapoints.griddynamics.com/
8. 9 Dec – Augment - Cloud Data Warehousing Summit, free virtual event https://hevodata.com/events/summit/how-to-migrate-setup-and-scale-a-cloud-data-warehouse
9. 14 Dec - Data Reliability Engineering Conference 2021, One full day of reliability standards and innovation, Free Registration, online https://drecon.org/
10. 15-18 Dec - IEEE Int. Conf. on Big Data (IEEE BigData 2021). Orlando, FL, USA https://bigdataieee.org/BigData2021/index.html

TechCrunch

iMerit ML DataOps Summit - Overview

490 views04:10

Big Data Science

🙌🏻XLS-R is a new set of ML models from Facebook AI
The PyTorch team from Facebook have published XLS-R, a set of large-scale models for self-study of cross-language speech representation based on wav2vec 2.0. The models were trained in 128 languages for over 400 thousand hours of unlabeled speech. Thanks to fine tuning, the models show a high level of speech recognition and are excellent for translation, understanding and language identification tasks. The training data was taken from a variety of sources, including audiobooks and court records. XLS-R neural network models contain more than 2 billion parameters and are multilingual. Moreover, testing has shown that teaching several languages at once increases the efficiency of neural networks. You can download XLS-R from Github right now.
https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr

GitHub

fairseq/examples/wav2vec/xlsr at main · facebookresearch/fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python. - facebookresearch/fairseq

533 views03:02

Big Data Science

Visual Genome: the most labeled dataset
👁Scientists at Stanford University have collected the most annotated dataset with over 100,000 images. In total, the dataset contains almost 5.5 million object denoscriptions, attributes and relationships. You don't even have to download the dataset, but get the data you need by accessing the RESTful API endpoint using the GET-method. Despite the fact that the latest updates to the dataset are dated 2017, this is an excellent data set for training models in typical ML problems, from recognizing generated data using graph algorithms.
https://visualgenome.org/api/v0/api_home.html

15.8K views04:47

Big Data Science

How to read Parquet files
The Apache Parquet format is widely used in Big Data due to its column-wise storage and efficient compression. It allows you to quickly read the data you want from a specific column instead of reading a full row and saves time. However, not every application system can read binary Parquet files. It is often necessary to convert Parquet files to CSV or TXT format before opening them, for example, in MS Excel or Power BI, which are often used in the work of a DS specialist. In this case, the Python pandas library will help with the built-in read_parquet() function for reading data from a Parquet file into a dataframe. The dataframe can then be saved to a CSV file using the to_csv() method and opened in almost any office spreadsheet editor.
https://medium.com/@i.m.mak/workaround-for-reading-parquet-files-in-power-bi-e2d060abcb80

Medium

Workaround for reading Parquet files in Power BI

If you’re working with Apache Spark, an ML solution or a reporting project where you need quick reads, you must have definitely come…

597 views07:45

Big Data Science

💦What are CDC system and why you need it
To reduce the amount of data read from a corporate warehouse or lake, but always keep abreast of the latest changes, a CDC approach is used: Capture Changed Data or Change Data Capture. There are ready-made CDC tools: Oracle Golden Gate, Qlik Replicate and HVR, which are best suited for receiving data from frequently updated relational DBMSs. Also, data engineers create their own solutions: • CDC calculations using timestamps marking creation, update, and expiration in the source tables. Any process that inserts, updates, or deletes a row must also update the corresponding timestamp column. Hard deletion is not allowed. The drawbacks to this method are that you need to redesign the database structure to add a timestamp column, and the need for tight coupling between the original table and the ETL process code. • CDC computation using a negative query, when a link is created between the source and the target sink and a minus SQL query is executed to calculate the change log. This method is more of an anti-pattern, since only works if the source and destination databases are of the same type, and it also increases the amount of data moved. Both methods of handwritten CDC impose significant overhead on the original database. And special CDC tools reduce the load on the network and the source by analyzing the logs to calculate changes. However, the main disadvantage of off-the-shelf CDC solutions is their high cost. In addition, the source DBMS administrator must grant the CDC tool privileged access to the database log, which is perceived as not very loyal for security reasons.
https://towardsdatascience.com/change-data-capture-cdc-for-data-ingestion-ca81ff5934d2

Medium

Change Data Capture(CDC) for Data Lake Data Ingestion

Change Data Capture(CDC) tools can accelerate Data Lake adoption by enabling scalable and network efficient near-real-time data replication

480 views04:14

Big Data Science

💥Python and SQL combo with FugueSQL: a single SQL interface for Pandas, Spark and Dask dateframes
FugueSQL is an open source Python library that allows you to combine Python code with an SQL command by switching between them in a Jupyter Notebook or Python noscript. FugueSQL supports distributed computing and provides a unified API to run the same SQL code in Pandas, Dask, and Apache Spark.
Unlike PandaSQL, which has a single SQLite server, which introduces a lot of overhead when transferring data between Pandas and the database, FugueSQL supports multiple local backends: pandas, DuckDB, and SQLite.
When using the pandas backend, Fugue translates SQL directly into pandas operations, excluding data transfers. DuckDB has excellent panda support so the data transfer overhead is negligible. Both Pandas and DuckDB are the preferred FugueSQL server-side modules for local data processing. Fugue also supports Spark, Dask, and cuDF (via blazingSQL) as backends.
The Fugue SQL code is parsed using ANTLR and mapped to equivalent functions in the Fugue API. FugueSQL has many features built in and the code is extensible with Python-code. By default, it supports the most common features of the function: filling in nulls, deleting nulls, renaming columns, changing the schema, and more. Fugue also improves on some improvements in standard SQL to handle end-to-end data workflows gracefully. For example, creating intermediate tables by assigning a number.
As Pandas %% fsql accepts NativeExecutionEngine as a default parameter. In Dask, FugueSQL is slightly slower than the native engine, but more complete in terms of implemented SQL keywords. FugueSQL also runs on Spark, mapping %% fsql operations to Spark and Spark SQL operations. This allows you to quickly develop distributed applications. All you have to do is create a local prototype using the NativeExecutionEngine, test it, and deploy it to the Spark cluster just by changing the execution engine.
https://towardsdatascience.com/introducing-fuguesql-sql-for-pandas-spark-and-dask-dataframes-63d461a16b27
https://fugue-tutorials.readthedocs.io/tutorials/fugue_sql/index.html

Medium

Introducing FugueSQL — SQL for Pandas, Spark, and Dask DataFrames

An End-To-End SQL Interface for Data Science and Analytics

432 views05:35

Big Data Science

👻What is UMAP and why is it useful for Data Scientist?
UMAP (Uniform Manifold Approximation and Projection) is a universal manifold learning and dimensionality reduction algorithm. It is designed to be compatible with scikit-learn, uses the same API, and can be added to sklearn pipelines. As a stochastic algorithm, UMAP uses randomization to speed up the approximation and optimization steps. This means that different UMAP runs may produce different results. Although the UMAP is relatively stable, ideally the difference between runs should be relatively small, but it is. To ensure accurate reproduction of results, UMAP allows the user to set a random initial state.
Since version 0.4, UMAP also supports multithreading to improve performance, and when optimized, race conditions between threads are allowed at certain stages. The randomness in the UMAP output for the multithreaded case depends not only on the input random seed, but also on race conditions between threads during optimization, which is impossible to control. Therefore, multithreaded UMAP results cannot be explicitly reproduced.
UMAP can be used as an efficient preprocessing step to improve the performance of density-based clustering. But UMAP, like t-SNE, does not completely preserve density and can create false discontinuities in clusters. Compared to t-SNE, UMAP maintains a more global structure, creating more meaningful clusters. And thanks to the support for arbitrary embed sizes, UMAP allows you to work with large dimensional spaces.
Due to the active use of the nearest neighbors method, for some datasets, UMAP can consume excessive memory. Setting low_memory to True will help to switch to a slower, but less intensive approach to calculating nearest neighbors. It's also important to know that when run without a random seed, UMAP will use a parallel implementation of NUMBA to multithread and consume CPU cores. By default, it will use as many cores as available. You can limit the number of threads Numba uses by using the NUMBA_NUM_THREADS environment variable. Also due to the nature of Numba, UMAP does not support 32-bit Windows.
Despite some disadvantages, UMAP can be used in the following cases:
• exploratory data analysis (EDA);
• interactive visualization of the analysis results;
• processing of sparse matrices;
• detection of malicious programs based on behavioral data;
• preprocessing vectors of phrases for clustering;
• preprocessing of image embeddings (Inception) for clustering.
https://github.com/lmcinnes/umap
https://umap-learn.readthedocs.io/en/latest/index.html

GitHub

GitHub - lmcinnes/umap: Uniform Manifold Approximation and Projection

Uniform Manifold Approximation and Projection. Contribute to lmcinnes/umap development by creating an account on GitHub.

597 views03:37

Big Data Science

🚀Speed up DS with big data: Pandas API right in Apache Spark
The popular computing framework Apache Spark allows you to write programs in Python, which is familiar to every DS-specialist. PySpark now includes a pandas library that can be imported with just one line: import pyspark.pandas as ps.
This provides the following benefits:
• lowers the threshold for entering Spark;
• unifies the codebase for small and big data, local machines and distributed clusters;
• speeds up Pandas code.
By the way, Pandas on Spark is even faster than the other popular Python engine, Dask!
https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html
https://towardsdatascience.com/run-pandas-as-fast-as-spark-f5eefe780c45
https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

Medium

Run Pandas as Fast as Spark

Why the Pandas API on Spark is a total game changer

447 views07:08

Big Data Science

👑Introduce useful DS-tools: meet the Streamlit
Streamlit is an open source Python library that makes it easy to create and publish beautiful custom web applications for ML and DS. Build and deploy powerful applications in just a couple of minutes. Comparing Streamlit to Dash is similar to comparing Python to C #. Streamlit makes it easy to build web data applications in pure Python code, often in a few lines of code. For example, one-line commands for displaying interactive visuals Plotly, Bokeh and Altair, Pandas DataFrames, etc. Streamlit is supported by a huge open-source developer community: add your own components to the library using JavaScript. And the cloud use of Streamlit is open to everyone: you can create and host up to three applications for free.
https://streamlit.io/

streamlit.io

Streamlit • A faster way to build and share data apps

Streamlit is an open-source Python framework for data scientists and AI/ML engineers to deliver interactive data apps – in only a few lines of code.

436 viewsedited 07:46

Big Data Science

☀️Meet the Gallia: a new library for data transformation
This schema-enabled Scala library comes in handy for practical data transformation, including ETL processes, function development, HTTP responses, and more. Highly scalable, it is designed to bridge the gap between Pandas and Spark SQL. Gallia is useful for those who appreciate the powerful type system in Scala, and those who find it difficult to understand too fancy SQL queries. Essentially, Gallia implements a one stop shop paradigm for most or all of your data transformation needs in a single application. The library supports all kinds of data manipulation, from aggregations to pivoting tables, including processing individual and nested objects, not just collections. For scaling, Gallia integrates perfectly with the Spark RDD API.
https://cros-anthony.medium.com/gallia-a-library-for-data-transformation-3fafaaa2d8b9
https://github.com/galliaproject/gallia-core/blob/master/README.md

Medium

Gallia: A Library for Data Transformation

A schema-aware Scala library for practical data transformation: ETL, feature engineering, HTTP responses, etc

471 views02:59

Big Data Science

👻4 simple tips for effective data engineering
To prevent data engineering projects with hundreds of artifacts, including dependency files, jobs, unit tests, shell files, and Jupyter notebooks from becoming chaos, follow these guidelines:
• manage dependencies, for example through a dependency manager like Poetry
• remember about unit tests - introducing unit tests into the project will save you from trouble and improve the quality of your code
• divide and conquer - store all data transformations in a separate module
• document to remember the code and the business problem it solves yourself and share knowledge with colleagues
https://blog.devgenius.io/keeping-your-data-pipelines-organized-fa387247d59e

Medium

Keeping Your Data Pipelines Organized

Presenting an easy to go Data Engineer project structure

711 views06:56

Big Data Science

👣AutoML and more with PyCaret
PyCaret is an open source AutoML library in Python with a low-level approach to automating most MLOps tasks. PyCaret has special features for parsing, deploying, and combining models that many other ML frameworks do not have. It allows you to go from preparing data to deploying an ML model in minutes in a user-selected development environment.
In fact, PyCaret is a Python wrapper for several libraries and ML frameworks: scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, etc. The simplicity of PyCaret allows it to be used not only by experienced DS specialists, but and ordinary users who are able to perform simple complex analytical tasks. The library is available for free download and use under the MIT license. The package contains several modules, functions in which are grouped according to the main use cases: from simple classification to NLP and anomaly detection.
https://pycaret.org/
https://github.com/pycaret/pycaret

pycaret.gitbook.io

PyCaret 3.0 | Docs

An open-source, low-code machine learning library in Python

558 views06:48

Big Data Science

🐻‍❄️On the eve of the New Year, speeding up DS: meet the Polars
Polars is a fast ML modeling data preparation library for Python and Rust. It is 15 times faster than Pandas, parallelizing the processing of dataframes and queries in memory. Written in Rust, Polars uses all the cores of the computer. Also, the library is optimized for the specifics of data processing and supports Python. The rich API allows not only to work with huge amounts of data at the stage of their pre-preparation, but also to build working pipelines. The benchmarking comparison showed that Polars is ahead of not only Pandas, but also other tools, including computing engines popular in Big Data such as Apache Spark, Dask, etc.
Installing and trying Polars is very easy with the pip package manager:
pip install polars
import polars as pl
https://www.pola.rs/
https://betterprogramming.pub/this-library-is-15-times-faster-than-pandas-7e49c0a17adc

pola.rs

Polars

DataFrames for the new era

483 views06:19

About

Blog

Apps

Platform