NEW BOT Телеграм, страница

Big Data Science

💥3 advantages and a couple of limitations of PySpark over Pandas
Although the Python-library Pandas is very popular among junior Data Scientists, it is not bound to work with really big data up to 10-12 GB. Although there are tools that can parallelize the work of Pandas, such as Dask, Swift, Ray, etc., this only speeds up the library, but does not address the root causes of the problems, because. Pandas always loads the dateframe into memory.
Therefore, dealing with large amounts of data requires something faster, such as PySpark - the Python API in Apache Spark, a distributed computing framework. Being distributed in its nature, Spark also uses lazy (deferred) evaluations, performing data operations not at the time of their declaration, but at an important call. This invariably sets many memory limits. Unlike PySpark, Pandas sticks to the Eager Execution by executing the reach task as early as possible, while PySpark follows the Lazy Execution, when the task is not elevated until after the action has been completed. PySpark is great for developing scalable applications and guarantees fault tolerance.
However, due to the networking of distributed computing, PySpark estimates a higher speed compared to local Pandas, which is indicative of the throughput of the application. In addition, PySpark remains demanding because all MapReduce tasks depend on RAM, without writing preliminary results to memory disk, unlike classical calculations in Hadoop. Finally, there are not so many specific algorithms for Data Science in PySpark.

👍2

669 views05:27

Big Data Science

🌨🍂Cool DS-events in November 2022 all over the world
Nov 1 • StreamSets Roadshow: SF Bay Area • San Mateo, CA, USA https://streamsets.com/roadshow/san-francisco/
Nov 1-3 • ODSC West 2022 • San Francisco, CA, USA https://odsc.com/california
Nov 2 • Modern Data Stack Roadshow • Denver, CO, USA https://www.moderndatastackroadshow.com/
Nov 2 • AiX Summit 2022 • San Francisco, CA, USA + Virtual https://odsc.com/california/aix-west/
Nov 2-3 • The AI Summit Austin • Austin, TX, USA https://austin.appliedintelligence.live/welcome
Nov 3 • StreamSets Roadshow: New York • New York, NY, USA https://streamsets.com/roadshow/new-york/
Nov 8 • The Data Science Symposium 2022, Presented by the UC Center for Business Analytics • Cincinnati, OH, USA https://web.cvent.com/event/36b0c2f9-5cfa-4f4c-875d-66c7f1ee2ed5/summary
Nov 8 • iMerit ML DataOps Summit 2022 • Virtual https://techcrunch.com/events/imerit-dataops-summit-2022/
Nov 9-10 • RE•WORK MLOps Summit • London, UK • https://london-ml-ops.re-work.co/
Nov 9-10 • Deep Learning Summit • Toronto, ON, Canada https://toronto-dl.re-work.co/
Nov 10 • Unleashing Hybrid and Multi-Cloud Data Science at Scale • Virtual https://www.dominodatalab.com/resources/unleashing-hybrid-multi-cloud-mlops
Nov 14-16 • Marketing Analytics & Data Science (MADS) Conference • San Antonio, TX, USA • https://informaconnect.com/marketing-analytics-data-science/
Nov 15-16 • Snowflake Build ‘22 • Virtual https://www.snowflake.com/build/
Nov 16-17 • 3rd Edition of Future Data Centres and Cloud Infrastructures • Dubai, UAE https://www.futuredatacentre.com/
Nov 16-17 • Smart Data Summit Plus 2022 • Dubai, UAE https://www.bigdata-me.com/
Nov 17-18 • Data Science Summit 2022 • Warsaw, Poland + Virtual https://dssconf.pl/
Nov 22 • trade/off: The Decision Intelligence Summit • Berlin, Germany https://www.tradeoff.ai/
Nov 22-23 • Nordic Data Science and Machine Learning Summit • Stockholm, Sweden + Virtual https://ndsmlsummit.com/

StreamSets

StreamSets Connected Enterprise Forum

Become a truly connected enterprise. Join us and learn the strategies and insights that deliver value to today’s data leaders.

👍4

553 views10:16

Big Data Science

🚀Python 3.11.0: Major New Features for the Developer
On October 24, 2022, a new version of Python 3.11.0 was released. It provides the following new features and bug fixes:
• fixed multiplication of a list by an integer (list *= int), which resulted in an integer overflow when the new allocated length is close to the maximum size;
• Changed forkserver start method code - on Linux the multiprocessing module now uses unix domain sockets again, using the file system to operate on the forkserver process instead of the Linux abstract socket namespace. Abstract sockets do not have permissions and can be used on the system in the same network destination (particularly the entire system) to inject code into the forkserver multiprocessor process. This is a significant privilege escalation vulnerability and has been fixed in the new version of Python.
• Fixed an issue where frame objects could also be contained by the same interpreter frame, resulting in memory corruption and severe interpreter crashes;
• Fixed possible propagation of data or data sets when accessing the f_back element of a newly created generator or program frames;
• Fixed an implementation crash when calling PyEval_GetFrame() when the topmost Python frame is in a partially implemented state.
• command line parsing fixed: reject -X int_max_str_digits with no value is invalid when environment setting PYTHONINTMAXSTRDIGITS is set to a valid value;
• fixed undefined behavior in _testcapimodule.c;
• updated pip and setuptools archives to version 22.3 and 65.5.0 respectively;
• the asyncio.Task.cancel("message") method, previously declared discovered, now works again and is not found discovered;
• semaphores work faster, which is important for multitasking programs;
• fixed flag to use CONFORM bounds, allowing flags to be combined with unexplored values;
• on Windows, when typing Python tests behind the -jN option for a temporary stdout file, ANSI encoding is now used instead of UTF-8;
• fixed a bug due to which multiprocessing from the natural environment spawned child processes in Windows when it was not needed;
• fixed py.exe startup scheme with -V:<company>/ parameter when default settings were set in environment settings or configuration file;
• The macOS 13 SDK includes support for the mkfifoat and mknodat system calls. Previously, using the dir_fd option with os.mkfifo() or os.mknod() resulted in a segfault if cpython was built with the macOS 13 SDK but worked on an earlier version of macOS.

https://www.python.org/downloads/release/python-3110/
https://docs.python.org/release/3.11.0/whatsnew/changelog.html#python-3-11-0-final

Python.org

Python Release Python 3.11.0

The official home of the Python Programming Language

👍3

708 views05:41

Big Data Science

🌞Easy prediction with Lazy Predict
Lazy Predict is a Python-library for comparing performance of different machine learning models in a dataset. In fact, this is a wrapper that allows you to quickly fit all ML models to a dataset and compare their performance in just a couple of lines of code. Lazy Predict works great with regression and classification of machine learning model comparison results by the most important metrics: R-Squared, RMSE, Accuracy (Precision), F1 Score and Training Time.
The library is open source and highly compatible with sklearn, numpy, and other Python libraries that are used in Data Science.
https://lazypredict.readthedocs.io/en/stable/readme.html

👍1

551 viewsedited 04:55

Big Data Science

🐍Python instead of Cypher in Neo4j with Py2neo
Manipulations with data in the Neo4j graph database are performed using the SQL-like query language Cypher. It is graph-optimized, identifying and exploiting data relationships, exploring relationships in all directions to discover previously unseen relationships and clusters. However, Python gives you a lot of flexibility in working with data. Therefore, many Data Scientists prefer it for various programs, including automating the process of creating nodes and links.
Use the Py2neo library, a Python-package to work with Neo4j. It can be installed using the pip package manager through the well-known pip install py2neo command on the command line. Next, you can open your favorite Python editor and run the graph in Neo4j.
Py2neo includes a set of tools for working with Neo4j from Python applications and from the command line. The library supports both Bolt and HTTP and provides a high-level API, OGM, administration tools, an interactive console, a Cypher lexical encoder for Pygments, and many other features that come in handy for graph analysis. Starting with version 2021.1, Py2neo contains full routing support provided by the Neo4j cluster, which can be enabled using the neo4j://... URI or by passing the routing=True parameter to the Graph class constructor.
https://py2neo.org/2021.1/

👍1

642 views09:40

Big Data Science

👀TOP 5 Open Source Data lineage Tools
Data lineage allows a company to comply with regulatory requirements, better understand and trust its data, and save time on manually analyzing the effects of data changes. To monitor data, there are many tools on the market today, open source and proprietary. Of the source code tools, the most important are:
• Tokern allows users to get column-level data passing data from databases and data stores hosted by Google BigQuery, AWS Redshift, and Snowflake. Tokern can be integrated with other bugs as well. it works well with a large number of data directories with large source code and ETL frameworks. Tokern also provides users with the ability to generate data from external sources or its frequency ETL, making an overview for high end BI and ETL tools. Tokern uses PostgreSQL as its local storage, and NetworkX for quick analysis of network graphs. These available users can be manipulated, visualized, and analyzed in column-level library origin data. Additionally, users can also interact with walkthrough data using the Tokern SDK or API. Tokern also discovered PII (personal information) and PHI (personal health information) using PIICatcher, combining regular expressions with NLP libraries Spacy and Stanford NER.
• Egeria - a metadata standard with original source code, enabling seamless integration of data processing tools for a robust and unified representation of metadata. Egeria allows you to build better solutions for data production, data quality checks, PII identification, and more. in addition to cataloging and metadata retrieval. Egeria Law on the OpenLineage data collection and storage standard. This allows users to gain a more complete view of the data by providing horizontal and vertical assignment and tracing. To get information about the origin of the data, Egeria listens to the Kafka events sent by the original messages.
• Pachyderm is a data collection tool that empowers developers to create machine learning pipelines regardless of language and framework, instead of focusing on cloud storage. It uses a version control system like LakeFS or Git saves and saves changes like commits keeping a complete and unaltered audit trail. Pachyderm also has a full audit trail and uses a central repository based on object storage in a custom Pachyderm File System to track data generation and version control. Pachyderm collects user data sources, uses global origin event and data object identifiers. Pachyderm allows you to create an immutable data production graph as a DAG in the user interface, which is especially useful when working with ML pipelines. Thick Skin integrates well with many databases, repositories and data lakes. Therefore, many companies use it for MLOps operations, ETL unstructured data, and NLP workflows.
• OpenLineage is a Linux Foundation project based on the highest ETL platforms, data orchestration mechanisms, metadata catalogs, data quality control mechanisms, and data inheritance tools. OpenLineage uses JSONSchema as the API definition and supports multiple languages and platforms. The previously mentioned Egeria has a core metadata layer built on top of OpenLineage. WeWork's Marquez also underpins the OpenLineage architecture, a preferred UI and metadata repository, as well as an API for collecting metadata, validating via GraphQL, and a REST API.
• TrueDat is a comprehensive data management solution that allows you to efficiently classify, search and evaluate data, as well as visualize the entire data lifecycle. The tool was created in 2017 by BlueTab, which is part of IBM, and is still actively developed.
https://blog.devgenius.io/5-best-open-source-data-lineage-tools-in-2022-f8ef39a7d5f6

👍1

852 views15:54

Big Data Science

🚀How to scale up Pandas with the Pandarallel library
Every Data Scientist knows that the Pandas Python library is quite slow and doesn't allocate large amounts of data. However, every Data Scientist uses it. 🤷‍♀️ To make Pandas faster, you can include Pandarallel in your project, a simple and convenient tool for parallelizing Pandas operations on all available processors.
Pandas only uses a single CPU profit, while Pandarallel allows you to profit from a multi-core computer. Pandarallel also offers progress bars for programs available on the laptop and terminal to get a rough idea of the remaining results of the calculations that are needed to collect data.
The library can be used on any computer running Linux and macOS, and on Windows there are small quirks: due to the multiprocessor system, the function that is offered in Pandarallel must be standalone and must not be excluded from external sources.
https://nalepae.github.io/pandarallel/

nalepae.github.io

Pandaral·lel documentation

A simple and efficient tool to parallelize Pandas operations on all available CPUs

👍3

675 views13:58

Big Data Science

#test
Why NumPy is faster than Pandas?

Anonymous Quiz

27%

NumPy is not faster than Pandas, they are the same

55%

NumPy uses faster datastructures and algorithms to deal with them

NumPy is not faster than Pandas, Pandas is faster

14%

NumPy uses one-dimensual array, Pandas uses multi-dimensial array

👍3

116 voters614 views12:02

Big Data Science

🤷‍♀️Not all missing data is equal
Missing data is a problem that often comes up in Data Science and Machine Learning. There are many reasons why data may be missing, depending on the type of data and collection methods. But not all missing data is the same, they can be broken down into the following categories:
• Missing due to the impossibility of their collection or the high cost of this procedure
• Structurally missing data that cannot be obtained due to the characteristics of the research objects;
• Missing due to random failure;
• Missing due to non-obvious or unknown reasons.
Depending on the reason for the lack of data in the dataset, you can smooth or even eliminate it. For example, structurally missing data suggests a change in the structure of questions about the object of study in order to get answers to them. Instead of collecting rare or expensive data, you can choose proxy metrics that allow you to make a decision. Constantly recurring failures need to be addressed, as well as more knowledge about the reasons for the lack of expected data.
https://medium.com/@nahmed3536/types-of-missing-data-e718e6ac2a55

Medium

Types of Missing Data

Not all missing data is equal. In this article, we’ll discuss missing data and the different types of missing data.

🤔1

548 views06:34

Big Data Science

🔥Data validation in a Python noscript with the pydantic library
This lightweight library allows you to validate data and manage settings using Python annotations by applying type hints at runtime. The library shows errors if the data is invalid. It is enough to determine what the data should be in pure, canonical Python and check it with pydantic.
To be fair, pydantic is primarily a parsing library, not a validator. The library benefits from types and limits of the output model rather than the input. However, while model validation is not core pydantic, it can be used for this.
The main methods for defining objects in pydantic models are derived classes from the BaseModel base class. You can set up models as data types in strongly typed sizes. Untrusted data can be passed into models, and after parsing and checking pydantic parameters, that the fields of the resulting model instance will match the field types of the files in the models.
The library plays well with any IDE and is very fast: part of a large library is compiled with Cython. Pydantic performs complex data structure validation using recursive models, and allows you to define the characteristics of data types through the use of a decorator with parsing and validating input data.
It is noteworthy that this project with likely source code uses many companies and products, including FastAPI, Jupyter, Microsoft, AWS, Uber, etc. You can try it right now by importing it into your Python noscript, first find it yourself through the pip project manager, and then its base class is BaseModel:
pip install pydantic
from pydantic import BaseModel
https://pydantic-docs.helpmanual.io/

👍2

561 views15:27

Big Data Science

👍🏻Need beautiful graphs in Jupyter Notebook? It’s easy with PivotTable.js!
PivotTable.js is a Javanoscript implementation of source code pivot tables with drag and drop functionality. The project is distributed under the MIT license and compiled through the pip package manager:
pip install pivot tables
The library allows you to conveniently and quickly visualize the statistics of a data set by simply drag and drop the necessary data.
https://pypi.org/project/pivottablejs/

PyPI

pivottablejs

PivotTable.js integration for Jupyter/IPython Notebook

👍1

471 views11:14

Big Data Science

Yandex named the laureates of its annual scientific award

Scientists who are engaged in research in the field of computer science will receive one million rubles for the development of their projects. In 2022, six young scientists became laureates:

•Maxim Velikanov — is engaged in the theory of deep learning, studies infinitely wide neural networks and statistical physics;

•Petr Mokrov — studies Wasserstein gradient flows, nonlinear filtering and Bayesian logistic regression;

•Maxim Kodryan — is engaged in deep learning, as well as optimization and generalization neural network models;

•Ruslan Rakhimov — works with neural visualization, CV and deep learning;

•Sergey Samsonov — studies Monte Carlo algorithms with Markov chains, stochastic approximation and other topics;

•Taras Hahulin — works in the field of computer vision.

It's cool that scientific supervisors are singled out also separately. This year, two people received grants — Dmitry Vetrov, Head of the HSE Center for Deep Learning and Bayesian Methods, and Alexey Naumov, Associate Professor at the HSE Faculty of Computer Science, Head of the International Laboratory of Stochastic Algorithms and Analysis Multidimensional Data.

More information about the awards and laureates of 2022 can be found on the website

Yandex ML Prize

🔥4

526 views12:00

Big Data Science

On December 3, Sberbank is holding a One Day Offer for Data scientists, Data analysts and Data engineers. Pass all the selection stages in one day and get an offer from the largest bank in the country!

👨‍🎓We are looking for specialists in the field of AI, ML, RecSys, Watch V, NLP.

Our team create information products for decision-making based on data, analytics, machine learning and artificial intelligence.

👉 You will have to:
- Solve classification/regression/ uplift-modeling tasks;
- Support the output of models in the PROM;
- Analyze and monitor the quality of models;
- Calculate CLTV and Unit - economics;
- Interact with the validation and finance departments on assessing the quality of models/financial results issues.

Data on more than 1 billion transactions daily, 75PB of information, 100TB of memory and over 7,200 CPU cores in sandboxes will be available for work.

Become a part of the bank's AI community!

✍️ Send a request for participation

598 viewsedited 13:10

Big Data Science

🌲TOP-5 DS-events in December 2022:
1. Nov 28-Dec 9 • NeurIPS 2022 https://nips.cc/
2. Dec 5-6 • 7th Global Conference on Data Science and Machine Learning, Dubai,UAE https://datascience.pulsusconference.com/
3. Dec 7 • Data Science Salon NYC | AI and ML in Finance & Technology • New York, NY, USA https://www.datascience.salon/newyork/
4. Dec 12-16 • The 20th Australasian Data Mining Conference 2022 (AUSDM’22) • Sydney, Australia + Virtual https://ausdm22.ausdm.org/
5. Dec 17-18 • 3rd International Conference on Data Science and Cloud Computing (DSCC 2022), Dubai, UAE https://cse2022.org/dscc/index

👍1

834 views12:02

Big Data Science

💥3 Useful Python Libraries for Data Science and More
Let’s introduce 3 libraries that can be useful in Data Science tasks:
• Fabric is a high-level Python library (2.7, 3.4+) for executing shell commands remotely over SSH to get useful Python objects. It is based on the Invoke API (execution of subprocess commands and command line functions) and Paramiko (an implementation of the SSH protocol). https://github.com/fabric/fabric
• TextDistance is a library for comparing the distance between two or more sequences using over 30 algorithms. Useful in NLP tasks to determine the distance and similarity between sequences. https://github.com/life4/textdistance
• Watchdog - Python API (3.6+) and shell utilities for monitoring file system events. Useful for monitoring directories specified as command line arguments, allows you to log generated events. https://github.com/gorakhargosh/watchdog

GitHub

GitHub - fabric/fabric: Simple, Pythonic remote execution and deployment.

Simple, Pythonic remote execution and deployment. Contribute to fabric/fabric development by creating an account on GitHub.

👍3

609 views05:40

Big Data Science

🤔PyPy vs CPython: Under the Hood of Python
Every Python-developer knows about CPython, the most common implementation of a virtual machine that interprets written code. As alternative to CPython, there is PyPy built using the RPython language. Compared to CPython, PyPy is faster, implementing Python 2.7.18, 3.9.15, and 3.8.15. PyPy supports most of the commonly used Python library modules. The x86 version of PyPy runs on multiple instances such as Linux (32/64 bits), MacOS (64 bits), Windows (32 bits), OpenBSD, FreeBSD. Versions other than x86 are found on Linux, and ARM64 on MacOS.
However, PyPy doesn't scale up code in short-running processes that take less than a couple of seconds to load: the JIT compiler won't have enough time to "warm up". Also, PyPy does not give a speed gain if all the execution time falls on runtime libraries, i.e. in C functions, not in the actual execution of the Python code. Therefore, PyPy works best when executing executable programs, when bigger part of the time is spent executing Python code.
In terms of memory consumption, PyPy can also outperform CPython: Python programs with high RAM consumption (hundreds of MB or more) can end up taking up less space in PyPy than in CPython.
https://www.pypy.org/features.html

PyPy

PyPy - Features

What is PyPy and what are its features

👍4

661 views07:02

Big Data Science

🚀Speed up the quality and runtime of your Python code with Refurb
Every Data Scientist knows that Python is an interpreted language. Interpreted code is always slower than compiled to direct machine code because the interpreted instruction takes much longer to implement. Therefore, the way you write your Python code greatly affects the speed of its execution. Having a good code structure and respecting the language transformation will increase the speed of Python code. The Refurb library will help improve the quality of Python code. It can upgrade or change Python code with a single command. The library is inspired by clippy, Rust's built-in linter.
Just install it via pip package manager
pip install refurb
And use hints to improve the quality and speed of your Python code.
https://github.com/dosisod/refurb

GitHub

GitHub - dosisod/refurb: A tool for refurbishing and modernizing Python codebases

A tool for refurbishing and modernizing Python codebases - dosisod/refurb

👍2❤1

593 views12:28

Big Data Science

🤔SQLlite as an embedded database in Python
SQLite is an embedded relational file-based database management system (RDBMS) that can be used in Python applications without installing additional software. It is enough to import the sqlite3 built-in Python library to use SQLite.
First, create a database connection by importing sqlite3 and then call the .connect() method with the name of the database to be created, eg new_database.db.
import sqlite3
conn = sqlite3.connect('new_database.db')
Before creating a table, you need to create a cursor - an object that is used to create a connection for executing SQL queries. This calls the created connection and the .cursor() method.
c = conn.cursor()
You can then use the .execute() method to create a new table in the database. Inside the quotes is written the usual SQL query syntax used to create a table in most DBMS. For example, creating a table through the CREATE TABLE statement:
c.execute("""CREATE TABLE new_table (
name TEXT,
weight INTEGER)""")
After filling the table with data, you can execute standard SQL queries on it to select and change values.
https://towardsdatascience.com/yes-python-has-a-built-in-database-heres-how-to-use-it-b3c033f172d3

Medium

Yes, Python Has a Built-In Database. Here’s How to Use It.

A simple guide to SQLite in Python.

🔥3👍1

784 views06:33

Big Data Science

#test
Vectorization can be used instead loops

Anonymous Quiz

27%

73%

Yes

👍2

84 voters647 views07:31

Big Data Science

🤔TOP 4 risks of embeddings in ML
Embedded models are widely used in machine learning to translate raw input data into a low-dimensional vector that captures its semantic meaning and can be used for various subsequent models. Pretrained embeddings as feature extractors are used to develop features of various input types (text, image, audio, video, multimodal) or categorical features with high cardinality.
The main risks of embeddings are:
• High obfuscation - Changing the output of the upstream embedding model affects the performance of the downstream model. Relying on the output of the embedding model, downstream models should be retrained.
• Hidden feedback loops. Pre-trained embedding features are often used as a black box. But knowing what the raw input embedding was trained on is very important for the quality of the model and its interpretability.
• High costs of real-time output (storage and maintenance). This directly affects the return on investment in ML. It is important to ensure quality during embedding service outages, cost of training, and cost of service per request.
• High cost of debugging: - debugging and monitoring built-in features or root causes of failures is very expensive. Therefore, built-in features that are not of great importance for the model should be abandoned.
https://medium.com/better-ml/embeddings-the-high-interest-credit-card-of-feature-engineering-414c00cb82e1

Medium

Embeddings: The high-interest credit card of feature engineering

Embedding models are widely used machine learning models that map a raw input (usually discrete) to a low-dimensional vector. The embedding vector captures the semantic meaning of the raw input data…

👍3

769 views15:33

Big Data Science

💥Not only Copilot: try Codex by OpenAI
OpenAI released Codex models based of GPT-3 that can interpret and generate code. Their training data contains both natural language and billions of lines of public code from GitHub. These neural network models are best versed in Python and speak over a dozen languages, including JavaScript, Go, Perl, PHP, Ruby, Swift, TypeScript, SQL, and even Shell. The code is useful in the following tasks:
• Make code from comments
• Complete a statement or function in a spec.
• Find payload or API call
• Add comments
• Make code refactoring to achieve its efficiency
https://beta.openai.com/docs/guides/code

👍3

995 views14:48

About

Blog

Apps

Platform