💥Instead of loops: 3 Python life hacks
Developers and data scientists know that loops in Python are slow. Instead, you can use possible alternatives:
• Map - apply a function to each value of an iterable object(list, tuple, etc. ).
• Filter - to filter out values from an iterable object (list, tuple, sets, etc.). The filtering conditions are set inside a function which is passed as an argument to the filter function.
• Reduce – this function is a bit different from the map and filter functions. It is applied iteratively to all the values of the iterable object and returns only one value.
Examples: https://medium.com/codex/3-most-efficient-yet-underutilized-functions-in-python-d865ffaca0bb
Developers and data scientists know that loops in Python are slow. Instead, you can use possible alternatives:
• Map - apply a function to each value of an iterable object(list, tuple, etc. ).
• Filter - to filter out values from an iterable object (list, tuple, sets, etc.). The filtering conditions are set inside a function which is passed as an argument to the filter function.
• Reduce – this function is a bit different from the map and filter functions. It is applied iteratively to all the values of the iterable object and returns only one value.
Examples: https://medium.com/codex/3-most-efficient-yet-underutilized-functions-in-python-d865ffaca0bb
Medium
Don’t Run Loops in Python, Instead, Use These!
No need to run loops in Python anymore
👍4
🤔Python-library to calendar operations
Python includes a built-in calendar module that includes an operation related to dates and days of the week. The functions and classes use the European calendar module, where Monday is the first day of the week and Sunday is Sunday.
To use this feature, you must first import it into your code:
import calendar
You can then call a function, for example print the names of the months in a list:
month_names = list(calendar.month_name[1:])
print(month_names)
https://docs.python.org/3/library/calendar.html
Python includes a built-in calendar module that includes an operation related to dates and days of the week. The functions and classes use the European calendar module, where Monday is the first day of the week and Sunday is Sunday.
To use this feature, you must first import it into your code:
import calendar
You can then call a function, for example print the names of the months in a list:
month_names = list(calendar.month_name[1:])
print(month_names)
https://docs.python.org/3/library/calendar.html
👍2❤1
#test
The probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true is called
The probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true is called
Anonymous Quiz
59%
power of a binary hypothesis test
38%
type II error
3%
confusion matrix
0%
random value
👍3
🗒Need to log Python application events? There is a special module!
Python library logging (https://docs.python.org/3/library/logging.html) defines functions and classes that implement a flexible event logging system for applications and the library. The main advantage of the logging API, an extension of this standard library, is the ability to log all events. Therefore, the Python application log can display native messages inline with messages from external modules.
The module consists of the following classes:
• Registrars require an interface that uses application code.
• Handlers send log entries (created by registrars) to the appropriate destination.
• Filters require more precise definition of the log entries to display.
• Formats for determining the location of entries in the final output.
The level of the log indicates its severity, i.e. How important is a separate message. At the basic logging level, DEBUG has the lowest priority, and CRITICAL has the highest. If we define a logger of message-sensitive logs with the DEBUG level, then all of our logged messages will be logged, since DEBUG is the lowest level. You can configure checking only for events with the ERROR and CRITICAL types.
Сode example: https://medium.com/@DavidElvis/logging-for-ml-systems-1b055005c2c2
Python library logging (https://docs.python.org/3/library/logging.html) defines functions and classes that implement a flexible event logging system for applications and the library. The main advantage of the logging API, an extension of this standard library, is the ability to log all events. Therefore, the Python application log can display native messages inline with messages from external modules.
The module consists of the following classes:
• Registrars require an interface that uses application code.
• Handlers send log entries (created by registrars) to the appropriate destination.
• Filters require more precise definition of the log entries to display.
• Formats for determining the location of entries in the final output.
The level of the log indicates its severity, i.e. How important is a separate message. At the basic logging level, DEBUG has the lowest priority, and CRITICAL has the highest. If we define a logger of message-sensitive logs with the DEBUG level, then all of our logged messages will be logged, since DEBUG is the lowest level. You can configure checking only for events with the ERROR and CRITICAL types.
Сode example: https://medium.com/@DavidElvis/logging-for-ml-systems-1b055005c2c2
Medium
Logging for ML Systems
Logging is the process of tracking and recording key events that occur in our applications. We want to log events so we can use them to…
👍4
🍁TOP-15 DS-events in September 2022 all over the world:
1. Sep 7-8 • AI for Defense Summit • Washington, DC, USA https://ai.dsigroup.org/
2. Sep 7-9 • Southern Data Science Conference 2022 • Atlanta, GA, USA https://www.southerndatascience.com/
3. Sep 12-14 • TDWI Data Governance, Quality, and Compliance. • Virtual https://tdwi.org/events/seminars/september/data-governance-quality-compliance/home.aspx
4. Sep 13-14 • Chief Data & Analytics Officers, Brazil • Brazil https://cdao-brazil.coriniumintelligence.com/
5. Sep 13-14 • Edge AI Summit • Santa Clara, CA, USA https://edgeaisummit.com/events/edge-ai-summit
6. Sep 13-15 • AI Hardware Summit • Santa Clara, CA, USA https://www.aihardwaresummit.com/events/aihardwaresummit
7. Sep 14-15 • Deep Learning Summit • London, UK https://www.re-work.co/events/deep-learning-summit-london-2022
8. Sep 14-15 • AI in Retail Summit • London, UK https://www.re-work.co/events/ai-in-retail-summit-london-2022
9. Sep 14-15 • Conversational AI Summit • London, UK https://www.re-work.co/events/conversational-ai-summit-london-2022
10. Sep 15-16 • The Modern Data Stack Conference • San Francisco, CA, USA https://www.moderndatastackconference.com/
11. Sep 21-22 • Big Data LDN • London, UK https://bigdataldn.com/
12. Sep 22 • EM Biotech Connect 2022 • Boston, MA, USA https://elementalmachines.com/em-biotech-connect-2022-0
13. Sep 22 • data.world Summit • Virtual https://data.world/events/summit/
14. Sep 26-30 • SIAM Conference on Mathematics of Data Science (MDS22) • San Diego, CA, USA https://www.siam.org/conferences/cm/conference/mds22
15. Sep 29 • Data2030 Summit 2022 • Stockholm, Sweden + Virtual https://data2030summit.com
1. Sep 7-8 • AI for Defense Summit • Washington, DC, USA https://ai.dsigroup.org/
2. Sep 7-9 • Southern Data Science Conference 2022 • Atlanta, GA, USA https://www.southerndatascience.com/
3. Sep 12-14 • TDWI Data Governance, Quality, and Compliance. • Virtual https://tdwi.org/events/seminars/september/data-governance-quality-compliance/home.aspx
4. Sep 13-14 • Chief Data & Analytics Officers, Brazil • Brazil https://cdao-brazil.coriniumintelligence.com/
5. Sep 13-14 • Edge AI Summit • Santa Clara, CA, USA https://edgeaisummit.com/events/edge-ai-summit
6. Sep 13-15 • AI Hardware Summit • Santa Clara, CA, USA https://www.aihardwaresummit.com/events/aihardwaresummit
7. Sep 14-15 • Deep Learning Summit • London, UK https://www.re-work.co/events/deep-learning-summit-london-2022
8. Sep 14-15 • AI in Retail Summit • London, UK https://www.re-work.co/events/ai-in-retail-summit-london-2022
9. Sep 14-15 • Conversational AI Summit • London, UK https://www.re-work.co/events/conversational-ai-summit-london-2022
10. Sep 15-16 • The Modern Data Stack Conference • San Francisco, CA, USA https://www.moderndatastackconference.com/
11. Sep 21-22 • Big Data LDN • London, UK https://bigdataldn.com/
12. Sep 22 • EM Biotech Connect 2022 • Boston, MA, USA https://elementalmachines.com/em-biotech-connect-2022-0
13. Sep 22 • data.world Summit • Virtual https://data.world/events/summit/
14. Sep 26-30 • SIAM Conference on Mathematics of Data Science (MDS22) • San Diego, CA, USA https://www.siam.org/conferences/cm/conference/mds22
15. Sep 29 • Data2030 Summit 2022 • Stockholm, Sweden + Virtual https://data2030summit.com
SDSC22
Southern Data Science Conference
Southern Data Science is a special data science R&D conference that brings experts and researchers from the top data science companies and institutes to present their work and share their best practices in data science.
🔥5👍1
🖕🏻3 ways to use the assignment operator in Python and a couple of reasons not to
The walrus operato (:=) to assign value provides following advantages:
• Reducing the number of function calls, for example, result = [y := func(x), y**2, y**3] instead of result = [func(x), func(x)**2, func(x) **3]
• Reducing nested conditionals, such as when typing regular expressions, by removing nested conditions if
• Simplify while loops, such as when reading files line by line or when receiving data from a socket. Instead of a dummy infinite while loop with flow control delegated to the break statement, you can use an assignment statement to reassign the value of the command, and then apply it in that conditional while loop in the same sequence, which results in significantly shorter code.
Of course, there are limitations to using this operator. For example, it is not recommended to use it with, because. when working with ContextManager(), the context is bound to the returned discussion, i.e. to the result of this method. This may be a requirement when debugging.
Also, parenthesize when mastering to ensure that the result is applied when mastered, otherwise the calculation could be done in an arbitrary order.
Code examples: https://betterprogramming.pub/should-you-be-using-pythons-walrus-operator-yes-and-here-s-why-36297be16907
The walrus operato (:=) to assign value provides following advantages:
• Reducing the number of function calls, for example, result = [y := func(x), y**2, y**3] instead of result = [func(x), func(x)**2, func(x) **3]
• Reducing nested conditionals, such as when typing regular expressions, by removing nested conditions if
• Simplify while loops, such as when reading files line by line or when receiving data from a socket. Instead of a dummy infinite while loop with flow control delegated to the break statement, you can use an assignment statement to reassign the value of the command, and then apply it in that conditional while loop in the same sequence, which results in significantly shorter code.
Of course, there are limitations to using this operator. For example, it is not recommended to use it with, because. when working with ContextManager(), the context is bound to the returned discussion, i.e. to the result of this method. This may be a requirement when debugging.
Also, parenthesize when mastering to ensure that the result is applied when mastered, otherwise the calculation could be done in an arbitrary order.
Code examples: https://betterprogramming.pub/should-you-be-using-pythons-walrus-operator-yes-and-here-s-why-36297be16907
Medium
Should You Be Using Python’s Walrus Operator? (Yes. And Here’s Why)
Python’s controversial assignment expression — also known as walrus operator — can improve your code, and it’s time you start using it!
👍4
Без английского в IT никак! 💻
Если ты знаешь английский, то уже наполовину знаешь IT.
Канал Английский для IT поможет тебе в этом. Тысячи слов и словосочетаний на английском, которые пригодятся тебе в изучении языков программирования.
Присоединяйся, чтобы погрузиться в компьютерный мир уже с багажом знаний! 😉
Если ты знаешь английский, то уже наполовину знаешь IT.
Канал Английский для IT поможет тебе в этом. Тысячи слов и словосочетаний на английском, которые пригодятся тебе в изучении языков программирования.
Присоединяйся, чтобы погрузиться в компьютерный мир уже с багажом знаний! 😉
👍1
🚀Need a visual dashboard quickly? Try Panel!
Every data analyst knows that the dashboard should summarize the most important indicators for decision makers. On the one hand, the dashboard is a simple HTML page for quickly viewing graphs and text. On the other hand, making it visual and not overloaded is not so easy. Moreover, not every BI specialist has design taste and skills in working with HTML components and their interaction with Javanoscript. But data specialists actively use Python.
To create the dashboard using only Python you can with Panel - the Python open source library. It allows to create custom interactive web applications and dashboards, widgets that users connect to graphs, images, tables or text. With it, you do not need to know how to create HTML components and their interaction with Javanoscript, because it is written on Python.
Panel supports almost all plot library builds and works just as well in Jupyter notebooks as it does on a standalone secure web server. The Panel uses the same code, supports both Python and adapts HTML/JavaScript, exports applications and can export applications, and can run rich interactive applications without tying domain-specific code to any possible GUI or web interface tool.
Panel makes easy the following:
• use of Python tools for data analysis and processing:
• development in IDE or laptop environment with further deployment of applications;
• Rapid prototyping of applications and dashboards, several polished templates for final deployment;
• deep interconnection, interaction switching and events on the client side in Python;
• transfer of large and small data to an external interface;
• Authentication in the application using built-in OAuth browsers.
Usage example: https://medium.com/@jairotunior/advanced-interactive-dashboards-in-python-cc2927dcde07
Every data analyst knows that the dashboard should summarize the most important indicators for decision makers. On the one hand, the dashboard is a simple HTML page for quickly viewing graphs and text. On the other hand, making it visual and not overloaded is not so easy. Moreover, not every BI specialist has design taste and skills in working with HTML components and their interaction with Javanoscript. But data specialists actively use Python.
To create the dashboard using only Python you can with Panel - the Python open source library. It allows to create custom interactive web applications and dashboards, widgets that users connect to graphs, images, tables or text. With it, you do not need to know how to create HTML components and their interaction with Javanoscript, because it is written on Python.
Panel supports almost all plot library builds and works just as well in Jupyter notebooks as it does on a standalone secure web server. The Panel uses the same code, supports both Python and adapts HTML/JavaScript, exports applications and can export applications, and can run rich interactive applications without tying domain-specific code to any possible GUI or web interface tool.
Panel makes easy the following:
• use of Python tools for data analysis and processing:
• development in IDE or laptop environment with further deployment of applications;
• Rapid prototyping of applications and dashboards, several polished templates for final deployment;
• deep interconnection, interaction switching and events on the client side in Python;
• transfer of large and small data to an external interface;
• Authentication in the application using built-in OAuth browsers.
Usage example: https://medium.com/@jairotunior/advanced-interactive-dashboards-in-python-cc2927dcde07
Medium
Advanced Interactive Dashboards in Python
Connect differents APIs to create an advanced interactive dashboard for analysis.
👍4
🤔PandaSQL: Python Combo for Data Scientist
SQL and Pandas are the most popular data analytics tools for tabular data management, processing and analysis. They can be used independently or together in the PandaSQL, the Python-package which provides SQL syntax capabilities in the Python environment. PandaSQL allows you to query date frames Pandas uses SQL syntax.
PandaSQL is a great tool for those who know SQL and are not familiar with the Python syntax that requires Pandas. For example, in Pandas, filtering a dataframe or grouping by column column when aggregating multiple columns can be explored in a confusing way, unlike SQL. However, this simple use of SQL in Pandas has been tainted with redundant runtime. For example, to calculate the number of rows, PandaSQL requires almost 100 times more execution runtime than Pandas.
Also, Python already has a lot of names that are reserved as basic words like for, while, in, if, else, elif, import, as. SQL adds more keywords: create, like, where, having.
Thus, it is worth to try PandaSQL, but this interesting tool is not suitable for an production data analytics pipelines.
Usage examples and runtime comparison: https://towardsdatascience.com/the-downsides-of-pandasql-that-no-one-talks-about-9b63c664bef4
SQL and Pandas are the most popular data analytics tools for tabular data management, processing and analysis. They can be used independently or together in the PandaSQL, the Python-package which provides SQL syntax capabilities in the Python environment. PandaSQL allows you to query date frames Pandas uses SQL syntax.
PandaSQL is a great tool for those who know SQL and are not familiar with the Python syntax that requires Pandas. For example, in Pandas, filtering a dataframe or grouping by column column when aggregating multiple columns can be explored in a confusing way, unlike SQL. However, this simple use of SQL in Pandas has been tainted with redundant runtime. For example, to calculate the number of rows, PandaSQL requires almost 100 times more execution runtime than Pandas.
Also, Python already has a lot of names that are reserved as basic words like for, while, in, if, else, elif, import, as. SQL adds more keywords: create, like, where, having.
Thus, it is worth to try PandaSQL, but this interesting tool is not suitable for an production data analytics pipelines.
Usage examples and runtime comparison: https://towardsdatascience.com/the-downsides-of-pandasql-that-no-one-talks-about-9b63c664bef4
PyPI
pandasql
sqldf for pandas
👍3
#test
What statistical term that refers to a systematic relationship between 2 random variables in which a change in the other reflects a change in one variable?
What statistical term that refers to a systematic relationship between 2 random variables in which a change in the other reflects a change in one variable?
Anonymous Quiz
29%
Covariance
0%
Delta
71%
Correlation
0%
Interpretation
👍5
Visual ETL with VDP
VDP (Visual Data Preparation) is an open source visual data ETL tool for optimizing the end-to-end visual data processing pipeline. It involves extracting unstructured visual data from pre-built data sources such as cloud/local storage or IoT devices, transforming it into parsable structured data using Vision AI models, and loading the processed data into repositories, applications, or other destinations.
VDP streamlines the end-to-end visual data processing pipeline by eliminating the need for developers to create their own connectors, model service platforms, and ELT automation tools. With VDP, visual data integration becomes easier and faster. The VDP is released under the Apache 2.0 license and is available for local and cloud deployment on Kubernetes. VDP is built from a data management perspective to optimize the end-to-end flow of visual data with a transformation component that can import Vision AI models from different sources in a flexible way. Building an ETL pipeline becomes like assembling from ready-made blocks, as in a children's constructor. And high performance is provided by a Go backend with Triton Inference Server with powerful NVIDIA GPU architectures supporting TensorRT, PyTorch, TensorFlow, ONNX and Python.
VDP is also in line with MLOps, allowing one-click import and deployment of ML/DL models from GitHub, Hugging Face, or cloud storage managed by version control tools such as DVC or ArtiVC. CV Task's standardized output formats simplify data warehousing, while pre-built ETL connectors provide advanced data access through integration with Airbyte.
VDP supports different usage scenarios: synchronous for real-time inference and asynchronous for on-demand workload. The scalable API-based microservice design is developer-friendly through seamless integration with the modern data stack. NoCode/Low Code interfaces lower the barrier to entry into the technology, giving the Data Scientist and analyst independence from data engineering.
https://github.com/instill-ai/vdp
VDP (Visual Data Preparation) is an open source visual data ETL tool for optimizing the end-to-end visual data processing pipeline. It involves extracting unstructured visual data from pre-built data sources such as cloud/local storage or IoT devices, transforming it into parsable structured data using Vision AI models, and loading the processed data into repositories, applications, or other destinations.
VDP streamlines the end-to-end visual data processing pipeline by eliminating the need for developers to create their own connectors, model service platforms, and ELT automation tools. With VDP, visual data integration becomes easier and faster. The VDP is released under the Apache 2.0 license and is available for local and cloud deployment on Kubernetes. VDP is built from a data management perspective to optimize the end-to-end flow of visual data with a transformation component that can import Vision AI models from different sources in a flexible way. Building an ETL pipeline becomes like assembling from ready-made blocks, as in a children's constructor. And high performance is provided by a Go backend with Triton Inference Server with powerful NVIDIA GPU architectures supporting TensorRT, PyTorch, TensorFlow, ONNX and Python.
VDP is also in line with MLOps, allowing one-click import and deployment of ML/DL models from GitHub, Hugging Face, or cloud storage managed by version control tools such as DVC or ArtiVC. CV Task's standardized output formats simplify data warehousing, while pre-built ETL connectors provide advanced data access through integration with Airbyte.
VDP supports different usage scenarios: synchronous for real-time inference and asynchronous for on-demand workload. The scalable API-based microservice design is developer-friendly through seamless integration with the modern data stack. NoCode/Low Code interfaces lower the barrier to entry into the technology, giving the Data Scientist and analyst independence from data engineering.
https://github.com/instill-ai/vdp
GitHub
GitHub - instill-ai/instill-core: 🔮 Instill Core is a full-stack AI infrastructure tool for data, model and pipeline orchestration…
🔮 Instill Core is a full-stack AI infrastructure tool for data, model and pipeline orchestration, designed to streamline every aspect of building versatile AI-first applications - instill-ai/instil...
👍2
🥒7 Reasons Not to Use Pickle to Save ML Models
A Data Scientist often writes code in notebooks like Jupyter Notebook, Google Colab, or specialized IDEs. To port this code to a production environment, it must be converted to a lightweight interchange format, compressed and serialized, that is independent of the development language. One of these formats is Pickle, a binary version of a Python object for serializing and deserializing its structure, converting a hierarchy of Python objects into a stream of bytes and vice versa. The Pickle format is quite popular due to its lightness. It does not require a data schema and is quite common, but it has a number of disadvantages:
• Unsafe. You can only unpack pickle files that you trust. An attacker can create malicious data that will execute arbitrary code during decompression. You can mitigate this risk by signing the data with hmac to make sure it hasn't been tampered with. The unreliability comes not from the fact that Pickles contain code, but from the fact that they create objects by calling the constructors mentioned in the file. Any callable object can be used instead of a class name to create objects. Malicious code will use other Python callables as constructors.
• Code mismatch. If the code changes between the time the ML model is packaged in the Pickle file and when it is used, the objects may not match the code. They will still have the structure created by the old code, but will try to work with the new version. For example, if an attribute was added after the Pickle was created, the objects in the Pickle file will not have that attribute. And if the new version of the code is supposed to process it, there will be problems.
• Implicit serialization. On the one hand, the Pickle format is convenient in that it serializes any structure of a Python object. But at the same time, there is no way to specify preferences for serialization of one or another type of data. Pickle serializes everything in objects, even data that doesn't need to be serialized. But there is no way to skip the serialization of this or that attribute in Pickle. If an object contains an attribute that cannot be boxed, such as an object with an open file, Pickle will not skip it, insisting on trying to box it, and then throw an exception.
• Lack of initialization. Pickle stores the entire structure of objects. When the Pickle module recreates the objects, it does not call the init method because the object has already been created, considering the initialization to have been called when the object was first created during the creation of the Pickle file. But the init method can do some important things, like opening file objects. In this case, raw objects will be in a state that is incompatible with the init method. Or initialization can register information about the object being created. Then unselected objects will not be displayed in the general log.
• Unreadable. Pickle is a stream of binary data, i.e. instructions for the abstract execution mechanism. Once a Pickle is opened as a normal file, its contents cannot be read. To find out what is in it, you will have to use the Pickle module to load. This can make debugging difficult, since it is difficult to find the desired data in binaries.
• Binding to Python. Being a Python library, Pickle is specific to this programming language. Although the format itself can be used for other programming languages, it is difficult to find packages that provide such capabilities. Also, they will be limited to cross-language common list/dict object structures. Pickle serializes objects containing callable functions and classes without problems. But the format does not store the code, but only the name of the function or class. When unpacking data, function names are used to look for existing code in the running process.
• Low speed. Finally, compared to other serialization methods, Pickle is much slower.
A Data Scientist often writes code in notebooks like Jupyter Notebook, Google Colab, or specialized IDEs. To port this code to a production environment, it must be converted to a lightweight interchange format, compressed and serialized, that is independent of the development language. One of these formats is Pickle, a binary version of a Python object for serializing and deserializing its structure, converting a hierarchy of Python objects into a stream of bytes and vice versa. The Pickle format is quite popular due to its lightness. It does not require a data schema and is quite common, but it has a number of disadvantages:
• Unsafe. You can only unpack pickle files that you trust. An attacker can create malicious data that will execute arbitrary code during decompression. You can mitigate this risk by signing the data with hmac to make sure it hasn't been tampered with. The unreliability comes not from the fact that Pickles contain code, but from the fact that they create objects by calling the constructors mentioned in the file. Any callable object can be used instead of a class name to create objects. Malicious code will use other Python callables as constructors.
• Code mismatch. If the code changes between the time the ML model is packaged in the Pickle file and when it is used, the objects may not match the code. They will still have the structure created by the old code, but will try to work with the new version. For example, if an attribute was added after the Pickle was created, the objects in the Pickle file will not have that attribute. And if the new version of the code is supposed to process it, there will be problems.
• Implicit serialization. On the one hand, the Pickle format is convenient in that it serializes any structure of a Python object. But at the same time, there is no way to specify preferences for serialization of one or another type of data. Pickle serializes everything in objects, even data that doesn't need to be serialized. But there is no way to skip the serialization of this or that attribute in Pickle. If an object contains an attribute that cannot be boxed, such as an object with an open file, Pickle will not skip it, insisting on trying to box it, and then throw an exception.
• Lack of initialization. Pickle stores the entire structure of objects. When the Pickle module recreates the objects, it does not call the init method because the object has already been created, considering the initialization to have been called when the object was first created during the creation of the Pickle file. But the init method can do some important things, like opening file objects. In this case, raw objects will be in a state that is incompatible with the init method. Or initialization can register information about the object being created. Then unselected objects will not be displayed in the general log.
• Unreadable. Pickle is a stream of binary data, i.e. instructions for the abstract execution mechanism. Once a Pickle is opened as a normal file, its contents cannot be read. To find out what is in it, you will have to use the Pickle module to load. This can make debugging difficult, since it is difficult to find the desired data in binaries.
• Binding to Python. Being a Python library, Pickle is specific to this programming language. Although the format itself can be used for other programming languages, it is difficult to find packages that provide such capabilities. Also, they will be limited to cross-language common list/dict object structures. Pickle serializes objects containing callable functions and classes without problems. But the format does not store the code, but only the name of the function or class. When unpacking data, function names are used to look for existing code in the running process.
• Low speed. Finally, compared to other serialization methods, Pickle is much slower.
👍3
🌴🌳🌲Decision Trees: brief overview
In general, decision trees are constructed via an algorithmic approach that identifies ways to split a data set based on various conditions. It is one of the most widely used and practical methods of non-parametric supervised learning used for classification and regression tasks. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. It is one of the most powerful and popular algorithms. Decision-tree algorithm falls under the category of supervised learning algorithms. It works for both continuous as well as categorical output variables. It learns from simple decision rules using the various data features.
Entropy is the measure of uncertainty or randomness in a data set. Entropy handles how a decision tree splits the data. The information gain measures the decrease in entropy after the data set is split. The Gini Index is used to determine the correct variable for splitting nodes. It measures how often a randomly chosen variable would be incorrectly identified. The root node is always the top node of a decision tree. It represents the entire population or data sample, and it can be further divided into different sets. Decision nodes are subnodes that can be split into different subnodes; they contain at least two branches. A leaf node in a decision tree carries the final results. These nodes, which are also known as terminal nodes, cannot be split any further.
Decision Tree Applications:
• to determine whether an applicant is likely to default on a loan.
• to determine the odds of an individual developing a specific disease.
• to find customer churn rates
• to predict whether a consumer is likely to purchase a specific product.
Advantages of Using Decision Trees
• Decision trees are simple to understand, interpret, and visualize
• They can effectively handle both numerical and categorical data
• They can determine the worst, best, and expected values for several scenarios
• Decision trees require little data preparation and data normalization
• They perform well, even if the actual model violates the assumptions
Disadvantages
• Overfitting is one of the practical difficulties for decision tree models. It happens when the learning algorithm continues developing hypotheses that reduce the training set error but at the cost of increasing test set error. But this issue can be resolved by pruning and setting constraints on the model parameters.
• Decision trees cannot be used well with continuous numerical variables.
• A small change in the data tends to cause a big difference in the tree structure, which causes instability.
https://blog.devgenius.io/decision-tree-regression-in-machine-learning-3ea6c734eb51
In general, decision trees are constructed via an algorithmic approach that identifies ways to split a data set based on various conditions. It is one of the most widely used and practical methods of non-parametric supervised learning used for classification and regression tasks. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. It is one of the most powerful and popular algorithms. Decision-tree algorithm falls under the category of supervised learning algorithms. It works for both continuous as well as categorical output variables. It learns from simple decision rules using the various data features.
Entropy is the measure of uncertainty or randomness in a data set. Entropy handles how a decision tree splits the data. The information gain measures the decrease in entropy after the data set is split. The Gini Index is used to determine the correct variable for splitting nodes. It measures how often a randomly chosen variable would be incorrectly identified. The root node is always the top node of a decision tree. It represents the entire population or data sample, and it can be further divided into different sets. Decision nodes are subnodes that can be split into different subnodes; they contain at least two branches. A leaf node in a decision tree carries the final results. These nodes, which are also known as terminal nodes, cannot be split any further.
Decision Tree Applications:
• to determine whether an applicant is likely to default on a loan.
• to determine the odds of an individual developing a specific disease.
• to find customer churn rates
• to predict whether a consumer is likely to purchase a specific product.
Advantages of Using Decision Trees
• Decision trees are simple to understand, interpret, and visualize
• They can effectively handle both numerical and categorical data
• They can determine the worst, best, and expected values for several scenarios
• Decision trees require little data preparation and data normalization
• They perform well, even if the actual model violates the assumptions
Disadvantages
• Overfitting is one of the practical difficulties for decision tree models. It happens when the learning algorithm continues developing hypotheses that reduce the training set error but at the cost of increasing test set error. But this issue can be resolved by pruning and setting constraints on the model parameters.
• Decision trees cannot be used well with continuous numerical variables.
• A small change in the data tends to cause a big difference in the tree structure, which causes instability.
https://blog.devgenius.io/decision-tree-regression-in-machine-learning-3ea6c734eb51
Medium
Decision Tree Regression in Machine learning
In general, decision trees are constructed via an algorithmic approach that identifies ways to split a data set based on various…
👍3
👍ETL and data integration with Airbyte
The key component of any data pipeline is the extraction of data. Once data is extracted it needs to be loaded and transformed (ELT). Make it easier with Airbyte, an open source data integration platform that aims to standardize and simplify the extraction and loading process. Airbyte works like an ELT, extracting raw data and uploading it to destinations. Airbyte also allows you to perform data transformation, separating them from the EL phases. It simplifies the process by creating connectors between data sources and data destinations. Airbyte is a plugin-based system where you can quickly create your own custom connector using the CDK.
If there are not enough ready-made 170+ connectors and 25+ destinations, you can develop your own. Airbyte has a built-in scheduler that provides different clock rates, AirFlow and dbt integration is supported. It is available on the K8s platform, includes octavia-cli with a YAML template for deployment, and supports near real-time CDC.
However, the framework is still in alpha and does not support IAM role-based authentication in AWS services. No built-in support for Prometheus, although Open Telemetry has been recently added. Airbyte does not provide support for user access control and replaying a specific job execution instance. And also the work can slow down with 2000+ simultaneous tasks.
https://airbyte.com/
The key component of any data pipeline is the extraction of data. Once data is extracted it needs to be loaded and transformed (ELT). Make it easier with Airbyte, an open source data integration platform that aims to standardize and simplify the extraction and loading process. Airbyte works like an ELT, extracting raw data and uploading it to destinations. Airbyte also allows you to perform data transformation, separating them from the EL phases. It simplifies the process by creating connectors between data sources and data destinations. Airbyte is a plugin-based system where you can quickly create your own custom connector using the CDK.
If there are not enough ready-made 170+ connectors and 25+ destinations, you can develop your own. Airbyte has a built-in scheduler that provides different clock rates, AirFlow and dbt integration is supported. It is available on the K8s platform, includes octavia-cli with a YAML template for deployment, and supports near real-time CDC.
However, the framework is still in alpha and does not support IAM role-based authentication in AWS services. No built-in support for Prometheus, although Open Telemetry has been recently added. Airbyte does not provide support for user access control and replaying a specific job execution instance. And also the work can slow down with 2000+ simultaneous tasks.
https://airbyte.com/
Airbyte
Airbyte | Open-Source Data Integration Platform | ELT Tool
Explore Airbyte, your go-to data integration platform and ELT tool. Seamlessly integrate, transform, and load data with our powerful, user-friendly solution.
👍3
#test
What is true about arrays and dataframes in Python?
What is true about arrays and dataframes in Python?
Anonymous Quiz
10%
Array is always only one-dimensional
1%
Dataframe is always only one-dimensional
72%
DataFrame can contain elements of different data types
17%
Array contains elements of different data types
👍4
💫Need to automate your WhatsApp messages? Use PyWhatKit!
PyWhatKit is a Python library with various useful features. It is easy to use and does not require additional settings. It allows you to automate the following actions:
• Sending a message to a WhatsApp group or contact
• Send image to WhatsApp group or contact
• Convert image to ASCII Art
• Convert string to handwriting
• YouTube video playback
• Sending emails with HTML code
For example, the following code will send the message "Hello!" to the number +78912345678 at 12:15
import pywhatkit
pywhatkit.sendwhatmsg("+78912345678", “Hello!”, 12, 15)
The sending time must be specified in advance, at least 2-3 minutes from the current time when the noscript is launched, otherwise the module will generate an error. Also, before running the noscript, you should make sure that you are logged into WhatsApp Web in Google Chrome through the web interface or desktop application by scanning the QR code on your mobile phone.
https://pypi.org/project/pywhatkit/
PyWhatKit is a Python library with various useful features. It is easy to use and does not require additional settings. It allows you to automate the following actions:
• Sending a message to a WhatsApp group or contact
• Send image to WhatsApp group or contact
• Convert image to ASCII Art
• Convert string to handwriting
• YouTube video playback
• Sending emails with HTML code
For example, the following code will send the message "Hello!" to the number +78912345678 at 12:15
import pywhatkit
pywhatkit.sendwhatmsg("+78912345678", “Hello!”, 12, 15)
The sending time must be specified in advance, at least 2-3 minutes from the current time when the noscript is launched, otherwise the module will generate an error. Also, before running the noscript, you should make sure that you are logged into WhatsApp Web in Google Chrome through the web interface or desktop application by scanning the QR code on your mobile phone.
https://pypi.org/project/pywhatkit/
PyPI
pywhatkit
PyWhatKit is a Simple and Powerful WhatsApp Automation Library with many useful Features
👍5
🤔How to extract tables from PDF? Try Camelot!
The open-source Camelot library helps extract tables from PDF files. Before installing it, you need to install the Tkinter and Ghostnoscript libraries. You can install these libraries through the pip or conda package managers:
pip install camelot-py
conda install -c conda-forge camelot-py
Next, you need to, as usual, import the desired module from the library in order to use its methods:
import Camelot
tables = camelot.read_pdf('foo.pdf', pages='1', flavor='lattice')
The flavor parameter is set to lattice by default, but can be reconfigured to stream. The lattice is more deterministic and it is great for analyzing tables where there are demarcation lines between cells. This allows you to automatically parse multiple tables present on the page. The lattice converts the PDF page to an image using the ghostnoscript library and then processes it to produce horizontal and vertical line segments by applying a set of morphological transformations using OpenCV.
To extract a table from a PDF, use the export() method to print it as a dataframe or export it to a CSV file:
tables.export('foo.csv', f='csv', compress=True)
tables[0].to_csv('foo.csv') # to a csv file
print(tables[0].df) # to a df
https://camelot-py.readthedocs.io/en/master/user/install.html
The open-source Camelot library helps extract tables from PDF files. Before installing it, you need to install the Tkinter and Ghostnoscript libraries. You can install these libraries through the pip or conda package managers:
pip install camelot-py
conda install -c conda-forge camelot-py
Next, you need to, as usual, import the desired module from the library in order to use its methods:
import Camelot
tables = camelot.read_pdf('foo.pdf', pages='1', flavor='lattice')
The flavor parameter is set to lattice by default, but can be reconfigured to stream. The lattice is more deterministic and it is great for analyzing tables where there are demarcation lines between cells. This allows you to automatically parse multiple tables present on the page. The lattice converts the PDF page to an image using the ghostnoscript library and then processes it to produce horizontal and vertical line segments by applying a set of morphological transformations using OpenCV.
To extract a table from a PDF, use the export() method to print it as a dataframe or export it to a CSV file:
tables.export('foo.csv', f='csv', compress=True)
tables[0].to_csv('foo.csv') # to a csv file
print(tables[0].df) # to a df
https://camelot-py.readthedocs.io/en/master/user/install.html
👍6
#test
The power of the experiment does NOT depend on the following factor
The power of the experiment does NOT depend on the following factor
Anonymous Quiz
28%
Significance level (alpha)
3%
Sample size
12%
Variability
26%
Effect size
31%
Standard deviation
👍4
🍁TOP-10 DS-events all over the world
• Oct 4-5 • Modern DataFest • Virtual https://www.datacated.com/moderndatafest
• Oct 4-6 • NLP Summit 2022 • Virtual https://www.nlpsummit.org/
• Oct 5-6 • Deep Learning World • Berlin, Germany https://deeplearningworld.de/
• Oct 5-6 • Marketing Analytics Summit • Berlin, Germany https://marketinganalyticssummit.de/
• Oct 6-7 • Big Data & AI Toronto • Toronto, ON, Canada https://www.bigdata-toronto.com/
• Oct 10-13 • Chief Data & Analytics Officers, Fall • Boston, MA, USA https://cdao-fall.coriniumintelligence.com/
• Oct 13-14 • Linq Conference – use the promocode ChernobrovovxLinq to 5% discount of ticket price https://linqconf.com
• Oct 17-21 • DeepLearn 2022 Autumn: 7th International School on Deep Learning • Luleå, Sweden https://irdta.eu/deeplearn/2022au/
• Oct 25-26 • IMPACT 2022: The Data Observability Summit • London, UK, New York, NY, Bay Area, CA + Virtual https://impactdatasummit.com/2022
• Oct 27-28 • The Data Science Conference • Chicago, IL, USA https://www.thedatascienceconference.com/home
• Oct 4-5 • Modern DataFest • Virtual https://www.datacated.com/moderndatafest
• Oct 4-6 • NLP Summit 2022 • Virtual https://www.nlpsummit.org/
• Oct 5-6 • Deep Learning World • Berlin, Germany https://deeplearningworld.de/
• Oct 5-6 • Marketing Analytics Summit • Berlin, Germany https://marketinganalyticssummit.de/
• Oct 6-7 • Big Data & AI Toronto • Toronto, ON, Canada https://www.bigdata-toronto.com/
• Oct 10-13 • Chief Data & Analytics Officers, Fall • Boston, MA, USA https://cdao-fall.coriniumintelligence.com/
• Oct 13-14 • Linq Conference – use the promocode ChernobrovovxLinq to 5% discount of ticket price https://linqconf.com
• Oct 17-21 • DeepLearn 2022 Autumn: 7th International School on Deep Learning • Luleå, Sweden https://irdta.eu/deeplearn/2022au/
• Oct 25-26 • IMPACT 2022: The Data Observability Summit • London, UK, New York, NY, Bay Area, CA + Virtual https://impactdatasummit.com/2022
• Oct 27-28 • The Data Science Conference • Chicago, IL, USA https://www.thedatascienceconference.com/home
Datacated
Modern DataFest
👍3
You will find out it on October 4 at 19:00 at Mathshub's Open Day.
Aira Mongush (founder of Mathshub, teacher of AI programs at MIPT/HSE, ex-Mail.Ru ex-aitarget. com) will tell you in detail how easy it is to create an ML project from scratch:
📍 In addition, the guys from Mathshub will tell about the school and their cool teachers - David Dale from AI Research, Igor Slinko, ex-Samsung AI Research and other specialists with extensive practice in deep learning and teaching.
📍And they will also present the autumn program for creating ML projects.
Who will be interested:
— those who want to put ML into practice
— those who have a product idea with ML
— those who want to create a portfolio in ML
— engineers and product managers who create features with ML algorithms
Sign up at the link
Please open Telegram to view this post
VIEW IN TELEGRAM
👍2
🚀Pandas features: use at and iat in loops instead of iloc and loc
The Python-library Pandas has iloc and loc functions to access dateframe values by row index and index or column name. But running them inside loops is time consuming. If you replace loc with at, or iloc with iat, the execution time of a for loop can be reduced by a factor of 60!
Such a different difference in speed is due to the nature of these functions: at and iat include access to a scalar, that is, to one element of the dataframe. And loc and iloc are used for simultaneous access to elements (rows, dataframes), i.e. they are initially needed to perform vectorized operations.
Because of at/iat are used for accessing the scalar environment, they are faster than loc/iloc for serie/dataframe access and take more space and time. Therefore, using loc/iloc inside loops in Python is slow and should be replaced with at/iat, which exucute faster. However, loc and iloc work fine outside of Python loops for vectorized operations.
https://medium.com/codex/dont-use-loc-iloc-with-loops-in-python-instead-use-this-f9243289dde7
The Python-library Pandas has iloc and loc functions to access dateframe values by row index and index or column name. But running them inside loops is time consuming. If you replace loc with at, or iloc with iat, the execution time of a for loop can be reduced by a factor of 60!
Such a different difference in speed is due to the nature of these functions: at and iat include access to a scalar, that is, to one element of the dataframe. And loc and iloc are used for simultaneous access to elements (rows, dataframes), i.e. they are initially needed to perform vectorized operations.
Because of at/iat are used for accessing the scalar environment, they are faster than loc/iloc for serie/dataframe access and take more space and time. Therefore, using loc/iloc inside loops in Python is slow and should be replaced with at/iat, which exucute faster. However, loc and iloc work fine outside of Python loops for vectorized operations.
https://medium.com/codex/dont-use-loc-iloc-with-loops-in-python-instead-use-this-f9243289dde7
Medium
Don’t use loc/iloc with Loops In Python, Instead, Use This!
Run your loops at a 60X faster speed
👍6