NEW BOT Телеграм, страница

Big Data Science

#test
ACID-requirements to transactions are implemented full in

Anonymous Quiz

60%

🔥1

67 voters619 views06:28

Big Data Science

🙌🏼7 Platforms of Federated ML
Federated learning is also referred to as collaborative because ML models are trained on multiple decentralized edge devices or servers containing local data samples without exchanging them. This approach differs from traditional centralized ML methods, where all local datasets are uploaded to a single server, and from more classical decentralized approaches with the same distribution of local data. Today, federated learning is actively used in the defense industry, telecommunications, pharmaceuticals and IoT platforms.
Federated Machine Learning ideas were first introduced by Google in 2017 to improve mobile keyboard text prediction using machine learning models trained on data from multiple devices. In federated ML, models are trained on multiple local datasets on local nodes without explicit data exchange, but with periodic exchange of parameters, such as deep neural network weights and biases, between local nodes to create a common global model. Unlike distributed learning, which was originally aimed at parallelizing computations, federated learning is aimed at learning heterogeneous data sets. In federated ML, datasets are usually highly heterogeneous in size. And clients, i.e. end devices where local models are trained can be unreliable and more prone to failure than in distributed learning systems where the nodes are data centers with powerful computing capabilities. Therefore, in order to provide distributed computing and synchronization of its results, federated ML requires frequent data exchange between nodes.
Due to its architectural features, federated ML has a number of disadvantages:
• heterogeneity between different local datasets - each node has an error in relation to the general population, and sample sizes can vary significantly;
• temporal heterogeneity - the distribution of each local dataset changes over time;
• it is necessary to ensure the compatibility of the data set on all nodes;
• hiding training datasets is fraught with the risk of introducing vulnerabilities into the global model;
• lack of access to global training data makes it difficult to identify unwanted biases in training inputs;
• there is a risk of losing updates to local ML models due to failures at individual nodes, which may affect the global model.
Today, federated ML is supported by the following platforms:
• FATE (Federated AI Technology Enabler) https://fate.fedai.org/
• Substra https://www.substra.ai/
• Python libraries PySyft and PyGrid https://github.com/OpenMined/PySyft, https://github.com/OpenMined/PyGrid, https://github.com/OpenMined/pygrid-admin
• Open FL https://github.com/intel/openfl
• TensorFlow Federated (TFF) https://www.tensorflow.org/federated
• IBM Federated Learning https://ibmfl.mybluemix.net/
• NVIDIA CLARA https://developer.nvidia.com/clara

Fate

HOME

An Industrial Grade
Federated Learning Framework
support federated learning architectures and secure computation of any machine learning algorithms

👍2

765 views04:53

Big Data Science

🙌🏼Computational complexity of ML algorithms
When the amount of data is low, almost any ML algorithm gives acceptable accuracy and is suitable for solving the tasks. But when the volume and size of the data become large, it is necessary to choose an algorithm for training the ML model that does not require too many computing resources. It is better to choose a simple or less expensive algorithm in terms of computation than an algorithm that requires large computational resources, when the accuracy of prediction and evaluation of results is similar or even slightly worse.
The choice of algorithm depends on the following consequences:
• the order of time (complexity of time) required to calculate the algorithm - functions associated with the data of the algorithm itself, the volume and number of features
• set of computational space (spatial complexity) - the order of the space required during the calculation of the algorithm - a function associated with the algorithm, such as the number of features, coefficients, hidden layers of neural networks. Space complexity includes both the size of the input data and the ancillary space (auxiliary space) used by the algorithm during execution;
For example, Mergesort has an ancillary space 𝑂(𝑛) and volume complexity 𝑂(𝑛), while Quicksort has an ancillary space 𝑂(1) and volume complexity 𝑂(𝑛). As a result, both merge sort and quick sort have time stability 𝑂(𝑛log𝑛).
https://medium.com/datadailyread/computational-complexity-of-machine-learning-algorithms-16e7ffcafa7d

Medium

Computational Complexity of Machine Learning Algorithms

Pick the right algorithm for your data

👍2

675 viewsedited 15:53

Big Data Science

Computational Complexity of Machine Learning Algorithms

👍5

1.88K views15:54

Big Data Science

#test
Avoid overfitting the ML-model on a large volume of highly noisy input data by highlighting the most significant features one of the following methods will help

Anonymous Quiz

👍3

78 voters672 views04:18

Big Data Science

👍🏻TOP 4 dbt tips for data analyst and data engineer
dbt (data build tool) is an open source code framework for executing, testing and documenting SQL queries, which allows you to process data analysis machine, including structuring and denoscription of arrivals, their search, nested calls, rule triggering, documentation and testing. For example, you can use the dbt CLI or dbt Cloud to work with data collection to consume, transform, and load data into storage by computing a dynamic database on a schedule. To increase the efficiency of using dbt for the selection of schemas, sources and models, it is possible to use data:
• The Schema.yml file can only be found in the dbt models folder. The tool allows you to create a unit test that counts the duration of a column for nulls.
• dbt data tests have a strict rule that they must return null rows in order to pass the test. Instead of looking for a value such as the number of a particular set of rows, the data test should be written to expect to find null rows if the results do not match the correct set of sums. Therefore, when developing test data, you need to think about how to return 0 rows in the expected key, but at the same time you need to check the number. You can use the != or <= operators to validate data.
• To increase the speed of testing increase the number of threads in the project profile, in the profiles.yml file. For example, if there are 30 tests, then there are 40 threads, indicate in the profiles.yml file. Probably 30 data and schema tests in 4 seconds.
• The history test needs a meaningful name. Although dbt automatically learns the test names, it is recommended that you label them yourself. dbt doesn't have much control over running small test suites, it needs to be able to see all running projects. In the same way that developers are encouraged to use functions and variables with semantic name definitions, testing should be given tests for meaningful names. Otherwise, it will be difficult to determine which test passes or fails during test execution. When a test error is found in dbt, all schema and data tests are run together. It's not easy to use a single directory in the data tests folder, but you can name them "dbt test - schema" or "dbt test - data" to quickly determine which tests to use.

https://corissa-haury.medium.com/4-quick-facts-about-dbt-testing-5c32b487b8cd

Medium

4 Quick Facts About dbt Testing

You may be using dbt CLI or dbt Cloud for your data pipeline work to Extract, Transform, and Load data into a warehouse by creating…

👍2

834 views08:09

Big Data Science

🔥PyMLPipe: A lightweight MLOps Python Package
PyMLPipe is a lightweight Python package for MLOps processes. It helps to automate:
• Monitoring of models and data schemas
• Versioning of ML models and data
• Model performance comparison
• API deployment in one click
This source library supports Scikit-Learn, XGBoost, LightGBM and Pytorch. It has a modular structure, represented by a set of Python functions combined into an API and a visual graphical interface. PyMLPipe is great for working with tabular data.
https://neelindresh.github.io/pymlpipe.documentation.io/

neelindresh.github.io

PyMLPipe

None

🔥2

723 views04:34

Big Data Science

🌸🌤TOP-10 DS-events in August 2022 all over the World:
• Aug 5, Bayesian Modelling Applications Workshop. Eindhoven, The Netherlands + Virtual. http://abnms.org/uai2022-apps-workshop/
• Aug 8-12, Data Matters. Virtual. https://datamatters.org/
• Aug 11, Subsurface Community Meetup: Why Apache Arrow is the industry-standard for columnar data processing and transport. Virtual. https://subsurfacemeetupaugust2022.splashthat.com/
• Aug 14-18, KDD 2022: ACM SIGKDD 2022. Washington, DC, USA. https://kdd.org/kdd2022/
• Aug 15, 1st ACM SIGKDD Workshop on Content Understanding and Generation for E-commerce. Washington, DC, USA. https://content-generation.github.io/workshop/
• Aug 15-17, TDWI Data Literacy Bootcamp. Virtual. https://tdwi.org/events/seminars/august/tdwi-data-literacy-bootcamp/home.aspx
• Aug 15-17, Disney Data & Analytics Conference. Orlando, FL, USA. https://disneydataconference.com/
• Aug 16, StateOfTheArt() - Free AI Conference with Top AI/ML Influencers. Virtual. https://www.eventbrite.com/e/stateoftheart-free-ai-conference-with-top-aiml-influencers-tickets-379160628647
• Aug 16-18, Ai4 2022, the industry's leading AI event. Aug 16, Las Vegas, NV, USA. https://ai4.io/usa/application-attendee/
• Aug 23, The data dividend: Mumbai. Mumbai, India. Aug 23-24, Ray Summit. San Francisco, CA, USA. https://events.economist.com/custom-events/the-data-dividend-mumbai

datamatters.org

Data Matters | Short Course Series

Data Matters™ is a week-long series of one and two-day courses aimed at students and professionals in business, research, and government.

🔥1

616 views04:58

Big Data Science

🗣Data analysts speak SQL. How can they understand themselfs?
Every analyst knows 5 rules of SQL query formatting to make them easy to read:
• Place Key Words (SELECT, FROM and WHERE) On New Lines
• List Column Names after SELECT On New Lines
• Indent Sub-Elements On New Line
• Add Subquery Parenthesis On Their Own Lines
• Place Case Statement Conditions On New Lines
However, not all analysts apply these rules in practice. Of course, specialized IDEs take on formatting functions, for example, Visual Studio Code has built-in document formatting capabilities, as well as the ability to connect external extensions such as SQLTools or SqlBeautifier. If you need to read a very big SQL- query from colleagues, presented in the form of flat text, use online formatters to convert the text of the SQL query to a readable form:
• https://codebeautify.org/sqlformatter
• https://www.freeformatter.com/sql-formatter.html
• https://sqlformat.org/

Medium

SQL 101: How To Format Queries The Right Way

Five Rules Of Thumb For Beginners Who Want To Write Clean SQL

👍2

566 views06:24

Big Data Science

👍🏻4 utilities for working with JSON files
Hadoop and Spark are the most popular big data frameworks for working with big data - large files. But often you need to process many small files, for example, in JSON format, which in Hadoop HDFS are distributed over many data blocks and partitions. The number of partitions determines the number of tasks since 1 task can deal with only 1 partition at a time. This will be a high load for the Application Master and reduction of productivity for the entire cluster. In addition, most of the time is spent only on opening and closing files, and not on reading data.
Therefore, it is worth to combine many small files in large one, which Hadoop and Spark can process very quickly. In the case of JSON files, such a union into an array of records can be done using the tools:
• jq – is used to filter and process incoming JSON data, great for parsing and processing data streams https://stedolan.github.io/jq/
• jo - creates JSON data structures https://github.com/jpmens/jo
• json_pp - displays JSON objects in a more convenient format, as well as convert them between textures https://github.com/deftek/json_pp
• jshon - JSON parser with fast evaluation of large amounts of data http://kmkeen.com/jshon/
https://sidk17.medium.com/boss-we-have-a-large-number-of-small-files-now-how-to-process-these-files-ee27f67dc461

GitHub

GitHub - jpmens/jo: JSON output from a shell

JSON output from a shell. Contribute to jpmens/jo development by creating an account on GitHub.

👍1

712 views04:17

Big Data Science

👀Looking for data to train ML-models? Generate it Yourself: 3 Python Packages for Generating Synthetic Data
Synthetic data is an artificially generated, not collected, topic learning dataset for training ML models or practicing analysis techniques. You can create them yourself using possible Python packages:
• Faker is a very simple and efficient Python package for creating their data. It's great when you need to load data into a database, create a use of XML documents, prepare for load testing, or anonymize data retrieved from involved services. https://github.com/joke2k/faker
• SDV (Synthetic Data Vault) is a synthetic data storage for creating synthetic data based on a given dataset. The generated data can be a single summary, pivot table, or time series, and have the same properties and statistics as the original dataset. SDV uses synthetic data with DL models. Even if the original dataset contains multiple data types and gaps, SDV handles them. https://sdv.dev/SDV/
• Gretel Synthetics - a source code package based on a recurrent neural network for generating structured and unstructured data. The batch approach treats a data set as text data and trains a model based on it. The model will then create synthetic data with text data. Gretel is based on RNN networks, it requires more computing power, so when working with it, it is better to use Google Colab, rather than load a personal computer. https://synthetics.docs.gretel.ai/en/stable/

GitHub

GitHub - joke2k/faker: Faker is a Python package that generates fake data for you.

Faker is a Python package that generates fake data for you. - joke2k/faker

👍1👏1

639 views09:53

Big Data Science

11 августа состоится Alfa Data Science MeetUp#2 📟

Участие бесплатное, необходимо зарегистрироваться на сайте, чтобы получить ссылку на онлайн-трансляцию.

Темы и спикеры:
🖲 Развитие клиентской базы: моделирование LTV и прогноз будущих доходов
- Сергей Королёв, Middle Data Scientist Альфа-Банк
🖲 Uplift-моделирование в ценообразовании кредитных продуктов
- Максим Коматовский, Junior Data Scientist Альфа-Банк
🖲 Совершенный ~~код~~ расчёт
- Максим Cтаценко, Team Lead/Senior DWH Developer в Яндекс
🖲 Побеждаем смещение распределения в задаче нейросетевого кредитного скоринга
- Алексей Фирстов, Senior Data Scientist Альфа-Банк

Митап пройдет в интерактивном формате, вопросы спикерам приветствуются, авторы лучших вопросов получат призы от Alfa Digital.

👍3

747 views07:33

Big Data Science

#test
The Student's t-test is applied on data follow

Anonymous Quiz

64%

Gaussian distribution

Bayesian distribution

Pareto distribution

28%

any distribution

👍3

118 voters549 views02:55

Big Data Science

😱3 types of data anomalies
Data analysts and machine learning professionals often detect anomalies in data – detection that is not detected to pattern detection and detection. There are 3 types of anomalies:
• point anomaly, when one data point (observation) in the data set is far from the sources of the others and represents an extreme, unevenness or deviation that occurs randomly and is not related to the overall load in the data. The point anomaly also eliminates the global outlier because it is significantly different from the rest of the dataset.
• context anomaly, when a particular instance is revealed to be observable from the context. For example, in the case of time series data, such as recording a certain amount over time, the context is temporal. Data points that are very different from other data in the same sense are due to contextual outliers. For example, when the number of cars passing through the checkpoint on the border of the region in March, on average, is 1 thousand over the past 20 years. And in June, when the vacation period begins, this number reaches 8 thousand. If at the beginning of March it is 9 thousand, it will be considered an anomaly, and in the summer it will not be an anomaly. It's common for retail to see shoppers pop up during the holiday season. But a sharp increase in sales outside of holidays or sales can be called a contextual outlier.
• Collective anomaly, where a group of correlated, interrelated, or sequential instances is significantly different from the rest of the data, ie these data points are judged to be anomalous. For time series data, this may look like typical peaks and troughs occurring outside of the time period when the seasonal sequence is normal, or as a set of time series that are in outlier conditions. For example, at the same time, along with a large number of companies, there is a drop in sales, although before that there was an upward trend.
https://medium.com/datadailyread/types-of-data-anomalies-2f6fb1747eb1

Medium

Types of Data Anomalies

Do you know what type of anomalies are you dealing with?

👍5

559 views03:55

Big Data Science

3 types of data anomalies

👍6

605 views03:55

Big Data Science

👍🏻10 Best Practices for Naming Tables and Fields in a Database
If every developer and analyst followed these simple rules, reverse engineering would become a hobby, not a laborious job. To make it easier for you and your colleagues to work with the database, try these simple rules:
1. Separate words with underscores if the name attribute or database table consists of 2 or more words. This is a more stylish camelCase case, improved readability and conceptual platform dependency. For example, word_count.
2. Write full and semantically based names for tables and columns without reference to data types. Saving a couple of characters will do nothing but confuse. It is permissible to use benefits only where this is an abbreviated name for everyone.
3. Write the attribute name with a lowercase letter to avoid confusion from upper-case SQL keywords. It will also improve your typing speed.
4. Do not use numbers in the names of tables and columns.
5. Name the tables clearly, but briefly.
6. Name tables and columns in the singular. For example, author instead of authors.
7. Name the linking tables in alphabetical order. For example author_book
8. When an index is set, add its table and column name. For example, CREATE INDEX person_ix_first_name_last_name ON person (first_name, last_name);
9. For Boolean column type add prefix name with is_ or has_ . For example is_admin or has_membership.
10. For columns of Date-Time type, add suffix _at or _time to the name. For example, order_at or order_time.
https://dev.to/mohammadfaisal/how-to-design-a-clean-database-1e83

DEV Community

How to Design a Clean Database

👍4

550 views05:15

Big Data Science

🔥Instead of Jupyter Notebook: Benefits of Deepnote
Jupyter notebooks have been actively using data analytics for many years and contain ML. However, despite its popularity, this research tool has significant implications:
• Difficulty in code versioning. Browsing Jupyter notebooks comes in the form of large JSON files, merging two notebooks is next to impossible. As is the usual use by developers of a Git-like tool version.
• Lack of high efficiency with IDE, code highlighting and tooltips. Usually a Data Scientist is not a professional software developer, and therefore tools that regulate the quality of the code and the value of its increase are very important.
• Difficulties of development through testing. The popular test-driven development methodology (test-driven development) is almost impossible to implement in Jupyter notebooks. Therefore, they cannot be used in data pipelines.
• Non-linear workflow due to transition from one cell to another. This may seem like an irreproducible experiment. The interactive way to encode and navigate between cells is both one of Jupyter Notebook's best features and its biggest weakness.
• Jupyter is not well suited for running long running asynchronous data scoping tasks.
😡
Many of the shortcomings are addressed as an alternative to Jupyter Notebook called Deepnote. Deepnote, like Jupyter, is an interactive notebook for solving a DS problem, but it outperforms its competitor in a number of advantages 💥:
• Real-time Collaboration - Created by Google Docs, you can share links to your notebook with colleagues, giving everyone the desired level of access (view, execute, comment, edit and full control). In addition, a cell in Deepnote allows a collaborator to leave comments, eliminating the need to switch between applications for posting messages and code for providing feedback. With access to the developer's code, managers and other board members easily follow the code development progress and development life cycle.
• Easy environment management deployment - Deepnote takes the job of installing modules and setting up the environment to run Python, including versions. In addition to Python, Deepnote also supports SQL queries.
• Deepnote has the ability to embed blocks of code in blogs and other repositories by implementing the creation of a GitHub project specifically for this purpose. Deepnote Cells allows you to embed code only, embed output only, and embed both code and output.
• Data Visualisation - Jupyter notebooks almost never evaluate EDA execution without explicit coding. Deepnote provides an early development tool in the notebook itself - the early block allows you to generate information, just like building Python, but without the need to write code.
• Save time and money - once Deepnote is in charge of code management and processing, teams don't need to commit their code pipelines to tools like GitHub, BitBucket, etc. thus reducing operating costs.
Try it for free: https://deepnote.com/

Deepnote

Deepnote: Collaborative analytics & data science notebook

Explore data with Python & SQL, work together with your team, and share insights that lead to action — all in one place with Deepnote.

👍4

853 viewsedited 04:58

Big Data Science

💥Instead of loops: 3 Python life hacks
Developers and data scientists know that loops in Python are slow. Instead, you can use possible alternatives:
• Map - apply a function to each value of an iterable object(list, tuple, etc. ).
• Filter - to filter out values from an iterable object (list, tuple, sets, etc.). The filtering conditions are set inside a function which is passed as an argument to the filter function.
• Reduce – this function is a bit different from the map and filter functions. It is applied iteratively to all the values of the iterable object and returns only one value.
Examples: https://medium.com/codex/3-most-efficient-yet-underutilized-functions-in-python-d865ffaca0bb

Medium

Don’t Run Loops in Python, Instead, Use These!

No need to run loops in Python anymore

👍4

716 views06:28

Big Data Science

🤔Python-library to calendar operations
Python includes a built-in calendar module that includes an operation related to dates and days of the week. The functions and classes use the European calendar module, where Monday is the first day of the week and Sunday is Sunday.
To use this feature, you must first import it into your code:
import calendar
You can then call a function, for example print the names of the months in a list:
month_names = list(calendar.month_name[1:])
print(month_names)
https://docs.python.org/3/library/calendar.html

👍2❤1

629 views06:21

Big Data Science

#test
The probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true is called

Anonymous Quiz

59%

power of a binary hypothesis test

👍3

87 voters604 views05:36

Big Data Science

🗒Need to log Python application events? There is a special module!
Python library logging (https://docs.python.org/3/library/logging.html) defines functions and classes that implement a flexible event logging system for applications and the library. The main advantage of the logging API, an extension of this standard library, is the ability to log all events. Therefore, the Python application log can display native messages inline with messages from external modules.
The module consists of the following classes:
• Registrars require an interface that uses application code.
• Handlers send log entries (created by registrars) to the appropriate destination.
• Filters require more precise definition of the log entries to display.
• Formats for determining the location of entries in the final output.
The level of the log indicates its severity, i.e. How important is a separate message. At the basic logging level, DEBUG has the lowest priority, and CRITICAL has the highest. If we define a logger of message-sensitive logs with the DEBUG level, then all of our logged messages will be logged, since DEBUG is the lowest level. You can configure checking only for events with the ERROR and CRITICAL types.
Сode example: https://medium.com/@DavidElvis/logging-for-ml-systems-1b055005c2c2

Medium

Logging for ML Systems

Logging is the process of tracking and recording key events that occur in our applications. We want to log events so we can use them to…

👍4

549 views04:36

About

Blog

Apps

Platform