NEW BOT Телеграм, страница

Big Data Science

📝How to improve medical datasets: 4 small tips
Getting the right data in the right amount - before ordering datasets of medical images, project managers need to coordinate with teams of machine learning, data science and clinical researchers. This will help overcome the difficulty of getting “bad” data or having annotation teams filter through thousands of irrelevant or low quality images and videos when annotating training data, which is costly and time consuming.
Empowering annotator teams with AI-based tools - annotating medical images for machine learning models requires precision, efficiency, high levels of quality, and security. With AI-based image annotation tools, medical annotators and specialists can save hours of work and generate more accurately labeled medical images.
Ensuring ease of data transfer - clinical data should be delivered and communicated in a format that is easy to parse, annotate, port, and after annotation, quickly and efficiently transfer to an ML model.
Overcome the complexities of storage and transmission - medical image data often consists of hundreds or thousands of terabytes that cannot simply be mailed. Project managers need to ensure the end-to-end security and efficiency of purchasing or retrieving, cleaning, storing and transferring medical data

634 views18:46

Big Data Science

🤼‍♂️Hive vs Impala: very worthy opponents
Hive and Impala are technologies that are used to analyze big data. In this post, we will look at the advantages and disadvantages of both technologies and compare them with each other.
Hive is a data analysis tool that is based on the HiveQL query language. Hive allows users to access data in Hadoop Distributed File System (HDFS) using SQL-like queries. However, due to the fact that Hive uses the MapReduce architecture, it may not be as fast as many other data analysis tools.
Impala is an interactive data analysis tool designed for use in a Hadoop environment. Impala works with SQL queries and can process data in real time. This means that users can quickly receive query results without delay.
What are the advantages and disadvantages of Hive and Impala?
Advantages of Hive:
• Hive is quite easily scalable and can handle huge amounts of data;
• Support for custom functions: Hive allows users to create their own functions and aggregates in the Java programming language, allowing the user to extend the functionality of Hive and create their own customized data processing solutions.
Disadvantages of Hive:
• Restrictions on working with streaming data: Hive is not suitable for working with streaming data because it uses MapReduce, which is a batch data processing framework. Hive processes data only after it has been written to files on HDFS, which limits Hive's ability to work with streaming data.
• Restrictions on working with streaming data: Hive is not suitable for working with streaming data because it uses MapReduce, which is a batch data processing framework. Hive processes data only after it has been written to files on HDFS, which limits Hive's ability to work with streaming data.
Advantages of Impala:
• Fast query processing: Impala provides high performance query processing due to the fact that it uses the MPP architecture and distributed data in memory. This allows analysts and developers to quickly get query results without delay.
• Restrictions on working with streaming data: Hive is not suitable for working with streaming data because it uses MapReduce, which is a batch data processing framework. Hive processes data only after it has been written to files on HDFS, which limits Hive's ability to work with streaming data.
Disadvantages of Impala:
• Limited scalability: Impala does not handle as large volumes of data as Hive and may experience scalability limitations when dealing with big data. Impala may require more resources to run than Hive.
• High resource requirements: Impala consumes more resources than Hive due to distributed memory usage. This may result in the need for more powerful servers to ensure performance.

The final choice between Hive and Impala depends on the specific situation and user requirements. If you work with large amounts of data and need a simple and accessible SQL-like environment, then Hive might be the best choice. On the other hand, if you need fast data processing and support for complex queries, then Impala may be preferable.

👍2

571 views16:04

Big Data Science

⛑⛑⛑Medical data: what to observe when working with the health service
The main problem with health data is its vulnerability. They contain confidential information protected by the Health Insurance Portability and Accountability Act (HIPAA) and may not be used without express consent. In the medical field, sensitive details are referred to as protected health information (PHI). Here are a few factors to consider when working with medical datasets:
Protected Health Information (PHI) is contained in various medical documents: emails, clinical notes, test results, or CT scans. While diagnoses or medical prenoscriptions are not considered sensitive information in and of themselves, they are subject to HIPAA when matched against so-called identifiers: names, dates, contacts, social security or account numbers, photographs of individuals, or other elements that can be used to locate or identify a particular patient, as well as contact him.
Anonymization of medical data and removal of personal information from them. Personal identifiers and even parts of them (such as initials) must be disposed of before medical data can be used for research or business purposes. There are two ways to do this - anonymization and deletion of personal information. Anonymization is the permanent elimination of all sensitive data. Removing personal information (de-identification) only encrypts personal information and hides it in separate datasets. Later, identifiers can be re-associated with health information.
Medical data markup. Any unstructured data (texts, images or audio files) for training machine learning models requires markup or annotation. This is the process of adding denoscriptive elements (labels or tags) to data blocks so that the machine can understand what is in the image or text. When working with healthcare data, healthcare professionals should perform the markup. The hourly cost of their services is much higher than that of annotators who do not have domain knowledge. This creates another barrier to the generation of high-quality medical datasets.

In summary, preparing medical data for machine learning typically requires more money and time than the average for other industries due to strict regulation and the involvement of highly paid subject matter experts. Consequently, we are seeing a situation where public medical datasets are relatively rare and are attracting serious attention from researchers, data scientists, and companies working on AI solutions in the field of medicine.

671 views17:54

Big Data Science

📖Top Enough Useful Data Visualization Books
Effective Data Storytelling: How to Drive Change - The book was written by American business intelligence consultant Brent Dykes, and is also suitable for readers without a technical background. It's not so much about visualizing data as it is about how to tell a story through data. In the book, Dykes describes his own data storytelling framework - how to use three main elements (the data itself, narrative and visual effects) to isolate patterns, develop concept solutions and justify them to the audience.
Information Dashboard Design is a practical guide that outlines the best practices and most common mistakes in creating dashboards. A separate part of the book is devoted to an introduction to design theory and data visualization.
The Accidental Analyst is an intuitive step-by-step guide for solving complex data visualization problems. The book describes the seven principles of analysis, which determine the procedure for working with data.
Beautiful visualization. Looking at Data Through the Eyes of Experts - this book talks about the process of data visualization using examples of real projects. It features commentary from 24 industry experts—from designers and scientists to artists and statisticians—who talk about their data visualization methods, approaches, and philosophies.
The Big Book of Dashboards - This book is a guide to creating dashboards. In addition, the book has a whole section devoted to psychological factors. For example, how to respond if a customer asks you to improve your dashboard by adding a couple of useless charts.

588 views17:36

Big Data Science

💥YTsaurus: a system for storing and processing Yandex's Big Data has become open-source
YTsaurus is an open source distributed storage and processing platform for big data. This system is based on MapReduce, distributed file system and NoSQL key-value database.
YTsaurus is built on top of Cypress (or Cypress) - a fault-tolerant tree-based storage that provides features such as:
a tree namespace whose nodes are directories, tables (structured or semi-structured data), and files (unstructured data);
support for columnar and row mechanisms for storing tabular data;
expressive data schematization with support for hierarchical types and features of data sorting;
background replication and repair of erasure data that do not require any manual actions;
transactions that can affect many objects and last indefinitely;

In general, YTsaurus is a fairly powerful computing platform that involves running arbitrary user code. Currently, YTsaurus dynamic tables store petabytes of data, and a large number of interactive services are built on top of them.

The Github-repository contains the YTsaurus server code, deployment infrastructure using k8s, as well as the system web interface and client SDK for common programming languages - C ++, Java, Go and Python. All this is under the Apache 2.0 license, which allows everyone to upload it to their servers, as well as modify it to suit their needs.

ytsaurus.tech

YTsaurus — платформа распределенного хранения и обработки больших данных

YTsaurus — платформа с открытым исходным кодом, способная хранить и обрабатывать большие данные для десятков тысяч пользователей одновременно. Выполняйте задачи по Batch-обработке, Ad hoc аналитике, OLTP, машинному обучению, построению хранилищ данных и ETL!

❤1

696 views07:26

Big Data Science

🤔What is Data Mesh: the essence of the concept
Data Mesh is a decentralized flexible approach to the work of various distributed teams and the dissemination of information. Data Mesh was born as a response to the dominant concepts of working with data in data-driven organizations - Data Warehouse and Data Lake. They are united by the idea of centralization. All data flows into a central repository, from where different teams can take it for their own purposes. However, all this needs to be supported by a team of data engineers with a special set of skills. Also, with the growth in the number of sources and the variety of data, it becomes more and more difficult to ensure their business quality, pipelines for transformation become more and more difficult.
Datamesh proposes to solve these and other problems based on four main principles:
1. Domain-oriented ownership - domain teams own data, not a centralized Data team. A domain is a part of an organization that performs a specific business function, for example, it can be product domains (mammography, fluorography, CT scan of the chest) or a domain for working with scribes.
2. Data as a product - data is perceived not as a static dataset, but as a dynamic product with its users, quality metrics, development backlog, which is monitored by a dedicated product-owner.
3. Self-serve data platform. The main function of the data platform in Data Mesh is to eliminate unnecessary cognitive load. This allows developers in domain teams (data product developers and data product consumers) who are not data scientists to conveniently create Data products, build, deploy, test, update, access and use them for their own purposes.
4. Federated computational governance - instead of centralized data management, a special federated body is created, consisting of representatives of domain teams, data platforms and experts (for example, lawyers and doctors), which sets global policies in the field of working with data and discusses the development of the data platform.

❤1

553 views17:45

Big Data Science

🤓What is synthetic data and why is it used?
Synthetic data is artificial data that mimics observations of the real world and is used to prepare machine learning models when obtaining real data is not possible due to complexity or cost. Synthesized data can be used for almost any project that requires computer simulation to predict or analyze real events. There are many reasons why a business might consider using synthetic data. Here are some of them:
1. Efficiency of financial and time costs. If a suitable dataset is not available, generating synthetic data can be much cheaper than collecting real world event data. The same applies to the time factor: synthesis can take a matter of days, while collecting and processing real data sometimes takes weeks, months or even years.
2. Research of rare data. In some cases, the data is rare or there is danger in collecting it. An example of sparse data would be a set of unusual fraud cases. An example of dangerous real-world data is traffic accidents, which self-driving cars must learn to respond to. In this case, they can be replaced by synthetic accidents.
3. Eliminate privacy issues. When it is necessary to process or transfer sensitive data to third parties, confidentiality issues should be taken into account. Unlike anonymization, synthetic data generation removes any trace of real data identity, creating new valid datasets without sacrificing privacy.
4. Ease of layout and control. From a technical point of view, fully synthetic data simplifies markup. For example, if an image of a park is generated, it is easy to automatically label trees, people, and animals. You don't have to hire people to manually lay out these objects. In addition, fully synthesized data is easy to control and modify.

575 views15:42

Big Data Science

🌎TOP-10 DS-events all over the world in April:
Apr 1 - IT Days - Warsaw, Polland - https://warszawskiedniinformatyki.pl/en/
Apr 3-5 - Data Governance, Quality, and Compliance - Online - https://tdwi.org/events/seminars/april/data-governance-quality-compliance/home.aspx
Apr 4-5 - HEALTHCARE NLP SUMMIT - Online - https://www.nlpsummit.org/
Apr 12-13 - Insurance AI & Innovative Tech USA 2022. Chicago, IL, USA. - Chicago, USA - https://events.reutersevents.com/insurance/insuranceai-usa
Apr 17-18 - ICDSADA 2023: 17. International Conference on Data Science and Data Analytics - Boston, USA - https://waset.org/data-science-and-data-analytics-conference-in-april-2023-in-boston
Apr 25 - Data Science Day 2023 - Vienna, Austria - https://wan-ifra.org/events/data-science-day-2023/
Apr 25-26 - Chief Data & Analytics Officers, Spring. San Francisco, CA, USA. - https://cdao-spring.coriniumintelligence.com/
Apr 25-27 - International Conference on Data Science, E-learning and Information Systems 2023 - Dubai, UAE - https://iares.net/Conference/DATA2022
Apr 26-27 - Computer Vision Summit. San Jose, CA, USA. - San Jose, USA - https://computervisionsummit.com/location/cvsanjose
Apr 26-28 - PYDATA SEATTLE 2023 - Seattle, USA - https://pydata.org/seattle2023/

Warsaw Computer Science Days

Warsaw IT Days

25+ tracks || 300+ lectures || Spring 2026 || PGE Narodowy

925 views15:47

Big Data Science

😎Searching for data and learning SQL at the same time is easy!!!
Census GPT is a tool that allows users to search for data about cities, neighborhoods, and other geographic areas.
Using artificial intelligence technology, Census-GPT organized and analyzed huge amounts of data to create a superdatabase. Currently, the Census-GPT database contains information about the United States, where users can request data on population, crime rates, education, income, age, and more. In addition, Census-GPT can display US maps in a clear and concise manner.
On the Census GPT site, users can also improve existing maps. The data results can be retrieved along with the SQL query. Accordingly, you can learn SQL and automatically test yourself on real examples.

Censusgpt

Census GPT

Search the census database with natural language

589 views06:36

Big Data Science

📝A selection of sources with medical datasets
The international healthcare system generates a wealth of medical data every day that (at least in theory) can be used for machine learning.
Here are some sources with medical datasets:
1. The Cancer Imaging Archive (TCIA) funded by the US National Cancer Institute (NCI) is a publicly accessible repository of radiological and histopathological images
2. National Covid-19 Chest Imaging Database (NCCID), part of the NHS AI Lab, contains radiographs, MRIs, and Chest CT scans of hospital patients across the UK. It is one of the largest archives of its kind, with 27 hospitals and foundations contributing.
3. Medicare Provider Catalog collects official data from Centers for Medicare and Medicaid Services (CMS). It covers many different topics, from the quality of care in different hospitals, rehabilitation centers, hospices and other healthcare facilities, to the cost of a visit and information about doctors and clinicians. The data can be viewed in a browser, download specific datasets in CSV format, or connect your own applications to the website using the API.
4. Older Adults Health Data Collection on Data.gov consists of 96 datasets managed by the US federal government. Its main purpose is to collect information about the health of people over 60 in the context of the Covid-19 pandemic and beyond. Organizations involved in maintaining the collection include the US Department of Health and Human Services, the Department of Veterans Affairs, the Centers for Disease Control and Prevention (CDC), and others. Datasets can be downloaded in various formats: HTML, CSV, XSL, JSON, XML and RDF.
5.The Cancer Genome Atlas (TCGA) is a major genomics database covering 33 disease types, including 10 rare ones.
6. Surveillance, Epidemiology, and End Results (SEER) is the most reliable source of cancer statistics in the United States, designed to reduce the proportion of cancer in the population. Its database is maintained by the Surveillance Research Program (SRP), which is part of the Division of Cancer Control and Population Sciences (DCCPS) of the National Cancer Institute.

www.cancer.gov

The Cancer Genome Atlas Program (TCGA)

The Cancer Genome Atlas (TCGA) is a landmark cancer genomics program that sequenced and molecularly characterized over 11,000 cases of primary cancer samples. Learn more about how the program transformed the cancer research community and beyond.

678 views12:59

Big Data Science

😳A selection of Python libraries for random generation of test data
Many people love Python for its convenience in data processing. But sometimes it happens that it is necessary to write and test an algorithm on a certain topic, but there is little or no data on this topic in the public domain. For such purposes, there are libraries with which you can generate fake data with the desired types.
Faker is a library for generating various types of random information. It also has an intuitive API. There are also implementations for languages such as PHP, Perl, and Ruby.
Mimesis is a Python library that helps generate data for various purposes. The library is written using the tools included in the Python standard library, so it does not have any third-party dependencies.
Radar - Library for generating random dates and times
Fake2db - a library that allows you to generate data directly in the database (there are also engines for different DBMS).

GitHub

GitHub - joke2k/faker: Faker is a Python package that generates fake data for you.

Faker is a Python package that generates fake data for you. - joke2k/faker

👍2

605 views19:24

Big Data Science

😎🥹Libraries for comfortable data processing
PyGWalker - Simplifies the data analysis and visualization workflow in Jupyter Notebook by turning a pandas dataframe into a Tableau-style user interface for visual exploration.
SciencePlots - A library for creating various matplotlib plots for presentations, research papers, etc.
CleverCSV is a library that fixes various parsing errors when reading CSV files with Pandas
fastparquet - Speeds up pandas I/O by about 5 times. fastparquet is a high performance Python implementation of the Parquet format designed to work seamlessly with Pandas dataframes. It provides fast read and write performance, efficient compression, and support for a wide range of data types.
Feather is a library that is designed to read and write data from devices. This library is great for translating data from one language to another. It is also able to quickly read large amounts of data.
Dask - this library allows you to effectively organize parallel computing. Big data collections are stored here as parallel arrays/lists and allow you to work with them through Numpy/Pandas
Ibis - provides access between the local environment in Python and remote data stores (for example, Hadoop)

GitHub

GitHub - Kanaries/pygwalker: PyGWalker: Turn your dataframe into an interactive UI for visual analysis

PyGWalker: Turn your dataframe into an interactive UI for visual analysis - Kanaries/pygwalker

❤2

657 views16:01

Big Data Science

🧐What is Data observability: Basic Principles
Data observability is a new level in the modern data processing stack, providing data teams with transparency and quality. The goal of data observability is to reduce the chance of errors in business decisions due to incorrect information in the data.
Observability is ensured by the following principles:
Freshness indicates how fresh data structures are.
Distribution tells you if the data falls within the expected range.
Volume involves understanding the completeness of data structures and the state of data sources.
The schema allows you to understand who and when makes changes to data structures.
Lineage maps upstream data sources to downstream data sinks, helping you determine where errors or failures occurred.
More about data observability in the source: https://habr.com/ru/companies/otus/articles/559320/

Хабр

5 вещей о наблюдаемости данных, которые должен знать каждый дата-инженер

Как быть уверенным в своих рабочих процессах, конвейер за конвейером В преддверии старта онлайн-курса "Data Engineer" подготовили перевод материала. Если вы начинающий дата-инженер, вот несколько...

628 views17:41

Big Data Science

😩Uncertainty in data: common bugs
There is a lot of talk about “data preparation” and “data cleaning” these days, but what separates high quality data from low quality data?
Most machine learning systems today use supervised learning. This means that the training data consists of (input, output) pairs, and we want the system to be able to take the input and match it with the output. For example, the input might be an audio clip and the output might be a trannoscription of a speech. To create such datasets, it is necessary to label them correctly. If there is uncertainty in the labeling of the data, then more data may be needed to achieve high accuracy of the machine learning model.
Data collection and annotation may not be correct for the following reasons:
1. Simple annotation errors. The simplest type of error is misannotation. An annotator, tired of a lot of markup, accidentally puts sample data in the wrong class. Although this is a simple bug, it is quite common and can have a huge negative impact on the performance of the AI system.
2. Inconsistencies in annotation guidelines. There are often subtleties of various kinds in annotating data items. For example, you might imagine reading social media posts and annotating whether they are product reviews. The task seems simple, but if you start to annotate, you can realize that “product” is a rather vague concept. Should digital media, such as podcasts or movies, be considered products? One specialist may say yes, another no, so the accuracy of the AI system can be greatly reduced.
3. Unbalanced data or missing classes. The way data is collected greatly affects the composition of datasets, which in turn can affect the accuracy of models on specific data classes or subsets. In most real world datasets, the number of examples in each category that we want to classify (class balance) can vary greatly. This can lead to reduced accuracy, as well as exacerbating balance problems and skew. For example, Google's AI facial recognition system was notorious for not being able to recognize faces of people of color, which was largely the result of using a dataset with insufficiently varied examples (among many other problems).

698 views15:59

Big Data Science

😳Dataset published for training an improved alternative to LLaMa
A group of researchers from various organizations and universities (Together, ontocord.ai, ds3lab.inf.ethz.ch, crfm.stanford.edu, hazyresearch.stanford.edu, mila.quebec) are working on an open-source alternative to the LLaMa model and have already published dataset relevant to the one used to create the last one. The non-free but well-balanced LLaMa has been used as the basis for projects such as Alpaca, Vicuna and Koala.
Now RedPajama is published in the public domain - a dataset of texts containing more than 1.2 trillion tokens. The next step, according to the developers, will be the creation of the model itself, which will require serious computing power.

GitHub

GitHub - togethercomputer/RedPajama-Data: The RedPajama-Data repository contains code for preparing large datasets for training…

The RedPajama-Data repository contains code for preparing large datasets for training large language models. - togethercomputer/RedPajama-Data

605 viewsedited 15:59

Big Data Science

😎TOP-10 DS-events all over the world in May:
May 1-5 - ICLR - International Conference on Learning Representations - Kigali, Rwanda - https://iclr.cc/
May 8-9 - Gartner Data & Analytics Summit - Mumbai, India - https://www.gartner.com/en/conferences/apac/data-analytics-india
May 9-11 - Open Data Science Conference EAST - Boston, USA - https://odsc.com/boston/
May 10-11 - Big Data & AI World - Frankfurt, Germany - https://www.bigdataworldfrankfurt.de/
May 11-12 - Data Innovation Summit 2023 - Stockholm, Sweden - https://datainnovationsummit.com/
May 17-19 - World Data Summit - Amsterdam, The Netherlands - https://worlddatasummit.com/
May 19-22 - The 15th International Conference on Digital Image Processing - Nanjing, China - http://www.icdip.org/
May 23-25 - Software Quality Days 2023 - Munich, Germany - https://www.software-quality-days.com/en/
May 25-26 - The Data Science Conference - Chicago, USA - https://www.thedatascienceconference.com//
May 26-29 - 2023 The 6th International Conference on Artificial Intelligence and Big Data - Chengdu, China - http://icaibd.org/index.html

814 views15:59

Big Data Science

DS books for the newest ones
1. Data science. John Kelleher, Brendan Tierney - the book covers the main aspects, from the moment of setting up data collection and analysis, to addressing the ethical revelations that are growing due to privacy policies. The reader will walk you through how to run neural networks and machine learning, and guide you through case studies of business problems and how to solve them. Additionally, they will talk about technical requirements that can be transferred to a greater extent.
2. Practical statistics for Data Science specialists. Peter Bruce, Bruce Bruce - A hands-on textbook presented for data scientists with programming language skills and familiarity with the definition of mathematical statistics. Here, in an accessible way, the main points from the statistics of data science are presented, as well as an explanation of what are the important needs and sides of data analysis.
3. We study the spark. Holden Karau, Matei Zachariah, Patrick Wendell, Andy Konwinski - The authors of the books are the developers of the Spark system. They will talk about the analysis of the execution of tasks with a few lines of code, as well as understand the scheme through examples.
4. Data science. Data science from scratch. Joel Gras - Joel Gras talks about the Python language, elements of linear algebra, mathematical statistics, methods for collecting, normalizing and processing data. Additionally, it provides an information base for machine learning. Describes mathematical models and ways to develop them according to the "k" recipe.
5. Fundamentals of Data Science and Big Data. Davy Silen, Arno Meisman, Mohamed Ali - Readers are introduced to theoretical framework, machine learning sequencing, working with large datasets, NoSQL, detailed text analysis and computational information. Examples are Data Science noscripts in Python.

👍1

630 views15:59

Big Data Science

🤓🧐Data consistency and its types
The concept of data consistency is complex and ambiguous, and its definition may vary depending on the context. In his article, which was translated by the VK Cloud team, the author discusses the concept of "consistency" in the context of distributed databases and offers his own definition of this term. In this article, the author identifies 3 types of data consistency:
1. Consistency in Brewer's theorem - According to this theorem, in a distributed system it is possible to guarantee only two of the following three properties:
Consistency: the system provides an up-to-date version of the data when it is read
Availability: every request to a node that is functioning properly results in a correct response
Partition Tolerance: The system continues to function even if there are network traffic failures between some nodes
2. Consistency in ACID transactions - In this category, consistency means that a transaction cannot lead to an invalid state, since the following components must be observed:
Atomicity: any operation will be performed completely or will not be performed at all
Consistency: after the completion of the transaction, the database is in a correct state
Isolation: when one transaction is executed, all other parallel transactions should not have any effect on it
Reliability: even in the event of a failure (no matter what), the completed transaction is saved
3. Data Consistency Models - This definition of the term also applies to databases and is related to the concept of consistency models. There are two main elements in the database consistency model:
Linearizability - replication of a single piece of data across multiple nodes affects its processing in the database
Serializability - simultaneous execution of transactions that work with several pieces of data affects their processing in the database
More details can be found in the source: https://habr.com/ru/companies/vk/articles/723734/

Хабр

Согласованность данных: что это на самом деле такое и почему с ней все так сложно

Понятие согласованности данных сложное, неоднозначное и включает в себя широкий спектр определений, лишь частично совпадающих друг с другом. Команда VK Cloud перевела статью, в которой автор...

❤1

648 views15:59

Big Data Science

😳😱Sber has published a dataset for emotion recognition in Russian
Dusha is a huge open dataset for recognizing emotions in oral speech in Russian.
Dusha is suitable for recognizing emotions in oral speech in Russian. The dataset consists of over 300,000 audio recordings with trannoscripts and emotional tags. The duration is about 350 hours of audio. The team chose four main emotions that usually appear in a dialogue with a voice assistant: joy, sadness, anger, and a neutral emotion. Since each record was marked up by several annotators, who from time to time still performed various control tasks, the markup turned out to be about 1.5 million records.
Read more about the Dusha dataset in the publication: https://habr.com/ru/companies/sberdevices/articles/715468/

GitHub

golos/dusha at master · salute-developers/golos

Contribute to salute-developers/golos development by creating an account on GitHub.

726 views15:59

Big Data Science

⚔️🤖Pandas vs Datatable: features of comparison when working with big data
Pandas and Datatable are two popular libraries for working with data in the Python programming language. However, they have some features that are used to choose one or another library for a specific task.
Pandas is one of the most common and popular data manipulation libraries in Python. It provides a wide and flexible toolkit for working with large data types, including tables, time series, multidimensional arrays, and more. Pandas also provides many data manipulation features such as filtering, sorting, grouping, aggregation, and more.
Pandas Benefits:
1. Powerful tools for working with large data types, including tables, time series, multidimensional arrays, and more.
2. Widespread community support that often causes bugs and library updates.
3. A rich set of functions for working with data, such as filtering, sorting, grouping, aggregation and more.
4. Fairly extensive documentation
Disadvantages of pandas:
1. Poor performance when working with large amounts of data.
2. Inconvenience when dealing with high column averages.

Datatable is a library designed to improve the performance and efficiency of working with data in Python. It provides faster data handling than Pandas, especially when working with large amounts of data. Datable provides a syntax very similar to Pandas that makes it easier to switch from one library to another.
Advantages
1. Sufficiently high performance when working with large amounts of data.
2. Syntax very similar to Pandas, which makes it easier to switch from one library to another.
Disadvantages of Datatable:
1. More limited functionality than Pandas
2. Limited cross-platform: some functions in Datatable may work differently on different textures, which can cause problems during development and testing.
3. Small community: Datatable is not as widely used as Pandas, which means that there is relatively little community and users who can help with issues and issues involved with the library.

Thus, the choice between Pandas and Datatable depends on the specific task and performance guarantee. If you need to work with large amounts of data and need maximum performance, then Datatable may be the best choice. If you need to work with data type discovery and operations, then Pandas is the best choice.

👍1

639 views15:59

Big Data Science

😎🔎Several useful geodata Python libraries
gmaps is a library for working with Google maps. Designed for visualization and interaction with Google geodata.
Leafmap is an open source Python package for creating interactive maps and geospatial analysis. It allows you to analyze and visualize geodata in a few lines of code in the Jupyter environment (Google Colab, Jupyter Notebook and JupyterLab)
ipyleaflet is an interactive widget library that allows you to visualize maps
Folium is a Python library for easy geodata visualization. It provides a Python interface to leaflet.js, one of the best JavaScript libraries for creating interactive maps. This library also allows you to work with GeoJSON and TopoJSON files, create background cartograms with different palettes, customize tooltips and interactive inset maps.
geopandas is a library for working with spatial data in Python. The main object of the GeoPandas module is the geodataframe, which is exactly the same as the Pandas dataframe definition, but also includes a Geometry field that contains the definition of the feature.

🔥1

689 views15:59

About

Blog

Apps

Platform