Big Data Science – Telegram
Big Data Science
3.74K subscribers
65 photos
9 videos
12 files
637 links
Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼https://news.1rj.ru/str/bds_job — channel about Data Science jobs and career
💻https://news.1rj.ru/str/bdscience_ru — Big Data Science [RU]
Download Telegram
🔎📝Datasets for Natural Language Processing
Sentiment analysis - a set of different datasets, each of which contains the necessary information for analyzing the sentiment of a text. So, the data taken from IMDb is a binary set for sentiment analysis. It consists of 50,000 reviews from the Movie Database (IMDb), marked as either positive or negative.
WikiQA is a collection of question and suggestion pairs. They were collected and annotated to investigate responses to questions in open domains. WikiQA is created using a more natural process. It includes questions for which there are no correct sentences, allowing researchers to work on the response trigger, a critical component of any QA system.
Amazon Reviews dataset - This dataset consists of several million Amazon customer reviews and their ratings. The dataset is used to enable fastText to learn by analyzing consumer sentiment. The idea is that despite the huge amount of data, this is a real business challenge. The model is trained in minutes. This is what sets Amazon Reviews apart from its peers.
Yelp dataset - The Yelp dataset is a collection of businesses, testimonials, and user data that can be applied to a Pet project and academia. You can also use Yelp to teach students how to work with databases, when learning NLP, and as a sample of production data. The dataset is available as JSON files and is a "classic" in natural language processing.
Text classification - Text classification is the task of assigning an appropriate category to a sentence or document. The categories depend on the selected dataset and may vary depending on the topics. For example, TREC is a question classification dataset that consists of fact-based open-ended questions. They are divided into broad semantic categories. The dataset has six-grade (TREC-6) and fifty-grade (TREC-50) versions. Both versions include 5452 training and 500 test cases.
💥📖TOP DS-events all over the world in July:
Jul 7-9
- 2023 IEEE the 6th International Conference on Big Data and Artificial Intelligence (BDAI) - Zheijiang, China - http://www.bdai.net/
Jul 11-13 - International Conference on Data Science, Technology and Applications (DATA) - Rome, Italy - https://data.scitevents.org/
Jul 12-16 - ICDM 2022: 23th Industrial Conf. on Data Mining - New York, NY, USA - https://www.data-mining-forum.de/icdm2023.php
Jul 14-16 - 6th International Conference on Sustainable Sciences and Technology - Istanbul, Turkey - https://icsusat.net/home
Jul 15-19 - MLDM 2023: 19th Int. Conf. on Machine Learning and Data Mining. New York, NY, USA - New York, USA - http://www.mldm.de/mldm2023.php
Jul 21-23 - 2023 7th International Conference on Artificial Intelligence and Virtual Reality (AIVR2023) - Kumamoto, Japan - http://aivr.org/
Jul 23-29 - ICML - International Conference on Machine Learning - Honolulu, Hawai'I - https://icml.cc/
Jul 27-29 - 7th International Conference on Deep Learning Technologies - Dalian, China - http://www.icdlt.org/
Jul 31-Aug 1 - Gartner Data & Analytics Summit - Sydney, Australia - https://www.gartner.com/en/conferences/apac/data-analytics-australia
😎💥YouTube-ASL repository has been made available to the public
This repository provides information about the YouTube-ASL dataset, which is an extensive open source dataset. It contains videos showing American Sign Language with English subnoscripts.
This dataset includes 11,093 American Sign Language (ASL) footage and has a total length of 984 hours of footage. In addition, there are 610,193 English subnoscripts in the set.
The repository contains a link to a text document with the video data ID.
This repository is located at the link: https://github.com/google-research/google-research/tree/master/youtube_asl
📊⚡️Open source data generators
Benerator - test data generation software solution for testing and training machine learning models
DataFactory is a project that makes it easy to generate test data to populate a database as well as test AI models
MockNeat - provides a simple API that allows developers to programmatically create data in json, xml, csv and sql formats.
Spawner is a data generator for various databases and AI models. It includes many types of fields, including those manually configured by the user
📝💡A few tips for preparing datasets for video analytics
1. Do not require a strict definition to include data in the set
: provide a variety of frames so that you can learn about different situations. The higher the diversity of the data set, the better the model will generalize the essence of the detected object.
2. Plan for testing in possible conditions: prepare a list of sites and contacts where you can offer to shoot.
3. Annotate the data: When preparing data for video analytics, it can be useful to annotate the events in the video. This helps to recognize and classify objects more accurately.
4. Time Synchronization: Make sure all cameras are time synchronized. This will help to restore the sequence of events and link the action, which is natural on different cameras.
5. Dividing videos into segments: If you have a search engine for large video files, divide them into smaller segments. This simplifies data processing and analysis, and improves system performance.
6. Video metadata: Create metadata for videos, including timestamps, location information, and other time period data. This is in organizing and searching for video files and events in the subsequent analysis.
😎📊Visualization no longer requires coding
Flourish Studio is a tool for creating interactive data visualizations without coding
With this tool, you can create dynamic and visual graphs, graphs, maps, and other visual elements.
Flourish Studio provides an extensive selection of pre-made templates and animations, as well as an easy-to-assemble visual editor. It can easily turn on and enjoy entertainment.
Cost: #free (no paid plans).
📝💡What is CDC: advantages and disadvantages
CDC (Change Data Capture)
is a technology for tracking and capturing data changes occurring in a data source, which allows you to efficiently replicate or synchronize data between different systems without the need to completely transfer the entire database. The main goal of CDC is to identify and retrieve only data changes that have occurred since the last capture. This makes the data replication process faster, more efficient, and more scalable.
CDC Benefits:
1. Efficiency of data replication:
CDC allows only changed data to be sent, which significantly reduces the amount of data required for synchronization between the data source and the target system. This reduces network load and speeds up the replication process.
2. Scalability: CDC technology is highly scalable, which can handle large amounts of data and high load.
3. Improved Reliability: CDC improves the reliability of the replication system by minimizing the possibility of errors in transmission and forwarding of data.
Disadvantages of CDC:
1. Additional complexity:
CDC implementation requires additional configuration and infrastructure, which can increase the complexity of the system and expose it to additional risks of failure.
2. Dependency on the data source: CDC depends on the ability of the data source to capture and provide changed data. If the source does not support CDC, this may be a barrier to its implementation.
3. Data schema conflicts: When synchronizing between systems with different data schemas, conflicts can occur that require additional processing and resolution.
Thus, CDC is a powerful tool for efficient data management and information replication between different systems. However, its successful implementation requires careful planning, tuning and testing to minimize potential risks and ensure reliable system operation.
👍1
💥🌎TOP DS-events all over the world in August
Aug 3-4
- ICCDS 2023 - Amsterdam, Netherlands - https://waset.org/cheminformatics-and-data-science-conference-in-august-2023-in-amsterdam
Aug 4-6 - 4th International Conference on Natural Language Processing and Artificial Intelligence - Urumqi, China - http://www.nlpai.org/
Aug 7-9 - Ai4 2023 - Las Vegas, USA - https://ai4.io/usa/
Aug 8-9 - Technology in Government Summit 2023 - Канберра, Австралия - https://www.terrapinn.com/conference/technology-in-government/index.stm
Aug 8-9 - CDAO Chicago - Chicago, USA - https://da-metro-chicago.coriniumintelligence.com/
Aug 10-11 - ICSADS 2023 - New York, USA - https://waset.org/sports-analytics-and-data-science-conference-in-august-2023-in-new-york
Aug 17-19 - 7th International Conference on Cloud and Big Data Computing - Manchester, UK - http://www.iccbdc.org/
Aug 19-20 - 4th International Conference on Data Science and Cloud Computing - Chennai, India - https://cse2023.org/dscc/index
Aug 20-24 - INTERSPEECH - Dublin, Ireland - https://www.interspeech2023.org/
Aug 22-25 - International Conference On Methods and Models In Automation and Robotics 2023 - Мендзыздрое, Польша - http://mmar.edu.pl/
📝🔎Problems and solutions of text data markup
Text data markup
is an important task in machine learning and natural language processing. However, she may encounter various problems that can make the process difficult and complicated. Some of these problems and possible solutions are listed below:
1. Subjectivity and ambiguity: Text markup can be subjective and ambiguous, as different people may interpret the content differently. This can lead to inconsistencies between markups.
Solution: To reduce subjectivity, it is necessary to provide markers with clear marking instructions and rules. Discussing and revising the results between scalers can also help identify and resolve ambiguities.
2. High cost and time consuming: Labeling text data can be costly and time consuming, especially when working with large datasets.
Solution: Using automatic labeling and machine learning methods for the initial phase can significantly reduce the amount of human work. It is also worth paying attention to the possibility of using crowdsourcing platforms to attract more markers and speed up the process.
3. Lack of standards and formats: There is no single standard for markup of textual data, and different projects may use different markup formats.
Solution: Define standards and formats for marking up data in your project. Follow common standards such as XML, JSON, or IOB (Inside-Outside-Beginning) to ensure compatibility and easy interoperability with other tools and libraries.
4. Lack of training for markups: Marking up textual data may require expert knowledge or experience in a particular subject area, and it is not always possible to find markups with the necessary competence.
Solution: Provide markups with learning material and access to resources to help them better understand the context and specifics of the task. You can also consider training markups within the team to improve markup quality.
5. Heterogeneity and imbalance in data: In some cases, labeling can be heterogeneous or unbalanced, which can affect the quality of training models.
Solution: Make an effort to balance the data and eliminate heterogeneity. This may include collecting additional data for smaller classes or applying data augmentation techniques.
6. Retraining of labelers: Labelers can adapt to the training dataset, which leads to overfitting and poor quality labeling of new data.
Solution: Regularly monitor markup quality and provide feedback to markers. Use cross-validation methods to check the stability and consistency of markups.

Thus, successful markup of textual data requires attention to detail, careful planning, and constant quality control. A combination of automatic and manual labeling methods can greatly improve the process and provide high quality data for model training.
💥ConvertCSV is a universal tool for working with CSV
ConvertCSV is an excellent solution for processing and converting CSV and TSV files into various formats, including: JSON, PDF, SQL, XML, HTML, etc.
It is important to note that all data processing takes place locally on your computer, which guarantees the security of user data. The service also provides support for Excel, as well as command-line tools and desktop applications.
👍1
📝🔎Data Observability: advantages and disadvantages
Data Observability
is the concept and practice of providing transparency, control and understanding of data in information systems and analytical processes. It aims to ensure that data is accessible, accurate, up-to-date, and understandable to everyone who interacts with it, from analysts and engineers to business users.
Benefits of Data Observability:
1. Quickly identify and fix problems:
Data Observability helps you quickly find and fix errors and problems in your data. This is especially important in cases where even small failures can lead to serious errors in analytical conclusions.
2. Improve the quality of analytics: Through data control, analysts can be confident in the accuracy and reliability of their work results. This contributes to making more informed decisions.
3. Improve Collaboration: Data Observability creates a common language and understanding of data across teams ranging from engineers to business users. This contributes to better cooperation and a more efficient exchange of information.
4. Risk Mitigation: By ensuring data reliability, Data Observability helps to mitigate the risks associated with bad decisions based on inaccurate or incorrect data.
Disadvantages of Data Observability:
1. Complexity of implementation:
Implementing a Data Observability system can be complex and require time and effort. This may require changes in the data architecture and the addition of additional tools.
2. Costs: Implementing and maintaining a data observability system can be a costly process. This includes both the financial costs of tools and the costs of training and staff maintenance.
3. Difficulty of scaling: As the volume of data and system complexity grows, it can be difficult to scale the data observability system.
4. Difficulty in training staff: Staff will need to learn new tools and practices, which may require time and training.

In general, Data Observability plays an important role in ensuring the reliability and quality of data, but its implementation requires careful planning and balancing between benefits and costs.
⚔️⚡️Comparison of Spark Dataframe and Pandas Dataframe: advantages and disadvantages
Dataframes
are structured data objects that allow you to analyze and manipulate large amounts of information. The two most popular dataframe tools are Spark Dataframe and Pandas Dataframe.
Pandas is a data analysis library in the Python programming language. Pandas Dataframe provides an easy and intuitive way to analyze and manipulate tabular data.
Benefits of Pandas Dataframe:
1. Ease of use:
Pandas offers an intuitive and easy to use interface for data analysis. It allows you to quickly load, filter, transform and aggregate data.
2. Rich integration with the Python ecosystem: Pandas integrates well with other Python libraries such as NumPy, Matplotlib and Scikit-Learn, making it a handy tool for data analysis and model building.
3. Time series support: Pandas provides excellent tools for working with time series, including functions for resampling, time windows, data alignment and aggregation.
Disadvantages of Pandas Dataframe:
1. Limited scalability:
Pandas runs on a single thread and may experience performance limitations when working with large amounts of data.
2. Memory: Pandas requires the entire dataset to be loaded into memory, which can be a problem when working with very large tables.
3. Not suitable for distributed computing: Pandas is not designed for distributed computing on server clusters and does not provide automatic scaling.

Apache Spark is a distributed computing platform designed to efficiently process large amounts of data. Spark Dataframe is a data abstraction that provides a similar interface to Pandas Dataframe, but with some critical differences.
Benefits of Spark Dataframe:
1. Scalability:
Spark Dataframe provides distributed computing, which allows you to efficiently process large amounts of data on server clusters.
2. In-instant computing: Spark Dataframe supports "in-memory" operations, which can significantly speed up queries and data manipulation.
3. Language Support: Spark Dataframe supports multiple programming languages including Scala, Java, Python, and R.
Disadvantages of Spark Dataframe:
1. Slightly slower performance for small amounts of data:
Due to the overhead of distributed computing, Spark Dataframe may show slightly slower performance when processing small amounts of data compared to Pandas.
2. Memory overhead: Due to its distributed nature, Spark Dataframe requires more RAM compared to Pandas Dataframe, which may require more powerful data processing servers.
🔉💥Opened access to more than 1.5 TB of labeled audio datasets
At Wonder Technologies, a significant amount of time has been spent by developers building deep learning systems that can perceive the world through audio signals. From applying sound-based deep learning to teaching computers to recognize emotions in sound, the company has used a wide range of data to create APIs that can function effectively even in extreme audio environments. According to the developers, the site provides a list of datasets that turned out to be very useful in the course of research and which were used to improve performance sound models in real scenarios.
🔎📝 DBT framework: advantages and disadvantages
DBT (Data Build Tool) is an open data analysis framework that facilitates the process of preparing and processing data before analytical queries. DBT was developed taking into account the features of modern analytical practices and is focused on working with data in the environment of data warehouses and analytical databases. The main goal of DBT is to provide practical tools for managing and transforming data in the preparation of an analytical environment.
DBT Benefits:
1. Modularity and manageability: DBT allows you to create data modules that can be easily reused and extended. This makes it easier to manage and maintain the analytical infrastructure.
2. Versioning and change management: DBT supports code and documentation versioning, which makes change management and collaboration more efficient.
3. Automation of the ETL process: DBT provides tools to automate the data extraction, transformation and loading (ETL) process. This saves time and effort on routine tasks.
4. Dependency Tracking: DBT automatically manages dependencies between different pieces of data, making it easy to update and maintain the consistency of the analytics environment.
5. Use of SQL: DBT uses SQL to describe data transformations, making it accessible to analysts and developers.

DBT Disadvantages:
1. Limitations of complex calculations:
DBT is more suitable for data transformation and preparation than for complex calculations or machine learning.
2. Complexity for Large Projects: Large projects with large amounts of data and complex dependencies between tables may require additional configuration and management.
3. Complexity of implementation: DBT implementation can take time and resources to set up and train employees.

Overall Conclusion: DBT is a powerful tool for managing and preparing data in an analytics environment. It provides many benefits, but may be less suitable for complex calculations and large projects. Before using DBT, it is recommended to carefully study its functionality and adapt it to the specific needs of the organization.
💥😎Recently, a dataset appeared on the network for video segmentation using motion expressions
MeViS is a large-scale motion expression-driven video segmentation dataset that focuses on object segmentation in video content based on a sentence that describes the motion of objects. The dataset contains many motion expressions to indicate targets in complex environments.
1
📚⚡️Vertical scaling: advantages and disadvantages
Data vertical scaling
is a database scaling approach in which performance gains are achieved by adding resources (eg, processors, memory) to existing system nodes. This approach focuses on improving system performance by increasing its resources within each node rather than adding new nodes.
Benefits of vertical data scaling:
1. Improved performance:
Adding resources to existing nodes allows you to process more data and requests on a single server. This can result in improved responsiveness and overall system performance.
2. Ease of Management: Compared to scaling out (adding new servers), scaling up is less difficult to manage because it doesn't require the same degree of configuration and synchronization across different nodes.
3. Infrastructure Cost Savings: Adding resources to existing servers can be more cost effective than buying and maintaining additional servers.

Disadvantages of vertical data scaling:
1. Resource limit:
Vertical scaling has limits determined by the maximum performance and resources of an individual node. Sooner or later, you can reach a point where further increase in resources will not lead to a significant improvement in performance.
2. Single-point-of-failure: If the node to which resources are being added goes down, this can lead to serious problems with the availability of the entire system. In horizontal scaling, the loss of one node does not have such a significant impact.
3. Limited scalability: As the load grows, more and more resources may need to be added, which can eventually become inefficient or costly.
4. Resource limit: With vertical scaling, the resource that is the most bottleneck can only be increased up to a certain limit. If this is the resource that limits performance, then additional resources may be spent inefficiently.
🌎TOP DS-events all over the world in September
Sep 12-13
- Chief Data & Analytics Officers, Brazil – San Paulo, Brazil - https://cdao-brazil.coriniumintelligence.com/
Sep 12-14 - The EDGE AI Summit - Santa Clara, USA - https://edgeaisummit.com/events/edge-ai-summit
Sep 13-14 - DSS Hybrid Miami: AI & ML in the Enterprise. Miami, FL, USA & Virtual – Miami, USA - https://www.datascience.salon/miami/
Sep 13-14 - Deep Learning Summit - London, UK - https://london-dl.re-work.co/
Sep 15-17 - International Conference on Smart Cities and Smart Grid (CSCSG 2023) – Changsha, China - https://www.icscsg.org/
Sep 18-22 - RecSys – ACM Conference on Recommender Systems – Singapore, Singapore - https://recsys.acm.org/recsys23/
Sep 21-23 - 3rd World Tech Summit on Big Data, Data Science & Machine Learning – Austin, USA - https://datascience-machinelearning.averconferences.com/
🔎🤔📝A little about unstructured data: advantages and disadvantages
Unstructured data
is information that does not have a clear organization or format, which distinguishes it from structured data such as databases and tables. Such data can be presented in a variety of formats such as text documents, images, videos, audio recordings, and more. Examples of unstructured data are emails, social media messages, photographs, trannoscripts of conversations, and more.
Benefits of unstructured data:
1. More information:
Unstructured data can contain valuable information that cannot be presented in a structured form. This may include nuance, context, and emotional aspects that may be missing from structured data.
2. Realistic representation: Unstructured data can reflect the real world and the natural behavior of people. They allow you to capture complex interactions and manifestations that can be lost in simplified structured data.
3. Innovation and research: Unstructured data provides a huge opportunity for innovation and research. The analysis of such data can lead to the discovery of new patterns, connections and insights.

Disadvantages of unstructured data:
1. Complexity of processing:
Due to the lack of a clear structure, the processing of unstructured data can be complex and require the use of specialized methods and tools.
2. Difficulties in analysis: Extracting meaning from unstructured data can be more difficult than from structured data. It is required to develop algorithms and models for efficient interpretation of information.
3. Privacy and Security Issues: Unstructured data may contain sensitive information and may be more difficult to manage in terms of security and privacy.

Thus, the need to work with unstructured data depends on the specific tasks and goals of the organization. In some cases, the analysis and use of unstructured data can lead to valuable insights and benefits, while in other situations it may be less useful or even redundant.
👍2
🔎📖📝A little about structured data: advantages and disadvantages
Structured data
is information organized in a specific form, where each element has well-defined properties and values. This data is usually presented in tables, databases, or other formats that provide an organized and easy-to-read presentation.
Benefits of structured data:
1. Easy Organization and Processing:
Structured data has a clear and organized structure, making it easy to organize and process. This allows you to quickly search, sort and analyze information.
2. Easy to store: Structured data is easy to store in databases, Excel spreadsheets, or other specialized data storage systems. This provides structured data with high availability and persistence.
3. High accuracy: Structured data is usually subject to quality control and validation, which helps to minimize errors and inaccuracies.

Disadvantages of structured data:
1. Restriction in information types:
Structured data works well for storing and processing data with clear structures, but may be inefficient for storing information that does not lend itself to rigid structuring, such as text, images, or audio.
2. Dependency on predefined structure: Working with structured data requires a well-defined schema or data structure. This limits their applicability in cases where the data structure can change dynamically.
3. Difficulty of Integration: Combining data from different sources with different structures can be a complex task that requires a lot of time and effort.
4. Inefficient for some types of tasks: For some types of data analysis and processing, especially those related to unstructured information, structured data may be inefficient or even inapplicable.
⚡️🔥😎Tools for image annotation in 2023
V7 Labs is a tool for creating accurate, high-quality datasets for machine learning and computer vision projects. Its wide range of annotation features allows it to be used in many different areas.
Labelbox is the most powerful vector labeling tool targeting simplicity, speed, and a variety of use cases. It can be set up in minutes, scale to any team size, and iterate quickly to create accurate training data.
Scale - With this annotation tool, users can add scales or rulers to help determine the size of objects in an image. This is especially useful when studying photographs of complex structures, such as microscopic organisms or geological formations.
SuperAnnotate is a powerful annotation application that allows users to quickly and accurately annotate photos and videos. It is intended for computer vision development teams, AI researchers, and data scientists who annotate computer vision models. In addition, SuperAnnotate has quality control tools such as automatic screening and consensus checking to ensure high-quality annotations.
Scalabel - Helps users improve accuracy with automated annotations. It focuses on scalability, adaptability and ease of use. Scalabel's support for collaboration and version control allows multiple users to work on the same project simultaneously.
👱‍♂️⚡️The DeepFakeFace dataset has become publicly available
DeepFakeFace(DFF)
is a dataset that serves as the basis for training and testing algorithms designed to detect deepfakes. This dataset is created using various advanced diffusion models.
The authors claim that they analyzed the DFF dataset and proposed two evaluation methods to evaluate the effectiveness and adaptability of deepfake recognition tools.
The first method tests whether an algorithm trained on one type of fake image can recognize images generated by other methods.
The second method evaluates the algorithm's performance on non-ideal images, such as blurry, low-quality, or compressed images.
Given the varying results of these methods, the authors highlight the need for more advanced deepfake detectors.


🧐 HF: https://huggingface.co/datasets/OpenRL/DeepFakeFace

🖥 Github: https://github.com/OpenRL-Lab/DeepFakeFace

📕 Paper: https://arxiv.org/abs/2309.02218
👍1