Big Data Science – Telegram
Big Data Science
3.74K subscribers
65 photos
9 videos
12 files
637 links
Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼https://news.1rj.ru/str/bds_job — channel about Data Science jobs and career
💻https://news.1rj.ru/str/bdscience_ru — Big Data Science [RU]
Download Telegram
💡SQL vs NoSQL: advantages and disadvantages
SQL and NoSQL are the two main approaches to data storage and processing. Each has its own advantages and disadvantages, and the choice between them depends on the specific needs of the project. Let's look at the main differences between them.
SQL (Structured Query Language) databases are relational databases as well as DBMS of the Hadoop ecosystem that use structured tables to store data.
Benefits of SQL:
1. Data structure: Tables, relationships and diagrams make data easily understandable and manageable.
2. ACID Consistency: SQL databases ensure compliance with ACID principles (atomicity, consistency, isolation, durability), ensuring transaction reliability.
3. Universal Query Language: SQL provides a rich and versatile set of tools for performing complex queries and analytics. In some DBMSs there may be only slight deviations in the form of SQL dialects
Disadvantages of SQL:
1. Horizontal scaling: Traditional SQL databases often face scaling limitations when dealing with large volumes of data.
2. Schema Complexity: Changing the data schema can be a complex and costly process.
3. Limited flexibility: Some SQL databases may have restrictions on data types or structures that may not be suitable for some types of data.
NoSQL databases, on the other hand, do not use traditional tables but instead offer flexible data models.
Advantages of NoSQL:
1. Data structure flexibility: NoSQL databases can easily scale and change without the need to rebuild the schema.
2. Horizontal scaling: Many NoSQL databases easily scale horizontally, providing high performance for large volumes of data.
3. Support for unstructured data: NoSQL databases are well suited for storing and processing unstructured data such as text, images and videos.
Disadvantages of NoSQL:
1. Lack of ACID support: Many NoSQL databases sacrifice ACID consistency for performance or flexibility.
2. Consistency Difficulty: When scaling and distributing data, NoSQL databases can face challenges maintaining data consistency.
Thus, depending on the project requirements and priorities for performance, scalability and flexibility, the choice between SQL and NoSQL databases may be different.
💡Dataset for detecting problems in code
SWE-bench is a dataset that was designed to provide a diverse set of codebase issues that could be verified using unit tests in repositories. The full SWE-bench split includes 2,294 issue-commit pairs across 12 python repositories.
Thus, the dataset offers a new task: solving problems in the presence of a complete repository and an issue on GitHub.
To load a dataset using a Python noscript, you can use the following command:
from datasets import load_dataset
dataset = load_dataset("princeton-nlp/SWE-bench")
👍1
📊😎💡🤖Dataset catalog for object detection and segmentation
SAM + Optical Flow = FlowSAM
FlowSAM is a new tool for detecting and segmenting moving objects in video, which significantly outperforms all previous models, both for a single object and for multiple objects
To train the model, many datasets were used, which became available for download at this link
📉📊Selection of tools for working with Big Data
Drill - Layers on top of multiple data sources, allowing users to query a wide range of information in a variety of formats, from Hadoop sequence files and server logs to NoSQL databases and cloud-based object stores.
Druid (https://druid.apache.org/) is a real-time analytics database that provides low query latency, high concurrency, multi-user capabilities, and instant visibility into streaming data. According to its proponents, multiple end users can simultaneously query data stored in Druid without any performance impact.
HPCC Systems is a big data platform developed by LexisNexis and open sourced in 2011. In accordance with its full name - High-Performance Computing Cluster - the technology is essentially a cluster of computers created on the basis of standard hardware for processing, managing and delivering big data.
Iceberg is an open table format used to manage data in lakes, achieved in part by tracking individual files of information in tables rather than directories. Created by Netflix for use with its petabyte-sized tables, Iceberg is now an Apache project. Iceberg is typically "used in production, where a single table can contain tens of petabytes of data."
Kylin is a distributed information storage and analytics platform for big data. It provides an analytical information processing (OLAP) engine designed to work with very large data sets. Because Kylin is built on top of other Apache technologies, including Hadoop, Hive, Parquet and Spark, its proponents say it can easily scale to handle large volumes of data.
Samza is a distributed stream processing system created by LinkedIn and is currently an open source project managed by Apache. The system can run on top of Hadoop YARN or Kubernetes, and a standalone deployment option is also offered. According to the developers, Samza can process "several terabytes" of data state information with low latency and high throughput for fast analysis.
👍1
💡Selection of ETL services for Big Data
Renta Marketing ETL - A cloud solution that allows you to integrate 28 enterprise data sources with popular data warehouses like Snowflake and BigQuery, the service allows a team of engineers and analysts to integrate third-party tools and create pipelines in a couple of minutes data without code. For example, you can set up Facebook Ads integration with BigQuery in four clicks. There is no need to involve developers to work with Renta Marketing ETL.
Fivetran - Cloud-based software that allows users to quickly and easily create pipelines. The platform supports more than 90 sources. Fivertran provides a set of ready-made integrations, so even novice developers can understand the service.
Hevo Data - the service provides users with more than 150 ready-made integrations. You can set up integrations in three simple steps. The result is a pipeline that copies data to storage and requires no maintenance.
Matillion is a low-code application for creating pipelines. With Matillion, teams can create pipelines and automate data processing. The service has a simple interface, so a user who is far from programming can create and change data. Marillion supports real-time processing. The tool supports popular data sources and makes it easy to identify and resolve data problems.
Supermetrics is an ETL solution designed for small businesses and marketers who primarily use Facebook Ads, Google Ads and Google Analytics. The tool has a built-in application on the Google Cloud Platform that allows you to export data directly to Google BigQuery.
👍2
🌎TOP DS-events all over the world in May
May 4 - SQL Saturday - Jacksonville, USA - https://sqlsaturday.com/2024-05-04-sqlsaturday1068/
May 7-9 - Real-Time Analytics Summit - San Jose, USA - https://www.rtasummit.com/
May 8 - Data Connect West - Portland, USA - https://www.dataconnectconf.com/dccwest/conference
May 8-9 - UNLEASH THE POWER OF YOUR DATA - Boston, USA - https://www.dbta.com/DataSummit/2024/default.aspx
May 8-9 - Data Innovation Summit - Dubai, UAE - https://mea.datainnovationsummit.com/
May 9 - Conversational AI Innovation Summit - San Francisco, USA - https://confx-conversationalai.com/
May 15-17 - World Data Summit - Amsterdam, The Netherlands - https://worlddatasummit.com/#up
May 16 - Spatial Data Science Conference 2024 - London, UK - https://spatial-data-science-conference.com/2024/london
May 18 - DSF MAYDAY - London, UK - https://datasciencefestival.com/event/mayday-2024/
May 21 - Deployment, Utilization & Optimization of Enterprise Generative AI - Silicon Valley, USA - https://ent-gen-ai-summit-west.com/events/enterprise-generative-ai-summit-west-coast
May 23-24 - The Data Science Conference - Chicago, USA - https://www.thedatascienceconference.com/
👍1
💡😎A great resource to add to the collection: datasets for LLMs
Many examples (including LLama-3 and Phi-3) show that LLM development = creating quality datasets.
A developer from London has taken and described in this repository a huge number of datasets for pre-training or fintuning LLM in table format: reference, size, authors, date and personal notes.
There are also instructions on how to build your own quality dataset, and what the word "quality" means in the context of a dataset.
☁️💡Dataset for studying direct air capture
Researchers from GeorgiaTech have published the largest dataset and a new SOTA model for studying direct air capture, this is a key process for combat climate change
This dataset contains an in-domain test set and 4 out-of-domain test sets (ood-large, ood-linker, ood-topology and ood-linker & topology). All LMD databases are compressed into one .tar.gz file.
💡📊Service for fast transfer of Big Data
Redpanda is a data streaming platform. It is perfectly compatible with the Kafka API. 10x faster. Doesn't require any ZooKeeper or any JVM.
Redpanda is designed to fully utilize fast big data storage devices such as SSDs or NVMe devices, and to take advantage of multi-core processors and computers with large amounts of RAM. This maximizes performance when processing significant amounts of data and queries.
Documentation is available at this link
👍1
🔎Extract data with Quivr
Quivr is an open-source service that allows you to extract information from local files (PDF, CSV, Excel, Word, audio, video, etc.)
Quivr can work offline, so you can always access your data anytime, anywhere.
Quivr is also compatible with Ubuntu 22 or later
The open source code can be obtained from this link
👍2
⚖️Apache Superset: advantages and disadvantages
Apache Superset is an open source data visualization tool that provides a rich set of capabilities for analyzing data and creating interactive dashboards.
Benefits of Apache Superset:
1. Open Source: Apache Superset is developed and maintained by the community, which provides a high degree of flexibility and extensibility to suit different needs.
2. Powerful Data Visualization: Superset offers a wide selection of graphs, charts and visuals, allowing users to create colorful and informative dashboards for data analysis.
3. Interactive capabilities: Users can easily interact with dashboards, apply filters, change parameters and drill down/expand data to gain a deeper understanding of the information.
4. Integration with various data sources: Superset supports multiple data sources including databases, data warehouses, Apache Druid and many more, making it a versatile tool for working with data from various sources.
5. Scalability and Performance: Thanks to its architecture and the use of technologies such as Apache Druid, Superset is able to efficiently process large amounts of data and provide high performance when working with dashboards.
Disadvantages of Apache Superset:
1. Difficulty in Setup: Although Superset provides extensive capabilities, its setup and configuration can be complex, especially for beginners, requiring a certain level of technical expertise.
2. Insufficient Documentation: Some users have noted that Superset's documentation is not always detailed or up-to-date enough, which can make it difficult to learn and use the tool.
Overall, Apache Superset is a powerful open source data visualization tool that comes with several advantages such as flexibility, scalability, and powerful visualization capabilities. However, before using it, you should also take into account the disadvantages, such as the complexity of setup and some restrictions on its availability.
😎Selection of vector databases
Vector databases are a special type of database designed to organize data based on similarity. To do this, they transform raw data—such as images, text, video, or audio—into mathematical representations known as multidimensional vectors. Each vector can have from tens to thousands of dimensions, depending on the complexity of the source data. At the moment there are such vector databases as:
Chroma is an open source vector database designed to provide developers and organizations of all sizes with the resources they need to build Large Language Model (LLM) based applications. It provides developers with a highly scalable and efficient solution for storing, searching, and retrieving multidimensional vectors.
One of the reasons for Chroma's popularity is its flexibility.
Pinecone - This is a cloud-based managed vector database. Its broad support for high-dimensional vectors makes Pinecone suitable for a variety of use cases, including similarity search, recommender systems, personalization, and semantic search. It also supports single-stage filtering capabilities. And its ability to analyze data in real time makes it an excellent choice for detecting threats and monitoring cybersecurity attacks.
Weviate - A notable feature of this database is that it can be used to store both vectors and objects. This makes it suitable for applications that combine multiple search methods, such as vector search and keyword search.
Milvus - uses the most modern algorithms to speed up the search process, which allows you to quickly find similar vectors even when working with large amounts of data.
👍2
💡😎Basics of working with Data-Mining: process, tools and techniques
Data mining is the process of processing data to identify patterns, correlations and anomalies in large datasets. It uses a variety of statistical analysis and machine learning techniques to extract meaningful information and insights from data. Companies can use these insights to make informed decisions, predict trends, and improve business strategies.
There are such data mining techniques as:
Decision trees - at the end of each branch there is a prediction or decision. In classification tasks, these endpoints separate data into categories
Detection of anomalies - anomalies can arise from fluctuations in measurements or be indicators of experimental error; in some cases they may indicate an important discovery or a new trend
Software for working with data mining is divided into:
1. Visualization tools:
Grafana - suitable for analytics and real-time monitoring
Google Charts is a web-based solution for creating interactive charts
2. Data mining platforms:
KNIME is an analytical platform that allows you to download data from various sources, transform data and load it into various databases
RapidMiner is a multi-user software platform that is an integrated environment for processing data in large information arrays, machine learning, text analytics and building predictive models, as well as for solving other Data Mining problems.
1👍1
⚔️💡MySQL vs PostgreSQL in Data Mining: advantages and disadvantages
When it comes to choosing a database for data mining tasks, the two most popular solutions are MySQL and PostgreSQL. Both of these DBMSs have their strengths and weaknesses, and the choice between them depends on the specific requirements of the project. Let's look at the main advantages and disadvantages of each of them in the context of data mining.
MySQL Advantages:
1. High Performance: MySQL is known for its fast performance on simple read and write operations, making it suitable for high-load applications.
2. Widespread: MySQL has a large community and lots of documentation, making it easy to find solutions and get help.
3. Integration with web technologies: MySQL is often used in web development and has good compatibility
Disadvantages of MySQL:
1. Limited Analytics Capabilities: MySQL is less efficient at running complex analytical queries and does not support some advanced features such as windowing functions that are important for data mining.
2. Less support for JSON and NoSQL: Although MySQL has JSON support, it is not as developed as PostgreSQL.
3. Limited transaction capabilities: MySQL is inferior to PostgreSQL in terms of complex transaction management and ACID compliance.
Benefits of PostgreSQL:
1. Powerful analytical capabilities: PostgreSQL supports complex analytical queries, window functions and CTE (Common Table Expressions), making it an excellent choice for data mining.
2. JSON and NoSQL support: PostgreSQL has advanced JSON support and can be used as a hybrid DBMS, making it easier to work with semi-structured data.
3. Extensibility and Compatibility: PostgreSQL is easily extensible with plugins and supports many SQL standards, making it very flexible.
4. Reliability and ACID compliance: PostgreSQL provides a high level of data reliability and full compliance with ACID transactions, which is important for mission-critical data mining applications.
Disadvantages of PostgreSQL:
1. Complexity of setup and administration: PostgreSQL requires more in-depth setup and administration, which can be more difficult for beginners.
2. Performance on simple tasks: In some cases, PostgreSQL may be inferior to MySQL in terms of speed for performing simple operations.
3. Resource Intensive: PostgreSQL may require more resources to achieve high performance, especially in complex scenarios.
Thus, the choice between MySQL and PostgreSQL for data mining tasks depends on the specific needs of the project. If you need a simple and fast database system for basic operations and integration with web applications, MySQL may be the best choice. If the project requires complex data analysis, flexibility and reliability, PostgreSQL will be a more suitable option.
3👍1😁1
💡🔎An assistant for interacting with any kind of data
Verba is a fully customizable personal assistant for querying and interacting with your data, locally or deployed via the cloud.

It can also answer questions related to your documents and retrieve information from existing knowledge bases.

Verba perfectly combines state-of-the-art RAG technology with Weaviate's context-aware database.
💡 Large Feedback Dataset
The RLAIF-V-Dataset is a large multimodal recall dataset. The dataset is built using open source models to provide high quality feedback.
The RLAIF-V-Dataset is a novel method of using open source MLLM to provide high quality feedback from deconfined model responses. By training on this data, models can achieve a higher level of confidence than existing open source models.
Load a dataset using a Python noscript as follows:
from datasets import load_dataset
dataset = load_dataset("HaoyeZhang/RLAIF-V-Dataset")
👍1
📊🔎A small selection of not very popular but useful libraries for data analysis
PySheets - provides a spreadsheet user interface for Python. Use Pandas, create charts, import Excel sheets, analyze data and create reports.
py2wasm - converts Python programs and data into WebAssembly and runs them 3x faster.
databonsai - is a Python library that uses LLM for data cleaning tasks such as categorization, transformation, and extraction.
2
🔎💡Useful GitHub repositories for master data development and beginners
Awesome Data Engineering - Contains a list of tools, frameworks and libraries for data engineering, making it a great starting point for those looking to dive into the field.
Data Engineering Zoomcamp is a comprehensive course that provides hands-on experience in data engineering.
The Data Engineering Cookbook is a collection of articles and tutorials covering various aspects of data engineering, including data entry, data processing, and data warehousing.
Awesome Open Source Data Engineering is a list of open source data engineering tools that will be a goldmine for anyone who wants to contribute or use them to create real data engineering projects. It contains a wealth of information about open source tools and frameworks, making it an excellent resource for those who want to explore alternative data engineering solutions.
Data Engineer Handbook is a comprehensive collection of resources covering all aspects of data engineering. It includes tutorials, articles and books on all topics related to data engineering. Whether you're looking for a quick reference guide or in-depth knowledge, this reference book has something for every level of data engineer.
The Data Engineering Wiki is a community-created wiki that provides a comprehensive resource for learning data engineering. This repository covers a wide range of topics including data pipelines, data warehouses, and data modeling.
Data Engineering Practice - Offers a practical approach to learning data engineering. It features practical projects and exercises that will help you apply your knowledge and skills to real-life scenarios. By completing these projects, you will gain hands-on experience and build a portfolio that demonstrates your data engineering capabilities.
🌎TOP DS-events all over the world in June
Jun 2-4 - AI Con USA 2024 - Las Vegas, USA - https://aiconusa.techwell.com/
Jun 3-4 - Institute for Data Science and Artificial Intelligence Conference 2024 - Manchester, UK - https://oxfordroadcorridor.com/events/institute-for-data-science-and-artificial-intelligence-conference-2024/
Jun 5 - Digital transformation summit - RIYADH, Saudi Arabia - https://digitransformationsummit.com/ksa/
Jun 5-6 - AI & Big Data Expo North America 2024 - Santa Clara, USA - https://www.ai-expo.net/northamerica/
Jun 5-6 - Big Data and Analytics Summit - Ontario, Canada - https://www.bigdatasummitcanada.com/
Jun 12-13 - The AI Summit - London, United Kingdom - https://london.theaisummit.com/
Jun 17-19 - Data Science & Statistics - Amsterdam, Netherlands - https://datascience.thepeopleevents.com/
Jun 18 - The Martech Summit - Jakarta, Indonesia - https://themartechsummit.com/jakarta
Jun 20 - Data Architecture Melbourne - Melbourne, Australia - https://data-architecture.coriniumintelligence.com/
Jun 25 - Data Architecture Sydney - Sydney, Australia - https://data-architecture-syd.coriniumintelligence.com/
Jun 25-28 - MLCon Munich - Munich, Germany / Online - https://mlconference.ai/munich/
Jun 26-28 - International Conference on Distributed Computing and Artificial Intelligence (DCAI) - Salamanca, Spain - https://www.dcai-conference.net/
👍2
💡🔎📉Adversarial verification: advantages and disadvantages
Adversarial Verification (AV) is a technique that evaluates a modern test data format based on operational data. This is especially useful in machine learning tasks, where the quality of predictions can matter due to the fact that the data relationship between the strategic and test samples is now clearly visible. Let's look at the main advantages and disadvantages of this situation.
Advantages of adversarial verification:
1. Detection of data inconsistencies:
AV helps identify if production and test data have very different distributions. This may signal dangerous problems with generalization models.
2. Improving the quality of models: By eliminating differences between process and test data, the quality of predictive models in test selection can be significantly improved.
3. Optimization of data selection: by using AV, organizational and validation data sets can be used more accurately, which will avoid overfitting and improve the overall quality of the model.
4. Identification of data leaks: AV helps to identify cases where information from the test sample “leaks” into the operational sample, which can lead to biased results of the model.
Disadvantages of the adversarial test:
1. Increased computational cost: Performing AV requires training additional models (usually a classifier), which increases the computational cost and time required for data analysis.
2. Difficulty in Implementation: Setting up and installing AV can require significant knowledge and experience in machine learning, which can be challenging for beginners.
3. Risk of overfitting: Using AV too often to correct data can lead to overtraining of models on operational data and deterioration of their generalization abilities.
💡🔎Not very well known, but very useful ETL services
Astera Centerprise is an enterprise-grade, ready-to-use ETL solution that offers data integration and transformation capabilities for raw data of any complexity and size in a variety of formats: from complex hierarchical files and unstructured documents to industry formats such as EDI, and even legacy data such as COBOL.
Talend is an open source software platform that offers data integration and management solutions. Talend specializes in big data integration. This tool provides features such as cloud, big data, enterprise application integration, data quality, and master data management. It also provides a single repository for storing and reusing metadata.
Skyvia is a web service for cloud data integration and backup. It offers ETL tools to integrate cloud CRM with other data sources and allows users to control all their business data. Data can be viewed and manipulated using SQL. Skyvia provides easy data integration without programming skills.
Pentaho is a business intelligence tool that provides clients with a wide range of business intelligence solutions. It is capable of reporting, data analysis, data integration, data extraction, etc. Pentaho also offers a complete set of BI features that can improve business productivity and efficiency.
Hevo Data is an ETL platform that supports data integration, movement, and processing. It supports a wide range of data sources and offers real-time data replication. This tool facilitates the extraction, transformation and loading of data to the designated target destinations.