Big Data Science – Telegram
Big Data Science
3.74K subscribers
65 photos
9 videos
12 files
637 links
Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼https://news.1rj.ru/str/bds_job — channel about Data Science jobs and career
💻https://news.1rj.ru/str/bdscience_ru — Big Data Science [RU]
Download Telegram
💡 A Quick Selection of GitHub Repositories for Beginners and Beyond

SQL Roadmap for Data Science & Data Analytics - a step-by-step program for learning SQL. This GitHub repository is supplemented with links to learning materials, making it a great resource for mastering SQL

kh-sql-projects - collection of source codes for popular SQL projects catering to developers of all levels, from beginners to advanced. The repository includes PostgreSQL-based projects for systems like library management, student records, hospitals, booking, and inventory. Perfect for hands-on SQL practice!

ds-cheatsheet - repository packed with handy cheat sheets for learning and working in the Data Science field. An excellent resource for quick reference and study

GenAI Showcase - repository showcasing the use of MongoDB in generative artificial intelligence. It includes examples of integrating MongoDB with Retrieval-Augmented Generation (RAG) techniques and various AI models
💡😎 A Small Selection of Big, Fascinating, and Useful Datasets

Sky-T1-data-17k — diverse dataset designed to train the Sky-T1-32B model, which powers the reasoning capabilities of MiniMax-Text-01. This model consistently outperforms GPT-4o and Gemini-2 in benchmarks involving long-context tasks

XMIDI Dataset — large-scale music dataset with precise emotion and genre labels. It contains 108,023 MIDI files, making it the largest known dataset of its kind—ideal for research in music and emotion recognition

AceMath-Data - family of datasets used by NVIDIA to train their flagship model, AceMath-72B-Instruct. This model significantly outperforms GPT-4o and Claude-3.5 Sonnet in solving mathematical problems
🤔💡 How Spotify Built a Scalable Annotation Platform: Insights and Results

Spotify recently shared their case study, How We Generated Millions of Content Annotations, detailing how they scaled their annotation process to support ML and GenAI model development. These improvements enabled the processing of millions of tracks and podcasts, accelerating model creation and updates.

Key Steps:
1️⃣ Scaling Human Expertise:
Core teams: annotators (primary reviewers), quality analysts (resolve complex cases), project managers (team training and liaison with engineers).
Automation: Introduced an LLM-based system to assist annotators, significantly reducing costs and effort.

2️⃣ New Annotation Tools:
Designed interfaces for complex tasks (e.g., annotating audio/video segments or texts).
Developed metrics to monitor progress: task completion, data volume, and annotator productivity.
Implemented a "consistency" metric to automatically flag contentious cases for expert review.

3️⃣ Integration with ML Infrastructure:
Built a flexible architecture to accommodate various tools.
Added CLI and UI for rapid project deployment.
Integrated annotations directly into production ML pipelines.

😎 Results:
Annotation volume increased 10x.
Annotator productivity improved 3x.
Reduced time-to-market for new models.

Spotify's scalable and efficient approach demonstrates how human expertise, automation, and robust infrastructure can transform annotation workflows for large-scale AI projects. 🚀
Which tool would you prefer to automate the processing and orchestration of Big Data tasks?
Anonymous Poll
24%
Kubernetes
59%
Apache Airflow
9%
Apache Nifi
9%
Apache Hive
😱 Data Errors That Led to Global Disasters

Demolishing the Wrong Houses – Due to inaccurate geoinformation system data, demolition crews were sent to incorrect addresses because of Google Maps errors. This led to homes being destroyed, causing tens of thousands of dollars in damages and legal battles for companies.

Zoll Medical Defibrillators – Data quality issues during manufacturing caused Zoll Medical defibrillators to display error messages or completely fail during use. The company had to issue a Category 1 recall (the most severe, with a high risk of serious injury or death), costing $5.4 million in fines and damaging trust.

UK Passport Agency Failures – Errors in data migration during system updates caused severe passport issuance delays, leading to public outcry and a backlog of applications. Fixing the issue and hiring extra staff once cost the agency £12.6 million.

Mars Climate Orbiter Disaster – The $327.6 million NASA probe burned up in Mars' atmosphere due to a unit conversion error—one engineering team used metric measurements, while another used the imperial system.

Knight Capital Stock Trading Error – A software bug caused Knight Capital to accidentally purchase 150 different stocks worth $7 billion in one hour. The firm lost $440 million and went bankrupt.

AWS Outage at Amazon – A typo in a server management command accidentally deleted more servers than intended, causing a 4-hour outage. Companies relying on AWS suffered $150 million in losses due to downtime.

Spanish Submarine "Isaac Peral" (S-81) – A decimal point miscalculation led to the submarine being 75–100 tons too heavy to float. A complete redesign caused significant delays and cost over €2 billion.

Boeing 737 Max Crashes – In 2018 and 2019, two Boeing 737 Max crashes killed 349 people. The aircraft relied on data from a single angle-of-attack sensor, which triggered an automatic system that overrode pilot control. The disaster grounded the entire 737 Max fleet, costing Boeing $18 billion.

Lehman Brothers Collapse – Poor data quality and weak risk analysis led Lehman Brothers to take on more risk than they could handle. The hidden true value of assets contributed to their $691 billion bankruptcy, triggering a global financial crisis.

💡Moral of the story: Data errors aren’t just small mistakes—they can cost billions, ruin companies, and even put lives at risk. Always verify, validate, and double-check!
🌎TOP DS-events all over the world in February

Feb 4-6 - AI Everything Global – Dubaï, UAE - https://aieverythingglobal.com/home
Feb 5 - Open Day at DSTI – Paris, France - https://dsti.school/open-day-at-dsti-5-february-2025/
Feb 5-6 - The AI & Big Data Expo – London, UK - https://www.ai-expo.net/global/
Feb 6-7 - International Conference on Data Analytics and Business – New York, USA - https://sciencenet.co/event/index.php?id=2703381&source=aca
Feb 11 - AI Summit West - San Jose, USA - https://ai-summit-west.re-work.co/
Feb 12-13 - CDAO UK – London, UK - https://cdao-uk.coriniumintelligence.com/
Feb 13-14 - 6th National Big Data Health Science Conference – Columbia, USA - https://www.sc-bdhs-conference.org/
Feb 13-15 - WAICF - WOrld AICAnnes Festival - Cannes, France - https://www.worldaicannes.com/
Feb 18 - adesso Data Day - Frankfurt / Main, Germany - https://www.adesso.de/de/news/veranstaltungen/adesso-data-day/programm.jsp
Feb 18-19 - Power BI Summit – Online - https://events.m365-summits.de/PowerBISummit2025-1819022025#/
Feb 18-20 - 4th IFC Workshop on Data Science in Central Banking – Rome, Italy - https://www.bis.org/ifc/events/250218_ifc.htm
Feb 19-20 - Data Science Day - Munich, Germany - https://wan-ifra.org/events/data-science-day-2025/
Feb 21 - ICBDIE 2025 – Suzhou, China - https://www.icbdie.org/submission
Feb 25 - Customerdata trends 2025 – Online - https://www.digitalkonferenz.net/
Feb 26-27 - ICET-25 - Chongqing, China - https://itar.in/conf/index.php?id=2703680
🚀 BigQuery Metastore: A Unified Metadata Service with Apache Iceberg Support

Google has announced a highly scalable metadata service for Lakehouse architecture. The new runtime metastore supports multiple analytics engines, including BigQuery, Apache Spark, Apache Hive, and Apache Flink.
BigQuery Metastore unifies metadata access, allowing different engines to query a single copy of data. It also supports Apache Iceberg, simplifying data management in lakehouse environments.

😎 Key Benefits:
Cross-compatibility – A single source of metadata for all analytics engines.
Open format support – Apache Iceberg, external BigQuery tables.
Built-in data governance – Access control, auditing, data masking.
Fully managed service – No configuration required, automatically scales.

🤔 Why is this important?
Traditional metastores are tied to specific engines, requiring manual table definitions and metadata synchronization. This leads to stale data, security issues, and high admin costs.

🤔 What does this change?
BigQuery Metastore standardizes metadata management, making lakehouse architecture more accessible, simplifying analytics, and reducing infrastructure maintenance costs.

🔎 Learn more here
🔥 WILDCHAT-50M: The Largest Open Dialogue Dataset for Language Models

Researchers have introduced WILDCHAT-50M—the largest open dataset of its kind, containing an extensive collection of real chat data. Designed to enhance language model training, particularly in dialogue processing and user interactions, this dataset consists of over 125 million chat trannoscripts spanning more than a million conversations. It serves as a valuable resource for researchers and developers working on advanced AI language models.

🔍 Key Features of WILDCHAT-50M:
Real-world conversational data – Unlike traditional datasets based on structured texts or curated dialogues, this dataset provides authentic user interactions.
Developed for RE-WILD SFT – Supports Supervised Fine-Tuning (SFT), enabling models to adapt to realistic conversation scenarios and improve long-term dialogue coherence.
A massive open benchmark – One of the largest publicly available datasets in its category, allowing developers to test, experiment, and refine their NLP models.

Most language model training datasets rely on structured articles or noscripted dialogues. In contrast, WILDCHAT-50M captures the nuances of real conversations, helping models generate more natural, context-aware responses.

🚀 Why does it matter?
By leveraging datasets like WILDCHAT-50M, language models can significantly improve their ability to generate human-like responses, understand spoken language dynamics, and advance the development of AI-powered virtual assistants, chatbots, and dialogue systems.

With access to real-world conversational data, AI is moving closer to truly natural and intelligent communication.
😎🛠 Another Roundup of Big Data Tools

NocoDB - An open-source platform that turns relational databases (MySQL, PostgreSQL, SQLite, MSSQL) into a no-code interface for managing tables, creating APIs, and visualizing data. A powerful self-hosted alternative to Airtable, offering full data control.

DrawDB - A visual database modeling tool that simplifies schema design, editing, and visualization. It supports automatic SQL code generation and integrates with MySQL, PostgreSQL, and SQLite. Ideal for developers and analysts who need a quick, user-friendly way to design databases.

Dolt - relational database with Git-like version control. It lets you track row-level changes, create branches, merge them, and view the full history of modifications while working with standard SQL queries.

ScyllaDB - high-performance NoSQL database that is fully compatible with Apache Cassandra but delivers lower latency and higher throughput. Optimized for modern multi-core processors, making it perfect for high-load distributed systems

Metabase - An intuitive business intelligence platform for visualizing data, creating reports, and building dashboards without deep SQL knowledge. It supports MySQL, PostgreSQL, MongoDB, and more, making data analysis more accessible

Azimutt - powerful ERD visualization tool for designing and analyzing complex databases. Features include interactive schema exploration, foreign key visualization, and problem detection, making it useful for both database development and auditing

sync - real-time data synchronization tool for MongoDB and MySQL. It uses Change Streams (MongoDB) and binlog replication (MySQL) to ensure incremental updates, fault tolerance, and seamless recovery. Great for distributed databases and analytics workflows
🤔 Vector vs. Graph Databases: Which One to Choose?

When dealing with unstructured and interconnected data, selecting the right database system is crucial. Let’s compare vector and graph databases.

😎 Vector Databases

📌 Advantages:
Optimized for similarity search (e.g., NLP, computer vision).
High-speed approximate nearest neighbor (ANN) search.
Efficient when working with embedding models.

⚠️ Disadvantages:
Not suitable for complex relationships between objects.
Limited support for traditional relational queries.

😎 Graph Databases

📌 Advantages:
Excellent for handling highly connected data (social networks, routing).
Optimized for complex relationship queries.
Flexible data storage schema.

⚠️ Disadvantages:
Slower for large-scale linear searches.
Inefficient for high-dimensional vector processing.

🧐 Conclusion:
If you need embedding-based search → Go for vector databases (Faiss, Milvus).
If you need complex relationship queries → Use graph databases (Neo4j, ArangoDB).
💡 News of the Day: Harvard Launches a Federal Data Archive from data.gov

Harvard’s Library Innovation Lab has unveiled an archive of data.gov on the Source Cooperative platform. The 16TB collection contains over 311,000 datasets gathered in 2024–2025, providing a complete snapshot of publicly available federal data.

The archive will be updated daily, ensuring access to up-to-date information for researchers, journalists, analysts, and the public. It includes datasets across various domains, such as environment, healthcare, economy, transportation, and agriculture.

Additionally, Harvard has released open-source software on GitHub for building similar repositories and data archiving solutions. This allows other organizations and research centers to develop their own public data archives. Project supported by Filecoin Foundation & Rockefeller Brothers Fund
🔍 Key Big Data Trends in 2025

Experts at Xenoss have outlined the major trends shaping Big Data's future. Despite Google's BigQuery engineer Jordan Tigani predicting the possible “decline” of Big Data, analysts argue that the field is rapidly evolving.

🚀 Hyperscalable platforms are becoming essential for handling massive datasets. Advancements in NVMe SSDs, multi-threaded CPUs, and high-speed networks enable near-instant petabyte-scale analysis, unlocking new potential in AI & ML for predictive strategies based on historical and real-time data.

📊 Zero-party data is taking center stage, offering companies user-consented personalized data. When combined with AI & LLMs, it enhances forecasting and recommendations in media, retail, finance, and healthcare.

⚡️ Hybrid batch & stream processing is balancing speed and accuracy. Lambda architectures enable real-time event response while retaining deep historical data analysis capabilities.

🔧 ETL/ELT optimization is now a priority. Companies are shifting from traditional data processing pipelines to AI-powered ELT workflows that automate data filtering, quality checks, and anomaly detection.

🛠 Data orchestration is evolving, reducing data silos and simplifying management. Open-source tools like Apache Airflow and Dagster are making complex workflows more accessible and flexible.

🌎 Big Data → Big Ops: The focus is shifting from storing data to actively leveraging it in automated business operations—enhancing marketing, sales, and customer service.

🧩 Composable data stacks are gaining traction, allowing businesses to mix and match the best tools for different tasks. Apache Arrow, Substrait, and open table formats enhance flexibility while reducing vendor lock-in.

🔮 Quantum computing is beginning to revolutionize Big Data by tackling previously unsolvable problems. Industries like banking, healthcare, and logistics are already testing quantum-powered financial modeling, medical research, and route optimization.

💰 Balancing performance & cost is critical. Companies that fail to optimize their infrastructure face exponentially rising expenses. One AdTech firm, featured in the article, reduced its annual cloud budget from $2.5M to $144K by rearchitecting its data pipeline.
🚀🐝 Hive vs. Spark Distribution: Pros & Cons

Apache Hive and Apache Spark are both powerful Big Data tools, but they handle distributed processing differently.

🔹 Hive: SQL Interface for Hadoop

Pros:
Scales well for massive datasets (stored in HDFS)
SQL-like language (HiveQL) makes it user-friendly
Great for batch processing

Cons:
High query latency (relies on MapReduce/Tez)
Slower compared to Spark
Limited real-time stream processing capabilities

🔹 Spark: Fast Distributed Processing

Pros:
In-memory computing → high-speed performance
Supports real-time data processing (Structured Streaming)
Flexible: Works with HDFS, S3, Cassandra, JDBC, and more

Cons:
Requires more RAM
More complex to manage
Less efficient for archived big data batch processing

💡 Conclusions:

Use Hive for complex SQL queries & batch processing.
Use Spark for real-time analytics & fast data processing.
🗂 VAST Data is Changing the Game in Data Storage

According to experts, VAST Data is taking a major step toward creating a unified data storage platform by adding block storage support and built-in event processing.

Unified Block Storage now integrates all key protocols (files, objects, tables, data streams), eliminating fragmented infrastructure. This provides a powerful, cost-effective solution for AI and analytics-driven companies.

VAST Event Broker replaces complex event-driven systems like Kafka, enabling built-in real-time data streaming. AI and analytics can now receive events instantly without additional software.

🚀 Key Features:
Accelerated AI analytics with real-time data delivery
Full compatibility with MySQL, PostgreSQL, Oracle, and cloud services
Scalable architecture with no performance trade-offs

🔎 Read more here
🌎TOP DS-events all over the world in March
Mar 1 - Open Data Day Flensburg - Flensburg, Germany - https://opendataday-flensburg.de/
Mar 3 – ICMBDC - Shanghai, China - https://asar.org.in/Conference/55676/ICMBDC/
Mar 3-6 - Mobile World Congress – Barcelona, Spain - https://www.mwcbarcelona.com/
Mar 4 – ElasticON - Singapore, Singapore - https://www.elastic.co/events/elasticon/singapore
Mar 3-5 - Gartner Data & Analytics Summit – Orlando, USA - https://www.gartner.com/en/conferences/na/data-analytics-us
Mar 5-7 – PGConf - Bengaluru, India - https://pgconf.in/conferences/pgconfin2025
Mar 7 - Webinar 'From data to metadata: enhancing quality across borders' – Online - https://dataeuropaacademy.clickmeeting.com/webinar-from-data-to-metadata-enhancing-quality-across-borders-/register
Mar 11-12 - Data Spaces Symposium 2025 - Warsaw, Poland - https://www.data-spaces-symposium.eu/
Mar 12-13 - Big Data & AI World – London, UK - https://www.bigdataworld.com/
Mar 15 - Open Data Day - Timisoara, Romania - https://tm.opendataday.ro/
Mar 17-18 – ALT DATA – Singapore - https://www.battlefin.com/events/asia-2025
Mar 19 - EU Open Data Days 2025 – Luxembourg - https://data.europa.eu/en/news-events/events/eu-open-data-days-2025
Mar 20 - International Conference on Big Data and Smart Computing - Washington-DC,USA - https://ijieee.org.in/Conference/14165/ICBDSC/
Mar 21 - Open Source Day 2025 - Florence, Italy - https://osday.dev/
Mar 26 – ICAIMLBDE - Philadelphia, USA - https://isete.org/Conference/26577/ICAIMLBDE/
Mar 27 – MLConf - New York, USA - https://mlconf.com/
Mar 29 – ICBDS - Boston, USA - https://bigdataresearchforum.com/Conference/472/ICBDS/
Mar 31 – Apr 2 - Data Science Leadership Summit – Columbus, USA - https://academicdatascience.org/events/adsa-meetings/2025-data-science-leadership-summit/
🐼 Pandas is outdated, FireDucks offers a replacement without code rewriting

Pandas is the most popular library for data processing, but it has long suffered from low performance. Modern alternatives like Polars significantly outperform it, but switching to new frameworks requires learning a new API, which stops many developers.

🔥 FireDucks solves this problem by offering full compatibility with Pandas but with multi-threaded processing and compiler acceleration. All that is needed for the transition is to change one line:

import fireducks.pandas as pd

FireDucks is faster than Pandas and Polars, as confirmed by benchmarks:

🔗 FireDucks GitHub repository: https://github.com/fireducks-dev/fireducks
🔗 Comparison with Polars and Pandas: https://github.com/fireducks-dev/fireducks/blob/main/notebooks/FireDucks_vs_Pandas_vs_Polars.ipynb
🔗 Detailed benchmarks: https://fireducks-dev.github.io/docs/benchmarks/
👍1
🎲 Conditional Probability: Updating Beliefs with New Data

As we receive new information, our perception of event probabilities changes. This is the core idea of conditional probability, widely used in machine learning, medicine, finance, and more.

💡 Simple Examples:

🔹 Drawing a King from a deck: 4/52. If we know the card is a face card, the probability increases to 4/12.
🔹 Rolling a 6 on a die: 1/6. If we know the number is even, the probability jumps to 1/3.

💡 Real-World Applications:
Medicine – Evaluating test accuracy (sensitivity, specificity, false positives).
Finance – Assessing market risks, default probability of borrowers.
Machine Learning – Spam filtering, medical diagnosis, credit scoring.

📌 Bayes' Theorem allows us to update probabilities as new data arrives. For instance, a positive test for a rare disease doesn’t necessarily mean a patient is sick—probability depends on disease prevalence and test accuracy.

🔎 Learn more: 👉 Conditional Probability
🔥 Everything to Markdown (E2M): Convert Anything to Markdown in Seconds!

Need to quickly and efficiently convert various file formats to Markdown? Check out Everything to Markdown (E2M) — a Python library that does it automatically!

📌 What Can E2M Do?
E2M supports conversion from multiple formats:
Text documents: DOC, DOCX, EPUB
Web pages: HTML, HTM, URL
Presentations & PDFs: PPT, PPTX, PDF
Audio files: MP3, M4A (speech recognition)

🤔 How Does It Work?
The conversion process is powered by two key modules:
🔹 Parser – Extracts text and images from files.
🔹 Converter – Transforms them into Markdown.

🎯 Why Use E2M?
Its main goal is to create structured text data for:
🚀 Retrieval-Augmented Generation (RAG)
🤖 Training & fine-tuning language models
📚 Effortless documentation creation

💡 Why Is It Useful?
E2M automates tedious work, enabling fast data structuring. Since Markdown is a universal format, it integrates seamlessly into any system.
👍1