Big Data Science – Telegram
Big Data Science
3.74K subscribers
65 photos
9 videos
12 files
637 links
Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼https://news.1rj.ru/str/bds_job — channel about Data Science jobs and career
💻https://news.1rj.ru/str/bdscience_ru — Big Data Science [RU]
Download Telegram
🔍 Key Big Data Trends in 2025

Experts at Xenoss have outlined the major trends shaping Big Data's future. Despite Google's BigQuery engineer Jordan Tigani predicting the possible “decline” of Big Data, analysts argue that the field is rapidly evolving.

🚀 Hyperscalable platforms are becoming essential for handling massive datasets. Advancements in NVMe SSDs, multi-threaded CPUs, and high-speed networks enable near-instant petabyte-scale analysis, unlocking new potential in AI & ML for predictive strategies based on historical and real-time data.

📊 Zero-party data is taking center stage, offering companies user-consented personalized data. When combined with AI & LLMs, it enhances forecasting and recommendations in media, retail, finance, and healthcare.

⚡️ Hybrid batch & stream processing is balancing speed and accuracy. Lambda architectures enable real-time event response while retaining deep historical data analysis capabilities.

🔧 ETL/ELT optimization is now a priority. Companies are shifting from traditional data processing pipelines to AI-powered ELT workflows that automate data filtering, quality checks, and anomaly detection.

🛠 Data orchestration is evolving, reducing data silos and simplifying management. Open-source tools like Apache Airflow and Dagster are making complex workflows more accessible and flexible.

🌎 Big Data → Big Ops: The focus is shifting from storing data to actively leveraging it in automated business operations—enhancing marketing, sales, and customer service.

🧩 Composable data stacks are gaining traction, allowing businesses to mix and match the best tools for different tasks. Apache Arrow, Substrait, and open table formats enhance flexibility while reducing vendor lock-in.

🔮 Quantum computing is beginning to revolutionize Big Data by tackling previously unsolvable problems. Industries like banking, healthcare, and logistics are already testing quantum-powered financial modeling, medical research, and route optimization.

💰 Balancing performance & cost is critical. Companies that fail to optimize their infrastructure face exponentially rising expenses. One AdTech firm, featured in the article, reduced its annual cloud budget from $2.5M to $144K by rearchitecting its data pipeline.
🚀🐝 Hive vs. Spark Distribution: Pros & Cons

Apache Hive and Apache Spark are both powerful Big Data tools, but they handle distributed processing differently.

🔹 Hive: SQL Interface for Hadoop

Pros:
Scales well for massive datasets (stored in HDFS)
SQL-like language (HiveQL) makes it user-friendly
Great for batch processing

Cons:
High query latency (relies on MapReduce/Tez)
Slower compared to Spark
Limited real-time stream processing capabilities

🔹 Spark: Fast Distributed Processing

Pros:
In-memory computing → high-speed performance
Supports real-time data processing (Structured Streaming)
Flexible: Works with HDFS, S3, Cassandra, JDBC, and more

Cons:
Requires more RAM
More complex to manage
Less efficient for archived big data batch processing

💡 Conclusions:

Use Hive for complex SQL queries & batch processing.
Use Spark for real-time analytics & fast data processing.
🗂 VAST Data is Changing the Game in Data Storage

According to experts, VAST Data is taking a major step toward creating a unified data storage platform by adding block storage support and built-in event processing.

Unified Block Storage now integrates all key protocols (files, objects, tables, data streams), eliminating fragmented infrastructure. This provides a powerful, cost-effective solution for AI and analytics-driven companies.

VAST Event Broker replaces complex event-driven systems like Kafka, enabling built-in real-time data streaming. AI and analytics can now receive events instantly without additional software.

🚀 Key Features:
Accelerated AI analytics with real-time data delivery
Full compatibility with MySQL, PostgreSQL, Oracle, and cloud services
Scalable architecture with no performance trade-offs

🔎 Read more here
🌎TOP DS-events all over the world in March
Mar 1 - Open Data Day Flensburg - Flensburg, Germany - https://opendataday-flensburg.de/
Mar 3 – ICMBDC - Shanghai, China - https://asar.org.in/Conference/55676/ICMBDC/
Mar 3-6 - Mobile World Congress – Barcelona, Spain - https://www.mwcbarcelona.com/
Mar 4 – ElasticON - Singapore, Singapore - https://www.elastic.co/events/elasticon/singapore
Mar 3-5 - Gartner Data & Analytics Summit – Orlando, USA - https://www.gartner.com/en/conferences/na/data-analytics-us
Mar 5-7 – PGConf - Bengaluru, India - https://pgconf.in/conferences/pgconfin2025
Mar 7 - Webinar 'From data to metadata: enhancing quality across borders' – Online - https://dataeuropaacademy.clickmeeting.com/webinar-from-data-to-metadata-enhancing-quality-across-borders-/register
Mar 11-12 - Data Spaces Symposium 2025 - Warsaw, Poland - https://www.data-spaces-symposium.eu/
Mar 12-13 - Big Data & AI World – London, UK - https://www.bigdataworld.com/
Mar 15 - Open Data Day - Timisoara, Romania - https://tm.opendataday.ro/
Mar 17-18 – ALT DATA – Singapore - https://www.battlefin.com/events/asia-2025
Mar 19 - EU Open Data Days 2025 – Luxembourg - https://data.europa.eu/en/news-events/events/eu-open-data-days-2025
Mar 20 - International Conference on Big Data and Smart Computing - Washington-DC,USA - https://ijieee.org.in/Conference/14165/ICBDSC/
Mar 21 - Open Source Day 2025 - Florence, Italy - https://osday.dev/
Mar 26 – ICAIMLBDE - Philadelphia, USA - https://isete.org/Conference/26577/ICAIMLBDE/
Mar 27 – MLConf - New York, USA - https://mlconf.com/
Mar 29 – ICBDS - Boston, USA - https://bigdataresearchforum.com/Conference/472/ICBDS/
Mar 31 – Apr 2 - Data Science Leadership Summit – Columbus, USA - https://academicdatascience.org/events/adsa-meetings/2025-data-science-leadership-summit/
🐼 Pandas is outdated, FireDucks offers a replacement without code rewriting

Pandas is the most popular library for data processing, but it has long suffered from low performance. Modern alternatives like Polars significantly outperform it, but switching to new frameworks requires learning a new API, which stops many developers.

🔥 FireDucks solves this problem by offering full compatibility with Pandas but with multi-threaded processing and compiler acceleration. All that is needed for the transition is to change one line:

import fireducks.pandas as pd

FireDucks is faster than Pandas and Polars, as confirmed by benchmarks:

🔗 FireDucks GitHub repository: https://github.com/fireducks-dev/fireducks
🔗 Comparison with Polars and Pandas: https://github.com/fireducks-dev/fireducks/blob/main/notebooks/FireDucks_vs_Pandas_vs_Polars.ipynb
🔗 Detailed benchmarks: https://fireducks-dev.github.io/docs/benchmarks/
👍1
🎲 Conditional Probability: Updating Beliefs with New Data

As we receive new information, our perception of event probabilities changes. This is the core idea of conditional probability, widely used in machine learning, medicine, finance, and more.

💡 Simple Examples:

🔹 Drawing a King from a deck: 4/52. If we know the card is a face card, the probability increases to 4/12.
🔹 Rolling a 6 on a die: 1/6. If we know the number is even, the probability jumps to 1/3.

💡 Real-World Applications:
Medicine – Evaluating test accuracy (sensitivity, specificity, false positives).
Finance – Assessing market risks, default probability of borrowers.
Machine Learning – Spam filtering, medical diagnosis, credit scoring.

📌 Bayes' Theorem allows us to update probabilities as new data arrives. For instance, a positive test for a rare disease doesn’t necessarily mean a patient is sick—probability depends on disease prevalence and test accuracy.

🔎 Learn more: 👉 Conditional Probability
🔥 Everything to Markdown (E2M): Convert Anything to Markdown in Seconds!

Need to quickly and efficiently convert various file formats to Markdown? Check out Everything to Markdown (E2M) — a Python library that does it automatically!

📌 What Can E2M Do?
E2M supports conversion from multiple formats:
Text documents: DOC, DOCX, EPUB
Web pages: HTML, HTM, URL
Presentations & PDFs: PPT, PPTX, PDF
Audio files: MP3, M4A (speech recognition)

🤔 How Does It Work?
The conversion process is powered by two key modules:
🔹 Parser – Extracts text and images from files.
🔹 Converter – Transforms them into Markdown.

🎯 Why Use E2M?
Its main goal is to create structured text data for:
🚀 Retrieval-Augmented Generation (RAG)
🤖 Training & fine-tuning language models
📚 Effortless documentation creation

💡 Why Is It Useful?
E2M automates tedious work, enabling fast data structuring. Since Markdown is a universal format, it integrates seamlessly into any system.
👍1
📊 Apache Iceberg vs Delta Lake vs Hudi: Which Format is Best for AI/ML?

Choosing the right data storage format is crucial for machine learning (ML) and analytics. The wrong choice can lead to slow queries, poor scalability, and data integrity issues.

🔥 Why Does Format Matter?
Traditional data lakes struggle with:
🚧 No ACID transactions – risk of read/write conflicts
📉 No data versioning – hard to track changes
🐢 Slow queries – large datasets slow down analytics

💡 Apache Iceberg – Best for Analytics & Batch Processing

📌 When to Use?

Handling historical datasets
Need for query optimization & schema evolution
Batch processing is a priority

📌 Key Advantages
ACID transactions with snapshot isolation
Time travel – restore previous versions of data
Hidden partitioning – speeds up queries
Supports Spark, Flink, Trino, Presto

📌 Use Cases
🔸 BI & trend analysis
🔸 Data storage for ML model training
🔸 Audit logs & rollback scenarios

💡 Delta Lake – Best for AI/ML & Streaming Workloads

📌 When to Use?
Streaming data is critical for ML
Need true ACID transactions
Working primarily with Apache Spark

📌 Key Advantages
Deep Spark integration
Incremental updates (avoids full dataset rewrites)
Z-Ordering – clusters similar data for faster queries
Time travel – rollback & restore capabilities

📌 Use Cases
🔹 Real-time ML pipelines (fraud detection, predictive analytics)
🔹 ETL workflows
🔹 IoT data processing & logs

💡 Apache Hudi – Best for Real-Time Updates

📌 When to Use?
Need fast real-time analytics
Data needs frequent updates
Working with Apache Flink, Spark, or Kafka

📌 Key Advantages
ACID transactions & version control
Merge-on-Read (MoR) – update without rewriting entire datasets
Optimized for real-time ML (fraud detection, recommendations)
Supports micro-batching & streaming

📌 Use Cases
🔸 Fraud detection (bank transactions, security monitoring)
🔸 Recommendation systems (e-commerce, streaming services)
🔸 AdTech (real-time bidding, personalized ads)

🤔 Which Format is Best for AI/ML?
Iceberg – Best for historical data and BI analytics
Delta Lake – Best for AI/ML, streaming, and Apache Spark
Hudi – Best for frequent updates & real-time ML (fraud detection, recommendations, AdTech)

🔗 Full breakdown here
👍1
🛠 Another Roundup of Tools for Data Management, Storage, and Analysis

🔹 DrawDB – A visual database management system that simplifies database design and interaction. Its graphical interface allows developers to create and visualize database structures without writing complex SQL queries.

🔹 Hector RAG – A Retrieval-Augmented Generation (RAG) framework built on PostgreSQL. It enhances AI applications by combining retrieval and text generation, improving response accuracy and efficiency in search-enhanced LLMs.

🔹 ERD Lab – A free online tool for designing and visualizing Entity-Relationship Diagrams (ERD). Users can import SQL noscripts or create new databases without writing code, making it an ideal solution for database design and documentation.

🔹 SuperMassive – A distributed, fault-tolerant in-memory key-value database designed for high-performance applications. It provides low-latency access and self-recovery, making it perfect for mission-critical workloads.

🔹 Smallpond – A lightweight data processing framework built on DuckDB and 3FS. It enables high-performance analytics on petabyte-scale datasets without requiring long-running services or complex infrastructure.

🔹 ingestr – A CLI tool for seamless data migration between databases like Postgres, BigQuery, Snowflake, Redshift, Databricks, DuckDB, and more. Supports full refresh & incremental updates with append, merge, or delete+insert strategies.

🚀 Whether you’re designing databases, optimizing AI pipelines, or managing large-scale data workflows, these tools will streamline your work and boost productivity!
This media is not supported in your browser
VIEW IN TELEGRAM
💡 Master SQL Easily: A Hands-On Training Site (SQL Practice)

Looking to sharpen your SQL skills with real-world examples? This site is a great choice!

🔹 Format – Exercises are based on a hospital database, making them relevant to real-life SQL use cases.
🔹 Difficulty Levels – Start with basic SELECT queries and gradually advance to joins, subqueries, window functions, and query optimization.
🔹 Practical Benefits – Especially useful for healthcare analysts, data professionals, and developers working with medical systems.
🔹 Perfect for Preparation – Ideal for interview prep, certifications, or simply improving SQL proficiency.

🚀 This resource not only teaches SQL but also helps you understand how to work with data in a medical context effectively!
📚 Book Review: Apache Pulsar in Action

Author: David Kjerrumgaard

"Apache Pulsar in Action" is a practical guide to using Apache Pulsar, a powerful platform for real-time messaging and data streaming. While primarily targeting experienced Java developers, it also includes Python examples, making it useful for professionals from various technical backgrounds.

🔍 What’s Inside?
The author explores Apache Pulsar’s architecture and its key advantages over messaging systems like Kafka and RabbitMQ, highlighting:
🔹 Multi-protocol support (MQTT, AMQP, Kafka binary protocol).
🔹 High fault tolerance & scalability in cloud environments.
🔹 Pulsar Functions for developing microservice applications.

💡 Who Should Read It?

📌 Microservices developers – Learn how to integrate Pulsar into distributed systems.
📌 DevOps engineers – Get insights on deployment and monitoring.
📌 Data processing specialists – Discover streaming analytics techniques.

🤔 Pros & Cons
Comprehensive development & architecture guide.
Hands-on approach with Java & Python code examples.
Accessible for developers at various experience levels.
Lacks real-world case studies, making it harder to adapt Pulsar for specific business use cases.

🏆 Final Verdict
"Apache Pulsar in Action" is a valuable resource for those looking to master streaming data processing and scalable distributed systems. While it could benefit from more industry-specific case studies, it remains an excellent hands-on guide for understanding and implementing Apache Pulsar.
📕 Think Stats — The Best Free Guide to Statistics for Python Developers

Think Stats is a hands-on guide to statistics and probability, designed specifically for Python developers. Unlike traditional textbooks, it dives straight into coding, helping you master statistical methods using real-world data and practical exercises.

🔍 Why Think Stats Stands Out

Practical focus – Minimal complex math, maximum real-world applications.
Fully integrated with Python – The book is structured as Jupyter Notebooks, allowing you to run code and see results instantly.
Real dataset analysis – Includes demographic data, medical research, and social media analytics.
Data Science-oriented – The learning approach is tailored for analysts, developers, and data scientists.
Easy to read – Concepts are explained in a clear and accessible manner, making it beginner-friendly.

📚 What’s Inside?
🔹 Core statistics and probability concepts in a programming context.
🔹 Data cleaning, processing, and visualization techniques.
🔹 Deep dive into distributions (normal, binomial, Poisson, etc.).
🔹 Parameter estimation, confidence intervals, and hypothesis testing.
🔹 Bayesian analysis, increasingly popular in Data Science.
🔹 Introduction to regression, forecasting, and statistical modeling.

🎯 Who Should Read It?

Python developers wanting to learn statistics through coding.
Data scientists & analysts looking for practical knowledge.
Students & self-learners who need real-world applications of statistics.
ML engineers who need a strong foundation in statistical methods.

🤔 Why You Should Read Think Stats
📌 No fluff, just practical statistics that you can apply immediately.
📌 Free and open-source (Creative Commons license) – Download, copy, and share freely.
📌 Jupyter Notebook integration for a hands-on learning experience.

💡Think Stats is a must-have resource for anyone who wants to learn and apply statistics effectively in Python. Whether you're a beginner or an experienced developer, this book will boost your data science skills!

💻Github
🤔🗂 Google Research Develops Privacy-Preserving Synthetic Data Generation Method

Google Research has introduced a new method for generating synthetic data using differentially private LLM inference. This approach ensures data privacy while maintaining statistical utility, preventing leaks of sensitive information.

🔍 How Does It Work?
During text generation, Gaussian noise is added to LLM token distributions, preventing the reconstruction of original data. This ensures that individual examples in the training dataset do not significantly affect the output.

🧐 Privacy Parameters (ε & δ):
🔹 Lower ε = stronger privacy, but lower text quality.
🔹 Recommended range: ε = 1–5, balancing privacy & utility.

🚀 Key Privacy Mechanisms
Noise addition to model log probabilities before token selection.
Gradient clipping during training to limit the influence of individual data points.
Query grouping to minimize privacy risks from multiple model interactions.

📊 Testing Results
🔹 Synthetic data retains practical utility for training downstream models.
🔹 Formal privacy guarantees (ε < 5) without significant loss in data quality.

🛠 Where Can It Be Used?
💡 Training AI models on sensitive datasets (e.g., healthcare, finance).
💡 Algorithm testing without access to real data.
💡 Data sharing between organizations without privacy risks.

⚖️ Pros & Cons
Privacy without losing functionality – secure data without major quality loss.
Ethical AI usage in sensitive domains.
Trade-off between quality & privacy – stronger privacy can reduce text coherence.
Increased computational costs – additional privacy checks slow down generation.

🤖 Conclusion
Google Research’s approach sets new standards for handling confidential data while balancing security & usability. This could redefine AI ethics and data-sharing practices for personal & corporate data.
🌎TOP DS-events all over the world in April
Apr 1-2 - Healthcare NLP Summit – Online - https://www.nlpsummit.org/healthcare-2025/
Apr 1-4 - KubeCon + CloudNativeCon – London, UK - https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/
Apr 3 - AI Integration & Autonomous Mobility - Munich, Germany - https://www.automantia.in/aiam-munich
Apr 9-10 - Data & AI – Warsaw, Polland - https://dataiwarsaw.tech/
Apr 9-11 - Google Cloud Next – Las Vegas, USA - https://cloud.withgoogle.com/next/25
Apr 9-11 - Data Management ThinkLab - Prague, Czech Republic - https://thinklinkers.com/events/data_management_conference_event_europe_2025
Apr 10-12 - Strata Data & AI Conference - New York City, USA - https://www.oreilly.com/conferences/strata-data-ai.html
Apr 23-25 - PyCon DE & PyData 2025 - Darmstadt, Germany - https://2025.pycon.de/
Apr 23-25 – GITEX ASIA x AI Everything Singapore – Singapore, Singapore - https://gitexasia.com/
Apr 24 - Elevate: Data Management roles in the AI Era - Antwerp, Belgium - https://datatrustassociates.com/roundtableapril/
Apr 28-30 - Machine Learning Prague 2025 - Prague, Czech Republic - https://www.mlprague.com/
Apr 29 – May 2 - Symposium on Data Science and Statistics – Salt Lake City – USA - https://ww2.amstat.org/meetings/sdss/2025/
Apr 29-30 - CDAO Germany - Munich, Germany - https://cdao-germany.coriniumintelligence.com/
🚀 HuggingFace Releases Datasets for Pre-Training LLM in Code Generation

Following the success of OlympicCoder-32B, which beat Sonnet 3.7 in LiveCodeBench and IOI 2024, HuggingFace has released a rich dataset for pre-training and fine-tuning LLM in programming tasks.

Stack-Edu (125 billion tokens) – educational code in 15 programming languages, filtered from The Stack v2
GitHub Issues (11 billion tokens) – data from discussions and bug reports on GitHub
CodeForces problems (10K tasks) – a unique set of CodeForces problems, 3K of which were not used in DeepMind training
CodeForces problems DeepSeek-R1 (8.69 GB) – filtered traces of CodeForces solutions
International Olympiad in Informatics: Problem statements dataset (2020 - 2024) - a unique set of programming Olympiad tasks, divided into subtasks so that each query corresponds to a solution to these subtasks
International Olympiad in Informatics: Problem - DeepSeek-R1 CoT dataset (2020 - 2023) - 11 thousand traces of reasoning performed by DeepSeek-R1 during the solution of programming Olympiad tasks

💡 What to use it for?
🔹 LLM pre-training for code generation
🔹 Developing AI assistants for programmers
🔹 Improving solutions in computer olympiads
🔹 Creating ML models for code analysis
📊 How to avoid data chaos? Ways to ensure consistency of metrics in a warehouse

If you work with analytics, you have probably encountered a situation where the same metric is calculated differently in different departments. This leads to confusion, reduces trust in the data, and slows down decision-making. The new article discusses the key reasons for this problem and two effective solutions.

🤔 Why do metrics diverge?
The reason lies in the spontaneous growth of analytics:
🔹 One analyst writes a SQL query to calculate the metric.
🔹 Then other teams create their own versions based on this query, making minor changes.
🔹 Over time, discrepancies arise, and the analytics team spends more and more time sorting out inconsistencies.

To avoid this situation, it is worth implementing uniform metric management standards.

🛠 Two approaches to ensure consistency

Semantic Layer
This is an intermediate layer between data and analytics tools, where metrics are defined centrally. They are stored in static files (e.g. YAML) and used to automatically generate SQL queries.

💡 Pros:
✔️ Flexibility: adapts to different queries without pre-creating tables.
✔️ Transparency: uniform definitions are available to all teams.
✔️ Relevance: data is updated in real time.

⚠️ Cons:
Requires investment in infrastructure and optimization.
May increase the load on calculations (but this can be solved by caching).

📌 Example of a tool: Cube.js is one of the few mature open-source solutions.

Pre-Aggregated Tables
Here, tables with pre-calculated metrics and fixed dimensions are created in advance.

💡 Pros:
✔️ Simple implementation, convenient for small projects.
✔️ Saving computing resources.
✔️ Full control over calculations.

⚠️ Cons:
Difficult to maintain as the number of users increases.
Discrepancies are possible if metrics are defined in different tables.

🤔 Which method to choose?
The optimal approach is hybrid use:
🔹 Implement a semantic layer for scalability.
🔹 Use pre-aggregated tables for critical metrics where minimal computation cost is important.

🔎More details here
📊 FinMind — world-class open financial data for analysis and learning

FinMind is not just a collection of quotes, but an entire ecosystem of financial data available for free and open source. The project is aimed at researchers, students, investors and enthusiasts who need access to high-quality, up-to-date data without having to pay for expensive subnoscriptions, such as Bloomberg Terminal or Quandl.

🔍 What you can find in FinMind:
📈 Historical and intraday stock quotes (tick data, candlesticks, volumes)
📊 Financial metrics: PER, PBR, EPS, ROE, etc.
💵 Dividends, company reports, revenue
📉 Options and futures data
🏦 Central bank interest rates, inflation
🛢 Commodity markets and bonds

🧠 Features:
Data is regularly updated automatically
Convenient and easy-to-learn Python API
Documentation and training examples in English and Chinese
The ability to quickly build a backtest or conduct market research

💡FinMind is ideal for:
Training courses on time series analysis, econometrics, ML in finance
Prototyping strategies, without risk and cost
University research and hackathons

🤖 GitHub