🌎TOP DS-events all over the world in February
Feb 4-6 - AI Everything Global – Dubaï, UAE - https://aieverythingglobal.com/home
Feb 5 - Open Day at DSTI – Paris, France - https://dsti.school/open-day-at-dsti-5-february-2025/
Feb 5-6 - The AI & Big Data Expo – London, UK - https://www.ai-expo.net/global/
Feb 6-7 - International Conference on Data Analytics and Business – New York, USA - https://sciencenet.co/event/index.php?id=2703381&source=aca
Feb 11 - AI Summit West - San Jose, USA - https://ai-summit-west.re-work.co/
Feb 12-13 - CDAO UK – London, UK - https://cdao-uk.coriniumintelligence.com/
Feb 13-14 - 6th National Big Data Health Science Conference – Columbia, USA - https://www.sc-bdhs-conference.org/
Feb 13-15 - WAICF - WOrld AICAnnes Festival - Cannes, France - https://www.worldaicannes.com/
Feb 18 - adesso Data Day - Frankfurt / Main, Germany - https://www.adesso.de/de/news/veranstaltungen/adesso-data-day/programm.jsp
Feb 18-19 - Power BI Summit – Online - https://events.m365-summits.de/PowerBISummit2025-1819022025#/
Feb 18-20 - 4th IFC Workshop on Data Science in Central Banking – Rome, Italy - https://www.bis.org/ifc/events/250218_ifc.htm
Feb 19-20 - Data Science Day - Munich, Germany - https://wan-ifra.org/events/data-science-day-2025/
Feb 21 - ICBDIE 2025 – Suzhou, China - https://www.icbdie.org/submission
Feb 25 - Customerdata trends 2025 – Online - https://www.digitalkonferenz.net/
Feb 26-27 - ICET-25 - Chongqing, China - https://itar.in/conf/index.php?id=2703680
Feb 4-6 - AI Everything Global – Dubaï, UAE - https://aieverythingglobal.com/home
Feb 5 - Open Day at DSTI – Paris, France - https://dsti.school/open-day-at-dsti-5-february-2025/
Feb 5-6 - The AI & Big Data Expo – London, UK - https://www.ai-expo.net/global/
Feb 6-7 - International Conference on Data Analytics and Business – New York, USA - https://sciencenet.co/event/index.php?id=2703381&source=aca
Feb 11 - AI Summit West - San Jose, USA - https://ai-summit-west.re-work.co/
Feb 12-13 - CDAO UK – London, UK - https://cdao-uk.coriniumintelligence.com/
Feb 13-14 - 6th National Big Data Health Science Conference – Columbia, USA - https://www.sc-bdhs-conference.org/
Feb 13-15 - WAICF - WOrld AICAnnes Festival - Cannes, France - https://www.worldaicannes.com/
Feb 18 - adesso Data Day - Frankfurt / Main, Germany - https://www.adesso.de/de/news/veranstaltungen/adesso-data-day/programm.jsp
Feb 18-19 - Power BI Summit – Online - https://events.m365-summits.de/PowerBISummit2025-1819022025#/
Feb 18-20 - 4th IFC Workshop on Data Science in Central Banking – Rome, Italy - https://www.bis.org/ifc/events/250218_ifc.htm
Feb 19-20 - Data Science Day - Munich, Germany - https://wan-ifra.org/events/data-science-day-2025/
Feb 21 - ICBDIE 2025 – Suzhou, China - https://www.icbdie.org/submission
Feb 25 - Customerdata trends 2025 – Online - https://www.digitalkonferenz.net/
Feb 26-27 - ICET-25 - Chongqing, China - https://itar.in/conf/index.php?id=2703680
Aieverythingabudhabi
Ai Everything Abu Dhabi | 11-13 May 2026
Where the global artificial intelligence ecosystem coalesces to create the world’s most impactful, visionary and ground-breaking AI event.
🚀 BigQuery Metastore: A Unified Metadata Service with Apache Iceberg Support
Google has announced a highly scalable metadata service for Lakehouse architecture. The new runtime metastore supports multiple analytics engines, including BigQuery, Apache Spark, Apache Hive, and Apache Flink.
BigQuery Metastore unifies metadata access, allowing different engines to query a single copy of data. It also supports Apache Iceberg, simplifying data management in lakehouse environments.
😎 Key Benefits:
✅ Cross-compatibility – A single source of metadata for all analytics engines.
✅ Open format support – Apache Iceberg, external BigQuery tables.
✅ Built-in data governance – Access control, auditing, data masking.
✅ Fully managed service – No configuration required, automatically scales.
🤔 Why is this important?
Traditional metastores are tied to specific engines, requiring manual table definitions and metadata synchronization. This leads to stale data, security issues, and high admin costs.
🤔 What does this change?
BigQuery Metastore standardizes metadata management, making lakehouse architecture more accessible, simplifying analytics, and reducing infrastructure maintenance costs.
🔎 Learn more here
Google has announced a highly scalable metadata service for Lakehouse architecture. The new runtime metastore supports multiple analytics engines, including BigQuery, Apache Spark, Apache Hive, and Apache Flink.
BigQuery Metastore unifies metadata access, allowing different engines to query a single copy of data. It also supports Apache Iceberg, simplifying data management in lakehouse environments.
😎 Key Benefits:
✅ Cross-compatibility – A single source of metadata for all analytics engines.
✅ Open format support – Apache Iceberg, external BigQuery tables.
✅ Built-in data governance – Access control, auditing, data masking.
✅ Fully managed service – No configuration required, automatically scales.
🤔 Why is this important?
Traditional metastores are tied to specific engines, requiring manual table definitions and metadata synchronization. This leads to stale data, security issues, and high admin costs.
🤔 What does this change?
BigQuery Metastore standardizes metadata management, making lakehouse architecture more accessible, simplifying analytics, and reducing infrastructure maintenance costs.
🔎 Learn more here
Google Cloud Blog
Introducing BigQuery metastore fully managed metadata service | Google Cloud Blog
BigQuery metastore is a fully managed, unified metadata service that provides processing engine interoperability while enabling consistent data governance.
🔥 WILDCHAT-50M: The Largest Open Dialogue Dataset for Language Models
Researchers have introduced WILDCHAT-50M—the largest open dataset of its kind, containing an extensive collection of real chat data. Designed to enhance language model training, particularly in dialogue processing and user interactions, this dataset consists of over 125 million chat trannoscripts spanning more than a million conversations. It serves as a valuable resource for researchers and developers working on advanced AI language models.
🔍 Key Features of WILDCHAT-50M:
✅ Real-world conversational data – Unlike traditional datasets based on structured texts or curated dialogues, this dataset provides authentic user interactions.
✅ Developed for RE-WILD SFT – Supports Supervised Fine-Tuning (SFT), enabling models to adapt to realistic conversation scenarios and improve long-term dialogue coherence.
✅ A massive open benchmark – One of the largest publicly available datasets in its category, allowing developers to test, experiment, and refine their NLP models.
Most language model training datasets rely on structured articles or noscripted dialogues. In contrast, WILDCHAT-50M captures the nuances of real conversations, helping models generate more natural, context-aware responses.
🚀 Why does it matter?
By leveraging datasets like WILDCHAT-50M, language models can significantly improve their ability to generate human-like responses, understand spoken language dynamics, and advance the development of AI-powered virtual assistants, chatbots, and dialogue systems.
With access to real-world conversational data, AI is moving closer to truly natural and intelligent communication.
Researchers have introduced WILDCHAT-50M—the largest open dataset of its kind, containing an extensive collection of real chat data. Designed to enhance language model training, particularly in dialogue processing and user interactions, this dataset consists of over 125 million chat trannoscripts spanning more than a million conversations. It serves as a valuable resource for researchers and developers working on advanced AI language models.
🔍 Key Features of WILDCHAT-50M:
✅ Real-world conversational data – Unlike traditional datasets based on structured texts or curated dialogues, this dataset provides authentic user interactions.
✅ Developed for RE-WILD SFT – Supports Supervised Fine-Tuning (SFT), enabling models to adapt to realistic conversation scenarios and improve long-term dialogue coherence.
✅ A massive open benchmark – One of the largest publicly available datasets in its category, allowing developers to test, experiment, and refine their NLP models.
Most language model training datasets rely on structured articles or noscripted dialogues. In contrast, WILDCHAT-50M captures the nuances of real conversations, helping models generate more natural, context-aware responses.
🚀 Why does it matter?
By leveraging datasets like WILDCHAT-50M, language models can significantly improve their ability to generate human-like responses, understand spoken language dynamics, and advance the development of AI-powered virtual assistants, chatbots, and dialogue systems.
With access to real-world conversational data, AI is moving closer to truly natural and intelligent communication.
huggingface.co
WildChat-50m - a nyu-dice-lab Collection
All model responses associated with the WildChat-50m paper.
😎🛠 Another Roundup of Big Data Tools
NocoDB - An open-source platform that turns relational databases (MySQL, PostgreSQL, SQLite, MSSQL) into a no-code interface for managing tables, creating APIs, and visualizing data. A powerful self-hosted alternative to Airtable, offering full data control.
DrawDB - A visual database modeling tool that simplifies schema design, editing, and visualization. It supports automatic SQL code generation and integrates with MySQL, PostgreSQL, and SQLite. Ideal for developers and analysts who need a quick, user-friendly way to design databases.
Dolt - relational database with Git-like version control. It lets you track row-level changes, create branches, merge them, and view the full history of modifications while working with standard SQL queries.
ScyllaDB - high-performance NoSQL database that is fully compatible with Apache Cassandra but delivers lower latency and higher throughput. Optimized for modern multi-core processors, making it perfect for high-load distributed systems
Metabase - An intuitive business intelligence platform for visualizing data, creating reports, and building dashboards without deep SQL knowledge. It supports MySQL, PostgreSQL, MongoDB, and more, making data analysis more accessible
Azimutt - powerful ERD visualization tool for designing and analyzing complex databases. Features include interactive schema exploration, foreign key visualization, and problem detection, making it useful for both database development and auditing
sync - real-time data synchronization tool for MongoDB and MySQL. It uses Change Streams (MongoDB) and binlog replication (MySQL) to ensure incremental updates, fault tolerance, and seamless recovery. Great for distributed databases and analytics workflows
NocoDB - An open-source platform that turns relational databases (MySQL, PostgreSQL, SQLite, MSSQL) into a no-code interface for managing tables, creating APIs, and visualizing data. A powerful self-hosted alternative to Airtable, offering full data control.
DrawDB - A visual database modeling tool that simplifies schema design, editing, and visualization. It supports automatic SQL code generation and integrates with MySQL, PostgreSQL, and SQLite. Ideal for developers and analysts who need a quick, user-friendly way to design databases.
Dolt - relational database with Git-like version control. It lets you track row-level changes, create branches, merge them, and view the full history of modifications while working with standard SQL queries.
ScyllaDB - high-performance NoSQL database that is fully compatible with Apache Cassandra but delivers lower latency and higher throughput. Optimized for modern multi-core processors, making it perfect for high-load distributed systems
Metabase - An intuitive business intelligence platform for visualizing data, creating reports, and building dashboards without deep SQL knowledge. It supports MySQL, PostgreSQL, MongoDB, and more, making data analysis more accessible
Azimutt - powerful ERD visualization tool for designing and analyzing complex databases. Features include interactive schema exploration, foreign key visualization, and problem detection, making it useful for both database development and auditing
sync - real-time data synchronization tool for MongoDB and MySQL. It uses Change Streams (MongoDB) and binlog replication (MySQL) to ensure incremental updates, fault tolerance, and seamless recovery. Great for distributed databases and analytics workflows
GitHub
GitHub - nocodb/nocodb: 🔥 🔥 🔥 Open Source Airtable Alternative
🔥 🔥 🔥 Open Source Airtable Alternative. Contribute to nocodb/nocodb development by creating an account on GitHub.
🤔 Vector vs. Graph Databases: Which One to Choose?
When dealing with unstructured and interconnected data, selecting the right database system is crucial. Let’s compare vector and graph databases.
😎 Vector Databases
📌 Advantages:
✅ Optimized for similarity search (e.g., NLP, computer vision).
✅ High-speed approximate nearest neighbor (ANN) search.
✅ Efficient when working with embedding models.
⚠️ Disadvantages:
❌ Not suitable for complex relationships between objects.
❌ Limited support for traditional relational queries.
😎 Graph Databases
📌 Advantages:
✅ Excellent for handling highly connected data (social networks, routing).
✅ Optimized for complex relationship queries.
✅ Flexible data storage schema.
⚠️ Disadvantages:
❌ Slower for large-scale linear searches.
❌ Inefficient for high-dimensional vector processing.
🧐 Conclusion:
✅ If you need embedding-based search → Go for vector databases (Faiss, Milvus).
✅ If you need complex relationship queries → Use graph databases (Neo4j, ArangoDB).
When dealing with unstructured and interconnected data, selecting the right database system is crucial. Let’s compare vector and graph databases.
😎 Vector Databases
📌 Advantages:
✅ Optimized for similarity search (e.g., NLP, computer vision).
✅ High-speed approximate nearest neighbor (ANN) search.
✅ Efficient when working with embedding models.
⚠️ Disadvantages:
❌ Not suitable for complex relationships between objects.
❌ Limited support for traditional relational queries.
😎 Graph Databases
📌 Advantages:
✅ Excellent for handling highly connected data (social networks, routing).
✅ Optimized for complex relationship queries.
✅ Flexible data storage schema.
⚠️ Disadvantages:
❌ Slower for large-scale linear searches.
❌ Inefficient for high-dimensional vector processing.
🧐 Conclusion:
✅ If you need embedding-based search → Go for vector databases (Faiss, Milvus).
✅ If you need complex relationship queries → Use graph databases (Neo4j, ArangoDB).
💡 News of the Day: Harvard Launches a Federal Data Archive from data.gov
Harvard’s Library Innovation Lab has unveiled an archive of data.gov on the Source Cooperative platform. The 16TB collection contains over 311,000 datasets gathered in 2024–2025, providing a complete snapshot of publicly available federal data.
The archive will be updated daily, ensuring access to up-to-date information for researchers, journalists, analysts, and the public. It includes datasets across various domains, such as environment, healthcare, economy, transportation, and agriculture.
Additionally, Harvard has released open-source software on GitHub for building similar repositories and data archiving solutions. This allows other organizations and research centers to develop their own public data archives. Project supported by Filecoin Foundation & Rockefeller Brothers Fund
Harvard’s Library Innovation Lab has unveiled an archive of data.gov on the Source Cooperative platform. The 16TB collection contains over 311,000 datasets gathered in 2024–2025, providing a complete snapshot of publicly available federal data.
The archive will be updated daily, ensuring access to up-to-date information for researchers, journalists, analysts, and the public. It includes datasets across various domains, such as environment, healthcare, economy, transportation, and agriculture.
Additionally, Harvard has released open-source software on GitHub for building similar repositories and data archiving solutions. This allows other organizations and research centers to develop their own public data archives. Project supported by Filecoin Foundation & Rockefeller Brothers Fund
Which method would you prefer to speed up join operations in Spark ?
Anonymous Poll
29%
Using broadcast join
21%
Using sort-merge join instead of hash join
25%
Pre-partitioning data (bucketing)
25%
Adding Partition Key to tables
👍1
🔍 Key Big Data Trends in 2025
Experts at Xenoss have outlined the major trends shaping Big Data's future. Despite Google's BigQuery engineer Jordan Tigani predicting the possible “decline” of Big Data, analysts argue that the field is rapidly evolving.
🚀 Hyperscalable platforms are becoming essential for handling massive datasets. Advancements in NVMe SSDs, multi-threaded CPUs, and high-speed networks enable near-instant petabyte-scale analysis, unlocking new potential in AI & ML for predictive strategies based on historical and real-time data.
📊 Zero-party data is taking center stage, offering companies user-consented personalized data. When combined with AI & LLMs, it enhances forecasting and recommendations in media, retail, finance, and healthcare.
⚡️ Hybrid batch & stream processing is balancing speed and accuracy. Lambda architectures enable real-time event response while retaining deep historical data analysis capabilities.
🔧 ETL/ELT optimization is now a priority. Companies are shifting from traditional data processing pipelines to AI-powered ELT workflows that automate data filtering, quality checks, and anomaly detection.
🛠 Data orchestration is evolving, reducing data silos and simplifying management. Open-source tools like Apache Airflow and Dagster are making complex workflows more accessible and flexible.
🌎 Big Data → Big Ops: The focus is shifting from storing data to actively leveraging it in automated business operations—enhancing marketing, sales, and customer service.
🧩 Composable data stacks are gaining traction, allowing businesses to mix and match the best tools for different tasks. Apache Arrow, Substrait, and open table formats enhance flexibility while reducing vendor lock-in.
🔮 Quantum computing is beginning to revolutionize Big Data by tackling previously unsolvable problems. Industries like banking, healthcare, and logistics are already testing quantum-powered financial modeling, medical research, and route optimization.
💰 Balancing performance & cost is critical. Companies that fail to optimize their infrastructure face exponentially rising expenses. One AdTech firm, featured in the article, reduced its annual cloud budget from $2.5M to $144K by rearchitecting its data pipeline.
Experts at Xenoss have outlined the major trends shaping Big Data's future. Despite Google's BigQuery engineer Jordan Tigani predicting the possible “decline” of Big Data, analysts argue that the field is rapidly evolving.
🚀 Hyperscalable platforms are becoming essential for handling massive datasets. Advancements in NVMe SSDs, multi-threaded CPUs, and high-speed networks enable near-instant petabyte-scale analysis, unlocking new potential in AI & ML for predictive strategies based on historical and real-time data.
📊 Zero-party data is taking center stage, offering companies user-consented personalized data. When combined with AI & LLMs, it enhances forecasting and recommendations in media, retail, finance, and healthcare.
⚡️ Hybrid batch & stream processing is balancing speed and accuracy. Lambda architectures enable real-time event response while retaining deep historical data analysis capabilities.
🔧 ETL/ELT optimization is now a priority. Companies are shifting from traditional data processing pipelines to AI-powered ELT workflows that automate data filtering, quality checks, and anomaly detection.
🛠 Data orchestration is evolving, reducing data silos and simplifying management. Open-source tools like Apache Airflow and Dagster are making complex workflows more accessible and flexible.
🌎 Big Data → Big Ops: The focus is shifting from storing data to actively leveraging it in automated business operations—enhancing marketing, sales, and customer service.
🧩 Composable data stacks are gaining traction, allowing businesses to mix and match the best tools for different tasks. Apache Arrow, Substrait, and open table formats enhance flexibility while reducing vendor lock-in.
🔮 Quantum computing is beginning to revolutionize Big Data by tackling previously unsolvable problems. Industries like banking, healthcare, and logistics are already testing quantum-powered financial modeling, medical research, and route optimization.
💰 Balancing performance & cost is critical. Companies that fail to optimize their infrastructure face exponentially rising expenses. One AdTech firm, featured in the article, reduced its annual cloud budget from $2.5M to $144K by rearchitecting its data pipeline.
Xenoss - AI and Data Software Development Company
Top Big Data Trends in 2025
Big Data Trends 2025: evolving into a more sophisticated ecosystem combining AI, real-time processing, and advanced analytics.
🚀🐝 Hive vs. Spark Distribution: Pros & Cons
Apache Hive and Apache Spark are both powerful Big Data tools, but they handle distributed processing differently.
🔹 Hive: SQL Interface for Hadoop
Pros:
✅Scales well for massive datasets (stored in HDFS)
✅SQL-like language (HiveQL) makes it user-friendly
✅Great for batch processing
Cons:
✅High query latency (relies on MapReduce/Tez)
✅Slower compared to Spark
✅Limited real-time stream processing capabilities
🔹 Spark: Fast Distributed Processing
Pros:
In-memory computing → high-speed performance
Supports real-time data processing (Structured Streaming)
Flexible: Works with HDFS, S3, Cassandra, JDBC, and more
Cons:
✅Requires more RAM
✅More complex to manage
✅Less efficient for archived big data batch processing
💡 Conclusions:
✅ Use Hive for complex SQL queries & batch processing.
✅ Use Spark for real-time analytics & fast data processing.
Apache Hive and Apache Spark are both powerful Big Data tools, but they handle distributed processing differently.
🔹 Hive: SQL Interface for Hadoop
Pros:
✅Scales well for massive datasets (stored in HDFS)
✅SQL-like language (HiveQL) makes it user-friendly
✅Great for batch processing
Cons:
✅High query latency (relies on MapReduce/Tez)
✅Slower compared to Spark
✅Limited real-time stream processing capabilities
🔹 Spark: Fast Distributed Processing
Pros:
In-memory computing → high-speed performance
Supports real-time data processing (Structured Streaming)
Flexible: Works with HDFS, S3, Cassandra, JDBC, and more
Cons:
✅Requires more RAM
✅More complex to manage
✅Less efficient for archived big data batch processing
💡 Conclusions:
✅ Use Hive for complex SQL queries & batch processing.
✅ Use Spark for real-time analytics & fast data processing.
🗂 VAST Data is Changing the Game in Data Storage
According to experts, VAST Data is taking a major step toward creating a unified data storage platform by adding block storage support and built-in event processing.
✅ Unified Block Storage now integrates all key protocols (files, objects, tables, data streams), eliminating fragmented infrastructure. This provides a powerful, cost-effective solution for AI and analytics-driven companies.
✅ VAST Event Broker replaces complex event-driven systems like Kafka, enabling built-in real-time data streaming. AI and analytics can now receive events instantly without additional software.
🚀 Key Features:
✅ Accelerated AI analytics with real-time data delivery
✅ Full compatibility with MySQL, PostgreSQL, Oracle, and cloud services
✅ Scalable architecture with no performance trade-offs
🔎 Read more here
According to experts, VAST Data is taking a major step toward creating a unified data storage platform by adding block storage support and built-in event processing.
✅ Unified Block Storage now integrates all key protocols (files, objects, tables, data streams), eliminating fragmented infrastructure. This provides a powerful, cost-effective solution for AI and analytics-driven companies.
✅ VAST Event Broker replaces complex event-driven systems like Kafka, enabling built-in real-time data streaming. AI and analytics can now receive events instantly without additional software.
🚀 Key Features:
✅ Accelerated AI analytics with real-time data delivery
✅ Full compatibility with MySQL, PostgreSQL, Oracle, and cloud services
✅ Scalable architecture with no performance trade-offs
🔎 Read more here
Database Trends and Applications
VAST DataStore Becomes Universal, Multiprotocol Storage Platform with Block Storage and Event-Processing
VAST Data, the AI data platform company, is announcing two significant advancements for the VAST Data Platform, unveiling Block storage functionality for the VAST DataStore, as well as the new VAST Event Broker. These latest capabilities aim to better accommodate…
You have a dataframe with missing values at random locations. What data processing method is most robust for you?
Anonymous Poll
36%
Fill with median for numeric and mode for categorical features
17%
Remove all rows with missing values
28%
Linear regression interpolation based on other features
19%
Fill with mean for numeric and "Unknown" for categorical
🌎TOP DS-events all over the world in March
Mar 1 - Open Data Day Flensburg - Flensburg, Germany - https://opendataday-flensburg.de/
Mar 3 – ICMBDC - Shanghai, China - https://asar.org.in/Conference/55676/ICMBDC/
Mar 3-6 - Mobile World Congress – Barcelona, Spain - https://www.mwcbarcelona.com/
Mar 4 – ElasticON - Singapore, Singapore - https://www.elastic.co/events/elasticon/singapore
Mar 3-5 - Gartner Data & Analytics Summit – Orlando, USA - https://www.gartner.com/en/conferences/na/data-analytics-us
Mar 5-7 – PGConf - Bengaluru, India - https://pgconf.in/conferences/pgconfin2025
Mar 7 - Webinar 'From data to metadata: enhancing quality across borders' – Online - https://dataeuropaacademy.clickmeeting.com/webinar-from-data-to-metadata-enhancing-quality-across-borders-/register
Mar 11-12 - Data Spaces Symposium 2025 - Warsaw, Poland - https://www.data-spaces-symposium.eu/
Mar 12-13 - Big Data & AI World – London, UK - https://www.bigdataworld.com/
Mar 15 - Open Data Day - Timisoara, Romania - https://tm.opendataday.ro/
Mar 17-18 – ALT DATA – Singapore - https://www.battlefin.com/events/asia-2025
Mar 19 - EU Open Data Days 2025 – Luxembourg - https://data.europa.eu/en/news-events/events/eu-open-data-days-2025
Mar 20 - International Conference on Big Data and Smart Computing - Washington-DC,USA - https://ijieee.org.in/Conference/14165/ICBDSC/
Mar 21 - Open Source Day 2025 - Florence, Italy - https://osday.dev/
Mar 26 – ICAIMLBDE - Philadelphia, USA - https://isete.org/Conference/26577/ICAIMLBDE/
Mar 27 – MLConf - New York, USA - https://mlconf.com/
Mar 29 – ICBDS - Boston, USA - https://bigdataresearchforum.com/Conference/472/ICBDS/
Mar 31 – Apr 2 - Data Science Leadership Summit – Columbus, USA - https://academicdatascience.org/events/adsa-meetings/2025-data-science-leadership-summit/
Mar 1 - Open Data Day Flensburg - Flensburg, Germany - https://opendataday-flensburg.de/
Mar 3 – ICMBDC - Shanghai, China - https://asar.org.in/Conference/55676/ICMBDC/
Mar 3-6 - Mobile World Congress – Barcelona, Spain - https://www.mwcbarcelona.com/
Mar 4 – ElasticON - Singapore, Singapore - https://www.elastic.co/events/elasticon/singapore
Mar 3-5 - Gartner Data & Analytics Summit – Orlando, USA - https://www.gartner.com/en/conferences/na/data-analytics-us
Mar 5-7 – PGConf - Bengaluru, India - https://pgconf.in/conferences/pgconfin2025
Mar 7 - Webinar 'From data to metadata: enhancing quality across borders' – Online - https://dataeuropaacademy.clickmeeting.com/webinar-from-data-to-metadata-enhancing-quality-across-borders-/register
Mar 11-12 - Data Spaces Symposium 2025 - Warsaw, Poland - https://www.data-spaces-symposium.eu/
Mar 12-13 - Big Data & AI World – London, UK - https://www.bigdataworld.com/
Mar 15 - Open Data Day - Timisoara, Romania - https://tm.opendataday.ro/
Mar 17-18 – ALT DATA – Singapore - https://www.battlefin.com/events/asia-2025
Mar 19 - EU Open Data Days 2025 – Luxembourg - https://data.europa.eu/en/news-events/events/eu-open-data-days-2025
Mar 20 - International Conference on Big Data and Smart Computing - Washington-DC,USA - https://ijieee.org.in/Conference/14165/ICBDSC/
Mar 21 - Open Source Day 2025 - Florence, Italy - https://osday.dev/
Mar 26 – ICAIMLBDE - Philadelphia, USA - https://isete.org/Conference/26577/ICAIMLBDE/
Mar 27 – MLConf - New York, USA - https://mlconf.com/
Mar 29 – ICBDS - Boston, USA - https://bigdataresearchforum.com/Conference/472/ICBDS/
Mar 31 – Apr 2 - Data Science Leadership Summit – Columbus, USA - https://academicdatascience.org/events/adsa-meetings/2025-data-science-leadership-summit/
opendataday-flensburg.de
Open Data Day Flensburg - 1. März 2025
Lerne, wie offene Daten unsere Gesellschaft verändern. Workshops, Talks & Networking beim Open Data Day Flensburg 2025.
🐼 Pandas is outdated, FireDucks offers a replacement without code rewriting
Pandas is the most popular library for data processing, but it has long suffered from low performance. Modern alternatives like Polars significantly outperform it, but switching to new frameworks requires learning a new API, which stops many developers.
🔥 FireDucks solves this problem by offering full compatibility with Pandas but with multi-threaded processing and compiler acceleration. All that is needed for the transition is to change one line:
import fireducks.pandas as pd
FireDucks is faster than Pandas and Polars, as confirmed by benchmarks:
🔗 FireDucks GitHub repository: https://github.com/fireducks-dev/fireducks
🔗 Comparison with Polars and Pandas: https://github.com/fireducks-dev/fireducks/blob/main/notebooks/FireDucks_vs_Pandas_vs_Polars.ipynb
🔗 Detailed benchmarks: https://fireducks-dev.github.io/docs/benchmarks/
Pandas is the most popular library for data processing, but it has long suffered from low performance. Modern alternatives like Polars significantly outperform it, but switching to new frameworks requires learning a new API, which stops many developers.
🔥 FireDucks solves this problem by offering full compatibility with Pandas but with multi-threaded processing and compiler acceleration. All that is needed for the transition is to change one line:
import fireducks.pandas as pd
FireDucks is faster than Pandas and Polars, as confirmed by benchmarks:
🔗 FireDucks GitHub repository: https://github.com/fireducks-dev/fireducks
🔗 Comparison with Polars and Pandas: https://github.com/fireducks-dev/fireducks/blob/main/notebooks/FireDucks_vs_Pandas_vs_Polars.ipynb
🔗 Detailed benchmarks: https://fireducks-dev.github.io/docs/benchmarks/
👍1
🎲 Conditional Probability: Updating Beliefs with New Data
As we receive new information, our perception of event probabilities changes. This is the core idea of conditional probability, widely used in machine learning, medicine, finance, and more.
💡 Simple Examples:
🔹 Drawing a King from a deck: 4/52. If we know the card is a face card, the probability increases to 4/12.
🔹 Rolling a 6 on a die: 1/6. If we know the number is even, the probability jumps to 1/3.
💡 Real-World Applications:
✅ Medicine – Evaluating test accuracy (sensitivity, specificity, false positives).
✅ Finance – Assessing market risks, default probability of borrowers.
✅ Machine Learning – Spam filtering, medical diagnosis, credit scoring.
📌 Bayes' Theorem allows us to update probabilities as new data arrives. For instance, a positive test for a rare disease doesn’t necessarily mean a patient is sick—probability depends on disease prevalence and test accuracy.
🔎 Learn more: 👉 Conditional Probability
As we receive new information, our perception of event probabilities changes. This is the core idea of conditional probability, widely used in machine learning, medicine, finance, and more.
💡 Simple Examples:
🔹 Drawing a King from a deck: 4/52. If we know the card is a face card, the probability increases to 4/12.
🔹 Rolling a 6 on a die: 1/6. If we know the number is even, the probability jumps to 1/3.
💡 Real-World Applications:
✅ Medicine – Evaluating test accuracy (sensitivity, specificity, false positives).
✅ Finance – Assessing market risks, default probability of borrowers.
✅ Machine Learning – Spam filtering, medical diagnosis, credit scoring.
📌 Bayes' Theorem allows us to update probabilities as new data arrives. For instance, a positive test for a rare disease doesn’t necessarily mean a patient is sick—probability depends on disease prevalence and test accuracy.
🔎 Learn more: 👉 Conditional Probability
Datacamp
Conditional Probability: A Close Look
Conditional probability is the likelihood of an event occurring given another has happened, found by dividing the joint probability by the event's probability.
🔥 Everything to Markdown (E2M): Convert Anything to Markdown in Seconds!
Need to quickly and efficiently convert various file formats to Markdown? Check out Everything to Markdown (E2M) — a Python library that does it automatically!
📌 What Can E2M Do?
E2M supports conversion from multiple formats:
✅ Text documents: DOC, DOCX, EPUB
✅ Web pages: HTML, HTM, URL
✅ Presentations & PDFs: PPT, PPTX, PDF
✅ Audio files: MP3, M4A (speech recognition)
🤔 How Does It Work?
The conversion process is powered by two key modules:
🔹 Parser – Extracts text and images from files.
🔹 Converter – Transforms them into Markdown.
🎯 Why Use E2M?
Its main goal is to create structured text data for:
🚀 Retrieval-Augmented Generation (RAG)
🤖 Training & fine-tuning language models
📚 Effortless documentation creation
💡 Why Is It Useful?
E2M automates tedious work, enabling fast data structuring. Since Markdown is a universal format, it integrates seamlessly into any system.
Need to quickly and efficiently convert various file formats to Markdown? Check out Everything to Markdown (E2M) — a Python library that does it automatically!
📌 What Can E2M Do?
E2M supports conversion from multiple formats:
✅ Text documents: DOC, DOCX, EPUB
✅ Web pages: HTML, HTM, URL
✅ Presentations & PDFs: PPT, PPTX, PDF
✅ Audio files: MP3, M4A (speech recognition)
🤔 How Does It Work?
The conversion process is powered by two key modules:
🔹 Parser – Extracts text and images from files.
🔹 Converter – Transforms them into Markdown.
🎯 Why Use E2M?
Its main goal is to create structured text data for:
🚀 Retrieval-Augmented Generation (RAG)
🤖 Training & fine-tuning language models
📚 Effortless documentation creation
💡 Why Is It Useful?
E2M automates tedious work, enabling fast data structuring. Since Markdown is a universal format, it integrates seamlessly into any system.
👍1
You trained the model and got AUC-ROC = 0.95. What would you prefer to do to check the quality of the model?
Anonymous Poll
33%
Check the stability of the metric on cross-validation
21%
Evaluate Precision-Recall for imbalanced classes
18%
Test on a holdout set not used for training
28%
Check for data leakage between training and testing
📊 Apache Iceberg vs Delta Lake vs Hudi: Which Format is Best for AI/ML?
Choosing the right data storage format is crucial for machine learning (ML) and analytics. The wrong choice can lead to slow queries, poor scalability, and data integrity issues.
🔥 Why Does Format Matter?
Traditional data lakes struggle with:
🚧 No ACID transactions – risk of read/write conflicts
📉 No data versioning – hard to track changes
🐢 Slow queries – large datasets slow down analytics
💡 Apache Iceberg – Best for Analytics & Batch Processing
📌 When to Use?
✅ Handling historical datasets
✅ Need for query optimization & schema evolution
✅ Batch processing is a priority
📌 Key Advantages
✅ ACID transactions with snapshot isolation
✅ Time travel – restore previous versions of data
✅ Hidden partitioning – speeds up queries
✅ Supports Spark, Flink, Trino, Presto
📌 Use Cases
🔸 BI & trend analysis
🔸 Data storage for ML model training
🔸 Audit logs & rollback scenarios
💡 Delta Lake – Best for AI/ML & Streaming Workloads
📌 When to Use?
✅ Streaming data is critical for ML
✅ Need true ACID transactions
✅ Working primarily with Apache Spark
📌 Key Advantages
✅ Deep Spark integration
✅ Incremental updates (avoids full dataset rewrites)
✅ Z-Ordering – clusters similar data for faster queries
✅ Time travel – rollback & restore capabilities
📌 Use Cases
🔹 Real-time ML pipelines (fraud detection, predictive analytics)
🔹 ETL workflows
🔹 IoT data processing & logs
💡 Apache Hudi – Best for Real-Time Updates
📌 When to Use?
✅ Need fast real-time analytics
✅ Data needs frequent updates
✅ Working with Apache Flink, Spark, or Kafka
📌 Key Advantages
✅ ACID transactions & version control
✅ Merge-on-Read (MoR) – update without rewriting entire datasets
✅ Optimized for real-time ML (fraud detection, recommendations)
✅ Supports micro-batching & streaming
📌 Use Cases
🔸 Fraud detection (bank transactions, security monitoring)
🔸 Recommendation systems (e-commerce, streaming services)
🔸 AdTech (real-time bidding, personalized ads)
🤔 Which Format is Best for AI/ML?
✅ Iceberg – Best for historical data and BI analytics
✅ Delta Lake – Best for AI/ML, streaming, and Apache Spark
✅ Hudi – Best for frequent updates & real-time ML (fraud detection, recommendations, AdTech)
🔗 Full breakdown here
Choosing the right data storage format is crucial for machine learning (ML) and analytics. The wrong choice can lead to slow queries, poor scalability, and data integrity issues.
🔥 Why Does Format Matter?
Traditional data lakes struggle with:
🚧 No ACID transactions – risk of read/write conflicts
📉 No data versioning – hard to track changes
🐢 Slow queries – large datasets slow down analytics
💡 Apache Iceberg – Best for Analytics & Batch Processing
📌 When to Use?
✅ Handling historical datasets
✅ Need for query optimization & schema evolution
✅ Batch processing is a priority
📌 Key Advantages
✅ ACID transactions with snapshot isolation
✅ Time travel – restore previous versions of data
✅ Hidden partitioning – speeds up queries
✅ Supports Spark, Flink, Trino, Presto
📌 Use Cases
🔸 BI & trend analysis
🔸 Data storage for ML model training
🔸 Audit logs & rollback scenarios
💡 Delta Lake – Best for AI/ML & Streaming Workloads
📌 When to Use?
✅ Streaming data is critical for ML
✅ Need true ACID transactions
✅ Working primarily with Apache Spark
📌 Key Advantages
✅ Deep Spark integration
✅ Incremental updates (avoids full dataset rewrites)
✅ Z-Ordering – clusters similar data for faster queries
✅ Time travel – rollback & restore capabilities
📌 Use Cases
🔹 Real-time ML pipelines (fraud detection, predictive analytics)
🔹 ETL workflows
🔹 IoT data processing & logs
💡 Apache Hudi – Best for Real-Time Updates
📌 When to Use?
✅ Need fast real-time analytics
✅ Data needs frequent updates
✅ Working with Apache Flink, Spark, or Kafka
📌 Key Advantages
✅ ACID transactions & version control
✅ Merge-on-Read (MoR) – update without rewriting entire datasets
✅ Optimized for real-time ML (fraud detection, recommendations)
✅ Supports micro-batching & streaming
📌 Use Cases
🔸 Fraud detection (bank transactions, security monitoring)
🔸 Recommendation systems (e-commerce, streaming services)
🔸 AdTech (real-time bidding, personalized ads)
🤔 Which Format is Best for AI/ML?
✅ Iceberg – Best for historical data and BI analytics
✅ Delta Lake – Best for AI/ML, streaming, and Apache Spark
✅ Hudi – Best for frequent updates & real-time ML (fraud detection, recommendations, AdTech)
🔗 Full breakdown here
👍1
🛠 Another Roundup of Tools for Data Management, Storage, and Analysis
🔹 DrawDB – A visual database management system that simplifies database design and interaction. Its graphical interface allows developers to create and visualize database structures without writing complex SQL queries.
🔹 Hector RAG – A Retrieval-Augmented Generation (RAG) framework built on PostgreSQL. It enhances AI applications by combining retrieval and text generation, improving response accuracy and efficiency in search-enhanced LLMs.
🔹 ERD Lab – A free online tool for designing and visualizing Entity-Relationship Diagrams (ERD). Users can import SQL noscripts or create new databases without writing code, making it an ideal solution for database design and documentation.
🔹 SuperMassive – A distributed, fault-tolerant in-memory key-value database designed for high-performance applications. It provides low-latency access and self-recovery, making it perfect for mission-critical workloads.
🔹 Smallpond – A lightweight data processing framework built on DuckDB and 3FS. It enables high-performance analytics on petabyte-scale datasets without requiring long-running services or complex infrastructure.
🔹 ingestr – A CLI tool for seamless data migration between databases like Postgres, BigQuery, Snowflake, Redshift, Databricks, DuckDB, and more. Supports full refresh & incremental updates with append, merge, or delete+insert strategies.
🚀 Whether you’re designing databases, optimizing AI pipelines, or managing large-scale data workflows, these tools will streamline your work and boost productivity!
🔹 DrawDB – A visual database management system that simplifies database design and interaction. Its graphical interface allows developers to create and visualize database structures without writing complex SQL queries.
🔹 Hector RAG – A Retrieval-Augmented Generation (RAG) framework built on PostgreSQL. It enhances AI applications by combining retrieval and text generation, improving response accuracy and efficiency in search-enhanced LLMs.
🔹 ERD Lab – A free online tool for designing and visualizing Entity-Relationship Diagrams (ERD). Users can import SQL noscripts or create new databases without writing code, making it an ideal solution for database design and documentation.
🔹 SuperMassive – A distributed, fault-tolerant in-memory key-value database designed for high-performance applications. It provides low-latency access and self-recovery, making it perfect for mission-critical workloads.
🔹 Smallpond – A lightweight data processing framework built on DuckDB and 3FS. It enables high-performance analytics on petabyte-scale datasets without requiring long-running services or complex infrastructure.
🔹 ingestr – A CLI tool for seamless data migration between databases like Postgres, BigQuery, Snowflake, Redshift, Databricks, DuckDB, and more. Supports full refresh & incremental updates with append, merge, or delete+insert strategies.
🚀 Whether you’re designing databases, optimizing AI pipelines, or managing large-scale data workflows, these tools will streamline your work and boost productivity!
GitHub
GitHub - drawdb-io/drawdb: Free, simple, and intuitive online database diagram editor and SQL generator.
Free, simple, and intuitive online database diagram editor and SQL generator. - drawdb-io/drawdb
This media is not supported in your browser
VIEW IN TELEGRAM
💡 Master SQL Easily: A Hands-On Training Site (SQL Practice)
Looking to sharpen your SQL skills with real-world examples? This site is a great choice!
🔹 Format – Exercises are based on a hospital database, making them relevant to real-life SQL use cases.
🔹 Difficulty Levels – Start with basic SELECT queries and gradually advance to joins, subqueries, window functions, and query optimization.
🔹 Practical Benefits – Especially useful for healthcare analysts, data professionals, and developers working with medical systems.
🔹 Perfect for Preparation – Ideal for interview prep, certifications, or simply improving SQL proficiency.
🚀 This resource not only teaches SQL but also helps you understand how to work with data in a medical context effectively!
Looking to sharpen your SQL skills with real-world examples? This site is a great choice!
🔹 Format – Exercises are based on a hospital database, making them relevant to real-life SQL use cases.
🔹 Difficulty Levels – Start with basic SELECT queries and gradually advance to joins, subqueries, window functions, and query optimization.
🔹 Practical Benefits – Especially useful for healthcare analysts, data professionals, and developers working with medical systems.
🔹 Perfect for Preparation – Ideal for interview prep, certifications, or simply improving SQL proficiency.
🚀 This resource not only teaches SQL but also helps you understand how to work with data in a medical context effectively!
📚 Book Review: Apache Pulsar in Action
Author: David Kjerrumgaard
"Apache Pulsar in Action" is a practical guide to using Apache Pulsar, a powerful platform for real-time messaging and data streaming. While primarily targeting experienced Java developers, it also includes Python examples, making it useful for professionals from various technical backgrounds.
🔍 What’s Inside?
The author explores Apache Pulsar’s architecture and its key advantages over messaging systems like Kafka and RabbitMQ, highlighting:
🔹 Multi-protocol support (MQTT, AMQP, Kafka binary protocol).
🔹 High fault tolerance & scalability in cloud environments.
🔹 Pulsar Functions for developing microservice applications.
💡 Who Should Read It?
📌 Microservices developers – Learn how to integrate Pulsar into distributed systems.
📌 DevOps engineers – Get insights on deployment and monitoring.
📌 Data processing specialists – Discover streaming analytics techniques.
🤔 Pros & Cons
✅ Comprehensive development & architecture guide.
✅ Hands-on approach with Java & Python code examples.
✅ Accessible for developers at various experience levels.
❌ Lacks real-world case studies, making it harder to adapt Pulsar for specific business use cases.
🏆 Final Verdict
"Apache Pulsar in Action" is a valuable resource for those looking to master streaming data processing and scalable distributed systems. While it could benefit from more industry-specific case studies, it remains an excellent hands-on guide for understanding and implementing Apache Pulsar.
Author: David Kjerrumgaard
"Apache Pulsar in Action" is a practical guide to using Apache Pulsar, a powerful platform for real-time messaging and data streaming. While primarily targeting experienced Java developers, it also includes Python examples, making it useful for professionals from various technical backgrounds.
🔍 What’s Inside?
The author explores Apache Pulsar’s architecture and its key advantages over messaging systems like Kafka and RabbitMQ, highlighting:
🔹 Multi-protocol support (MQTT, AMQP, Kafka binary protocol).
🔹 High fault tolerance & scalability in cloud environments.
🔹 Pulsar Functions for developing microservice applications.
💡 Who Should Read It?
📌 Microservices developers – Learn how to integrate Pulsar into distributed systems.
📌 DevOps engineers – Get insights on deployment and monitoring.
📌 Data processing specialists – Discover streaming analytics techniques.
🤔 Pros & Cons
✅ Comprehensive development & architecture guide.
✅ Hands-on approach with Java & Python code examples.
✅ Accessible for developers at various experience levels.
❌ Lacks real-world case studies, making it harder to adapt Pulsar for specific business use cases.
🏆 Final Verdict
"Apache Pulsar in Action" is a valuable resource for those looking to master streaming data processing and scalable distributed systems. While it could benefit from more industry-specific case studies, it remains an excellent hands-on guide for understanding and implementing Apache Pulsar.
📕 Think Stats — The Best Free Guide to Statistics for Python Developers
Think Stats is a hands-on guide to statistics and probability, designed specifically for Python developers. Unlike traditional textbooks, it dives straight into coding, helping you master statistical methods using real-world data and practical exercises.
🔍 Why Think Stats Stands Out
✅ Practical focus – Minimal complex math, maximum real-world applications.
✅ Fully integrated with Python – The book is structured as Jupyter Notebooks, allowing you to run code and see results instantly.
✅ Real dataset analysis – Includes demographic data, medical research, and social media analytics.
✅ Data Science-oriented – The learning approach is tailored for analysts, developers, and data scientists.
✅ Easy to read – Concepts are explained in a clear and accessible manner, making it beginner-friendly.
📚 What’s Inside?
🔹 Core statistics and probability concepts in a programming context.
🔹 Data cleaning, processing, and visualization techniques.
🔹 Deep dive into distributions (normal, binomial, Poisson, etc.).
🔹 Parameter estimation, confidence intervals, and hypothesis testing.
🔹 Bayesian analysis, increasingly popular in Data Science.
🔹 Introduction to regression, forecasting, and statistical modeling.
🎯 Who Should Read It?
✅ Python developers wanting to learn statistics through coding.
✅ Data scientists & analysts looking for practical knowledge.
✅ Students & self-learners who need real-world applications of statistics.
✅ ML engineers who need a strong foundation in statistical methods.
🤔 Why You Should Read Think Stats
📌 No fluff, just practical statistics that you can apply immediately.
📌 Free and open-source (Creative Commons license) – Download, copy, and share freely.
📌 Jupyter Notebook integration for a hands-on learning experience.
💡Think Stats is a must-have resource for anyone who wants to learn and apply statistics effectively in Python. Whether you're a beginner or an experienced developer, this book will boost your data science skills!
💻Github
Think Stats is a hands-on guide to statistics and probability, designed specifically for Python developers. Unlike traditional textbooks, it dives straight into coding, helping you master statistical methods using real-world data and practical exercises.
🔍 Why Think Stats Stands Out
✅ Practical focus – Minimal complex math, maximum real-world applications.
✅ Fully integrated with Python – The book is structured as Jupyter Notebooks, allowing you to run code and see results instantly.
✅ Real dataset analysis – Includes demographic data, medical research, and social media analytics.
✅ Data Science-oriented – The learning approach is tailored for analysts, developers, and data scientists.
✅ Easy to read – Concepts are explained in a clear and accessible manner, making it beginner-friendly.
📚 What’s Inside?
🔹 Core statistics and probability concepts in a programming context.
🔹 Data cleaning, processing, and visualization techniques.
🔹 Deep dive into distributions (normal, binomial, Poisson, etc.).
🔹 Parameter estimation, confidence intervals, and hypothesis testing.
🔹 Bayesian analysis, increasingly popular in Data Science.
🔹 Introduction to regression, forecasting, and statistical modeling.
🎯 Who Should Read It?
✅ Python developers wanting to learn statistics through coding.
✅ Data scientists & analysts looking for practical knowledge.
✅ Students & self-learners who need real-world applications of statistics.
✅ ML engineers who need a strong foundation in statistical methods.
🤔 Why You Should Read Think Stats
📌 No fluff, just practical statistics that you can apply immediately.
📌 Free and open-source (Creative Commons license) – Download, copy, and share freely.
📌 Jupyter Notebook integration for a hands-on learning experience.
💡Think Stats is a must-have resource for anyone who wants to learn and apply statistics effectively in Python. Whether you're a beginner or an experienced developer, this book will boost your data science skills!
💻Github
GitHub
GitHub - AllenDowney/ThinkStats: Notebooks for the third edition of Think Stats
Notebooks for the third edition of Think Stats. Contribute to AllenDowney/ThinkStats development by creating an account on GitHub.