NEW BOT Телеграм, страница

Big Data Science

📊 Apache Iceberg vs Delta Lake vs Hudi: Which Format is Best for AI/ML?

Choosing the right data storage format is crucial for machine learning (ML) and analytics. The wrong choice can lead to slow queries, poor scalability, and data integrity issues.

🔥 Why Does Format Matter?
Traditional data lakes struggle with:
🚧 No ACID transactions – risk of read/write conflicts
📉 No data versioning – hard to track changes
🐢 Slow queries – large datasets slow down analytics

💡 Apache Iceberg – Best for Analytics & Batch Processing

📌 When to Use?
✅ Handling historical datasets
✅ Need for query optimization & schema evolution
✅ Batch processing is a priority

📌 Key Advantages
✅ ACID transactions with snapshot isolation
✅ Time travel – restore previous versions of data
✅ Hidden partitioning – speeds up queries
✅ Supports Spark, Flink, Trino, Presto

📌 Use Cases
🔸 BI & trend analysis
🔸 Data storage for ML model training
🔸 Audit logs & rollback scenarios

💡 Delta Lake – Best for AI/ML & Streaming Workloads

📌 When to Use?
✅ Streaming data is critical for ML
✅ Need true ACID transactions
✅ Working primarily with Apache Spark

📌 Key Advantages
✅ Deep Spark integration
✅ Incremental updates (avoids full dataset rewrites)
✅ Z-Ordering – clusters similar data for faster queries
✅ Time travel – rollback & restore capabilities

📌 Use Cases
🔹 Real-time ML pipelines (fraud detection, predictive analytics)
🔹 ETL workflows
🔹 IoT data processing & logs

💡 Apache Hudi – Best for Real-Time Updates

📌 When to Use?
✅ Need fast real-time analytics
✅ Data needs frequent updates
✅ Working with Apache Flink, Spark, or Kafka

📌 Key Advantages
✅ ACID transactions & version control
✅ Merge-on-Read (MoR) – update without rewriting entire datasets
✅ Optimized for real-time ML (fraud detection, recommendations)
✅ Supports micro-batching & streaming

📌 Use Cases
🔸 Fraud detection (bank transactions, security monitoring)
🔸 Recommendation systems (e-commerce, streaming services)
🔸 AdTech (real-time bidding, personalized ads)

🤔 Which Format is Best for AI/ML?
✅ Iceberg – Best for historical data and BI analytics
✅ Delta Lake – Best for AI/ML, streaming, and Apache Spark
✅ Hudi – Best for frequent updates & real-time ML (fraud detection, recommendations, AdTech)

🔗 Full breakdown here

👍1

638 views15:59

Big Data Science

🛠 Another Roundup of Tools for Data Management, Storage, and Analysis

🔹 DrawDB – A visual database management system that simplifies database design and interaction. Its graphical interface allows developers to create and visualize database structures without writing complex SQL queries.

🔹 Hector RAG – A Retrieval-Augmented Generation (RAG) framework built on PostgreSQL. It enhances AI applications by combining retrieval and text generation, improving response accuracy and efficiency in search-enhanced LLMs.

🔹 ERD Lab – A free online tool for designing and visualizing Entity-Relationship Diagrams (ERD). Users can import SQL noscripts or create new databases without writing code, making it an ideal solution for database design and documentation.

🔹 SuperMassive – A distributed, fault-tolerant in-memory key-value database designed for high-performance applications. It provides low-latency access and self-recovery, making it perfect for mission-critical workloads.

🔹 Smallpond – A lightweight data processing framework built on DuckDB and 3FS. It enables high-performance analytics on petabyte-scale datasets without requiring long-running services or complex infrastructure.

🔹 ingestr – A CLI tool for seamless data migration between databases like Postgres, BigQuery, Snowflake, Redshift, Databricks, DuckDB, and more. Supports full refresh & incremental updates with append, merge, or delete+insert strategies.

🚀 Whether you’re designing databases, optimizing AI pipelines, or managing large-scale data workflows, these tools will streamline your work and boost productivity!

GitHub

GitHub - drawdb-io/drawdb: Free, simple, and intuitive online database diagram editor and SQL generator.

Free, simple, and intuitive online database diagram editor and SQL generator. - drawdb-io/drawdb

616 views15:59

Big Data Science

0:24

This media is not supported in your browser

VIEW IN TELEGRAM

💡 Master SQL Easily: A Hands-On Training Site (SQL Practice)

Looking to sharpen your SQL skills with real-world examples? This site is a great choice!

🔹 Format – Exercises are based on a hospital database, making them relevant to real-life SQL use cases.
🔹 Difficulty Levels – Start with basic SELECT queries and gradually advance to joins, subqueries, window functions, and query optimization.
🔹 Practical Benefits – Especially useful for healthcare analysts, data professionals, and developers working with medical systems.
🔹 Perfect for Preparation – Ideal for interview prep, certifications, or simply improving SQL proficiency.

🚀 This resource not only teaches SQL but also helps you understand how to work with data in a medical context effectively!

583 views15:59

Big Data Science

📚 Book Review: Apache Pulsar in Action

Author: David Kjerrumgaard

"Apache Pulsar in Action" is a practical guide to using Apache Pulsar, a powerful platform for real-time messaging and data streaming. While primarily targeting experienced Java developers, it also includes Python examples, making it useful for professionals from various technical backgrounds.

🔍 What’s Inside?
The author explores Apache Pulsar’s architecture and its key advantages over messaging systems like Kafka and RabbitMQ, highlighting:
🔹 Multi-protocol support (MQTT, AMQP, Kafka binary protocol).
🔹 High fault tolerance & scalability in cloud environments.
🔹 Pulsar Functions for developing microservice applications.

💡 Who Should Read It?
📌 Microservices developers – Learn how to integrate Pulsar into distributed systems.
📌 DevOps engineers – Get insights on deployment and monitoring.
📌 Data processing specialists – Discover streaming analytics techniques.

🤔 Pros & Cons
✅ Comprehensive development & architecture guide.
✅ Hands-on approach with Java & Python code examples.
✅ Accessible for developers at various experience levels.
❌ Lacks real-world case studies, making it harder to adapt Pulsar for specific business use cases.

🏆 Final Verdict
"Apache Pulsar in Action" is a valuable resource for those looking to master streaming data processing and scalable distributed systems. While it could benefit from more industry-specific case studies, it remains an excellent hands-on guide for understanding and implementing Apache Pulsar.

543 views15:59

Big Data Science

📕 Think Stats — The Best Free Guide to Statistics for Python Developers

Think Stats is a hands-on guide to statistics and probability, designed specifically for Python developers. Unlike traditional textbooks, it dives straight into coding, helping you master statistical methods using real-world data and practical exercises.

🔍 Why Think Stats Stands Out
✅ Practical focus – Minimal complex math, maximum real-world applications.
✅ Fully integrated with Python – The book is structured as Jupyter Notebooks, allowing you to run code and see results instantly.
✅ Real dataset analysis – Includes demographic data, medical research, and social media analytics.
✅ Data Science-oriented – The learning approach is tailored for analysts, developers, and data scientists.
✅ Easy to read – Concepts are explained in a clear and accessible manner, making it beginner-friendly.

📚 What’s Inside?
🔹 Core statistics and probability concepts in a programming context.
🔹 Data cleaning, processing, and visualization techniques.
🔹 Deep dive into distributions (normal, binomial, Poisson, etc.).
🔹 Parameter estimation, confidence intervals, and hypothesis testing.
🔹 Bayesian analysis, increasingly popular in Data Science.
🔹 Introduction to regression, forecasting, and statistical modeling.

🎯 Who Should Read It?
✅ Python developers wanting to learn statistics through coding.
✅ Data scientists & analysts looking for practical knowledge.
✅ Students & self-learners who need real-world applications of statistics.
✅ ML engineers who need a strong foundation in statistical methods.

🤔 Why You Should Read Think Stats
📌 No fluff, just practical statistics that you can apply immediately.
📌 Free and open-source (Creative Commons license) – Download, copy, and share freely.
📌 Jupyter Notebook integration for a hands-on learning experience.

💡Think Stats is a must-have resource for anyone who wants to learn and apply statistics effectively in Python. Whether you're a beginner or an experienced developer, this book will boost your data science skills!

💻Github

GitHub

GitHub - AllenDowney/ThinkStats: Notebooks for the third edition of Think Stats

Notebooks for the third edition of Think Stats. Contribute to AllenDowney/ThinkStats development by creating an account on GitHub.

568 views15:59

Big Data Science

🤔🗂 Google Research Develops Privacy-Preserving Synthetic Data Generation Method

Google Research has introduced a new method for generating synthetic data using differentially private LLM inference. This approach ensures data privacy while maintaining statistical utility, preventing leaks of sensitive information.

🔍 How Does It Work?
During text generation, Gaussian noise is added to LLM token distributions, preventing the reconstruction of original data. This ensures that individual examples in the training dataset do not significantly affect the output.

🧐 Privacy Parameters (ε & δ):
🔹 Lower ε = stronger privacy, but lower text quality.
🔹 Recommended range: ε = 1–5, balancing privacy & utility.

🚀 Key Privacy Mechanisms
✅ Noise addition to model log probabilities before token selection.
✅ Gradient clipping during training to limit the influence of individual data points.
✅ Query grouping to minimize privacy risks from multiple model interactions.

📊 Testing Results
🔹 Synthetic data retains practical utility for training downstream models.
🔹 Formal privacy guarantees (ε < 5) without significant loss in data quality.

🛠 Where Can It Be Used?
💡 Training AI models on sensitive datasets (e.g., healthcare, finance).
💡 Algorithm testing without access to real data.
💡 Data sharing between organizations without privacy risks.

⚖️ Pros & Cons
✅ Privacy without losing functionality – secure data without major quality loss.
✅ Ethical AI usage in sensitive domains.
❌ Trade-off between quality & privacy – stronger privacy can reduce text coherence.
❌ Increased computational costs – additional privacy checks slow down generation.

🤖 Conclusion
Google Research’s approach sets new standards for handling confidential data while balancing security & usability. This could redefine AI ethics and data-sharing practices for personal & corporate data.

research.google

Generating synthetic data with differentially private LLM inference

656 views15:59

Big Data Science

Which data compression method do you prefer for storing large arrays of numerical data?

Anonymous Poll

57%

Using a columnar storage format (Parquet)

21%

Using Snappy or LZ4 algorithms

11%

Using delta encoding and RLE compression

11%

Combination of ZSTD and dictionary encoding

28 voters563 views15:59

Big Data Science

🌎TOP DS-events all over the world in April
Apr 1-2 - Healthcare NLP Summit – Online - https://www.nlpsummit.org/healthcare-2025/
Apr 1-4 - KubeCon + CloudNativeCon – London, UK - https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/
Apr 3 - AI Integration & Autonomous Mobility - Munich, Germany - https://www.automantia.in/aiam-munich
Apr 9-10 - Data & AI – Warsaw, Polland - https://dataiwarsaw.tech/
Apr 9-11 - Google Cloud Next – Las Vegas, USA - https://cloud.withgoogle.com/next/25
Apr 9-11 - Data Management ThinkLab - Prague, Czech Republic - https://thinklinkers.com/events/data_management_conference_event_europe_2025
Apr 10-12 - Strata Data & AI Conference - New York City, USA - https://www.oreilly.com/conferences/strata-data-ai.html
Apr 23-25 - PyCon DE & PyData 2025 - Darmstadt, Germany - https://2025.pycon.de/
Apr 23-25 – GITEX ASIA x AI Everything Singapore – Singapore, Singapore - https://gitexasia.com/
Apr 24 - Elevate: Data Management roles in the AI Era - Antwerp, Belgium - https://datatrustassociates.com/roundtableapril/
Apr 28-30 - Machine Learning Prague 2025 - Prague, Czech Republic - https://www.mlprague.com/
Apr 29 – May 2 - Symposium on Data Science and Statistics – Salt Lake City – USA - https://ww2.amstat.org/meetings/sdss/2025/
Apr 29-30 - CDAO Germany - Munich, Germany - https://cdao-germany.coriniumintelligence.com/

NLP Summit

Healthcare 2025 | April - NLP Summit

The NLP Summit is the gathering place for those putting state-of-the-art natural language processing to good use. This fourth edition of the virtual conference showcases NLP best practices, real-world case studies, challenges in applying deep learning & transfer…

535 views15:59

Big Data Science

🚀 HuggingFace Releases Datasets for Pre-Training LLM in Code Generation

Following the success of OlympicCoder-32B, which beat Sonnet 3.7 in LiveCodeBench and IOI 2024, HuggingFace has released a rich dataset for pre-training and fine-tuning LLM in programming tasks.

✅Stack-Edu (125 billion tokens) – educational code in 15 programming languages, filtered from The Stack v2
✅GitHub Issues (11 billion tokens) – data from discussions and bug reports on GitHub
✅ CodeForces problems (10K tasks) – a unique set of CodeForces problems, 3K of which were not used in DeepMind training
✅ CodeForces problems DeepSeek-R1 (8.69 GB) – filtered traces of CodeForces solutions
✅ International Olympiad in Informatics: Problem statements dataset (2020 - 2024) - a unique set of programming Olympiad tasks, divided into subtasks so that each query corresponds to a solution to these subtasks
✅ International Olympiad in Informatics: Problem - DeepSeek-R1 CoT dataset (2020 - 2023) - 11 thousand traces of reasoning performed by DeepSeek-R1 during the solution of programming Olympiad tasks

💡 What to use it for?
🔹 LLM pre-training for code generation
🔹 Developing AI assistants for programmers
🔹 Improving solutions in computer olympiads
🔹 Creating ML models for code analysis

huggingface.co

HuggingFaceTB/stack-edu · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

❤1

510 views15:59

Big Data Science

📊 How to avoid data chaos? Ways to ensure consistency of metrics in a warehouse

If you work with analytics, you have probably encountered a situation where the same metric is calculated differently in different departments. This leads to confusion, reduces trust in the data, and slows down decision-making. The new article discusses the key reasons for this problem and two effective solutions.

🤔 Why do metrics diverge?
The reason lies in the spontaneous growth of analytics:
🔹 One analyst writes a SQL query to calculate the metric.
🔹 Then other teams create their own versions based on this query, making minor changes.
🔹 Over time, discrepancies arise, and the analytics team spends more and more time sorting out inconsistencies.

To avoid this situation, it is worth implementing uniform metric management standards.

🛠 Two approaches to ensure consistency

✅Semantic Layer
This is an intermediate layer between data and analytics tools, where metrics are defined centrally. They are stored in static files (e.g. YAML) and used to automatically generate SQL queries.

💡 Pros:
✔️ Flexibility: adapts to different queries without pre-creating tables.
✔️ Transparency: uniform definitions are available to all teams.
✔️ Relevance: data is updated in real time.

⚠️ Cons:
❌ Requires investment in infrastructure and optimization.
❌ May increase the load on calculations (but this can be solved by caching).

📌 Example of a tool: Cube.js is one of the few mature open-source solutions.

✅Pre-Aggregated Tables
Here, tables with pre-calculated metrics and fixed dimensions are created in advance.

💡 Pros:
✔️ Simple implementation, convenient for small projects.
✔️ Saving computing resources.
✔️ Full control over calculations.

⚠️ Cons:
❌ Difficult to maintain as the number of users increases.
❌ Discrepancies are possible if metrics are defined in different tables.

🤔 Which method to choose?
The optimal approach is hybrid use:
🔹 Implement a semantic layer for scalability.
🔹 Use pre-aggregated tables for critical metrics where minimal computation cost is important.

🔎More details here

Start Data Engineering

How to ensure consistent metrics in your warehouse – Start Data Engineering

If you’ve worked on a data team, you’ve likely encountered situations where multiple teams define metrics in slightly different ways, leaving you to untangle why discrepancies exist. The root cause of these metric deviations often stems from rapid data utilization…

600 views15:59

Big Data Science

📊 FinMind — world-class open financial data for analysis and learning

FinMind is not just a collection of quotes, but an entire ecosystem of financial data available for free and open source. The project is aimed at researchers, students, investors and enthusiasts who need access to high-quality, up-to-date data without having to pay for expensive subnoscriptions, such as Bloomberg Terminal or Quandl.

🔍 What you can find in FinMind:
📈 Historical and intraday stock quotes (tick data, candlesticks, volumes)
📊 Financial metrics: PER, PBR, EPS, ROE, etc.
💵 Dividends, company reports, revenue
📉 Options and futures data
🏦 Central bank interest rates, inflation
🛢 Commodity markets and bonds

🧠 Features:
✅Data is regularly updated automatically
✅Convenient and easy-to-learn Python API
✅Documentation and training examples in English and Chinese
✅The ability to quickly build a backtest or conduct market research

💡FinMind is ideal for:
✅Training courses on time series analysis, econometrics, ML in finance
✅Prototyping strategies, without risk and cost
✅University research and hackathons

🤖 GitHub

GitHub

GitHub - FinMind/FinMind: Open Data, more than 50 financial data. 提供超過 50 個金融資料(台股為主)，每天更新 https://finmind.github.io/

Open Data, more than 50 financial data. 提供超過 50 個金融資料(台股為主)，每天更新 https://finmind.github.io/ - FinMind/FinMind

617 views15:59

Big Data Science

You receive data from an external API with an unstable structure. What would you do in preprocessing so that the entire pipeline does not crash?

Anonymous Poll

16%

I write a schema-validator with default values and logging

16%

I wrap parsing in try/except with sending alerts in case of anomalies

40%

I save “raw” data separately so that it can be re-parsed during revision

28%

I make mapping API → internal schema via an adapter or ETL layer

25 voters540 views15:59

Big Data Science

📚Winning with Data Science — a guide for business leaders in the digital age

Authors: Howard Steven Friedman and Akshay Swaminathan

🔍 What is the book about?

The book Winning with Data Science is not about writing Python code or building neural networks. It is about how businesses can get real value from data science projects, even if you are not a techie.
The authors explain how to be a competent “customer” of analytical solutions: ask the right questions, understand the stages of a data science project, participate in the formation of requirements and evaluation of results.

💡 Key ideas of the book:

✅Data Science ≠ magic. It is a tool for solving specific business problems, and not a reason to chase hype.
✅The role of business is key. The business customer must clearly understand the problem, constraints, and desired outcome.
✅It is important to ask the right questions. Not "which algorithm is better?", but "will this help achieve the goal on time and on budget?"
✅Technical skills are not required. But basic knowledge of data, model types, storage, qualities, and capabilities is desirable.
✅Ethics is not the least important. Models should not reinforce biases or discriminate.

🧠 Who will benefit from this:

✅Managers implementing analytics
✅Product managers and startup founders
✅Marketers and CFOs
✅Anyone who wants to effectively interact with data science teams

🤔General conclusion:
This book is a bridge between business and analysts. Without an overload of terms, but with a deep understanding of the processes.

i.twirpx.link

Скачать Friedman Howard S., Swaminathan Akshay. Winning with Data Science: A Handbook for Business Leaders [PDF]

Columbia Business School Publishing, 2024. 272 р. ISBN 13: 978-0231206860. This book is a compelling and comprehensive guide to data science, emphasizing its real-world business applications and focusing on how to collaborate productively with data science…

584 views15:59

Big Data Science

🎓 How Students Use AI at University — A Real-World Study of 1,000,000 Sessions from Anthropic

AI is increasingly permeating education, but most discussions have so far been based on surveys and lab experiments. In a new report, the Anthropic team conducted one of the largest studies of real-world student AI use, analyzing over 1 million anonymized conversations with Claude.ai.

🤔 Who Uses AI the Most?

✅Computer Science students are the clear winner: they make up 36.8% of all conversations with Claude, although in fact they only make up 5.4% of US graduates.
✅STEM disciplines dominate overall — students in science, math, and engineering use AI far more than students in business, medicine, or the humanities.
✅ Business makes up almost 19% of all degrees, but only 8.9% of all AI sessions. The picture is similar in healthcare and humanities.

🤔 Why students turn to AI: the most popular queries

✅Creating and editing educational materials - 39.3%
✅Solving assignments, coding, explaining theory - 33.5%
✅Analyzing and visualizing data - 11%
✅Assistance in research and tool development - 6.5%
✅Translations, proofreading, creating diagrams - the remaining percentages.

🤔How exactly do students interact with AI?

✅Direct problem solving
✅Direct content creation
✅Collaborative problem solving
✅Collaborative content creation

💡Interestingly, all 4 types are found approximately equally often (23–29% each).

⚠️ The issue of academic honesty and thinking

The study showed that Claude most often performs high-level cognitive functions according to Bloom's taxonomy in conversations with students:

✅Creation — 39.8%
✅Analysis — 30.2%
✅Application, understanding, and memorization — much less often

⚡️ All this turns the familiar "Bloom's pyramid" upside down — and raises concerns: are students starting to "delegate" critical thinking operations to AI too early?

💡This study is just the beginning of a long journey, but it already provides a lot of food for thought for teachers, students, and university administrators.

💻The full text of the article is here

Anthropic

Anthropic Education Report: How university students use Claude

AI systems are no longer just specialized research tools: they’re everyday academic companions. As AIs integrate more deeply into educational environments, we need to consider important questions about learning, assessment, and skill development. Until now…

650 views15:59

Big Data Science

🌍 Geospatial Reasoning from Google: AI that understands geodata — and solves real problems

What if AI could not just "see" satellite images, but also understand what was happening on them? Google has launched a new large-scale project Geospatial Reasoning, combining powerful foundation models and generative AI to accelerate the analysis of geospatial data. This is not about theory, but about real-life scenarios: from assessing damage after a hurricane to improving urban planning and climate adaptation.

🔎 What's under the hood of Geospatial Reasoning?

✅Population Dynamics Foundation Model (PDFM) — a model that simulates population behavior and interaction with the environment;
✅Mobility model by trajectories — for tracking and analyzing movements;
✅New foundation models for remote sensing — trained on a huge array of satellite and aerial images with annotations.

🧠 How does it work?

The Geospatial Reasoning project allows you to combine the capabilities of Google models with your own data and create agent-based workflows. For example, after a hurricane, the system can:
✅Compare before and after images,
✅Identify which buildings are damaged,
✅Calculate estimated economic damage,
✅Assess the social vulnerability of affected areas,
✅Form priorities for assistance

🚀 Who has already joined?

✅Airbus — plans to use models to quickly analyze trillions of pixels of satellite data;
✅Maxar — integrates foundation models into its "living map of the Earth";
✅Planet Labs — accelerates the extraction of geoinsights for businesses and government agencies;
✅WPP (Choreograph) — uses PDFM to enhance media analytics based on behavioral patterns

📌 Why is this important?

Geodata is one of the most complex and potentially useful classes of information. There is a lot of it, it is heterogeneous and often requires high expertise to analyze.

💻 Read more in this article

Google Research

Geospatial Reasoning: Unlocking insights with generative AI and multiple foundation models

We’re introducing new geospatial foundation models and bringing them together with Geospatial Reasoning, a research effort that uses generative AI to accelerate geospatial problem solving. This can unlock powerful insights for crisis response, public health…

722 views15:59

Big Data Science

😎Top Dataset Search Engines and Data Warehouses

✅Google Dataset Search - opens access to free public datasets. You can select data on various topics and in various formats, including .pdf, .csv, .jpg, .txt and others. Using it is as easy as a regular Google search: just type the name or topic you are interested in into the search bar. As you type, the system will suggest datasets with the necessary keywords - you may accidentally stumble upon something new and interesting.

✅World Bank Open Data - open data from the World Bank is considered one of the most extensive and diverse sources of statistical information and public datasets. You can search for data by various categories. The World Bank website is unique in that it offers free resources and tools for public use, such as Data Bank, a convenient tool for analyzing and visualizing large data sets

✅Data.world - this platform allows you to access free data sets and work with them directly on the site. All you need to do is create a free account, after which you will have access to 3 free projects. If necessary, you can upgrade to paid plans with a larger storage volume. Using the search bar, you can find keywords, resources, organizations, or users. And for a more precise search, you can use the “Create advanced filter” button to find exactly what you need.

✅DataHub is a data publishing platform (SaaS) developed by Datopian, where you can browse one of the most diverse collections of public data sets organized by topic. The platform also has a blog with materials on topics related to Big Data Science.

✅Humanitarian Data Exchange - a platform for searching datasets. Here you can search for free datasets and filter the results by criteria such as location, format, organization, and license. The platform also allows you to share data by different categories.

✅UCI Machine Learning Repository - the least extensive of all the resources mentioned, it remains useful for those who want to build a machine learning model. Despite the limited number of datasets, you can also search for data by task type, attribute type, data format, and application area.

✅Academic Torrents - if you are doing research, writing an article or a master's thesis, then Academic Torrents will be a great helper. The platform offers a variety of large datasets from scientific publications, some of which are as large as 2 terabytes. Using Academic Torrents is very simple: you can search for datasets, articles, courses, and collections, and upload your own data for others to work with. The datasets are free, but you will need a torrent client installed on your device to download them.

World Bank Open Data

Free and open access to global development data

912 views15:59

Big Data Science

You need to build CDC (Change Data Capture) from PostgreSQL. What do you prefer?

Anonymous Poll

46%

Debezium + Kafka — a popular standard with log support

29%

Apache NiFi — drag-and-drop stream from change logs

11%

LogMiner + Kafka Connect — if you work with Oracle and want unification

14%

Own agent based on logical replication + custom JSON format

28 voters649 views15:59

😱China sets the pace in medical technology

🤖 Robots in white coats

Dozens of medical robots work in the largest clinics in China. They not only disinfect and deliver medicines, but also perform complex procedures:
📌 At Zhongshan Hospital at Fudan University, a robot phlebotomist takes blood more accurately than a person

🧠 AI is not an assistant, but a full-fledged doctor

China has opened the world's first AI hospital, where patients are consulted by "virtual doctors"
⚙️ The system receives patients 24/7, analyzes symptoms, makes preliminary diagnoses and even prescribes treatment.
🔬 Hundreds of thousands of consultations per month — without fatigue and human factor

🌏 Medtech for export

✅ Chinese robots and AI systems are already working in clinics in Asia, Africa and Latin America.
✅ Negotiations are underway with the European Union and the Middle East.

🔮 What does this mean?

The future of medicine has already arrived. And it is made in China😁

757 views15:59

Big Data Science

💡X-AnyLabeling — professional AI tool for automatic data labeling

X-AnyLabeling is an extended and improved version of the popular open-source project AnyLabeling, created for industrial use. Thanks to the support of AI and dozens of CV models, the tool makes labeling much faster and more accurately than manually.

💡 What X-AnyLabeling can do:
🔹 Automatic and semi-automatic annotation of images and videos
🔹 Support for 20+ computer vision models: YOLO, SAM, DETR, etc.
🔹 Working with real-time video — including object tracking on streams
🔹 Intuitive interface + collaboration mode
🔹 Export/import to all key annotation formats: COCO, VOC, YOLO, LabelMe, etc.

🔧 For whom:
✅ ML engineers and CV project teams
✅ Data analysts
✅ Researchers working with medical images, drones, security, etc.
✅ Companies creating commercial datasets

GitHub

GitHub - CVHub520/X-AnyLabeling: Effortless data labeling with AI support from Segment Anything and other awesome models.

Effortless data labeling with AI support from Segment Anything and other awesome models. - CVHub520/X-AnyLabeling

729 views15:59

Big Data Science

🌎TOP DS-events all over the world in May

May 7-8 - Data Science Next Conference Europe 2025 - Amsterdam, Netherlands - https://dscnextcon.com/
May 7-8 - Data Innovation Summit – Stockholm, Sweden - https://datainnovationsummit.com/
May 12-13 – DataFest – Edinburgh, UK - https://thedatalab.com/datafest/
May 13-15 - Qlik Connect – Orlando, USA - https://www.qlikconnect.com/
May 14 - Real-Time Analytics Summit – Online - https://rtasummit.startree.ai/
May 14 - Rise of AI Conference - Berlin, Germany - https://dev.events/conferences/rise-of-ai-conference-2025-j9m-mlae
May 14-15 - Data Summit 2025 – Boston, USA - https://www.dbta.com/DataSummit/2025/default.aspx
May 14-16 - JOTB2025 - Malaga, Spain - https://jonthebeach.com/
May 20-21 - M3 Konferenz - Minds Mastering Machines - IHK Karlsruhe, Germany - https://dev.events/conferences/minds-mastering-machines-mgc-axtp
May 22 - DATA SCIENCE RUHRGEBIET CONGRESS - ZESS - Bochum, Germany - https://data-science.ruhr/
May 29-30 - The Data Science Conference – Chicago, USA - https://www.thedatascienceconference.com/

❤1👍1

783 views15:59

Big Data Science

👨‍💻Data Science professionals need cutting-edge tools to stay ahead.
There are 5 services that can boost work efficiency:

1️⃣ DataRobot — A platform that helps automate the process of creating and optimizing machine learning models, allowing even beginners to quickly implement data analysis solutions.

2️⃣ Hugging Face — A repository and tools for working with NLP models and transformers. Essential for processing text data.

3️⃣ RapidMiner — A tool for automating data analysis processes with minimal time investment. It includes powerful features for modeling and visualization.

4️⃣ Kaggle Kernels — A platform for data science learning and competitions, offering seamless access to vast datasets and cloud-based computing resources. Ideal for collaboration, experimentation, and sharpening your skills in a hands-on environment.

5️⃣ Neptune.ai — An experiment-tracking platform for machine learning, enabling teams to log, compare, and visualize model performance effortlessly. Ideal for collaborative projects requiring reproducibility and insight into complex workflows.

💡 Why is this important?
These tools help analysts and researchers save time on routine tasks and achieve results more quickly. Investing in modern services is becoming an important factor in boosting productivity in Data Science.

DataRobot

DataRobot Homepage | DataRobot

DataRobot delivers the industry-leading AI applications and platform that maximize impact and minimize risk for your business

👍2

794 views12:48

About

Blog

Apps

Platform