💡A small selection of useful things for working with Big Data
postgres-backup-local is a Docker tool for creating backups of PostgreSQL databases, storing them in the local file system with the ability to flexibly manage copies. With its help, you can back up multiple databases from one server by specifying their names through the POSTGRES_DB environment variable (separated by a comma or space).
The tool supports webhooks before and after backup, automatically manages the rotation and deletion of old copies, and is also available for Linux architectures, including amd64, arm64, arm/v7, s390x, and ppc64le.
EfCore.SchemaCompare is a tool for comparing database schemas in Entity Framework Core (EF Core), allowing you to find and analyze differences between the current database and migrations. It provides a convenient way to track changes in data structures, which helps prevent errors caused by schema mismatches during application development.
Suitable for database versioning, especially useful when developing and upgrading EF Core-based applications.
Greenmask is an open-source tool for PostgreSQL designed for masking, obfuscation, and logical backup of data. It allows you to anonymize sensitive information in database dumps, making it useful for preparing data for use in non-production environments such as development and testing. Greenmask support helps protect data by meeting privacy requirements and reducing the risk of leaks during development.
postgres-backup-local is a Docker tool for creating backups of PostgreSQL databases, storing them in the local file system with the ability to flexibly manage copies. With its help, you can back up multiple databases from one server by specifying their names through the POSTGRES_DB environment variable (separated by a comma or space).
The tool supports webhooks before and after backup, automatically manages the rotation and deletion of old copies, and is also available for Linux architectures, including amd64, arm64, arm/v7, s390x, and ppc64le.
EfCore.SchemaCompare is a tool for comparing database schemas in Entity Framework Core (EF Core), allowing you to find and analyze differences between the current database and migrations. It provides a convenient way to track changes in data structures, which helps prevent errors caused by schema mismatches during application development.
Suitable for database versioning, especially useful when developing and upgrading EF Core-based applications.
Greenmask is an open-source tool for PostgreSQL designed for masking, obfuscation, and logical backup of data. It allows you to anonymize sensitive information in database dumps, making it useful for preparing data for use in non-production environments such as development and testing. Greenmask support helps protect data by meeting privacy requirements and reducing the risk of leaks during development.
GitHub
GitHub - prodrigestivill/docker-postgres-backup-local: Backup PostgresSQL to local filesystem with periodic backups and rotate…
Backup PostgresSQL to local filesystem with periodic backups and rotate backups. - prodrigestivill/docker-postgres-backup-local
😎How Spotify accelerated data markup for ML by 10x
Spotify shared how it accelerated data markup for machine learning models using large language models (LLMs) in conjunction with the work of annotators. Automated initial LLM partitioning significantly reduced processing time by allowing annotators to focus on complex or ambiguous cases. This combined solution tripled process throughput and reduced costs. This scalable solution is especially relevant for a rapidly growing platform and is used to monitor compliance with service rules and policies.
💡 Spotify's data partitioning strategy is based on three core principles:
✅Scaling human expertise: annotators validate and refine results to improve data accuracy.
✅Annotation tools: creating efficient tools that simplify the work of annotators and allow models to be integrated more quickly into the process.
✅Fundamental infrastructure and integration: the platform is designed to handle large amounts of data in parallel and run dozens of projects simultaneously.
This approach has allowed Spotify to run multiple projects simultaneously, reduce costs, and maintain high accuracy.
More information about Spotify's solution can be found in their whitepaper.
Spotify shared how it accelerated data markup for machine learning models using large language models (LLMs) in conjunction with the work of annotators. Automated initial LLM partitioning significantly reduced processing time by allowing annotators to focus on complex or ambiguous cases. This combined solution tripled process throughput and reduced costs. This scalable solution is especially relevant for a rapidly growing platform and is used to monitor compliance with service rules and policies.
💡 Spotify's data partitioning strategy is based on three core principles:
✅Scaling human expertise: annotators validate and refine results to improve data accuracy.
✅Annotation tools: creating efficient tools that simplify the work of annotators and allow models to be integrated more quickly into the process.
✅Fundamental infrastructure and integration: the platform is designed to handle large amounts of data in parallel and run dozens of projects simultaneously.
This approach has allowed Spotify to run multiple projects simultaneously, reduce costs, and maintain high accuracy.
More information about Spotify's solution can be found in their whitepaper.
Spotify Engineering
How We Generated Millions of Content Annotations
How We Generated Millions of Content Annotations - Spotify Engineering
This media is not supported in your browser
VIEW IN TELEGRAM
😂A Radical Solution from AI
Every day, thousands of programmers can breathe a sigh of relief when AI performs tasks for them like queries, data formatting, or other routine tasks😁
🖥ChatGPT was asked to write SQL queries for a store database. The answer just killed
😎Sometimes AI's views on solving a particular problem are slightly different from human ones
Every day, thousands of programmers can breathe a sigh of relief when AI performs tasks for them like queries, data formatting, or other routine tasks😁
🖥ChatGPT was asked to write SQL queries for a store database. The answer just killed
😎Sometimes AI's views on solving a particular problem are slightly different from human ones
👍1😁1
What happens to the data after standardization is applied?
Anonymous Poll
35%
They get a minimum value of 0 and a maximum value of 1
49%
The mean becomes 0 and the standard deviation becomes 1
10%
All data are rounded to integer values
6%
The data is sorted in ascending order
😎The Power of Data: Analyzing Quarterly Revenue Growth for Business Success
💡I recently came across an article in which the author talks about analyzing quarterly revenue growth. He argues that focusing only on annual data can hide trends and slow down decision making. Quarterly analysis allows you to better understand the current performance of the business and identify potential problems, such as a decrease in revenue in a certain period. This granularity helps you identify causes (such as seasonal fluctuations or marketing shortcomings) faster and take action faster than when analyzing only annual data. Quarterly data creates a foundation for optimizing growth strategies, moving from reactive to more effective data-driven management.
The author also highlights key metrics for analyzing quarterly revenue growth:
✅Customer Acquisition Cost (CAC): It is important to understand the cost of acquiring new customers to optimize marketing and sales efforts, which helps increase ROI and revenue growth.
✅Customer Lifetime Value (CLTV): This metric shows the total revenue a customer brings in over their entire relationship with the company, helping to identify high-yield segments for targeting and retention.
✅Sales Conversion: Analyzing conversion at each stage of the funnel helps identify bottlenecks and improve overall sales efficiency, which contributes to revenue growth.
🖥Link to the article
💡I recently came across an article in which the author talks about analyzing quarterly revenue growth. He argues that focusing only on annual data can hide trends and slow down decision making. Quarterly analysis allows you to better understand the current performance of the business and identify potential problems, such as a decrease in revenue in a certain period. This granularity helps you identify causes (such as seasonal fluctuations or marketing shortcomings) faster and take action faster than when analyzing only annual data. Quarterly data creates a foundation for optimizing growth strategies, moving from reactive to more effective data-driven management.
The author also highlights key metrics for analyzing quarterly revenue growth:
✅Customer Acquisition Cost (CAC): It is important to understand the cost of acquiring new customers to optimize marketing and sales efforts, which helps increase ROI and revenue growth.
✅Customer Lifetime Value (CLTV): This metric shows the total revenue a customer brings in over their entire relationship with the company, helping to identify high-yield segments for targeting and retention.
✅Sales Conversion: Analyzing conversion at each stage of the funnel helps identify bottlenecks and improve overall sales efficiency, which contributes to revenue growth.
🖥Link to the article
Medium
The Power of Data: Analyzing Quarterly Revenue Growth for Business Success
Beyond the Numbers: Drive Business Growth with Quarterly Revenue Analysis
👍1
This media is not supported in your browser
VIEW IN TELEGRAM
🧐Anthropic CEO Dario Amodei interviews Lex Fridman
😎Highlights:
✅Dario expressed optimism about the imminent emergence of AI capable of reaching human levels. He noted that development and training costs will increase in the coming years, and by 2027, clusters will likely be built worth around $100 billion - significantly larger than the current largest supercomputers, which cost around $1 billion.
✅Amodei believes that models will continue to scale, despite the lack of a theoretical explanation for this process - there is, according to him, some "magic" in it.
✅AI models are currently improving at an astonishing rate, especially in areas such as programming, physics, and mathematics. On the SWE-bench test, their success at the beginning of the year was only 2-3%, and now reaches about 50%. The main concern in these conditions is the possible monopoly on AI, when control over it ends up in a small number of large companies, which could threaten
🖥You can watch the interview here
😎Highlights:
✅Dario expressed optimism about the imminent emergence of AI capable of reaching human levels. He noted that development and training costs will increase in the coming years, and by 2027, clusters will likely be built worth around $100 billion - significantly larger than the current largest supercomputers, which cost around $1 billion.
✅Amodei believes that models will continue to scale, despite the lack of a theoretical explanation for this process - there is, according to him, some "magic" in it.
✅AI models are currently improving at an astonishing rate, especially in areas such as programming, physics, and mathematics. On the SWE-bench test, their success at the beginning of the year was only 2-3%, and now reaches about 50%. The main concern in these conditions is the possible monopoly on AI, when control over it ends up in a small number of large companies, which could threaten
🖥You can watch the interview here
Why does the T-SNE method
visualization result may be different for each run?
visualization result may be different for each run?
Anonymous Poll
66%
It uses a stochastic approach for optimization
16%
The algorithm is sensitive to the size of the input data
16%
Algorithm is dependent on the test data sampling
2%
Results display is based on linear transformations
🔎 Optimizing search in MongoDB
MongoDB is a non-relational database that differs from SQL databases such as PostgreSQL or MySQL in its structure. Instead of tables with columns and rows, MongoDB uses collections.
Searching for text in MongoDB involves using special query operators to work with text data. It allows you to search for text phrases in collections and return documents containing the specified words. This is often used for complex operations where data is grouped by common attributes such as price, authors, or age.
In this article, the author also shares his experience with MongoDB, including the challenges in creating optimal search queries to make them easier to understand for beginners.
The article also mentions Mongoose, a popular ORM (object-relational mapping) tool that simplifies the interaction between MongoDB and programming languages such as Node.js/JavaScript. It provides functions for data modeling, schema development, model authentication, and data management.
MongoDB is a non-relational database that differs from SQL databases such as PostgreSQL or MySQL in its structure. Instead of tables with columns and rows, MongoDB uses collections.
Searching for text in MongoDB involves using special query operators to work with text data. It allows you to search for text phrases in collections and return documents containing the specified words. This is often used for complex operations where data is grouped by common attributes such as price, authors, or age.
In this article, the author also shares his experience with MongoDB, including the challenges in creating optimal search queries to make them easier to understand for beginners.
The article also mentions Mongoose, a popular ORM (object-relational mapping) tool that simplifies the interaction between MongoDB and programming languages such as Node.js/JavaScript. It provides functions for data modeling, schema development, model authentication, and data management.
MongoDB
MongoDB: The World’s Leading Modern Database
Get your ideas to market faster with a flexible, AI-ready database. MongoDB makes working with data easy.
👍1
😎💡AlphaQubit from Google: a new standard for accuracy in quantum computing.
Google DeepMind and Google Quantum AI have unveiled AlphaQubit, a decoder that dramatically improves error correction accuracy in quantum computing. Based on a neural network trained on synthetic and real data from the Sycamore processor, AlphaQubit uses the Transformers architecture to analyze errors.
Tests have shown that AlphaQubit reduces errors by 6% compared to tensor networks and 30% with correlation matching. However, despite the high level of accuracy, real-world speed and scalability issues remain.
✅Link to blog
Google DeepMind and Google Quantum AI have unveiled AlphaQubit, a decoder that dramatically improves error correction accuracy in quantum computing. Based on a neural network trained on synthetic and real data from the Sycamore processor, AlphaQubit uses the Transformers architecture to analyze errors.
Tests have shown that AlphaQubit reduces errors by 6% compared to tensor networks and 30% with correlation matching. However, despite the high level of accuracy, real-world speed and scalability issues remain.
✅Link to blog
Google
AlphaQubit tackles one of quantum computing’s biggest challenges
AlphaQubit is an AI-based decoder that identifies quantum computing errors with state-of-the-art accuracy.
👍1
🤔CUPED: advantages and disadvantages
CUPED (Controlled Pre-Experiment Data) is a data preprocessing technique used to improve the accuracy of A/B test evaluation. CUPED reduces the variance of metrics by utilizing data collected before the experiment, allowing statistically significant differences to be identified more quickly.
Benefits of CUPED:
✅Reduces variance of metrics: Improves test sensitivity by accounting for prior data.
Resource savings: Reduces the sample size required to achieve statistical significance.
✅Faster interpretation of results: Reducing noise allows real effects to be found more quickly.
✅Accounting for seasonality: Using data before the experiment helps account for trends and external factors.
Disadvantages of CUPED:
✅Implementation complexity: Requires knowledge of statistics and proper choice of covariates.
✅Dependence on data quality: Pre-experimental data must be reliable and representative.
✅Necessity of covariates: A significant correlation between metric and predictor is required, otherwise the effect will be minimized.
✅Risk of overestimation: If not properly adjusted, may lead to overestimation of the effect.
Thus, CUPED is particularly useful when it is important to maximize the efficiency of experiments but requires careful data preparation and analysis.
CUPED (Controlled Pre-Experiment Data) is a data preprocessing technique used to improve the accuracy of A/B test evaluation. CUPED reduces the variance of metrics by utilizing data collected before the experiment, allowing statistically significant differences to be identified more quickly.
Benefits of CUPED:
✅Reduces variance of metrics: Improves test sensitivity by accounting for prior data.
Resource savings: Reduces the sample size required to achieve statistical significance.
✅Faster interpretation of results: Reducing noise allows real effects to be found more quickly.
✅Accounting for seasonality: Using data before the experiment helps account for trends and external factors.
Disadvantages of CUPED:
✅Implementation complexity: Requires knowledge of statistics and proper choice of covariates.
✅Dependence on data quality: Pre-experimental data must be reliable and representative.
✅Necessity of covariates: A significant correlation between metric and predictor is required, otherwise the effect will be minimized.
✅Risk of overestimation: If not properly adjusted, may lead to overestimation of the effect.
Thus, CUPED is particularly useful when it is important to maximize the efficiency of experiments but requires careful data preparation and analysis.
👍1
🤖Deus in Machina: Jesus-AI has been installed in a Swiss church
St. Peter's Chapel in Lucerne has launched an AI Jesus project that communicates in 100 languages. The AI is installed in the confessional where visitors can ask questions and receive answers in real time.
Trained on theological texts, Jesus-AI engaged more than 1,000 people in two months, two-thirds of whom described the experience as “spiritual.” However, the experiment has drawn criticism for the superficiality of the answers and the inability to have meaningful conversations with the machine.
🖥Read more here
St. Peter's Chapel in Lucerne has launched an AI Jesus project that communicates in 100 languages. The AI is installed in the confessional where visitors can ask questions and receive answers in real time.
Trained on theological texts, Jesus-AI engaged more than 1,000 people in two months, two-thirds of whom described the experience as “spiritual.” However, the experiment has drawn criticism for the superficiality of the answers and the inability to have meaningful conversations with the machine.
🖥Read more here
👍1
💡 SmolTalk: a synthetic English-language dataset for LLM education
SmolTalk is a synthetic dataset from HuggingFace designed for teacher-led LLM learning. It consists of 2 million rows and was used to develop SmolLM2-Instruct models.
🔥Dataset includes both new and existing datasets
😎New datasets:
✅Smol-Magpie-Ultra (400k rows).
✅Smol-constraints (36,000 rows)
✅Smol-rewrite (50 thousand lines)
✅Smol-summarize (101 thousand lines)
⚡️Older datasets:
✅OpenHermes2.5 (100 thousand lines)
✅MetaMathQA (50 thousand lines)
✅NuminaMath-CoT (1120 thousand lines)
✅Self-Oss-Starcoder2-Instruct (1120 thousand lines)
✅SystemChats2.0 (30 thou. lines)
✅LongAlign (less than 16 thousand tokens)
✅Everyday-conversations (50 thousand lines)
✅APIGen-Function-Calling (80k lines)
✅Explore-Instruct-Rewriting (30k lines)
📚Training results:
SmolTalk showed significant improvements in model performance, especially in the tasks of math, programming, and following system prompts. SmolTalk training gave better results on IFEval, BBH, GS8Mk and MATH labels, including when training Mistral-7B.
SmolTalk is a synthetic dataset from HuggingFace designed for teacher-led LLM learning. It consists of 2 million rows and was used to develop SmolLM2-Instruct models.
🔥Dataset includes both new and existing datasets
😎New datasets:
✅Smol-Magpie-Ultra (400k rows).
✅Smol-constraints (36,000 rows)
✅Smol-rewrite (50 thousand lines)
✅Smol-summarize (101 thousand lines)
⚡️Older datasets:
✅OpenHermes2.5 (100 thousand lines)
✅MetaMathQA (50 thousand lines)
✅NuminaMath-CoT (1120 thousand lines)
✅Self-Oss-Starcoder2-Instruct (1120 thousand lines)
✅SystemChats2.0 (30 thou. lines)
✅LongAlign (less than 16 thousand tokens)
✅Everyday-conversations (50 thousand lines)
✅APIGen-Function-Calling (80k lines)
✅Explore-Instruct-Rewriting (30k lines)
📚Training results:
SmolTalk showed significant improvements in model performance, especially in the tasks of math, programming, and following system prompts. SmolTalk training gave better results on IFEval, BBH, GS8Mk and MATH labels, including when training Mistral-7B.
huggingface.co
HuggingFaceTB/smoltalk · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
🌎TOP DS-events all over the world in December
Dec 2-5 - TIES 2024 - Adelaide, Australia - https://www.isi-next.org/conferences/ties2024/
Dec 3 - Generation AI - Paris, France - https://dev.events/conferences/generation-ai-c4odjomu
Dec 5 - The International AI Summit 2024 - Brussels, Belgium - https://global-aiconference.com/
Dec 2-6 - Data Science Week 2024 - Fort Wayne, USA - https://sites.google.com/view/data-science-week-2024
Dec 2-6 - AWS re:Invent - LAS VEGAS, USA - https://reinvent.awsevents.com/
Dec 9-10 - ICMSCS 2024: 18 - London, United Kingdom - https://waset.org/mathematics-statistics-and-computational-sciences-conference-in-december-2024-in-london
Dec 10 - Global Big Data Conference - Online - https://www.globalbigdataconference.com/
Dec 10 - Prompt Engineering Bulgaria 2024 - Sofia, Bulgaria - https://www.eventbrite.nl/e/prompt-engineering-bulgaria-2024-tickets-796563251127?aff=oddtdtcreator
Dec 11 - AI Heroes - Torino, Italy - https://dev.events/conferences/ai-heroes-xxrqdxu9
Dec 11-12 - The AI Summit New York - New York, USA - https://newyork.theaisummit.com/
Dec 12-13 - AI: 2057 - Dubai, UAE - https://www.globalaishow.com/
Dec 15-18 - IEEE International Conference on Big Data 2024 - Washington, D.C., USA - https://www3.cs.stonybrook.edu/~ieeebigdata2024/
Dec 19 - Normandie.ai 2024 - Rouen, France - https://dev.events/conferences/normandie-ai-2024-e15asbe6
Dec 2-5 - TIES 2024 - Adelaide, Australia - https://www.isi-next.org/conferences/ties2024/
Dec 3 - Generation AI - Paris, France - https://dev.events/conferences/generation-ai-c4odjomu
Dec 5 - The International AI Summit 2024 - Brussels, Belgium - https://global-aiconference.com/
Dec 2-6 - Data Science Week 2024 - Fort Wayne, USA - https://sites.google.com/view/data-science-week-2024
Dec 2-6 - AWS re:Invent - LAS VEGAS, USA - https://reinvent.awsevents.com/
Dec 9-10 - ICMSCS 2024: 18 - London, United Kingdom - https://waset.org/mathematics-statistics-and-computational-sciences-conference-in-december-2024-in-london
Dec 10 - Global Big Data Conference - Online - https://www.globalbigdataconference.com/
Dec 10 - Prompt Engineering Bulgaria 2024 - Sofia, Bulgaria - https://www.eventbrite.nl/e/prompt-engineering-bulgaria-2024-tickets-796563251127?aff=oddtdtcreator
Dec 11 - AI Heroes - Torino, Italy - https://dev.events/conferences/ai-heroes-xxrqdxu9
Dec 11-12 - The AI Summit New York - New York, USA - https://newyork.theaisummit.com/
Dec 12-13 - AI: 2057 - Dubai, UAE - https://www.globalaishow.com/
Dec 15-18 - IEEE International Conference on Big Data 2024 - Washington, D.C., USA - https://www3.cs.stonybrook.edu/~ieeebigdata2024/
Dec 19 - Normandie.ai 2024 - Rouen, France - https://dev.events/conferences/normandie-ai-2024-e15asbe6
dev.events
Generation AI
Your clustering algorithm finds too many
clusters that overlap. What would you use to solve
the problem?
clusters that overlap. What would you use to solve
the problem?
Anonymous Poll
21%
Applying a GMM (Gaussian Mixture Model) with a adjusting the covariance parameter
45%
Increasing the value of the hyperparameter of the minimum number of points in a cluster for DBSCAN
33%
Use of hierarchical clustering with adjustment of the dendrogram cutoff distance
0%
Application of the K-means++ algorithm for accurate selection of the initial centroids
😎🔥A selection of tools for Big Data processing
Timeplus Proton is a ClickHouse-based SQL engine designed to process, route, and analyze streaming data from sources such as Apache Kafka and Redpanda, with the ability to transfer aggregated data to other systems.
qsv is a command-line utility designed for quickly indexing, processing, analyzing, filtering, sorting, and merging CSV files. It offers convenient and understandable commands for performing these operations.
WrenAI is an open-source tool that prepares an existing database for working with RAG (Retrieval-Augmented Generation). It allows you to transform text queries into SQL, explore data from the database without writing SQL code, and perform other tasks.
Groll is an open-source CLI utility for managing schema migrations in PostgreSQL. It provides safe and reversible changes, supporting multiple schema versions at the same time. Groll supports complex migrations, ensuring that client applications do not stop working while updating the database schema.
Valkey is a high-performance open-source data warehouse that supports caching, message queues, and can be used as a primary database. It operates as a standalone background service or as part of a cluster, providing replication and high availability.
DataEase is an open-source BI tool for creating interactive visualizations and analyzing business metrics. It simplifies access to analytics with an intuitive drag-and-drop interface, making working with data convenient and understandable.
SurrealDB is a modern multi-model database that combines SQL, NoSQL, and graph databases. It supports relational, document, graph, temporal, and key-value data models, providing a unified solution for managing data without the need for different platforms.
LibSQL is a fork of SQLite, extended with features such as HTTP and gRPC query processing, and transparent replication support. It allows you to create distributed databases with writes on the primary server and reads from replicas. LibSQL provides secure data transfer via TLS and provides a Docker image for easy deployment.
Redash is an open-source data analytics tool designed to simplify connecting, querying, and visualizing data from a variety of sources. It allows you to create SQL and NoSQL queries, visualize results in the form of graphs and charts, and share dashboards with teams.
Timeplus Proton is a ClickHouse-based SQL engine designed to process, route, and analyze streaming data from sources such as Apache Kafka and Redpanda, with the ability to transfer aggregated data to other systems.
qsv is a command-line utility designed for quickly indexing, processing, analyzing, filtering, sorting, and merging CSV files. It offers convenient and understandable commands for performing these operations.
WrenAI is an open-source tool that prepares an existing database for working with RAG (Retrieval-Augmented Generation). It allows you to transform text queries into SQL, explore data from the database without writing SQL code, and perform other tasks.
Groll is an open-source CLI utility for managing schema migrations in PostgreSQL. It provides safe and reversible changes, supporting multiple schema versions at the same time. Groll supports complex migrations, ensuring that client applications do not stop working while updating the database schema.
Valkey is a high-performance open-source data warehouse that supports caching, message queues, and can be used as a primary database. It operates as a standalone background service or as part of a cluster, providing replication and high availability.
DataEase is an open-source BI tool for creating interactive visualizations and analyzing business metrics. It simplifies access to analytics with an intuitive drag-and-drop interface, making working with data convenient and understandable.
SurrealDB is a modern multi-model database that combines SQL, NoSQL, and graph databases. It supports relational, document, graph, temporal, and key-value data models, providing a unified solution for managing data without the need for different platforms.
LibSQL is a fork of SQLite, extended with features such as HTTP and gRPC query processing, and transparent replication support. It allows you to create distributed databases with writes on the primary server and reads from replicas. LibSQL provides secure data transfer via TLS and provides a Docker image for easy deployment.
Redash is an open-source data analytics tool designed to simplify connecting, querying, and visualizing data from a variety of sources. It allows you to create SQL and NoSQL queries, visualize results in the form of graphs and charts, and share dashboards with teams.
GitHub
GitHub - timeplus-io/proton: ⚡ Fastest SQL ETL pipeline in a single C++ binary, built for stream processing, observability, analytics…
⚡ Fastest SQL ETL pipeline in a single C++ binary, built for stream processing, observability, analytics and AI/ML - GitHub - timeplus-io/proton: ⚡ Fastest SQL ETL pipeline in a single C++ bin...
🧐Data and its markup in 2024: emerging trends and future requirements
Caught an interesting article about data markup. Here are a few key points:
🤔 Current trends:
✅ Increasing complexity of datasets
✅ The move to real-time partitioning
✅ Large-scale development of automated tools to complement manual labor
🤔Market forecasts:
✅Expected to grow to $8.22 billion by 2028 at a CAGR of 26.6%
✅The requirements for quality and speed of markup are increasing and will grow exponentially
🤔Technological trends:
✅Adaptive AI.
✅Metauniverse
✅Industry cloud platforms
✅ Improvements in wireless technologies
Thus, the author indicates that the data partitioning industry will grow rapidly due to the increasing demand for accurate and reliable data for AI and machine learning. Automation, adaptive AI, and new technological solutions will improve the quality and speed of data partitioning.
Caught an interesting article about data markup. Here are a few key points:
🤔 Current trends:
✅ Increasing complexity of datasets
✅ The move to real-time partitioning
✅ Large-scale development of automated tools to complement manual labor
🤔Market forecasts:
✅Expected to grow to $8.22 billion by 2028 at a CAGR of 26.6%
✅The requirements for quality and speed of markup are increasing and will grow exponentially
🤔Technological trends:
✅Adaptive AI.
✅Metauniverse
✅Industry cloud platforms
✅ Improvements in wireless technologies
Thus, the author indicates that the data partitioning industry will grow rapidly due to the increasing demand for accurate and reliable data for AI and machine learning. Automation, adaptive AI, and new technological solutions will improve the quality and speed of data partitioning.
Medium
Data Labeling in 2024: Emerging Trends and Future Demands for Impactful Results
Data labeling and annotation play a crucial role in various machine learning and AI initiatives, and the need for accurate and reliable…
Media is too big
VIEW IN TELEGRAM
😎Google unveiled Willow - a quantum chip with exponential scaling
Google has released Willow, the world's first quantum chip capable of exponential error reduction with increasing number of qubits. This is made possible by the efficient implementation of logical qubits that operate below the boundary of Quantum Error Correction, a method of protecting data through its distribution across qubits.
Willow features:
✅Record number of qubits: 105, far exceeding previous quantum computers.
✅Calculation speed: a septillion times faster than classical chips. Willow solves problems in 300 seconds that a conventional chip would take 10 quintillion years to complete.
✅ Error minimization: as the number of qubits increases, errors decrease exponentially, solving a major problem in quantum computing over the past 30 years.
While tasks like cracking bitcoin will require 300-400 million qubits, Willow is already setting a new bar in quantum technology.
🔎 Learn more here
Google has released Willow, the world's first quantum chip capable of exponential error reduction with increasing number of qubits. This is made possible by the efficient implementation of logical qubits that operate below the boundary of Quantum Error Correction, a method of protecting data through its distribution across qubits.
Willow features:
✅Record number of qubits: 105, far exceeding previous quantum computers.
✅Calculation speed: a septillion times faster than classical chips. Willow solves problems in 300 seconds that a conventional chip would take 10 quintillion years to complete.
✅ Error minimization: as the number of qubits increases, errors decrease exponentially, solving a major problem in quantum computing over the past 30 years.
While tasks like cracking bitcoin will require 300-400 million qubits, Willow is already setting a new bar in quantum technology.
🔎 Learn more here
🥲TOP fails with different DBMSs: pain, tears
✅PostgreSQL and the vacuum of surprise
Everyone loves PostgreSQL until they encounter the autovacuum. If you forget to configure it correctly, the database starts to slow down so much that it's easier to migrate data to Excel.
✅Cassandra: master of sharding and chaos
Oh, this magical world of distributed data! As long as everything is running smoothly, Cassandra is cool. But when one node fails, clusters become a mystery with a surprise: what part of the data survived? And cross-DC replication in large networks is a lottery.
✅Firebase Realtime Database
Sounds cool: data synchronized in real time! But when you have tens of thousands of active users, everything becomes hell, because every little query costs a ton of money. And unmonitored updates affect all clients at once.
✅Redis as the main database
Easy, fast, everything in memory. Sounds cool until you realize that they forgot about the data recovery mechanism. Oops, the server crashed - data flew to nowhere.
✅PostgreSQL and the vacuum of surprise
Everyone loves PostgreSQL until they encounter the autovacuum. If you forget to configure it correctly, the database starts to slow down so much that it's easier to migrate data to Excel.
✅Cassandra: master of sharding and chaos
Oh, this magical world of distributed data! As long as everything is running smoothly, Cassandra is cool. But when one node fails, clusters become a mystery with a surprise: what part of the data survived? And cross-DC replication in large networks is a lottery.
✅Firebase Realtime Database
Sounds cool: data synchronized in real time! But when you have tens of thousands of active users, everything becomes hell, because every little query costs a ton of money. And unmonitored updates affect all clients at once.
✅Redis as the main database
Easy, fast, everything in memory. Sounds cool until you realize that they forgot about the data recovery mechanism. Oops, the server crashed - data flew to nowhere.
😎🔥A small collection of useful datasets:
Synthia-v1.5-I – a dataset that includes over 20,000 technical questions and answers. It uses system prompts in the Orca style to generate diverse responses, making it a valuable resource for training and testing LLMs on complex technical data.
HelpSteer2 – an English-language dataset designed for training reward models that improve the utility, accuracy, and coherence of responses generated by other LLMs.
LAION-DISCO-12M – includes 12 million links to publicly available YouTube tracks with metadata. The dataset is created to support research in machine learning, sound processing model development, musical data analysis, audio data processing, and training recommender systems and applications.
Universe – a large-scale collection containing astronomical data of various types: images, spectra, and light curves. It is intended for research in astronomy and astrophysics.
Synthia-v1.5-I – a dataset that includes over 20,000 technical questions and answers. It uses system prompts in the Orca style to generate diverse responses, making it a valuable resource for training and testing LLMs on complex technical data.
HelpSteer2 – an English-language dataset designed for training reward models that improve the utility, accuracy, and coherence of responses generated by other LLMs.
LAION-DISCO-12M – includes 12 million links to publicly available YouTube tracks with metadata. The dataset is created to support research in machine learning, sound processing model development, musical data analysis, audio data processing, and training recommender systems and applications.
Universe – a large-scale collection containing astronomical data of various types: images, spectra, and light curves. It is intended for research in astronomy and astrophysics.
❤1
😎📊Data Trends That Will Transform Business in 2025
The article The Most Powerful Data Trends That Will Transform Business In 2025 highlights key trends shaping the future of data usage.
🤔Here are some of them:
✅ Confidential Computing: Blockchain and homomorphic encryption will enable data analysis without exposing its content. This is a crucial step for secure collaborative analytics between companies.
✅ Growth of Data Marketplaces: Businesses will start monetizing their datasets, creating new revenue streams. Specialized platforms for trading data will emerge.
✅ Expansion of Edge Computing: Processing data at the network edge will reduce latency and enhance security. Technologies like tinyML will transform industries where real-time data processing is critical.
✅ Behavioral Data as a New Asset: Emotional and behavioral data analysis will underpin personalized solutions and decision-making.
The article The Most Powerful Data Trends That Will Transform Business In 2025 highlights key trends shaping the future of data usage.
🤔Here are some of them:
✅ Confidential Computing: Blockchain and homomorphic encryption will enable data analysis without exposing its content. This is a crucial step for secure collaborative analytics between companies.
✅ Growth of Data Marketplaces: Businesses will start monetizing their datasets, creating new revenue streams. Specialized platforms for trading data will emerge.
✅ Expansion of Edge Computing: Processing data at the network edge will reduce latency and enhance security. Technologies like tinyML will transform industries where real-time data processing is critical.
✅ Behavioral Data as a New Asset: Emotional and behavioral data analysis will underpin personalized solutions and decision-making.
Your project requires processing high-throughput streaming data (over 100,000 events per second) with guaranteed data delivery without loss. Which architecture would you prefer?
Anonymous Poll
62%
Apache Kafka with Exactly-Once semantics and Spark Structured Streaming
22%
Using Amazon S3 for data storage and subsequent analysis with Athena
5%
Combination of HDFS and Apache Storm with manual error handling
11%
A NoSQL database (e.g., Cassandra) with periodic data aggregatio