💻High-performance distributed database
YugabyteDB is a high-performance distributed database that supports all PostgreSQL features.
YugabyteDB is well suited for cloud-based OLTP applications (i.e. real-time and business-critical) that require absolute data correctness and require scalability or high fault tolerance.
🖥 GitHub
🟡 Documentation
Creating a local YugabyteDB cluster with Docker:
YugabyteDB is a high-performance distributed database that supports all PostgreSQL features.
YugabyteDB is well suited for cloud-based OLTP applications (i.e. real-time and business-critical) that require absolute data correctness and require scalability or high fault tolerance.
🖥 GitHub
🟡 Documentation
Creating a local YugabyteDB cluster with Docker:
docker run -d --name yugabyte -p7000:7000 -p9000:9000 -p15433:15433 -p5433:5433 -p9042:9042 \
yugabytedb/yugabyte:2.21.1.0-b271 bin/yugabyted start \
--background=false
GitHub
GitHub - yugabyte/yugabyte-db: YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
YugabyteDB - the cloud native distributed SQL database for mission-critical applications. - yugabyte/yugabyte-db
⚡️💡💻 MySQL 9.0.0 has been released
Oracle recently released MySQL DBMS 9.0.0. The developers of the project have prepared and made publicly available MySQL Community Server 9.0.0 builds for major Linux, FreeBSD, macOS and Windows distributions.
In 2023, the company announced a change in the MySQL DBMS release formation model. Developers began releasing two types of MySQL branches: Innovation (new features, frequent updates, three months of support) and LTS (with extended support time and unchanged behavior).
As the developers note, the MySQL 9.0 project is assigned to the Innovation branch, which will also include the next major releases of MySQL 9.1 and 9.2.
Distributions based on Innovation branches are recommended for those users who want to get access to new functionality earlier. They are published every 3 months and are supported only until the next major release is published (for example, after the 9.1 branch is released, support for the 9.0 branch will be discontinued).
Oracle recently released MySQL DBMS 9.0.0. The developers of the project have prepared and made publicly available MySQL Community Server 9.0.0 builds for major Linux, FreeBSD, macOS and Windows distributions.
In 2023, the company announced a change in the MySQL DBMS release formation model. Developers began releasing two types of MySQL branches: Innovation (new features, frequent updates, three months of support) and LTS (with extended support time and unchanged behavior).
As the developers note, the MySQL 9.0 project is assigned to the Innovation branch, which will also include the next major releases of MySQL 9.1 and 9.2.
Distributions based on Innovation branches are recommended for those users who want to get access to new functionality earlier. They are published every 3 months and are supported only until the next major release is published (for example, after the 9.1 branch is released, support for the 9.0 branch will be discontinued).
Oracle
Introducing MySQL Innovation and Long-Term Support (LTS) versions
Introducing MySQL Innovation and Long-Term Support (LTS) versions.
👍1
⚡️Tool to significantly enhance the database
WrenAI is an open-source tool that makes your existing database RAG-ready.
It allows you to convert text to SQL, explore data from the database without writing SQL, and do many other things
🖥 GitHub
🟡 Documentation
WrenAI is an open-source tool that makes your existing database RAG-ready.
It allows you to convert text to SQL, explore data from the database without writing SQL, and do many other things
🖥 GitHub
🟡 Documentation
GitHub
GitHub - Canner/WrenAI: ⚡️ GenBI (Generative BI) queries any database in natural language, generates accurate SQL (Text-to-SQL)…
⚡️ GenBI (Generative BI) queries any database in natural language, generates accurate SQL (Text-to-SQL), charts (Text-to-Chart), and AI-powered business intelligence in seconds. - Canner/WrenAI
💡Another small selection of AI tools for Big Data analytics
KNIME Analytics Platform is a free, open-source platform that allows users to stay at the forefront of data science and has 300+ connectors to various data sources. and integrates with all popular machine learning libraries.
Polymer - artificial intelligence for transforming data into an optimized, flexible and powerful database. All a user needs to do is upload their spreadsheet to the platform to instantly transform it into an optimized database that can then be mined for insights.
IBM Cognos Analytics is a componentized online business intelligence (BI) service that provides access to a wide range of functions for creating business reports, data analysis, event monitoring and metrics to develop effective business decisions.
Akkio is a business intelligence and forecasting tool that allows users to analyze their data and predict potential outcomes. The AI tool allows users to upload their dataset and select the variable they want to predict, which helps Akkio build a neural network around that variable. Like many other tools, Akkio requires no prior programming experience.
Monkeylearn - uses AI data analytics capabilities to help users visualize and reorganize their data. It can also be used to set up text classifiers and text extractors, which help automatically sort data according to topic or intent, and extract product characteristics or user data.
KNIME Analytics Platform is a free, open-source platform that allows users to stay at the forefront of data science and has 300+ connectors to various data sources. and integrates with all popular machine learning libraries.
Polymer - artificial intelligence for transforming data into an optimized, flexible and powerful database. All a user needs to do is upload their spreadsheet to the platform to instantly transform it into an optimized database that can then be mined for insights.
IBM Cognos Analytics is a componentized online business intelligence (BI) service that provides access to a wide range of functions for creating business reports, data analysis, event monitoring and metrics to develop effective business decisions.
Akkio is a business intelligence and forecasting tool that allows users to analyze their data and predict potential outcomes. The AI tool allows users to upload their dataset and select the variable they want to predict, which helps Akkio build a neural network around that variable. Like many other tools, Akkio requires no prior programming experience.
Monkeylearn - uses AI data analytics capabilities to help users visualize and reorganize their data. It can also be used to set up text classifiers and text extractors, which help automatically sort data according to topic or intent, and extract product characteristics or user data.
KNIME
KNIME Analytics Platform | KNIME
KNIME Analytics Platform is free and open source, which ensures users remain on the bleeding edge of data science, 300+ connectors to data sources, and integrations to all popular machine learning libraries.
👍2
🔎Lakehouse architecture: advantages and disadvantages
Lakehouse architecture is designed to provide more flexible and efficient data processing, including data storage, processing and analytics. It is a hybrid approach that combines elements of a traditional Data Warehouse and a Data Lake.
Lakehouse advantages:
1. Data unification: Lakehouse architecture allows you to store structured and unstructured data in one place. This simplifies data access and analysis, eliminating the need for separate systems for each type of data.
2. Cost-effective: By using low-cost data storage solutions such as cloud storage objects, Lakehouse architecture can be more cost-effective compared to traditional data warehouses.
3. Flexibility and Scalability: Lakehouse supports scalability, making it easy to increase data storage and processing power as needed. This is especially important for companies working with large volumes of data and requiring high performance.
4. Compatibility with modern analytical tools: Many modern analytical tools and platforms, such as Apache Spark, Delta Lake and others, integrate with the Lakehouse architecture, providing high performance and reliability of data analysis.
Disadvantages of Lakehouse
1. Implementation Difficulty: Implementing Lakehouse architecture can require significant effort and expense in planning, designing, and configuring the system. This may include training staff and adapting existing processes and tools.
2. Data Quality Management: Merging data from different sources can lead to data quality issues, especially if there are no rigorous data cleaning and validation processes in place.
3. Security and Privacy: Consolidating large amounts of data in one place increases the risks associated with data security and privacy. Additional measures are required to protect data from unauthorized access and leaks.
4. Potential Data Access Latency: In some cases, the Lakehouse architecture may experience latency in data access, especially when processing large volumes of unstructured data.
Thus, Lakehouse architecture offers many benefits such as data unification, cost efficiency and flexibility, making it attractive to many organizations. However, its implementation is associated with certain challenges, including complexity of integration, data quality management and security issues.
Lakehouse architecture is designed to provide more flexible and efficient data processing, including data storage, processing and analytics. It is a hybrid approach that combines elements of a traditional Data Warehouse and a Data Lake.
Lakehouse advantages:
1. Data unification: Lakehouse architecture allows you to store structured and unstructured data in one place. This simplifies data access and analysis, eliminating the need for separate systems for each type of data.
2. Cost-effective: By using low-cost data storage solutions such as cloud storage objects, Lakehouse architecture can be more cost-effective compared to traditional data warehouses.
3. Flexibility and Scalability: Lakehouse supports scalability, making it easy to increase data storage and processing power as needed. This is especially important for companies working with large volumes of data and requiring high performance.
4. Compatibility with modern analytical tools: Many modern analytical tools and platforms, such as Apache Spark, Delta Lake and others, integrate with the Lakehouse architecture, providing high performance and reliability of data analysis.
Disadvantages of Lakehouse
1. Implementation Difficulty: Implementing Lakehouse architecture can require significant effort and expense in planning, designing, and configuring the system. This may include training staff and adapting existing processes and tools.
2. Data Quality Management: Merging data from different sources can lead to data quality issues, especially if there are no rigorous data cleaning and validation processes in place.
3. Security and Privacy: Consolidating large amounts of data in one place increases the risks associated with data security and privacy. Additional measures are required to protect data from unauthorized access and leaks.
4. Potential Data Access Latency: In some cases, the Lakehouse architecture may experience latency in data access, especially when processing large volumes of unstructured data.
Thus, Lakehouse architecture offers many benefits such as data unification, cost efficiency and flexibility, making it attractive to many organizations. However, its implementation is associated with certain challenges, including complexity of integration, data quality management and security issues.
⚡️🔎Fully Synthetic Dataset
A huge dataset consisting entirely of synthetic data has appeared on Hugging Face.
The LLM (in this case GPT-4o + VLLM) generates answers by representing itself each time with some character: for example, a chemical scientist or a musician.
Synthetic data can sometimes help a lot (especially when the task is abstract and there is no structured information), but they are still treated with caution. They are not realistic enough, they are not diverse enough, and they potentially harbor hallucinations. It is still unclear whether we will ever be free to use “synthetics”, but it is actively being worked on.
A huge dataset consisting entirely of synthetic data has appeared on Hugging Face.
The LLM (in this case GPT-4o + VLLM) generates answers by representing itself each time with some character: for example, a chemical scientist or a musician.
Synthetic data can sometimes help a lot (especially when the task is abstract and there is no structured information), but they are still treated with caution. They are not realistic enough, they are not diverse enough, and they potentially harbor hallucinations. It is still unclear whether we will ever be free to use “synthetics”, but it is actively being worked on.
huggingface.co
proj-persona/PersonaHub · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
💡 Large video dataset with long duration and structured annotations
Tencent's MiraData is an off-the-shelf dataset with a total video duration of 16 thousand hours, designed to train text-to-video generation models. It includes long videos (average 72.1 seconds) with high motion intensity and detailed structured annotations (average 318 words per video).
To evaluate the quality of the dataset, a MiraBench benchmark system of 17 metrics assessing temporal consistency, motion in the frame, video quality, and other parameters was even specially created. According to their results, MiroData outperforms other known datasets available in open sources, which mostly consist of short videos with floating quality and short denoscriptions.
Tencent's MiraData is an off-the-shelf dataset with a total video duration of 16 thousand hours, designed to train text-to-video generation models. It includes long videos (average 72.1 seconds) with high motion intensity and detailed structured annotations (average 318 words per video).
To evaluate the quality of the dataset, a MiraBench benchmark system of 17 metrics assessing temporal consistency, motion in the frame, video quality, and other parameters was even specially created. According to their results, MiroData outperforms other known datasets available in open sources, which mostly consist of short videos with floating quality and short denoscriptions.
GitHub
GitHub - mira-space/MiraData: Official repo for paper "MiraData: A Large-Scale Video Dataset with Long Durations and Structured…
Official repo for paper "MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions" - mira-space/MiraData
😎Graph database implemented on the Apache Apache TinkerPop3 framework
HugeGraph is an open-source graph database implemented on the Apache TinkerPop3 framework and fully compatible with the Gremlin query language.
HugeGraph supports the import of over 10 billion vertices and edges and can process queries very quickly (at the ms level).
Typical HugeGraph application scenarios include exploring relationships between objects, association analysis, path finding, feature extraction, data clustering, community detection, and graph construction.
Quick start with Docker:
HugeGraph is an open-source graph database implemented on the Apache TinkerPop3 framework and fully compatible with the Gremlin query language.
HugeGraph supports the import of over 10 billion vertices and edges and can process queries very quickly (at the ms level).
Typical HugeGraph application scenarios include exploring relationships between objects, association analysis, path finding, feature extraction, data clustering, community detection, and graph construction.
Quick start with Docker:
docker run -itd --name=graph -p 8080:8080 hugegraph/hugegraph
# docker exec -it graph bash
GitHub
GitHub - apache/incubator-hugegraph: A graph database that supports more than 100+ billion data, high performance and scalability…
A graph database that supports more than 100+ billion data, high performance and scalability (Include OLTP Engine & REST-API & Backends) - apache/incubator-hugegraph
⚡️The largest collection of datasets of ~ 1 million pairs of problems and solutions for mathematical competitions
NuminaMath - datasets consisting of 1 million pairs of problems and solutions for various mathematical problems.
🔎Chain of Reasoning (CoT): 860 thousand pairs of problems and solutions created using CoT.
🛠 Tool-Integrated Reasoning (TIR): 73K synthetic solutions derived from GPT-4 with code execution feedback to break complex problems into simpler subproblems that can be solved using Python.
According to the researchers, models trained on NuminaMath achieve best-in-class performance among open-weight models and approach or beat their own models in math competition scores.
NuminaMath - datasets consisting of 1 million pairs of problems and solutions for various mathematical problems.
🔎Chain of Reasoning (CoT): 860 thousand pairs of problems and solutions created using CoT.
🛠 Tool-Integrated Reasoning (TIR): 73K synthetic solutions derived from GPT-4 with code execution feedback to break complex problems into simpler subproblems that can be solved using Python.
According to the researchers, models trained on NuminaMath achieve best-in-class performance among open-weight models and approach or beat their own models in math competition scores.
huggingface.co
NuminaMath - a AI-MO Collection
Datasets and models for training SOTA math LLMs. See our GitHub for training & inference code: https://github.com/project-numina/aimo-progress-prize
👍1
😎💡Benchmark for comprehensive assessment of LLM logical thinking
ZebraLogic is a benchmark based on logic puzzles and is a set of 1000 program-generated tasks of varying difficulty - with a grid from 2x2 to 6x6.
Each puzzle consists of N houses (numbered from left to right) and M features for each house. The task is to determine the unique distribution of feature values across the houses based on the provided clues.
Language models are given one example of the puzzle solution with a detailed explanation of the reasoning process and the answer in JSON format. Models must then solve a new problem, providing both the reasoning progress and the final solution in a given format.
Evaluation Metrics:
1. Puzzle-level accuracy (percentage of completely correctly solved puzzles).
2. Cell-level accuracy (percentage of correctly completed cells in the solution matrix).
🟡 Project Page
🟡 Dataset
Local launch of ZebraLogic as part of the ZeroEval framefork:
ZebraLogic is a benchmark based on logic puzzles and is a set of 1000 program-generated tasks of varying difficulty - with a grid from 2x2 to 6x6.
Each puzzle consists of N houses (numbered from left to right) and M features for each house. The task is to determine the unique distribution of feature values across the houses based on the provided clues.
Language models are given one example of the puzzle solution with a detailed explanation of the reasoning process and the answer in JSON format. Models must then solve a new problem, providing both the reasoning progress and the final solution in a given format.
Evaluation Metrics:
1. Puzzle-level accuracy (percentage of completely correctly solved puzzles).
2. Cell-level accuracy (percentage of correctly completed cells in the solution matrix).
🟡 Project Page
🟡 Dataset
Local launch of ZebraLogic as part of the ZeroEval framefork:
# Install via conda
conda create -n zeroeval python=3.10
conda activate zeroeval
# pip install vllm -U # pip install -e vllm
pip install vllm==0.5.1
pip install -r requirements.txt
# export HF_HOME=/path/to/your/custom/cache_dir/
# Run Meta-Llama-3-8B-Instruct via local, with greedy decoding on `zebra-grid`
bash zero_eval_local.sh -d zebra-grid -m meta-llama/Meta-Llama-3-8B-Instruct -p Meta-Llama-3-8B-Instruct -s 4
GitHub
GitHub - WildEval/ZeroEval: A simple unified framework for evaluating LLMs
A simple unified framework for evaluating LLMs. Contribute to WildEval/ZeroEval development by creating an account on GitHub.
👍3
💡Datasets used to build various ML bases
Iphone dataset - a set of datasets on the basis of which more than 40 thousand dynamic and more than 100 thousand static Gaussians, 20 SE(3) bases were built using Shape of Motion
The training time on 1xGPU A100 using the Adam optimizer with a resolution of 960x720 was just over 2 hours at a rendering speed of 40 frames per second.
According to the results of tests during the training process, Shape of Motion showed good results in the quality and consistency of scene construction.
However, the method still requires optimization for each specific scene and cannot handle significant changes in camera angle. There is also a critical dependence on precise camera parameters and user input to create a moving object mask.
Iphone dataset - a set of datasets on the basis of which more than 40 thousand dynamic and more than 100 thousand static Gaussians, 20 SE(3) bases were built using Shape of Motion
The training time on 1xGPU A100 using the Adam optimizer with a resolution of 960x720 was just over 2 hours at a rendering speed of 40 frames per second.
According to the results of tests during the training process, Shape of Motion showed good results in the quality and consistency of scene construction.
However, the method still requires optimization for each specific scene and cannot handle significant changes in camera angle. There is also a critical dependence on precise camera parameters and user input to create a moving object mask.
GitHub
GitHub - vye16/shape-of-motion
Contribute to vye16/shape-of-motion development by creating an account on GitHub.
🌎TOP DS-events all over the world in August
Aug 2-4 - MLMI 2024 - Osaka, Japan - https://mlmi.net/
Aug 3-9 - International Joint Conference on Artificial Intelligence (IJCAI) - Jeju, South Korea - https://ijcai24.org/
Aug 5-6 - ICASAM 2024 - Vancouver, Canada - https://waset.org/applied-statistics-analysis-and-modeling-conference-in-august-2024-in-vancouver
Aug 7-8 - CDAO Chicago - Chicago, United States - https://da-metro-chicago.coriniumintelligence.com/
Aug 12-14 - AI4 2024 - Las Vegas, United States - https://ai4.io/vegas/
Aug 16-17 - Machine Learning for Healthcare 2024 - Toronto, Canada - https://www.mlforhc.org/
Aug 19-20 - Artificial Intelligence and Machine Learning - Toronto, Canada - https://www.scitechseries.com/artificial-intelligence-machine
Aug 19-22 - The Bioprocessing Summit - Boston, USA - https://www.bioprocessingsummit.com/
Aug 25-29 - ACM KDD 2024 - Barcelona, Spain - https://kdd2024.kdd.org/
Aug 27 - Azure AI Summer Jam -
Aug 27-29 - ITCN Asia 25th - Karachi, Pakistan - https://itcnasia.com/karachi/
Aug 31 - DATA SATURDAY #52 - Oslo, Norway - https://datasaturdays.com/Event/20240831-datasaturday0052
Aug 2-4 - MLMI 2024 - Osaka, Japan - https://mlmi.net/
Aug 3-9 - International Joint Conference on Artificial Intelligence (IJCAI) - Jeju, South Korea - https://ijcai24.org/
Aug 5-6 - ICASAM 2024 - Vancouver, Canada - https://waset.org/applied-statistics-analysis-and-modeling-conference-in-august-2024-in-vancouver
Aug 7-8 - CDAO Chicago - Chicago, United States - https://da-metro-chicago.coriniumintelligence.com/
Aug 12-14 - AI4 2024 - Las Vegas, United States - https://ai4.io/vegas/
Aug 16-17 - Machine Learning for Healthcare 2024 - Toronto, Canada - https://www.mlforhc.org/
Aug 19-20 - Artificial Intelligence and Machine Learning - Toronto, Canada - https://www.scitechseries.com/artificial-intelligence-machine
Aug 19-22 - The Bioprocessing Summit - Boston, USA - https://www.bioprocessingsummit.com/
Aug 25-29 - ACM KDD 2024 - Barcelona, Spain - https://kdd2024.kdd.org/
Aug 27 - Azure AI Summer Jam -
Aug 27-29 - ITCN Asia 25th - Karachi, Pakistan - https://itcnasia.com/karachi/
Aug 31 - DATA SATURDAY #52 - Oslo, Norway - https://datasaturdays.com/Event/20240831-datasaturday0052
waset.org
International Conference on Applied Statistics, Analysis and Modeling ICASAM in August 2024 in Vancouver
Applied Statistics, Analysis and Modeling scheduled on August 05-06, 2024 in August 2024 in Vancouver is for the researchers, scientists, scholars, engineers, academic, scientific and university practitioners to present research activities that might want…
❤1
💡😎A startup that revolutionized the way we process data
CRAM is a new memory technology that can reduce energy consumption when processing AI data by 1000 times.
Researchers from the University of Minnesota have developed a new technology, Computational Random-Access Memory (CRAM), that can reduce energy consumption when processing data. Unlike traditional solutions, where data moves between memory and the processor, CRAM allows data to be processed directly in memory cells.
This is achieved through the use of a high-density and reconfigurable spintronic structure embedded in memory cells. Thus, the data does not leave the memory, which minimizes response delays and energy consumption associated with the transfer of information.
With CRAM, data never leaves memory, but is instead processed entirely within the computer’s memory array. This allows a system running an AI computing application to reduce power consumption by “about 1,000 times compared to a state-of-the-art solution,” according to the research team.
CRAM is a new memory technology that can reduce energy consumption when processing AI data by 1000 times.
Researchers from the University of Minnesota have developed a new technology, Computational Random-Access Memory (CRAM), that can reduce energy consumption when processing data. Unlike traditional solutions, where data moves between memory and the processor, CRAM allows data to be processed directly in memory cells.
This is achieved through the use of a high-density and reconfigurable spintronic structure embedded in memory cells. Thus, the data does not leave the memory, which minimizes response delays and energy consumption associated with the transfer of information.
With CRAM, data never leaves memory, but is instead processed entirely within the computer’s memory array. This allows a system running an AI computing application to reduce power consumption by “about 1,000 times compared to a state-of-the-art solution,” according to the research team.
Tom's Hardware
New memory tech unveiled that reduces AI processing energy requirements by 1,000 times or more
New CRAM technology gives RAM chips the power to process data, not just store it.
💡😎Interesting Caldera Dataset
The Caldera dataset is an open source scene dataset containing much of the geometry found in the game Call of Duty®: Warzone™.
This includes geometry that can be visualized, as well as some alternate, usually unseen representations used in other calculations. For example, the developers have included volumes here to aid in lighting calculations or simple shapes for collision detection. Excluded are many single-point entities, such as character spawn locations or complex noscript-based models. As the developers note, they decided not to include textures and materials in this release. That would have added complexity and size to an already heavy scene. They focused on the many connections between spatial elements that can be found in this set, rather than an accurate visual representation.
The Caldera dataset is an open source scene dataset containing much of the geometry found in the game Call of Duty®: Warzone™.
This includes geometry that can be visualized, as well as some alternate, usually unseen representations used in other calculations. For example, the developers have included volumes here to aid in lighting calculations or simple shapes for collision detection. Excluded are many single-point entities, such as character spawn locations or complex noscript-based models. As the developers note, they decided not to include textures and materials in this release. That would have added complexity and size to an already heavy scene. They focused on the many connections between spatial elements that can be found in this set, rather than an accurate visual representation.
GitHub
GitHub - Activision/caldera: Caldera data set from Call of Duty®: Warzone™
Caldera data set from Call of Duty®: Warzone™. Contribute to Activision/caldera development by creating an account on GitHub.
👍1
💡😎The book "PostgreSQL 16 from the inside" is now freely available
The Postgres Professional DBMS developer has released a new book "PostgreSQL 16 from the inside". The electronic version of the textbook is freely available. The author of the book is Egor Rogov, Director of Educational Program Development at Postgres Professional.
The first edition of this textbook, based on version 14 of PostgreSQL, was released in March 2022 and updated to version 15. Due to great reader interest, the company translated the book into English. It later became the most popular thematic publication of 2023 according to Postgres Weekly and was included in the list of professional literature on the official website of the PostgreSQL community.
The current edition of the book "PostgreSQL 16 from the Inside" takes into account readers' comments, corrects typos, and reflects changes that occurred in the PostgreSQL 16 version. Postgres Professional has also updated the localized documentation for PostgreSQL 16.
The Postgres Professional DBMS developer has released a new book "PostgreSQL 16 from the inside". The electronic version of the textbook is freely available. The author of the book is Egor Rogov, Director of Educational Program Development at Postgres Professional.
The first edition of this textbook, based on version 14 of PostgreSQL, was released in March 2022 and updated to version 15. Due to great reader interest, the company translated the book into English. It later became the most popular thematic publication of 2023 according to Postgres Weekly and was included in the list of professional literature on the official website of the PostgreSQL community.
The current edition of the book "PostgreSQL 16 from the Inside" takes into account readers' comments, corrects typos, and reflects changes that occurred in the PostgreSQL 16 version. Postgres Professional has also updated the localized documentation for PostgreSQL 16.
⚡️📊OpenAI now provides normal structured JSON with data
I would like to remind you that the JSON mode has been working for about a year, but the outputs of the models corresponded to the declared format in less than half of the cases.
However, there is great news for developers who need good data markup. The updated version gpt-4o-2024-08-06 no longer has this problem: 100% of tests have no errors in the format.
The code and tutorial on using the feature are here.
I would like to remind you that the JSON mode has been working for about a year, but the outputs of the models corresponded to the declared format in less than half of the cases.
However, there is great news for developers who need good data markup. The updated version gpt-4o-2024-08-06 no longer has this problem: 100% of tests have no errors in the format.
The code and tutorial on using the feature are here.
Openai
Introducing Structured Outputs in the API
We are introducing Structured Outputs in the API—model outputs now reliably adhere to developer-supplied JSON Schemas.
❤1👍1
⚠️Attention! Spark = Pandas + Big Data support
Be careful when applying your Pandas knowledge to Spark!!!
Of course, Pandas and Spark operate on the same data type — tables. However, the way they interact with them is significantly different.
For example, the main difference is that Pandas runs in a single process on a single machine and loads all the data into memory, while Spark is designed to work with large distributed data sets and can process terabytes and petabytes of data without loading it entirely into the memory of a single node
However, unfortunately, many programmers often transfer their knowledge from Pandas to Spark, assuming similar architectures, which leads to performance bottlenecks.
You can learn more about solving this problem from this article
Be careful when applying your Pandas knowledge to Spark!!!
Of course, Pandas and Spark operate on the same data type — tables. However, the way they interact with them is significantly different.
For example, the main difference is that Pandas runs in a single process on a single machine and loads all the data into memory, while Spark is designed to work with large distributed data sets and can process terabytes and petabytes of data without loading it entirely into the memory of a single node
However, unfortunately, many programmers often transfer their knowledge from Pandas to Spark, assuming similar architectures, which leads to performance bottlenecks.
You can learn more about solving this problem from this article
Dailydoseofds
Spark != Pandas + Big Data Support
Extend your learnings from Pandas to Spark with caution.
👍2
Which of the following is faster for analyzing more than 1 million structured data?
Anonymous Poll
15%
Apache Hive
47%
Apache Spark
13%
ClickHouse
19%
PostgreSQL
6%
SAS
⚡️A Scalable Dataset for Tuning Instructions in Software Mathematical Reasoning
The Mathematical Reasoning pipeline emphasizes separating numbers from mathematical problems to synthesize number-independent programs, enabling efficient and flexible scaling while minimizing dependence on specific numerical values.
As the authors note in their paper, experiments in fine-tuning open-source language and code models such as Llama2 and CodeLlama demonstrate the practical benefits of the InfinityMATH dataset.
In addition, these models have shown high reliability on the GSM8K+ and MATH+ benchmarks, which are improved versions of the benchmarks with minor changes to the numerical values.
📊Dataset
📖Research paper
The Mathematical Reasoning pipeline emphasizes separating numbers from mathematical problems to synthesize number-independent programs, enabling efficient and flexible scaling while minimizing dependence on specific numerical values.
As the authors note in their paper, experiments in fine-tuning open-source language and code models such as Llama2 and CodeLlama demonstrate the practical benefits of the InfinityMATH dataset.
In addition, these models have shown high reliability on the GSM8K+ and MATH+ benchmarks, which are improved versions of the benchmarks with minor changes to the numerical values.
📊Dataset
📖Research paper
huggingface.co
Paper page - InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic
Mathematical Reasoning
Mathematical Reasoning
Join the discussion on this paper page
👍1
What is the best way to generate synthetic data?
Anonymous Poll
26%
Apache Hive
20%
PostgreSQL
32%
MySQL
22%
None of above
👍2
🧐What is the difference between DICOM and NIfTI medical image formats
Before we look at the differences between DICOM and NIfTI, let's take a closer look at what each of these formats is individually
🤔What is the DICOM standard?
The DICOM standard — Digital Imaging and Communications in Medicine (DICOM) — is used to exchange images and information, it has been popular for more than a decade. Today, almost every device used in radiology (including CT, MRI, ultrasound and radiography) is equipped with support for the DICOM standard. According to the information from the standard developer (), DICOM allows you to transfer medical images in an environment of devices from different manufacturers and simplify the development and expansion of image archiving and communication systems.
🤔What is the NIfTI standard?
The Neuroimaging Informatics Technology Initiative (NIfTI) was created to work with users and manufacturers of medical devices to address some of the problems and shortcomings of other imaging standards. NIfTI was specifically designed to address these issues in the field of neuroimaging, with a focus on functional magnetic resonance imaging (fMRI). According to the NIfTI definition, the primary mission of NIfTI is to provide coordinated, targeted services, education, and research to accelerate the development and usability of neuroimaging informatics tools. NIfTI consists of two standards, NIfTI-1 and NIfTI-2, the latter being a 64-bit enhancement of the former. It does not replace NIfTI-1, but is used in parallel and supported by a wide range of medical neuroimaging devices and operating systems.
❓What is the difference between DICOM and NIfTI?
1. NIfTI files have less metadata: An NIfTI file does not require as many tags to be filled in as a DICOM image file. There is much less metadata to inspect and analyze, but this is in some ways a disadvantage because DICOM provides users with different layers of image and patient data.
2. DICOM files are often bulkier: DICOM data transfer is governed by strict formatting rules that ensure that the receiving device supports SOP classes and transfer syntaxes, such as the file format and encryption used to transfer the data. When transferring DICOM files, one device talks to another. If one device cannot process the information that the other is trying to send, it will inform the requesting device so that the sender can roll back to a different object (e.g. a previous version) or send the information to a different receiving end. Therefore, NIfTI files are usually easier and faster to process, transfer, read, and write than DICOM image files.
3. DICOM works with 2D layers, while NIfTI can display 3D details: NIfTI files store images and other data in a 3D format. It is specifically designed to overcome the spatial orientation issues of other medical imaging file formats. DICOM image files and associated data are made up of 2D layers. This allows for viewing different sections of an image, which is especially useful when analyzing the human body and different organs. However, with NIfTI, neurosurgeons can quickly identify objects in images in 3D, such as the right and left hemispheres of the brain. This is invaluable when analyzing images of the human brain, which is extremely difficult to evaluate and annotate.
4. DICOM files can store more information: As mentioned above, DICOM files allow medical professionals to store more information in different layers. Structured reports can be created and even images can be frozen so that other clinicians and data scientists can clearly see what the opinion/recommendation is based on.
Before we look at the differences between DICOM and NIfTI, let's take a closer look at what each of these formats is individually
🤔What is the DICOM standard?
The DICOM standard — Digital Imaging and Communications in Medicine (DICOM) — is used to exchange images and information, it has been popular for more than a decade. Today, almost every device used in radiology (including CT, MRI, ultrasound and radiography) is equipped with support for the DICOM standard. According to the information from the standard developer (), DICOM allows you to transfer medical images in an environment of devices from different manufacturers and simplify the development and expansion of image archiving and communication systems.
🤔What is the NIfTI standard?
The Neuroimaging Informatics Technology Initiative (NIfTI) was created to work with users and manufacturers of medical devices to address some of the problems and shortcomings of other imaging standards. NIfTI was specifically designed to address these issues in the field of neuroimaging, with a focus on functional magnetic resonance imaging (fMRI). According to the NIfTI definition, the primary mission of NIfTI is to provide coordinated, targeted services, education, and research to accelerate the development and usability of neuroimaging informatics tools. NIfTI consists of two standards, NIfTI-1 and NIfTI-2, the latter being a 64-bit enhancement of the former. It does not replace NIfTI-1, but is used in parallel and supported by a wide range of medical neuroimaging devices and operating systems.
❓What is the difference between DICOM and NIfTI?
1. NIfTI files have less metadata: An NIfTI file does not require as many tags to be filled in as a DICOM image file. There is much less metadata to inspect and analyze, but this is in some ways a disadvantage because DICOM provides users with different layers of image and patient data.
2. DICOM files are often bulkier: DICOM data transfer is governed by strict formatting rules that ensure that the receiving device supports SOP classes and transfer syntaxes, such as the file format and encryption used to transfer the data. When transferring DICOM files, one device talks to another. If one device cannot process the information that the other is trying to send, it will inform the requesting device so that the sender can roll back to a different object (e.g. a previous version) or send the information to a different receiving end. Therefore, NIfTI files are usually easier and faster to process, transfer, read, and write than DICOM image files.
3. DICOM works with 2D layers, while NIfTI can display 3D details: NIfTI files store images and other data in a 3D format. It is specifically designed to overcome the spatial orientation issues of other medical imaging file formats. DICOM image files and associated data are made up of 2D layers. This allows for viewing different sections of an image, which is especially useful when analyzing the human body and different organs. However, with NIfTI, neurosurgeons can quickly identify objects in images in 3D, such as the right and left hemispheres of the brain. This is invaluable when analyzing images of the human brain, which is extremely difficult to evaluate and annotate.
4. DICOM files can store more information: As mentioned above, DICOM files allow medical professionals to store more information in different layers. Structured reports can be created and even images can be frozen so that other clinicians and data scientists can clearly see what the opinion/recommendation is based on.
NEMA
Digital Imaging and Communications in Medicine (DICOM)
❤1