💡🤖😎10 AI Terms and Aspects That Everyone Needs to Understand and Be Aware of Today
🧐Today, we’ll look at 10 aspects that most broadly cover the field of AI in its various manifestations:
✅ Reasoning/Planning: Modern AI systems can solve problems by using patterns they’ve learned from historical data to understand the information, which is similar to the process of reasoning. The most advanced systems can go further, tackling more complex problems by creating plans and determining a sequence of actions to achieve a goal.
✅ Learning/Inference: There are two stages to creating and using an AI system: learning and inference. Learning can be compared to the process of educating an AI, where it’s given a set of data and it learns to perform tasks or make predictions based on that data.
Inference is the process by which an AI uses learned patterns and parameters to, for example, predict the price of a new home that will soon go on sale.
✅ Small Language Models (SLMs): Compact versions of Large Language Models (LLMs). Both of these types use machine learning techniques to recognize patterns and relationships, allowing them to generate realistic and natural language responses. However, unlike LLMs, which are huge and require a lot of computing power and memory, SLMs like Phi-3 are trained on smaller, curated datasets and have fewer parameters.
✅ Grounded: Generative AI systems can create stories, poems, jokes, and answer research questions. However, they sometimes have difficulty separating fact from fiction or use outdated data, leading to erroneous answers called “hallucinations.” Developers aim to make AI interactions with the real world more accurate through a process called grounding, where the model is connected to current data and specific examples to improve accuracy and produce more relevant results.
✅ Retrieval Augmented Generation (RAG): When developers give AI access to external data sources to make it more accurate and relevant, a technique called Retrieval Augmented Generation (RAG) is used. This approach saves time and resources by adding new knowledge without having to retrain the AI.
✅ Orchestration: AI programs perform many tasks when processing user requests, and an orchestration layer manages their actions in the right order to get the best response. The orchestration layer can also follow the RAG pattern, searching the web for fresh information and adding context.
✅ Memory: Modern AI models technically do not have memory. However, they may have orchestration instructions that help them “remember” information by performing specific steps with each interaction.
✅ Transformers and Diffusion Models: Humans have been training AI systems to understand and generate language for decades, but one of the breakthroughs that has accelerated progress is the Transformer model. Among generative AIs, Transformers are the ones that understand context and nuance the best and fastest.
Diffusion models are typically used to generate images. These models continue to make small adjustments until they create the desired output.
✅ Frontier Models: Frontier models are large-scale systems that push the boundaries of AI and can perform a wide range of tasks with new and advanced capabilities. They are becoming key tools for a variety of industries, including healthcare, finance, scientific research, and education.
✅ GPU: A graphics processing unit is a powerful computing unit. Initially created to improve the graphics in video games, they have now become the real “muscles” of the computing world. And since AI essentially deals with a huge number of computational problems in order to understand language and recognize images or sounds, GPUs are indispensable for AI both at the training stage and when working with finished models.
🧐Today, we’ll look at 10 aspects that most broadly cover the field of AI in its various manifestations:
✅ Reasoning/Planning: Modern AI systems can solve problems by using patterns they’ve learned from historical data to understand the information, which is similar to the process of reasoning. The most advanced systems can go further, tackling more complex problems by creating plans and determining a sequence of actions to achieve a goal.
✅ Learning/Inference: There are two stages to creating and using an AI system: learning and inference. Learning can be compared to the process of educating an AI, where it’s given a set of data and it learns to perform tasks or make predictions based on that data.
Inference is the process by which an AI uses learned patterns and parameters to, for example, predict the price of a new home that will soon go on sale.
✅ Small Language Models (SLMs): Compact versions of Large Language Models (LLMs). Both of these types use machine learning techniques to recognize patterns and relationships, allowing them to generate realistic and natural language responses. However, unlike LLMs, which are huge and require a lot of computing power and memory, SLMs like Phi-3 are trained on smaller, curated datasets and have fewer parameters.
✅ Grounded: Generative AI systems can create stories, poems, jokes, and answer research questions. However, they sometimes have difficulty separating fact from fiction or use outdated data, leading to erroneous answers called “hallucinations.” Developers aim to make AI interactions with the real world more accurate through a process called grounding, where the model is connected to current data and specific examples to improve accuracy and produce more relevant results.
✅ Retrieval Augmented Generation (RAG): When developers give AI access to external data sources to make it more accurate and relevant, a technique called Retrieval Augmented Generation (RAG) is used. This approach saves time and resources by adding new knowledge without having to retrain the AI.
✅ Orchestration: AI programs perform many tasks when processing user requests, and an orchestration layer manages their actions in the right order to get the best response. The orchestration layer can also follow the RAG pattern, searching the web for fresh information and adding context.
✅ Memory: Modern AI models technically do not have memory. However, they may have orchestration instructions that help them “remember” information by performing specific steps with each interaction.
✅ Transformers and Diffusion Models: Humans have been training AI systems to understand and generate language for decades, but one of the breakthroughs that has accelerated progress is the Transformer model. Among generative AIs, Transformers are the ones that understand context and nuance the best and fastest.
Diffusion models are typically used to generate images. These models continue to make small adjustments until they create the desired output.
✅ Frontier Models: Frontier models are large-scale systems that push the boundaries of AI and can perform a wide range of tasks with new and advanced capabilities. They are becoming key tools for a variety of industries, including healthcare, finance, scientific research, and education.
✅ GPU: A graphics processing unit is a powerful computing unit. Initially created to improve the graphics in video games, they have now become the real “muscles” of the computing world. And since AI essentially deals with a huge number of computational problems in order to understand language and recognize images or sounds, GPUs are indispensable for AI both at the training stage and when working with finished models.
👍1
💡Creating recommendations for applications with minimal complexity using vector databases
This data not only trains AI systems, but is also the final output that you continue to work with. That's why it's so important to use "good" data. No matter how powerful the model is, if the input is bad data, the output will be the same.
This article is about an example of using the Weaviate database in Streamlit format to simplify working with vector databases. The authors believe that this will allow you to create a powerful search and recommendation system taking into account technical and cost factors.
📚For information, it is worth noting that:
✅Weaviate is an open-source vector database that allows users to store data objects and vector data from machine learning models and easily scales to billions of data objects. .
✅Streamlit is a Python framework. It contains a set of software tools that allow you to transfer a machine learning model to a website. The written "smart" program with this framework can be quickly turned into web applications.
This data not only trains AI systems, but is also the final output that you continue to work with. That's why it's so important to use "good" data. No matter how powerful the model is, if the input is bad data, the output will be the same.
This article is about an example of using the Weaviate database in Streamlit format to simplify working with vector databases. The authors believe that this will allow you to create a powerful search and recommendation system taking into account technical and cost factors.
📚For information, it is worth noting that:
✅Weaviate is an open-source vector database that allows users to store data objects and vector data from machine learning models and easily scales to billions of data objects. .
✅Streamlit is a Python framework. It contains a set of software tools that allow you to transfer a machine learning model to a website. The written "smart" program with this framework can be quickly turned into web applications.
Which of the following would you classify as anomalies (outliers) in the data?
Anonymous Poll
13%
All values within the standard deviation
24%
Values with a large number of NULLs
11%
Duplicate values
52%
All values outside the standard deviation
📊Quick Tips for Handling Large Datasets in Google's Pandas
Pandas is a great tool for working with small datasets, typically between two and three gigabytes in size.
For datasets larger than this threshold, using Pandas is not recommended. This is because if the dataset size exceeds the available RAM, Pandas loads the entire dataset into memory before processing. Memory issues can arise even with smaller datasets, as preprocessing and rewriting create duplicate DataFrames.
⚠️Here are some tips for efficient data processing in Pandas:
✅ Use efficient data types: Use more memory-efficient data types (e.g. int32 instead of int64, float32 instead of float64) to reduce memory usage.
✅ Load less data: Use the use-cols parameter to load only the columns you need, reducing memory consumption.pd.read_csv()
✅ Chunking: Use the chunksize parameter in to read the dataset in smaller chunks, processing each chunk iteratively.pd.read_csv()
✅ Optimize Pandas dtypes: Use the astype method to convert columns to more memory-efficient types after loading the data, if appropriate.
✅ Parallelize Pandas with Dask: Use Dask, a parallel computing library, to scale Pandas workflows to larger-than-memory datasets by leveraging parallel processing.
🖥Learn more here
Pandas is a great tool for working with small datasets, typically between two and three gigabytes in size.
For datasets larger than this threshold, using Pandas is not recommended. This is because if the dataset size exceeds the available RAM, Pandas loads the entire dataset into memory before processing. Memory issues can arise even with smaller datasets, as preprocessing and rewriting create duplicate DataFrames.
⚠️Here are some tips for efficient data processing in Pandas:
✅ Use efficient data types: Use more memory-efficient data types (e.g. int32 instead of int64, float32 instead of float64) to reduce memory usage.
✅ Load less data: Use the use-cols parameter to load only the columns you need, reducing memory consumption.pd.read_csv()
✅ Chunking: Use the chunksize parameter in to read the dataset in smaller chunks, processing each chunk iteratively.pd.read_csv()
✅ Optimize Pandas dtypes: Use the astype method to convert columns to more memory-efficient types after loading the data, if appropriate.
✅ Parallelize Pandas with Dask: Use Dask, a parallel computing library, to scale Pandas workflows to larger-than-memory datasets by leveraging parallel processing.
🖥Learn more here
GeeksforGeeks
Handling Large Datasets in Pandas - GeeksforGeeks
Your All-in-One Learning Portal: GeeksforGeeks is a comprehensive educational platform that empowers learners across domains-spanning computer science and programming, school education, upskilling, commerce, software tools, competitive exams, and more.
👍1
🧐💡A Brief Introduction to MapReduce: Advantages and Disadvantages
MapReduce is a programming model and associated framework for processing large data sets in parallel on distributed computing systems. It includes two main phases: Map (projection) and Reduce (reduction).
Advantages of MapReduce:
✅Scalability: MapReduce easily scales to thousands of machines, allowing it to process huge amounts of data
✅Parallelism: MapReduce automatically distributes tasks across available nodes, executing them in parallel, reducing computational time
✅Fault tolerance: Built-in fault tolerance allows tasks to be restarted in the event of node failure, ensuring completion without data loss
Disadvantages of MapReduce:
✅High I/O Cost: One of the key disadvantages is that data is written and read from disk between the Map and Reduce stages, significantly reducing performance in tasks where fast data transfer is important
✅Lack of interactivity: MapReduce is designed for batch processing, making it inefficient for interactive queries or real-time analysis
✅Shuffle phase requirement: The shuffle phase is often resource intensive and time, making this process a bottleneck in MapReduce performance
✅Low performance for complex tasks: For complex algorithms that require many steps of communication between nodes (e.g. iterative tasks), MapReduce performance degrades
You can also learn more about MapReduce from here
MapReduce is a programming model and associated framework for processing large data sets in parallel on distributed computing systems. It includes two main phases: Map (projection) and Reduce (reduction).
Advantages of MapReduce:
✅Scalability: MapReduce easily scales to thousands of machines, allowing it to process huge amounts of data
✅Parallelism: MapReduce automatically distributes tasks across available nodes, executing them in parallel, reducing computational time
✅Fault tolerance: Built-in fault tolerance allows tasks to be restarted in the event of node failure, ensuring completion without data loss
Disadvantages of MapReduce:
✅High I/O Cost: One of the key disadvantages is that data is written and read from disk between the Map and Reduce stages, significantly reducing performance in tasks where fast data transfer is important
✅Lack of interactivity: MapReduce is designed for batch processing, making it inefficient for interactive queries or real-time analysis
✅Shuffle phase requirement: The shuffle phase is often resource intensive and time, making this process a bottleneck in MapReduce performance
✅Low performance for complex tasks: For complex algorithms that require many steps of communication between nodes (e.g. iterative tasks), MapReduce performance degrades
You can also learn more about MapReduce from here
Medium
Everything you need to know about MapReduce
All the key insights from the paper MapReduce: Simplified Data Processing on Large Clusters from Google
👍1
😎💡🔥A selection of unpopular but very useful Python libraries for working with data
Bottleneck is a library that speeds up NumPy methods up to 25 times, especially when processing arrays containing NaN values. It optimizes calculations such as finding minima, maxima, medians, and other aggregate functions. By using specialized algorithms and handling missing data, Bottleneck significantly speeds up work with large data sets, making it more efficient than standard NumPy methods.
Nbcommands is a tool that simplifies code search in Jupyter notebooks, eliminating the need for users to search manually. It allows you to find and manage code by keywords, functions, or other elements, which significantly speeds up working with large projects in Jupyter and helps users navigate their notes and code blocks more efficiently.
SciencePlots is a style library for matplotlib that allows you to create professional graphs for presentations, research papers, and other scientific publications. It offers a set of predefined styles that meet the requirements for data visualization in scientific papers, making graphs more readable and aesthetically pleasing. SciencePlots makes it easy to create high-quality graphs that meet the standards of academic publications and presentations.
Aquarel is a library that adds additional styles to visualizations in matplotlib. It allows you to improve the appearance of graphs, making them more attractive and professional. Aquarel simplifies the creation of custom styles, helping users create graphs with more interesting designs without having to manually configure all the visualization parameters.
Modelstore is a library for managing and tracking machine learning models. It helps organize, save, and version models, as well as track their lifecycle. With Modelstore, users can easily save models to various storages (S3, GCP, Azure, and others), manage their updates and restore. This makes it easier to deploy and monitor models in production environments, making working with models more convenient and controllable.
CleverCSV is a library that improves the process of parsing CSV files and helps avoid errors when reading them with Pandas. It automatically detects the correct delimiters and format of CSV files, which is especially useful when working with files that have non-standard or heterogeneous structures. CleverCSV simplifies working with data by eliminating errors associated with incorrect recognition of delimiters and other file format parameters.
Bottleneck is a library that speeds up NumPy methods up to 25 times, especially when processing arrays containing NaN values. It optimizes calculations such as finding minima, maxima, medians, and other aggregate functions. By using specialized algorithms and handling missing data, Bottleneck significantly speeds up work with large data sets, making it more efficient than standard NumPy methods.
Nbcommands is a tool that simplifies code search in Jupyter notebooks, eliminating the need for users to search manually. It allows you to find and manage code by keywords, functions, or other elements, which significantly speeds up working with large projects in Jupyter and helps users navigate their notes and code blocks more efficiently.
SciencePlots is a style library for matplotlib that allows you to create professional graphs for presentations, research papers, and other scientific publications. It offers a set of predefined styles that meet the requirements for data visualization in scientific papers, making graphs more readable and aesthetically pleasing. SciencePlots makes it easy to create high-quality graphs that meet the standards of academic publications and presentations.
Aquarel is a library that adds additional styles to visualizations in matplotlib. It allows you to improve the appearance of graphs, making them more attractive and professional. Aquarel simplifies the creation of custom styles, helping users create graphs with more interesting designs without having to manually configure all the visualization parameters.
Modelstore is a library for managing and tracking machine learning models. It helps organize, save, and version models, as well as track their lifecycle. With Modelstore, users can easily save models to various storages (S3, GCP, Azure, and others), manage their updates and restore. This makes it easier to deploy and monitor models in production environments, making working with models more convenient and controllable.
CleverCSV is a library that improves the process of parsing CSV files and helps avoid errors when reading them with Pandas. It automatically detects the correct delimiters and format of CSV files, which is especially useful when working with files that have non-standard or heterogeneous structures. CleverCSV simplifies working with data by eliminating errors associated with incorrect recognition of delimiters and other file format parameters.
GitHub
GitHub - pydata/bottleneck: Fast NumPy array functions written in C
Fast NumPy array functions written in C. Contribute to pydata/bottleneck development by creating an account on GitHub.
👍2
🌎TOP DS-events all over the world in October
Oct 1-2 - AI and Big Data Expo Europe - Amsterdam, Netherlands - https://www.ai-expo.net/europe/
Oct 7-10 - Coalesce - Las Vegas, USA - https://coalesce.getdbt.com/
Oct 9-10 - World Summit AI - Amsterdam, Netherlands - https://worldsummit.ai/
Oct 9-10 - Big Data & AI World - Singapore, Singapore - https://www.bigdataworldasia.com/
Oct 10-11 - COLLIDE 2024: The South's largest data & AI conference - Atlanta, USA - https://datasciconnect.com/events/collide/
Oct 14-17 - Data, AI & Analytics Conference Europe 2024 - London, UK - https://irmuk.co.uk/data-ai-conference-europe-2024/
Oct 16-17 - Spatial Data Science Conference 2024 - New York, USA - https://spatial-data-science-conference.com/2024/newyork
Oct 19 - Oktoberfest - London, UK - https://datasciencefestival.com/event/oktoberfest-2024/
Oct 19 - INFORMS Workshop on Data Science 2024 - Seattle, Washington, USA - https://sites.google.com/view/data-science-2024
Oct 20-25 - TDWI Transform - Orlando, USA - https://tdwi.org/events/conferences/orlando/information/sell-your-boss.aspx
Oct 21-25 - SIAM Conference on Mathematics of Data Science (MDS24) - Atlanta, USA - https://www.siam.org/conferences-events/siam-conferences/mds24/
Oct 23-24 - NDSML Summit 2024 + AI2R Expo - Stockholm, Sweden - https://ndsmlsummit.com/
Oct 28-29 - Cyber Security Summit - San Paulo, Brazil - https://www.cybersecuritysummit.com.br/index.php
Oct 29-31 - ODSC West - California, United States - https://odsc.com/
Oct 1-2 - AI and Big Data Expo Europe - Amsterdam, Netherlands - https://www.ai-expo.net/europe/
Oct 7-10 - Coalesce - Las Vegas, USA - https://coalesce.getdbt.com/
Oct 9-10 - World Summit AI - Amsterdam, Netherlands - https://worldsummit.ai/
Oct 9-10 - Big Data & AI World - Singapore, Singapore - https://www.bigdataworldasia.com/
Oct 10-11 - COLLIDE 2024: The South's largest data & AI conference - Atlanta, USA - https://datasciconnect.com/events/collide/
Oct 14-17 - Data, AI & Analytics Conference Europe 2024 - London, UK - https://irmuk.co.uk/data-ai-conference-europe-2024/
Oct 16-17 - Spatial Data Science Conference 2024 - New York, USA - https://spatial-data-science-conference.com/2024/newyork
Oct 19 - Oktoberfest - London, UK - https://datasciencefestival.com/event/oktoberfest-2024/
Oct 19 - INFORMS Workshop on Data Science 2024 - Seattle, Washington, USA - https://sites.google.com/view/data-science-2024
Oct 20-25 - TDWI Transform - Orlando, USA - https://tdwi.org/events/conferences/orlando/information/sell-your-boss.aspx
Oct 21-25 - SIAM Conference on Mathematics of Data Science (MDS24) - Atlanta, USA - https://www.siam.org/conferences-events/siam-conferences/mds24/
Oct 23-24 - NDSML Summit 2024 + AI2R Expo - Stockholm, Sweden - https://ndsmlsummit.com/
Oct 28-29 - Cyber Security Summit - San Paulo, Brazil - https://www.cybersecuritysummit.com.br/index.php
Oct 29-31 - ODSC West - California, United States - https://odsc.com/
AI & Big Data Expo Europe - Conference & Exhibition
Home
AI & Big Data Expo, part of TechEx Europe, an AI Conference & Big Data Exhibition showcasing Generative AI, Machine Learning & Data.
💡😎3 unpopular but very necessary visualization libraries
Supertree is a Python library designed for interactive and convenient visualization of decision trees in Jupyter Notebooks, Jupyter Lab, Google Colab and other notebooks that support HTML rendering. With this tool, you can not only visualize decision trees, but also interact with them directly in the notebook.
Mycelium is a library for creating graphical visualizations of machine learning models or any other directed acyclic graphs. It also provides the ability to use the Talaria graph viewer to visualize and optimize models
TensorHue is a Python library designed to visualize tensors directly in the console, making it easier to analyze and debug them, making the process of working with tensors more visual and understandable.
Supertree is a Python library designed for interactive and convenient visualization of decision trees in Jupyter Notebooks, Jupyter Lab, Google Colab and other notebooks that support HTML rendering. With this tool, you can not only visualize decision trees, but also interact with them directly in the notebook.
Mycelium is a library for creating graphical visualizations of machine learning models or any other directed acyclic graphs. It also provides the ability to use the Talaria graph viewer to visualize and optimize models
TensorHue is a Python library designed to visualize tensors directly in the console, making it easier to analyze and debug them, making the process of working with tensors more visual and understandable.
GitHub
GitHub - mljar/supertree: Visualize decision trees in Python
Visualize decision trees in Python. Contribute to mljar/supertree development by creating an account on GitHub.
🔥1
😎⚡️A powerful dataset generated using Claude Opus.
Synthia-v1.5-I is a dataset of over 20,000 technical questions and answers designed to train large language models (LLM). It includes system prompts styled like Orca to encourage the generation of diverse answers. This dataset can be used to train models to answer technical questions more accurately and comprehensively, improving their performance on a variety of technical and engineering problems.
✅To load the dataset using Python:
from datasets import load_dataset
ds = load_dataset("migtissera/Synthia-v1.5-I")
Synthia-v1.5-I is a dataset of over 20,000 technical questions and answers designed to train large language models (LLM). It includes system prompts styled like Orca to encourage the generation of diverse answers. This dataset can be used to train models to answer technical questions more accurately and comprehensively, improving their performance on a variety of technical and engineering problems.
✅To load the dataset using Python:
from datasets import load_dataset
ds = load_dataset("migtissera/Synthia-v1.5-I")
In which of the following cases is data normalization applied?
Anonymous Poll
37%
Normalizing the data to a normal distribution
20%
Reducing data dimensionality
39%
For numerical features, especially in algorithms that are sensitive to the scale of the data
4%
To be able to perform linear interpolation of numerical features
👍1
⚡️HTTP SQLite StarbaseDB
StarbaseDB is a powerful and scalable open source database that is based on SQLite and runs over the HTTP protocol. This database is built to run in a cloud environment (e.g. on Cloudflare), allowing it to scale efficiently down to zero based on load. Key benefits of StarbaseDB include:
✅Ease of use: Provides the ability to work through HTTP requests, making it easy to integrate with various systems and services.
✅Scalability: Automatically adjusts to load volume with the ability to scale both ways.
✅Support for SQLite: Utilize the time-tested and lightweight SQLite database for data storage.
✅Open Source: Open source, allowing developers to customize and improve the system to suit their needs.
It is suitable for developers who are looking for a simple and reliable way to organize databases with minimal customization and high availability in cloud platforms such as Cloudflare.
StarbaseDB is a powerful and scalable open source database that is based on SQLite and runs over the HTTP protocol. This database is built to run in a cloud environment (e.g. on Cloudflare), allowing it to scale efficiently down to zero based on load. Key benefits of StarbaseDB include:
✅Ease of use: Provides the ability to work through HTTP requests, making it easy to integrate with various systems and services.
✅Scalability: Automatically adjusts to load volume with the ability to scale both ways.
✅Support for SQLite: Utilize the time-tested and lightweight SQLite database for data storage.
✅Open Source: Open source, allowing developers to customize and improve the system to suit their needs.
It is suitable for developers who are looking for a simple and reliable way to organize databases with minimal customization and high availability in cloud platforms such as Cloudflare.
GitHub
GitHub - outerbase/starbasedb: HTTP SQLite scale-to-zero database on the edge built on Cloudflare Durable Objects.
HTTP SQLite scale-to-zero database on the edge built on Cloudflare Durable Objects. - outerbase/starbasedb
👍2
💡 News of the day: MongoDB creates AI partner ecosystem
MongoDB is actively adapting to the challenges of artificial intelligence development by introducing an improved version of its database (8.0) and launching the MongoDB AI Application Program (MAAP). This program aims to create a global partner ecosystem aimed at standardizing AI solutions. Key partners include major cloud and consulting players such as Microsoft Azure, Google Cloud Platform, Amazon Web Services, Accenture, and AI companies Anthropic and Fireworks AI.
Updates to MongoDB 8.0 promise notable performance improvements:
✅ A 32% increase in throughput.
✅Acceleration of batch writes by 56%.
✅ Increase parallel write speed by 20%.
This gives MongoDB the ability to better handle the high loads often encountered with big data and AI. Solutions have already been deployed for large companies, including one of France's leading automakers and a global home appliance manufacturer.
In this way, MongoDB, by building MAAP and improving its technology, aims to become a key player in the AI industry, supporting developers and companies in their quest for innovation.
🔎Read more here
MongoDB is actively adapting to the challenges of artificial intelligence development by introducing an improved version of its database (8.0) and launching the MongoDB AI Application Program (MAAP). This program aims to create a global partner ecosystem aimed at standardizing AI solutions. Key partners include major cloud and consulting players such as Microsoft Azure, Google Cloud Platform, Amazon Web Services, Accenture, and AI companies Anthropic and Fireworks AI.
Updates to MongoDB 8.0 promise notable performance improvements:
✅ A 32% increase in throughput.
✅Acceleration of batch writes by 56%.
✅ Increase parallel write speed by 20%.
This gives MongoDB the ability to better handle the high loads often encountered with big data and AI. Solutions have already been deployed for large companies, including one of France's leading automakers and a global home appliance manufacturer.
In this way, MongoDB, by building MAAP and improving its technology, aims to become a key player in the AI industry, supporting developers and companies in their quest for innovation.
🔎Read more here
IT Europa
MongoDB builds AI partner ecosystem to reverse failures in the field
At least 30% of generative AI projects will be abandoned after proof of concept by the end of 2025, due to poor data quality, inadequate risk controls, escalating costs or unclear business value,
😎 Optimizing Analytics with Oracle
Oracle posted an article on their blog where they talk about how to connect to a BDS cluster using Hive and Spark connections from Oracle Analytics Cloud (OAC).
Oracle Big Data Service clusters contain a Hadoop Distributed File System (HDFS) and a Hive database that load and transform data from different sources and in different formats (structured, semi-structured, and unstructured).
Learn how to connect Oracle Analytics Cloud to Oracle Big Data Service using Hive and Spark to improve data analytics. Combining powerful tools can help you efficiently process and visualize large amounts of data.
Oracle posted an article on their blog where they talk about how to connect to a BDS cluster using Hive and Spark connections from Oracle Analytics Cloud (OAC).
Oracle Big Data Service clusters contain a Hadoop Distributed File System (HDFS) and a Hive database that load and transform data from different sources and in different formats (structured, semi-structured, and unstructured).
Learn how to connect Oracle Analytics Cloud to Oracle Big Data Service using Hive and Spark to improve data analytics. Combining powerful tools can help you efficiently process and visualize large amounts of data.
Oracle
Connect Oracle Analytics Cloud to Oracle Big Data Service with Hive and Spark for Enhanced Data Insights
❤1
😎Top Python libraries for optimizing work with data
✅Pony ORM is a convenient and powerful library for working with object-relational databases, which allows you to write SQL queries using Python syntax. It automatically converts Python code into SQL queries, which simplifies interaction with databases, making it more intuitive and concise. Pony ORM supports major DBMSs such as PostgreSQL, MySQL, SQLite and others, providing flexibility and convenience when creating queries and working with data models.
✅Pypika is a library for creating SQL queries programmatically in Python, which allows you to avoid errors in hand-writing SQL code and protects against SQL injections. It is especially useful for building dynamic and parameterized queries, making it an ideal tool for database applications. Pypika allows you to build queries with a high degree of detail and complexity, while maintaining the readability and security of your code.
✅EdgeDB is a modern database and client library for Python that simplifies managing data schemas and writing queries. It offers a more intuitive and convenient approach compared to traditional SQL databases, providing advanced capabilities for working with data. Key features of EdgeDB include automatic schema generation, working with relational data without the need to write complex SQL queries, as well as support for type safety and a more expressive syntax for manipulating data.
✅Tortoise ORM is a modern asynchronous ORM (Object-Relational Mapping) designed for working with databases in asynchronous Python applications. It supports various relational databases such as PostgreSQL, MySQL, SQLite, and is written with an emphasis on simplicity and ease of use. Tortoise ORM allows you to build complex SQL queries using Python code, automatically synchronizing data models with the database. Support for asynchrony makes it especially useful in high-load or web applications where it is important to efficiently manage resources and database queries.
✅Polars is a high-performance data processing and analysis library in Python and Rust, focused on working with large volumes of data. Thanks to multithreading and an optimized architecture, Polars provides significantly higher execution speeds compared to traditional tools such as Pandas. The library supports a wide range of operations on tabular data (dataframes), offering an intuitive interface for filtering, aggregating and transforming data. It is ideal for tasks that require high performance, especially when working with large data sets.
✅Pony ORM is a convenient and powerful library for working with object-relational databases, which allows you to write SQL queries using Python syntax. It automatically converts Python code into SQL queries, which simplifies interaction with databases, making it more intuitive and concise. Pony ORM supports major DBMSs such as PostgreSQL, MySQL, SQLite and others, providing flexibility and convenience when creating queries and working with data models.
✅Pypika is a library for creating SQL queries programmatically in Python, which allows you to avoid errors in hand-writing SQL code and protects against SQL injections. It is especially useful for building dynamic and parameterized queries, making it an ideal tool for database applications. Pypika allows you to build queries with a high degree of detail and complexity, while maintaining the readability and security of your code.
✅EdgeDB is a modern database and client library for Python that simplifies managing data schemas and writing queries. It offers a more intuitive and convenient approach compared to traditional SQL databases, providing advanced capabilities for working with data. Key features of EdgeDB include automatic schema generation, working with relational data without the need to write complex SQL queries, as well as support for type safety and a more expressive syntax for manipulating data.
✅Tortoise ORM is a modern asynchronous ORM (Object-Relational Mapping) designed for working with databases in asynchronous Python applications. It supports various relational databases such as PostgreSQL, MySQL, SQLite, and is written with an emphasis on simplicity and ease of use. Tortoise ORM allows you to build complex SQL queries using Python code, automatically synchronizing data models with the database. Support for asynchrony makes it especially useful in high-load or web applications where it is important to efficiently manage resources and database queries.
✅Polars is a high-performance data processing and analysis library in Python and Rust, focused on working with large volumes of data. Thanks to multithreading and an optimized architecture, Polars provides significantly higher execution speeds compared to traditional tools such as Pandas. The library supports a wide range of operations on tabular data (dataframes), offering an intuitive interface for filtering, aggregating and transforming data. It is ideal for tasks that require high performance, especially when working with large data sets.
Which of these actions could disrupt the distribution of values in the data when preparing it for model training?
Anonymous Poll
27%
Scaling data using standardization
34%
Applying a logarithmic transformation to positive numbers
17%
Shuffling rows in a sample
22%
Removing standard deviation outliers
🔥A small selection of data annotation tools with all the details
CVAT (Computer Vision Annotation Tool) is one of the most popular and sought-after image annotation tools used to create datasets in the field of computer vision.
Advantages of CVAT:
✅Customization: CVAT, as an open-source solution, gives users complete freedom to customize the platform to their needs. This makes the tool flexible and adaptable, allowing it to be integrated into various workflows. The CVAT documentation provides detailed instructions on customization, making the setup process more accessible even for beginners.
✅Detailed documentation: CVAT documentation includes detailed denoscriptions of functionality, use cases, life hacks, and images. Regular documentation updates ensure that users are always aware of the latest changes and improvements.
Disadvantages of CVAT:
✅High resource requirements: One of the main disadvantages of CVAT is its high server resource requirements, which can be a problem for some teams.
Supervisely is a multi-functional platform for working with computer vision projects, offering solutions for the entire lifecycle of AI projects, from data labeling to model training and deployment.
Advantages:
✅A rich ecosystem of applications: Supervisely Apps already offers many ready-made widgets that allow you to extend the functionality of any part of the platform. Each of them is open source and available on GitHub, which makes it possible not only to modify existing applications but also to create new ones.
Disadvantages:
✅High cost: Despite its extensive capabilities, Supervisely may be a less profitable choice financially compared to other tools.
Label Studio is a powerful and flexible open-source tool for data annotation in various machine learning tasks, including computer vision, text, and audio processing. It is used to label data for subsequent training of models.
Advantages:
✅Flexibility: Users can create labels themselves using code, which opens up new possibilities for customization.
✅Extensibility: The modular structure allows for easy addition of new features and integration of additional label types.
Disadvantages:
✅High resource requirements: Label Studio may require a significant amount of resources to fully use, which makes it less convenient for users with disabilities.
✅Limitations in Bounding Boxes labeling: While, for example, CVAT offers a more convenient and fast tool for Bounding Boxes labeling, Label Studio is better suited for labeling audio data.
CVAT (Computer Vision Annotation Tool) is one of the most popular and sought-after image annotation tools used to create datasets in the field of computer vision.
Advantages of CVAT:
✅Customization: CVAT, as an open-source solution, gives users complete freedom to customize the platform to their needs. This makes the tool flexible and adaptable, allowing it to be integrated into various workflows. The CVAT documentation provides detailed instructions on customization, making the setup process more accessible even for beginners.
✅Detailed documentation: CVAT documentation includes detailed denoscriptions of functionality, use cases, life hacks, and images. Regular documentation updates ensure that users are always aware of the latest changes and improvements.
Disadvantages of CVAT:
✅High resource requirements: One of the main disadvantages of CVAT is its high server resource requirements, which can be a problem for some teams.
Supervisely is a multi-functional platform for working with computer vision projects, offering solutions for the entire lifecycle of AI projects, from data labeling to model training and deployment.
Advantages:
✅A rich ecosystem of applications: Supervisely Apps already offers many ready-made widgets that allow you to extend the functionality of any part of the platform. Each of them is open source and available on GitHub, which makes it possible not only to modify existing applications but also to create new ones.
Disadvantages:
✅High cost: Despite its extensive capabilities, Supervisely may be a less profitable choice financially compared to other tools.
Label Studio is a powerful and flexible open-source tool for data annotation in various machine learning tasks, including computer vision, text, and audio processing. It is used to label data for subsequent training of models.
Advantages:
✅Flexibility: Users can create labels themselves using code, which opens up new possibilities for customization.
✅Extensibility: The modular structure allows for easy addition of new features and integration of additional label types.
Disadvantages:
✅High resource requirements: Label Studio may require a significant amount of resources to fully use, which makes it less convenient for users with disabilities.
✅Limitations in Bounding Boxes labeling: While, for example, CVAT offers a more convenient and fast tool for Bounding Boxes labeling, Label Studio is better suited for labeling audio data.
CVAT.ai
Leading Image & Video Data Annotation Platform | CVAT
Annotate smarter with CVAT, the industry-leading data annotation platform for machine learning. Used and trusted by teams at any scale, for data of any scale.
💡🔥Working with geographic data efficiently
GeoPy is a Python library that allows you to work with geographic data and provides tools for performing tasks such as geocoding (converting addresses to coordinates), reverse geocoding (converting coordinates to addresses), and calculating distances between geographic points.
😎Main features of working with geodata via GeoPy:
✅Geocoding: Converts addresses or places into geographic coordinates (latitude and longitude). This is useful when you need to, for example, visualize data on a map.
✅Reverse geocoding: Converts coordinates into a human-readable address. This can be useful for creating more understandable data or interfaces.
✅Reverse geocoding: Converts coordinates into a human-readable address. This can be useful for creating more understandable data or interfaces.
🖥You can learn more about geographic data analysis from this article
GeoPy is a Python library that allows you to work with geographic data and provides tools for performing tasks such as geocoding (converting addresses to coordinates), reverse geocoding (converting coordinates to addresses), and calculating distances between geographic points.
😎Main features of working with geodata via GeoPy:
✅Geocoding: Converts addresses or places into geographic coordinates (latitude and longitude). This is useful when you need to, for example, visualize data on a map.
✅Reverse geocoding: Converts coordinates into a human-readable address. This can be useful for creating more understandable data or interfaces.
✅Reverse geocoding: Converts coordinates into a human-readable address. This can be useful for creating more understandable data or interfaces.
🖥You can learn more about geographic data analysis from this article
Medium
Handling Location Features Effectively with GeoPy
In most machine learning tasks, cleaning and standardizing data before modeling is crucial, especially when working with location features…
😎Nvidia have published a new dataset for training faintune models
HelpSteer2 is an English-language dataset developed by NVIDIA and hosted on the Hugging Face platform. It includes 21,362 rows and is designed to train reward models that help improve the utility, factual accuracy, and coherence of answers generated by large language models (LLMs).
Each row in the dataset contains a query, a response, and five human annotated response attributes:
✅Utility (usefulness)
✅ Correctness
✅ Coherence
✅ Complexity
✅ Verbosity
The dataset can be used to fine-tune LLMs to generate more relevant and better responses to user queries.
HelpSteer2 is an English-language dataset developed by NVIDIA and hosted on the Hugging Face platform. It includes 21,362 rows and is designed to train reward models that help improve the utility, factual accuracy, and coherence of answers generated by large language models (LLMs).
Each row in the dataset contains a query, a response, and five human annotated response attributes:
✅Utility (usefulness)
✅ Correctness
✅ Coherence
✅ Complexity
✅ Verbosity
The dataset can be used to fine-tune LLMs to generate more relevant and better responses to user queries.
huggingface.co
nvidia/HelpSteer2 · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
👍2
Which of the following would be considered a sign of multicollinearity?
Anonymous Poll
26%
High value of variance of variables
32%
Weak correlation between independent variables
29%
High value of VIF coefficient
13%
Difference in mean values of categorical variables
🌎TOP DS-events all over the world in November
Nov 4-8 - PASS Data Community Summit 2024 - Seattle, USA - https://passdatacommunitysummit.com/
Nov 6 - Enterprise AI & Big Data - London, UK - https://whitehallmedia.co.uk/bdanov2024/
Nov 6-8 - PyData NYC, New York, USA - https://pydata.org/nyc2024
Nov 7 - Data Science Day 2024 - https://events.altair.com/data-science-day-2024/
Nov 7 - Data & Analytics Congres 2024 - Liemes, Utrecht - https://datainsightsnetwork.nl/events/dac-2024/
Nov 14 - IMPACT: The Data Observability Summit - Online - https://impactdatasummit.com/
Nov 18-19 - Machine Learning Week Europe - Munich, Germany - https://machinelearningweek.eu/
Nov 18-22 - LEADING GLOBAL AI EVENT - Belgrade, Serbia - https://datasciconference.com/
Nov 18-22 - QCon - San Francisco, USA - https://qconsf.com/
Nov 20 - Tech & AI LIVE 2024 - New York, USA - https://live.technologymagazine.com/tech-ai-newyork-2024/
Nov 20-23 - FMLDS - Sydney, Australia - https://www.fmlds.org/
Nov 20-21 - Data & Analytics Insight Summit - San Diego, USA - https://gdsgroup.com/events/physical-summit/data-analytics-na-nov-24/
Nov 21 - Data Science Summit - Warsaw, Polland - https://dssconf.pl/
Nov 28-29 - AI ML, Data Science & Robotics Conferences 2024 - Porto, Portugal - https://aiml.events/events/ai-ml-data-science-robotics-conferences-2024
Nov 4-8 - PASS Data Community Summit 2024 - Seattle, USA - https://passdatacommunitysummit.com/
Nov 6 - Enterprise AI & Big Data - London, UK - https://whitehallmedia.co.uk/bdanov2024/
Nov 6-8 - PyData NYC, New York, USA - https://pydata.org/nyc2024
Nov 7 - Data Science Day 2024 - https://events.altair.com/data-science-day-2024/
Nov 7 - Data & Analytics Congres 2024 - Liemes, Utrecht - https://datainsightsnetwork.nl/events/dac-2024/
Nov 14 - IMPACT: The Data Observability Summit - Online - https://impactdatasummit.com/
Nov 18-19 - Machine Learning Week Europe - Munich, Germany - https://machinelearningweek.eu/
Nov 18-22 - LEADING GLOBAL AI EVENT - Belgrade, Serbia - https://datasciconference.com/
Nov 18-22 - QCon - San Francisco, USA - https://qconsf.com/
Nov 20 - Tech & AI LIVE 2024 - New York, USA - https://live.technologymagazine.com/tech-ai-newyork-2024/
Nov 20-23 - FMLDS - Sydney, Australia - https://www.fmlds.org/
Nov 20-21 - Data & Analytics Insight Summit - San Diego, USA - https://gdsgroup.com/events/physical-summit/data-analytics-na-nov-24/
Nov 21 - Data Science Summit - Warsaw, Polland - https://dssconf.pl/
Nov 28-29 - AI ML, Data Science & Robotics Conferences 2024 - Porto, Portugal - https://aiml.events/events/ai-ml-data-science-robotics-conferences-2024
PASS Data Community Summit
PASS Data Community Summit is the year's largest gathering of data platform professionals.
💡A small selection of useful things for working with Big Data
postgres-backup-local is a Docker tool for creating backups of PostgreSQL databases, storing them in the local file system with the ability to flexibly manage copies. With its help, you can back up multiple databases from one server by specifying their names through the POSTGRES_DB environment variable (separated by a comma or space).
The tool supports webhooks before and after backup, automatically manages the rotation and deletion of old copies, and is also available for Linux architectures, including amd64, arm64, arm/v7, s390x, and ppc64le.
EfCore.SchemaCompare is a tool for comparing database schemas in Entity Framework Core (EF Core), allowing you to find and analyze differences between the current database and migrations. It provides a convenient way to track changes in data structures, which helps prevent errors caused by schema mismatches during application development.
Suitable for database versioning, especially useful when developing and upgrading EF Core-based applications.
Greenmask is an open-source tool for PostgreSQL designed for masking, obfuscation, and logical backup of data. It allows you to anonymize sensitive information in database dumps, making it useful for preparing data for use in non-production environments such as development and testing. Greenmask support helps protect data by meeting privacy requirements and reducing the risk of leaks during development.
postgres-backup-local is a Docker tool for creating backups of PostgreSQL databases, storing them in the local file system with the ability to flexibly manage copies. With its help, you can back up multiple databases from one server by specifying their names through the POSTGRES_DB environment variable (separated by a comma or space).
The tool supports webhooks before and after backup, automatically manages the rotation and deletion of old copies, and is also available for Linux architectures, including amd64, arm64, arm/v7, s390x, and ppc64le.
EfCore.SchemaCompare is a tool for comparing database schemas in Entity Framework Core (EF Core), allowing you to find and analyze differences between the current database and migrations. It provides a convenient way to track changes in data structures, which helps prevent errors caused by schema mismatches during application development.
Suitable for database versioning, especially useful when developing and upgrading EF Core-based applications.
Greenmask is an open-source tool for PostgreSQL designed for masking, obfuscation, and logical backup of data. It allows you to anonymize sensitive information in database dumps, making it useful for preparing data for use in non-production environments such as development and testing. Greenmask support helps protect data by meeting privacy requirements and reducing the risk of leaks during development.
GitHub
GitHub - prodrigestivill/docker-postgres-backup-local: Backup PostgresSQL to local filesystem with periodic backups and rotate…
Backup PostgresSQL to local filesystem with periodic backups and rotate backups. - prodrigestivill/docker-postgres-backup-local