NEW BOT Телеграм, страница

Big Data Science

😎 Optimizing Analytics with Oracle

Oracle posted an article on their blog where they talk about how to connect to a BDS cluster using Hive and Spark connections from Oracle Analytics Cloud (OAC).

Oracle Big Data Service clusters contain a Hadoop Distributed File System (HDFS) and a Hive database that load and transform data from different sources and in different formats (structured, semi-structured, and unstructured).

Learn how to connect Oracle Analytics Cloud to Oracle Big Data Service using Hive and Spark to improve data analytics. Combining powerful tools can help you efficiently process and visualize large amounts of data.

Oracle

Connect Oracle Analytics Cloud to Oracle Big Data Service with Hive and Spark for Enhanced Data Insights

❤1

893 views15:59

Big Data Science

😎Top Python libraries for optimizing work with data

✅Pony ORM is a convenient and powerful library for working with object-relational databases, which allows you to write SQL queries using Python syntax. It automatically converts Python code into SQL queries, which simplifies interaction with databases, making it more intuitive and concise. Pony ORM supports major DBMSs such as PostgreSQL, MySQL, SQLite and others, providing flexibility and convenience when creating queries and working with data models.

✅Pypika is a library for creating SQL queries programmatically in Python, which allows you to avoid errors in hand-writing SQL code and protects against SQL injections. It is especially useful for building dynamic and parameterized queries, making it an ideal tool for database applications. Pypika allows you to build queries with a high degree of detail and complexity, while maintaining the readability and security of your code.

✅EdgeDB is a modern database and client library for Python that simplifies managing data schemas and writing queries. It offers a more intuitive and convenient approach compared to traditional SQL databases, providing advanced capabilities for working with data. Key features of EdgeDB include automatic schema generation, working with relational data without the need to write complex SQL queries, as well as support for type safety and a more expressive syntax for manipulating data.

✅Tortoise ORM is a modern asynchronous ORM (Object-Relational Mapping) designed for working with databases in asynchronous Python applications. It supports various relational databases such as PostgreSQL, MySQL, SQLite, and is written with an emphasis on simplicity and ease of use. Tortoise ORM allows you to build complex SQL queries using Python code, automatically synchronizing data models with the database. Support for asynchrony makes it especially useful in high-load or web applications where it is important to efficiently manage resources and database queries.

✅Polars is a high-performance data processing and analysis library in Python and Rust, focused on working with large volumes of data. Thanks to multithreading and an optimized architecture, Polars provides significantly higher execution speeds compared to traditional tools such as Pandas. The library supports a wide range of operations on tabular data (dataframes), offering an intuitive interface for filtering, aggregating and transforming data. It is ideal for tasks that require high performance, especially when working with large data sets.

1K views15:59

Big Data Science

Which of these actions could disrupt the distribution of values in the data when preparing it for model training?

Anonymous Poll

27%

Scaling data using standardization

34%

Applying a logarithmic transformation to positive numbers

17%

Shuffling rows in a sample

22%

Removing standard deviation outliers

41 voters818 views15:59

Big Data Science

🔥A small selection of data annotation tools with all the details

CVAT (Computer Vision Annotation Tool) is one of the most popular and sought-after image annotation tools used to create datasets in the field of computer vision.

Advantages of CVAT:
✅Customization: CVAT, as an open-source solution, gives users complete freedom to customize the platform to their needs. This makes the tool flexible and adaptable, allowing it to be integrated into various workflows. The CVAT documentation provides detailed instructions on customization, making the setup process more accessible even for beginners.

✅Detailed documentation: CVAT documentation includes detailed denoscriptions of functionality, use cases, life hacks, and images. Regular documentation updates ensure that users are always aware of the latest changes and improvements.

Disadvantages of CVAT:
✅High resource requirements: One of the main disadvantages of CVAT is its high server resource requirements, which can be a problem for some teams.

Supervisely is a multi-functional platform for working with computer vision projects, offering solutions for the entire lifecycle of AI projects, from data labeling to model training and deployment.

Advantages:
✅A rich ecosystem of applications: Supervisely Apps already offers many ready-made widgets that allow you to extend the functionality of any part of the platform. Each of them is open source and available on GitHub, which makes it possible not only to modify existing applications but also to create new ones.

Disadvantages:
✅High cost: Despite its extensive capabilities, Supervisely may be a less profitable choice financially compared to other tools.

Label Studio is a powerful and flexible open-source tool for data annotation in various machine learning tasks, including computer vision, text, and audio processing. It is used to label data for subsequent training of models.

Advantages:
✅Flexibility: Users can create labels themselves using code, which opens up new possibilities for customization.
✅Extensibility: The modular structure allows for easy addition of new features and integration of additional label types.

Disadvantages:
✅High resource requirements: Label Studio may require a significant amount of resources to fully use, which makes it less convenient for users with disabilities.
✅Limitations in Bounding Boxes labeling: While, for example, CVAT offers a more convenient and fast tool for Bounding Boxes labeling, Label Studio is better suited for labeling audio data.

CVAT.ai

Leading Data Annotation Platform for Images, Videos and 3D | CVAT

Annotate smarter with CVAT, the industry-leading visual data annotation platform for machine learning. Used and trusted by teams at any scale, for data of any scale.

829 views15:59

Big Data Science

💡🔥Working with geographic data efficiently

GeoPy is a Python library that allows you to work with geographic data and provides tools for performing tasks such as geocoding (converting addresses to coordinates), reverse geocoding (converting coordinates to addresses), and calculating distances between geographic points.

😎Main features of working with geodata via GeoPy:

✅Geocoding: Converts addresses or places into geographic coordinates (latitude and longitude). This is useful when you need to, for example, visualize data on a map.

✅Reverse geocoding: Converts coordinates into a human-readable address. This can be useful for creating more understandable data or interfaces.

✅Reverse geocoding: Converts coordinates into a human-readable address. This can be useful for creating more understandable data or interfaces.

🖥You can learn more about geographic data analysis from this article

Medium

Handling Location Features Effectively with GeoPy

In most machine learning tasks, cleaning and standardizing data before modeling is crucial, especially when working with location features…

889 views15:06

Big Data Science

😎Nvidia have published a new dataset for training faintune models

HelpSteer2 is an English-language dataset developed by NVIDIA and hosted on the Hugging Face platform. It includes 21,362 rows and is designed to train reward models that help improve the utility, factual accuracy, and coherence of answers generated by large language models (LLMs).

Each row in the dataset contains a query, a response, and five human annotated response attributes:
✅Utility (usefulness)
✅ Correctness
✅ Coherence
✅ Complexity
✅ Verbosity

The dataset can be used to fine-tune LLMs to generate more relevant and better responses to user queries.

huggingface.co

nvidia/HelpSteer2 · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

👍2

1.05K views15:59

Big Data Science

Which of the following would be considered a sign of multicollinearity?

Anonymous Poll

26%

High value of variance of variables

32%

Weak correlation between independent variables

29%

High value of VIF coefficient

13%

Difference in mean values of categorical variables

38 voters850 views15:59

Big Data Science

🌎TOP DS-events all over the world in November
Nov 4-8 - PASS Data Community Summit 2024 - Seattle, USA - https://passdatacommunitysummit.com/
Nov 6 - Enterprise AI & Big Data - London, UK - https://whitehallmedia.co.uk/bdanov2024/
Nov 6-8 - PyData NYC, New York, USA - https://pydata.org/nyc2024
Nov 7 - Data Science Day 2024 - https://events.altair.com/data-science-day-2024/
Nov 7 - Data & Analytics Congres 2024 - Liemes, Utrecht - https://datainsightsnetwork.nl/events/dac-2024/
Nov 14 - IMPACT: The Data Observability Summit - Online - https://impactdatasummit.com/
Nov 18-19 - Machine Learning Week Europe - Munich, Germany - https://machinelearningweek.eu/
Nov 18-22 - LEADING GLOBAL AI EVENT - Belgrade, Serbia - https://datasciconference.com/
Nov 18-22 - QCon - San Francisco, USA - https://qconsf.com/
Nov 20 - Tech & AI LIVE 2024 - New York, USA - https://live.technologymagazine.com/tech-ai-newyork-2024/
Nov 20-23 - FMLDS - Sydney, Australia - https://www.fmlds.org/
Nov 20-21 - Data & Analytics Insight Summit - San Diego, USA - https://gdsgroup.com/events/physical-summit/data-analytics-na-nov-24/
Nov 21 - Data Science Summit - Warsaw, Polland - https://dssconf.pl/
Nov 28-29 - AI ML, Data Science & Robotics Conferences 2024 - Porto, Portugal - https://aiml.events/events/ai-ml-data-science-robotics-conferences-2024

PASS Data Community Summit

PASS Data Community Summit is the year's largest gathering of data platform professionals.

911 views15:59

Big Data Science

💡A small selection of useful things for working with Big Data

postgres-backup-local is a Docker tool for creating backups of PostgreSQL databases, storing them in the local file system with the ability to flexibly manage copies. With its help, you can back up multiple databases from one server by specifying their names through the POSTGRES_DB environment variable (separated by a comma or space).
The tool supports webhooks before and after backup, automatically manages the rotation and deletion of old copies, and is also available for Linux architectures, including amd64, arm64, arm/v7, s390x, and ppc64le.

EfCore.SchemaCompare is a tool for comparing database schemas in Entity Framework Core (EF Core), allowing you to find and analyze differences between the current database and migrations. It provides a convenient way to track changes in data structures, which helps prevent errors caused by schema mismatches during application development.
Suitable for database versioning, especially useful when developing and upgrading EF Core-based applications.

Greenmask is an open-source tool for PostgreSQL designed for masking, obfuscation, and logical backup of data. It allows you to anonymize sensitive information in database dumps, making it useful for preparing data for use in non-production environments such as development and testing. Greenmask support helps protect data by meeting privacy requirements and reducing the risk of leaks during development.

GitHub

GitHub - prodrigestivill/docker-postgres-backup-local: Backup PostgresSQL to local filesystem with periodic backups and rotate…

Backup PostgresSQL to local filesystem with periodic backups and rotate backups. - prodrigestivill/docker-postgres-backup-local

854 views15:59

Big Data Science

😎How Spotify accelerated data markup for ML by 10x

Spotify shared how it accelerated data markup for machine learning models using large language models (LLMs) in conjunction with the work of annotators. Automated initial LLM partitioning significantly reduced processing time by allowing annotators to focus on complex or ambiguous cases. This combined solution tripled process throughput and reduced costs. This scalable solution is especially relevant for a rapidly growing platform and is used to monitor compliance with service rules and policies.

💡 Spotify's data partitioning strategy is based on three core principles:

✅Scaling human expertise: annotators validate and refine results to improve data accuracy.

✅Annotation tools: creating efficient tools that simplify the work of annotators and allow models to be integrated more quickly into the process.

✅Fundamental infrastructure and integration: the platform is designed to handle large amounts of data in parallel and run dozens of projects simultaneously.

This approach has allowed Spotify to run multiple projects simultaneously, reduce costs, and maintain high accuracy.
More information about Spotify's solution can be found in their whitepaper.

Spotify Engineering

How We Generated Millions of Content Annotations

How We Generated Millions of Content Annotations - Spotify Engineering

941 views15:59

Big Data Science

This media is not supported in your browser

VIEW IN TELEGRAM

😂A Radical Solution from AI

Every day, thousands of programmers can breathe a sigh of relief when AI performs tasks for them like queries, data formatting, or other routine tasks😁

🖥ChatGPT was asked to write SQL queries for a store database. The answer just killed

😎Sometimes AI's views on solving a particular problem are slightly different from human ones

👍1😁1

998 views15:59

Big Data Science

What happens to the data after standardization is applied?

Anonymous Poll

35%

They get a minimum value of 0 and a maximum value of 1

49%

The mean becomes 0 and the standard deviation becomes 1

10%

All data are rounded to integer values

The data is sorted in ascending order

51 voters817 views15:59

Big Data Science

😎The Power of Data: Analyzing Quarterly Revenue Growth for Business Success

💡I recently came across an article in which the author talks about analyzing quarterly revenue growth. He argues that focusing only on annual data can hide trends and slow down decision making. Quarterly analysis allows you to better understand the current performance of the business and identify potential problems, such as a decrease in revenue in a certain period. This granularity helps you identify causes (such as seasonal fluctuations or marketing shortcomings) faster and take action faster than when analyzing only annual data. Quarterly data creates a foundation for optimizing growth strategies, moving from reactive to more effective data-driven management.

The author also highlights key metrics for analyzing quarterly revenue growth:

✅Customer Acquisition Cost (CAC): It is important to understand the cost of acquiring new customers to optimize marketing and sales efforts, which helps increase ROI and revenue growth.
✅Customer Lifetime Value (CLTV): This metric shows the total revenue a customer brings in over their entire relationship with the company, helping to identify high-yield segments for targeting and retention.
✅Sales Conversion: Analyzing conversion at each stage of the funnel helps identify bottlenecks and improve overall sales efficiency, which contributes to revenue growth.

🖥Link to the article

Medium

The Power of Data: Analyzing Quarterly Revenue Growth for Business Success

Beyond the Numbers: Drive Business Growth with Quarterly Revenue Analysis

👍1

855 views15:59

Big Data Science

1:06

This media is not supported in your browser

VIEW IN TELEGRAM

🧐Anthropic CEO Dario Amodei interviews Lex Fridman

😎Highlights:

✅Dario expressed optimism about the imminent emergence of AI capable of reaching human levels. He noted that development and training costs will increase in the coming years, and by 2027, clusters will likely be built worth around $100 billion - significantly larger than the current largest supercomputers, which cost around $1 billion.

✅Amodei believes that models will continue to scale, despite the lack of a theoretical explanation for this process - there is, according to him, some "magic" in it.

✅AI models are currently improving at an astonishing rate, especially in areas such as programming, physics, and mathematics. On the SWE-bench test, their success at the beginning of the year was only 2-3%, and now reaches about 50%. The main concern in these conditions is the possible monopoly on AI, when control over it ends up in a small number of large companies, which could threaten

🖥You can watch the interview here

893 views15:59

Big Data Science

Why does the T-SNE method
visualization result may be different for each run?

Anonymous Poll

66%

It uses a stochastic approach for optimization

16%

The algorithm is sensitive to the size of the input data

16%

Algorithm is dependent on the test data sampling

Results display is based on linear transformations

44 voters749 views15:59

Big Data Science

🔎 Optimizing search in MongoDB

MongoDB is a non-relational database that differs from SQL databases such as PostgreSQL or MySQL in its structure. Instead of tables with columns and rows, MongoDB uses collections.

Searching for text in MongoDB involves using special query operators to work with text data. It allows you to search for text phrases in collections and return documents containing the specified words. This is often used for complex operations where data is grouped by common attributes such as price, authors, or age.

In this article, the author also shares his experience with MongoDB, including the challenges in creating optimal search queries to make them easier to understand for beginners.

The article also mentions Mongoose, a popular ORM (object-relational mapping) tool that simplifies the interaction between MongoDB and programming languages such as Node.js/JavaScript. It provides functions for data modeling, schema development, model authentication, and data management.

MongoDB

MongoDB: The World’s Leading Modern Database

Get your ideas to market faster with a flexible, AI-ready database. MongoDB makes working with data easy.

👍1

842 views15:59

Big Data Science

😎💡AlphaQubit from Google: a new standard for accuracy in quantum computing.

Google DeepMind and Google Quantum AI have unveiled AlphaQubit, a decoder that dramatically improves error correction accuracy in quantum computing. Based on a neural network trained on synthetic and real data from the Sycamore processor, AlphaQubit uses the Transformers architecture to analyze errors.

Tests have shown that AlphaQubit reduces errors by 6% compared to tensor networks and 30% with correlation matching. However, despite the high level of accuracy, real-world speed and scalability issues remain.

✅Link to blog

Google

AlphaQubit tackles one of quantum computing’s biggest challenges

AlphaQubit is an AI-based decoder that identifies quantum computing errors with state-of-the-art accuracy.

👍1

818 views15:59

Big Data Science

🤔CUPED: advantages and disadvantages

CUPED (Controlled Pre-Experiment Data) is a data preprocessing technique used to improve the accuracy of A/B test evaluation. CUPED reduces the variance of metrics by utilizing data collected before the experiment, allowing statistically significant differences to be identified more quickly.

Benefits of CUPED:

✅Reduces variance of metrics: Improves test sensitivity by accounting for prior data.
Resource savings: Reduces the sample size required to achieve statistical significance.
✅Faster interpretation of results: Reducing noise allows real effects to be found more quickly.
✅Accounting for seasonality: Using data before the experiment helps account for trends and external factors.

Disadvantages of CUPED:

✅Implementation complexity: Requires knowledge of statistics and proper choice of covariates.
✅Dependence on data quality: Pre-experimental data must be reliable and representative.
✅Necessity of covariates: A significant correlation between metric and predictor is required, otherwise the effect will be minimized.
✅Risk of overestimation: If not properly adjusted, may lead to overestimation of the effect.

Thus, CUPED is particularly useful when it is important to maximize the efficiency of experiments but requires careful data preparation and analysis.

👍1

784 views15:59

Big Data Science

🤖Deus in Machina: Jesus-AI has been installed in a Swiss church

St. Peter's Chapel in Lucerne has launched an AI Jesus project that communicates in 100 languages. The AI is installed in the confessional where visitors can ask questions and receive answers in real time.

Trained on theological texts, Jesus-AI engaged more than 1,000 people in two months, two-thirds of whom described the experience as “spiritual.” However, the experiment has drawn criticism for the superficiality of the answers and the inability to have meaningful conversations with the machine.

🖥Read more here

👍1

833 views15:59

Big Data Science

💡 SmolTalk: a synthetic English-language dataset for LLM education

SmolTalk is a synthetic dataset from HuggingFace designed for teacher-led LLM learning. It consists of 2 million rows and was used to develop SmolLM2-Instruct models.

🔥Dataset includes both new and existing datasets

😎New datasets:

✅Smol-Magpie-Ultra (400k rows).
✅Smol-constraints (36,000 rows)
✅Smol-rewrite (50 thousand lines)
✅Smol-summarize (101 thousand lines)

⚡️Older datasets:

✅OpenHermes2.5 (100 thousand lines)
✅MetaMathQA (50 thousand lines)
✅NuminaMath-CoT (1120 thousand lines)
✅Self-Oss-Starcoder2-Instruct (1120 thousand lines)
✅SystemChats2.0 (30 thou. lines)
✅LongAlign (less than 16 thousand tokens)
✅Everyday-conversations (50 thousand lines)
✅APIGen-Function-Calling (80k lines)
✅Explore-Instruct-Rewriting (30k lines)

📚Training results:
SmolTalk showed significant improvements in model performance, especially in the tasks of math, programming, and following system prompts. SmolTalk training gave better results on IFEval, BBH, GS8Mk and MATH labels, including when training Mistral-7B.

huggingface.co

HuggingFaceTB/smoltalk · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

773 views15:59

Big Data Science

🌎TOP DS-events all over the world in December

Dec 2-5 - TIES 2024 - Adelaide, Australia - https://www.isi-next.org/conferences/ties2024/
Dec 3 - Generation AI - Paris, France - https://dev.events/conferences/generation-ai-c4odjomu
Dec 5 - The International AI Summit 2024 - Brussels, Belgium - https://global-aiconference.com/
Dec 2-6 - Data Science Week 2024 - Fort Wayne, USA - https://sites.google.com/view/data-science-week-2024
Dec 2-6 - AWS re:Invent - LAS VEGAS, USA - https://reinvent.awsevents.com/
Dec 9-10 - ICMSCS 2024: 18 - London, United Kingdom - https://waset.org/mathematics-statistics-and-computational-sciences-conference-in-december-2024-in-london
Dec 10 - Global Big Data Conference - Online - https://www.globalbigdataconference.com/
Dec 10 - Prompt Engineering Bulgaria 2024 - Sofia, Bulgaria - https://www.eventbrite.nl/e/prompt-engineering-bulgaria-2024-tickets-796563251127?aff=oddtdtcreator
Dec 11 - AI Heroes - Torino, Italy - https://dev.events/conferences/ai-heroes-xxrqdxu9
Dec 11-12 - The AI Summit New York - New York, USA - https://newyork.theaisummit.com/
Dec 12-13 - AI: 2057 - Dubai, UAE - https://www.globalaishow.com/
Dec 15-18 - IEEE International Conference on Big Data 2024 - Washington, D.C., USA - https://www3.cs.stonybrook.edu/~ieeebigdata2024/
Dec 19 - Normandie.ai 2024 - Rouen, France - https://dev.events/conferences/normandie-ai-2024-e15asbe6

dev.events

Generation AI

843 views15:59

About

Blog

Apps

Platform