It’s not reasonable to think in terms of ‘death or glory’ for either EDWs or Hadoop. Use the best tool for the job.
Source: https://www.dezyre.com/article/is-hadoop-going-to-replace-data-warehouse/256
Source: https://www.dezyre.com/article/is-hadoop-going-to-replace-data-warehouse/256
#briefly Traditional ETL vs ELT on Hadoop
https://telegra.ph/ETL-vs-Hadoop-ELT-12-13
What's inside:
— Denoscription of typical ETL/ELT processes.
— Advantages and disadvantages of both.
https://telegra.ph/ETL-vs-Hadoop-ELT-12-13
What's inside:
— Denoscription of typical ETL/ELT processes.
— Advantages and disadvantages of both.
Telegraph
Traditional ETL vs ELT on Hadoop
Source: Bitwise Blog The data warehouse area has been dominated by RDBMSes and traditional ETL tools. But traditional ETL tools are limited by problems related to scalability and cost overruns. ETL: extract-transform-load The ETL process typically extracts…
#briefly Aiohttp от автора
https://telegra.ph/Aiohttp-ot-avtora-PyCon-Russia-2018-04-13
Чтобы в DWH появились данные, их нужно запросить у источников. Для этого я активно использую aiohttp.
По ссылке — конспект доклада Андрея Светлова, разработчика aiohttp, в котором автор делится советами по правильному использованию библиотеки.
https://telegra.ph/Aiohttp-ot-avtora-PyCon-Russia-2018-04-13
Чтобы в DWH появились данные, их нужно запросить у источников. Для этого я активно использую aiohttp.
По ссылке — конспект доклада Андрея Светлова, разработчика aiohttp, в котором автор делится советами по правильному использованию библиотеки.
Telegraph
«Aiohttp от автора» (PyCon Russia 2018)
7:16 asyncio.Future необходимы только для создателей библиотек, использование их «простыми» пользователями не предполагается. 8:45 aiohttp не предназначен для ускорения синхронных веб-фреймворков (Django, Flask), т.к. смешение синхронного и асинхронного кода…
#briefly #aws Serverless data lake on AWS
https://telegra.ph/Serverless-data-lake-on-AWS-06-25
Недавно прошел AWS Dev Day. Один из докладчиков поделился опытом построения бессерверного Data Lake.
По ссылке — наиболее важные мысли из доклада и кулуарного обсуждения после него. В посте выше — презентация. В ней приведена разрабатываемая докладчиком схема из сервисов AWS.
https://telegra.ph/Serverless-data-lake-on-AWS-06-25
Недавно прошел AWS Dev Day. Один из докладчиков поделился опытом построения бессерверного Data Lake.
По ссылке — наиболее важные мысли из доклада и кулуарного обсуждения после него. В посте выше — презентация. В ней приведена разрабатываемая докладчиком схема из сервисов AWS.
Telegraph
Serverless data lake on AWS
Микро-батч процессинг Процессят с помощью AWS Lambda, разбивая данные по партициям. То есть не каждую отдельную запись, но и не всю поставку, а с детализацией до партиции. Партицирование обычно привязано к дате поставке данных. Постановка задач реализована…
🔥1
#cheatsheet Классы СУБД и их традиционные применения
Неожиданно полезная шпаргалка из опроса пользователей AWS. Англоязычная и русскоязычная версии.
Неожиданно полезная шпаргалка из опроса пользователей AWS. Англоязычная и русскоязычная версии.
Опубликовали мой рассказ о разработке аналитической инфраструктуры для Tproger 🎉
Теперь в Tproger есть собственный аналитический инструмент в дополнение к традиционным Яндекс.Метрике и Google Analytics. Реализован он на базе ClickHouse и сервисов Яндекс.Облака.
В статье я поделился опытом создания и развёртывания трекера событий и рассказал, какие задачи потребуется решать тем, кто захочет повторить этот путь, и зачем разрабатывать своё решение: https://tproger.ru/articles/tproger-tracker-yandex-cloud/
Теперь в Tproger есть собственный аналитический инструмент в дополнение к традиционным Яндекс.Метрике и Google Analytics. Реализован он на базе ClickHouse и сервисов Яндекс.Облака.
В статье я поделился опытом создания и развёртывания трекера событий и рассказал, какие задачи потребуется решать тем, кто захочет повторить этот путь, и зачем разрабатывать своё решение: https://tproger.ru/articles/tproger-tracker-yandex-cloud/
Рекомендую к прочтению тем, кто пробовал NoSQL: https://medium.com/@nabtechblog/advanced-design-patterns-for-amazon-dynamodb-354f97c96c2 — эта статья буквально расширила границы моего сознания!
Здесь рассказывается про то, как проектировать таблицы в NoSQL БД на примере (и с большой привязкой к) AWS DynamoDB.
Расширение сознания вызвано тем, что основной рассмотренный прием — это хранение в одной таблице совершенно разных данных, относящихся к одном объекту, чтобы ускорить к ним доступ. В реляционных СУБД в одной таблице лежат данные «одной грани» каждой сущности, и это логично, привычно и оправданно. А вот идея хранить в одной таблице разные по сути данные звучит провокационно, однако статья вполне обосновывает такой подход.
В конце статьи дан пример, как шесть таблиц реляционной СУБД упихали в одну NoSQL таблицу, обеспечив доступ к разным «срезам» с помощью глобального индекса (Global Secondary Index). И это звучит обоснованно и модно 😉
Почитать про основные аспекты NoSQL и конкретно DynamoDB можно в первой части статьи: https://medium.com/@nabtechblog/advanced-design-patterns-for-amazon-dynamodb-c31d65d2e3de
Здесь рассказывается про то, как проектировать таблицы в NoSQL БД на примере (и с большой привязкой к) AWS DynamoDB.
Расширение сознания вызвано тем, что основной рассмотренный прием — это хранение в одной таблице совершенно разных данных, относящихся к одном объекту, чтобы ускорить к ним доступ. В реляционных СУБД в одной таблице лежат данные «одной грани» каждой сущности, и это логично, привычно и оправданно. А вот идея хранить в одной таблице разные по сути данные звучит провокационно, однако статья вполне обосновывает такой подход.
В конце статьи дан пример, как шесть таблиц реляционной СУБД упихали в одну NoSQL таблицу, обеспечив доступ к разным «срезам» с помощью глобального индекса (Global Secondary Index). И это звучит обоснованно и модно 😉
Почитать про основные аспекты NoSQL и конкретно DynamoDB можно в первой части статьи: https://medium.com/@nabtechblog/advanced-design-patterns-for-amazon-dynamodb-c31d65d2e3de
Medium
Advanced Design Patterns for Amazon DynamoDB
Part two
Подготовил конспект курса по Spark. Поможет освежить важное в памяти или сэкономить время на просмотре. Сам курс к изучению рекомендую.
#briefly #spark Spark Starter Kit
https://telegra.ph/Udemy-Spark-Starter-Kit-part-1-06-19
What's inside:
— Hadoop and Spark comparison: storage, MapReduce, speed, resources management.
— Challenges Spark tries to address.
— How Spark achieves high efficiency.
— How Spark achieves fault-tolerance.
— What is RDD.
Ссылка на курс: Spark Starter Kit
#briefly #spark Spark Starter Kit
https://telegra.ph/Udemy-Spark-Starter-Kit-part-1-06-19
What's inside:
— Hadoop and Spark comparison: storage, MapReduce, speed, resources management.
— Challenges Spark tries to address.
— How Spark achieves high efficiency.
— How Spark achieves fault-tolerance.
— What is RDD.
Ссылка на курс: Spark Starter Kit
Telegraph
Udemy: Spark Starter Kit, part 1
Spark vs Hadoop: who wins? Link to lecture. Hadoop = HDFS + MapReduce. Spark is not a replacement for Hadoop. In particular, Spark does not come with its own storage: it leverages existing one like HDFS, S3, etc. Distributed filesystem are preferred to accelerate…
👍1
Подготовил конспект статьи от Shopify о сетапе Airflow на 10 тысяч DAG'ов со 150 тысячами запусков в день. Сэкономит вам время на прочтении и поможет освежить в памяти в будущем.
#briefly #airflow Airflow: scaling out recommendations by Shopify
https://telegra.ph/Airflow-scaling-out-recommendations-by-Shopify-06-03
What's inside:
— Cloud Storage vs Network File System.
— Metadata retention policy.
— Manifest file.
— Consistent distribution of load.
— Concurrency management.
— Using different execution environments.
Origin: Lessons Learned From Running Apache Airflow at Scale
#briefly #airflow Airflow: scaling out recommendations by Shopify
https://telegra.ph/Airflow-scaling-out-recommendations-by-Shopify-06-03
What's inside:
— Cloud Storage vs Network File System.
— Metadata retention policy.
— Manifest file.
— Consistent distribution of load.
— Concurrency management.
— Using different execution environments.
Origin: Lessons Learned From Running Apache Airflow at Scale
Telegraph
Airflow: scaling out recommendations by Shopify
Shopify runs over 10k DAGs. 150k runs per day. Over 400 tasks at a given moment on average. This is a brief overview of their approach. Link to source article. Fast file access Problem: reading DAGs files from Google Cloud Storage (through GCSFuse as a filesystem…
🔥5
#article #ethereum Exporting the full history of Ethereum into S3
https://medium.com/@tony.bryzgaloff/how-to-dump-full-ethereum-history-to-s3-296fb3ad175 (author: @bryzgaloff)
What's inside:
— BigQuery public datasets with Ethereum data: how to transfer to S3 quickly.
— Alternative approach: exporting data from a public Ethereum node. No need to run your own node!
— Processing
— Processing realtime updates from Ethereum.
— Best Data Engineering practices to process Ethereum data.
A short summary inside 👇
https://medium.com/@tony.bryzgaloff/how-to-dump-full-ethereum-history-to-s3-296fb3ad175 (author: @bryzgaloff)
What's inside:
— BigQuery public datasets with Ethereum data: how to transfer to S3 quickly.
— Alternative approach: exporting data from a public Ethereum node. No need to run your own node!
— Processing
uint256 with AWS Athena.— Processing realtime updates from Ethereum.
— Best Data Engineering practices to process Ethereum data.
A short summary inside 👇
Medium
How to dump a full history of Ethereum blockchain to S3
An efficient way to export blockchain data to a cloud storage, by Anton Bryzgalov
👍6
#briefly #ethereum How to export a full Ethereum history into S3, efficiently
https://blockchain.works-hub.com/learn/how-to-export-a-full-ethereum-history-into-s3-efficiently-f37df
A brief summary of my original article about building a Data Platform for Ethereum:
— Which node to use: a free public one, a node provider, or run your own?
— Start querying right away: public BigQuery datasets with Ethereum data.
— How large is the dataset and how to process it cost-efficiently?
— Implementing a real-time Ethereum data ingestion.
Give it a chance if the original article is too long for you but you are interested in the best practices for Ethereum data engineering 😉
https://blockchain.works-hub.com/learn/how-to-export-a-full-ethereum-history-into-s3-efficiently-f37df
A brief summary of my original article about building a Data Platform for Ethereum:
— Which node to use: a free public one, a node provider, or run your own?
— Start querying right away: public BigQuery datasets with Ethereum data.
— How large is the dataset and how to process it cost-efficiently?
— Implementing a real-time Ethereum data ingestion.
Give it a chance if the original article is too long for you but you are interested in the best practices for Ethereum data engineering 😉
Blockchain Works
How to export a full Ethereum history into S3, efficiently
Blockchain technologies are not a geeky thing anymore and many companies run their business around crypto assets. A popular scenario is to build a Data Platform operating blockchain data and launch analytical and realtime services on top of it. This is what…
👍2🔥1
#youtube #briefly #ethereum Ethereum Data Analysis and Ingestion in AWS
🎥 YouTube talk + trannoscription with slides
What's inside:
— Building a realtime API for calculating tokens balances.
— Public vs own Ethereum nodes comparison.
— Support for other EVM and non-EVM blockchains.
Other formats:
— 🎞 Pictures and text: a trannoscription of the talk with slides, Medium.
— ⚡️ A super-quick summary (2 minutes read).
— 📰 The original article, Medium. Covers all this in detail.
🎥 YouTube talk + trannoscription with slides
What's inside:
— Building a realtime API for calculating tokens balances.
— Public vs own Ethereum nodes comparison.
— Support for other EVM and non-EVM blockchains.
Other formats:
— 🎞 Pictures and text: a trannoscription of the talk with slides, Medium.
— ⚡️ A super-quick summary (2 minutes read).
— 📰 The original article, Medium. Covers all this in detail.
YouTube
Ethereum Data Analysis and Ingestion in AWS | PyChain 2022
This is a video recording of the PyChain 2022 conference sessions.
Speaker: Anton Bryzgalov
Exporting the full history of Ethereum into S3
How to Export a Full History of Ethereum Blockchain to S3
What’s inside:
- BigQuery public datasets with Ethereum…
Speaker: Anton Bryzgalov
Exporting the full history of Ethereum into S3
How to Export a Full History of Ethereum Blockchain to S3
What’s inside:
- BigQuery public datasets with Ethereum…
Wow, have you known about this awesome hardware benchmarks page for ClickHouse? 🤩 See it: https://benchmark.clickhouse.com/hardware/
Results are contributed by ClickHouse users with various setups: from local laptops and bare metal VMs to cloud filesystems like AWS EFS.
In particular, I was interested in AWS EFS/EBS comparison: both are quite bad when compared to bare metal (which is no surprise 🤓) but with a huge advantage of EBS on cold runs 👍🏻
Hot runs EFS/EBS performance is comparable: both are about 6 times worse than bare metal.
Thus, both options are good for a quick MVP. EC2+EBS is a simpler setup while EFS can be attached to a disposable serverless ClickHouse container run as an ECS task.
Results are contributed by ClickHouse users with various setups: from local laptops and bare metal VMs to cloud filesystems like AWS EFS.
In particular, I was interested in AWS EFS/EBS comparison: both are quite bad when compared to bare metal (which is no surprise 🤓) but with a huge advantage of EBS on cold runs 👍🏻
Hot runs EFS/EBS performance is comparable: both are about 6 times worse than bare metal.
Thus, both options are good for a quick MVP. EC2+EBS is a simpler setup while EFS can be attached to a disposable serverless ClickHouse container run as an ECS task.
#article #blockchain How to work with
https://betterprogramming.pub/how-to-work-with-uint256-blockchain-data-type-using-sql-and-other-data-analysis-tools-a6bb52b1fb97 (author: @bryzgaloff)
What's inside:
— Analyzing numeric blockchain data using SQL: how to work with huge
— Choosing between: native
— Implementations with detailed explanations and illustrations.
— Big Data trade off: precision, ease of use, or both.
Implementation tips and shothand summaries for each approach inside 👇
uint256 blockchain data type using SQL and other Data Analysis toolshttps://betterprogramming.pub/how-to-work-with-uint256-blockchain-data-type-using-sql-and-other-data-analysis-tools-a6bb52b1fb97 (author: @bryzgaloff)
What's inside:
— Analyzing numeric blockchain data using SQL: how to work with huge
uint256 numbers which do not fit traditional 64 bit data types.— Choosing between: native
uint256 support (ClickHouse), conversion to double, and long arithmetics in pure SQL 🤓— Implementations with detailed explanations and illustrations.
— Big Data trade off: precision, ease of use, or both.
Implementation tips and shothand summaries for each approach inside 👇
Medium
How to work with uint256 blockchain data type using SQL and other Data Analysis tools
Efficiently aggregate uint256 values using popular tech stacks
👏2👍1🔥1
How to DWH with Python pinned «#article #ethereum Exporting the full history of Ethereum into S3 https://medium.com/@tony.bryzgaloff/how-to-dump-full-ethereum-history-to-s3-296fb3ad175 (author: @bryzgaloff) What's inside: — BigQuery public datasets with Ethereum data: how to transfer to…»
#article #coding Working with Code: 5 Ways AI Can Help
https://medium.com/@bryzgaloff/working-with-code-5-ways-ai-can-help-by-anton-bryzgalov-bf92395dfafd (author: @bryzgaloff)
What's inside:
— Code Simplification: Discover how AI can demystify complex code, making it accessible even for junior developers.
— Automated Documentation and Testing: Learn how AI streamlines code documentation and testing, enhancing codebase understanding and reliability.
— Code Generation: Explore the power of AI in generating code and accelerating the development process.
2 minutes read 👇
https://medium.com/@bryzgaloff/working-with-code-5-ways-ai-can-help-by-anton-bryzgalov-bf92395dfafd (author: @bryzgaloff)
What's inside:
— Code Simplification: Discover how AI can demystify complex code, making it accessible even for junior developers.
— Automated Documentation and Testing: Learn how AI streamlines code documentation and testing, enhancing codebase understanding and reliability.
— Code Generation: Explore the power of AI in generating code and accelerating the development process.
2 minutes read 👇
Medium
Working with Code: 5 Ways AI Can Help
A short essay answering a question “How can AI help junior developers with new code?”
👏2🔥1
#article #clickhouse How to Implement Lambda Architecture Using ClickHouse
https://medium.com/@bryzgaloff/how-to-implement-lambda-architecture-using-clickhouse-9109e78c718b (author: @bryzgaloff)
What's inside:
— A brief overview of Lambda approach.
— A practical, tried-and-true ClickHouse-only solution using materialized views and partition management.
— A step-by-step guide with SQL examples.
— Insights on potential improvements like more frequent batch updates, managing late data arrivals, and handling JOINs in materialized views.
Find a short summary of advantages in the article's conclusion 👇
https://medium.com/@bryzgaloff/how-to-implement-lambda-architecture-using-clickhouse-9109e78c718b (author: @bryzgaloff)
What's inside:
— A brief overview of Lambda approach.
— A practical, tried-and-true ClickHouse-only solution using materialized views and partition management.
— A step-by-step guide with SQL examples.
— Insights on potential improvements like more frequent batch updates, managing late data arrivals, and handling JOINs in materialized views.
Find a short summary of advantages in the article's conclusion 👇
Medium
How to Implement Lambda Architecture Using ClickHouse
How to combine batch and real-time data in ClickHouse using tables partitions atomic replacement and materialized views.
🔥3👏3👍2
#article #clickhouse The Unbundling of the Cloud Data Warehouse
https://clickhouse.com/blog/the-unbundling-of-the-cloud-data-warehouse (author: @tbragin)
What's inside:
— Where ClickHouse's development is headed: real-time cloud data warehouse.
— Real-time data warehouse requirements:
1) continuous real-time data loading (e.g. from Apache Kafka),
2) continuously-updating materialized views,
3) quick filtering and aggregation,
4) BI integration,
5) archiving to an object store (e.g. AWS S3),
6) ad hoc queries to an object store.
— History of data solutions as bundling-unbundling cycles:
1) mainframes (bundled)
2) → relational databases (unbundled)
3) → traditional data warehouses (bundled)
4) → early cloud providers (unbundled)
5) → cloud data warehouses (bundled)
------ we are here!
6) → real-time cloud data warehouses (unbundled).
— What makes ClickHouse suitable for Gen AI applications.
https://clickhouse.com/blog/the-unbundling-of-the-cloud-data-warehouse (author: @tbragin)
What's inside:
— Where ClickHouse's development is headed: real-time cloud data warehouse.
— Real-time data warehouse requirements:
1) continuous real-time data loading (e.g. from Apache Kafka),
2) continuously-updating materialized views,
3) quick filtering and aggregation,
4) BI integration,
5) archiving to an object store (e.g. AWS S3),
6) ad hoc queries to an object store.
— History of data solutions as bundling-unbundling cycles:
1) mainframes (bundled)
2) → relational databases (unbundled)
3) → traditional data warehouses (bundled)
4) → early cloud providers (unbundled)
5) → cloud data warehouses (bundled)
------ we are here!
6) → real-time cloud data warehouses (unbundled).
— What makes ClickHouse suitable for Gen AI applications.
👍5🔥1👏1