NEW BOT Телеграм, страница

How to DWH with Python

It’s not reasonable to think in terms of ‘death or glory’ for either EDWs or Hadoop. Use the best tool for the job.

Source: https://www.dezyre.com/article/is-hadoop-going-to-replace-data-warehouse/256

529 views10:10

How to DWH with Python

#briefly Traditional ETL vs ELT on Hadoop
https://telegra.ph/ETL-vs-Hadoop-ELT-12-13

What's inside:
— Denoscription of typical ETL/ELT processes.
— Advantages and disadvantages of both.

Telegraph

Traditional ETL vs ELT on Hadoop

Source: Bitwise Blog The data warehouse area has been dominated by RDBMSes and traditional ETL tools. But traditional ETL tools are limited by problems related to scalability and cost overruns. ETL: extract-transform-load The ETL process typically extracts…

655 views10:01

How to DWH with Python

#briefly Aiohttp от автора
https://telegra.ph/Aiohttp-ot-avtora-PyCon-Russia-2018-04-13

Чтобы в DWH появились данные, их нужно запросить у источников. Для этого я активно использую aiohttp.

По ссылке — конспект доклада Андрея Светлова, разработчика aiohttp, в котором автор делится советами по правильному использованию библиотеки.

Telegraph

«Aiohttp от автора» (PyCon Russia 2018)

7:16 asyncio.Future необходимы только для создателей библиотек, использование их «простыми» пользователями не предполагается. 8:45 aiohttp не предназначен для ускорения синхронных веб-фреймворков (Django, Flask), т.к. смешение синхронного и асинхронного кода…

979 views09:24

How to DWH with Python

Артем Пеннер, Serverless data lake on AWS.pdf

37.1 MB

691 views09:44

How to DWH with Python

#briefly #aws Serverless data lake on AWS
https://telegra.ph/Serverless-data-lake-on-AWS-06-25

Недавно прошел AWS Dev Day. Один из докладчиков поделился опытом построения бессерверного Data Lake.

По ссылке — наиболее важные мысли из доклада и кулуарного обсуждения после него. В посте выше — презентация. В ней приведена разрабатываемая докладчиком схема из сервисов AWS.

Telegraph

Serverless data lake on AWS

Микро-батч процессинг Процессят с помощью AWS Lambda, разбивая данные по партициям. То есть не каждую отдельную запись, но и не всю поставку, а с детализацией до партиции. Партицирование обычно привязано к дате поставке данных. Постановка задач реализована…

🔥1

1.2K views09:44

How to DWH with Python

#cheatsheet Классы СУБД и их традиционные применения

Неожиданно полезная шпаргалка из опроса пользователей AWS. Англоязычная и русскоязычная версии.

850 viewsedited 10:08

How to DWH with Python

Опубликовали мой рассказ о разработке аналитической инфраструктуры для Tproger 🎉

Теперь в Tproger есть собственный аналитический инструмент в дополнение к традиционным Яндекс.Метрике и Google Analytics. Реализован он на базе ClickHouse и сервисов Яндекс.Облака.

В статье я поделился опытом создания и развёртывания трекера событий и рассказал, какие задачи потребуется решать тем, кто захочет повторить этот путь, и зачем разрабатывать своё решение: https://tproger.ru/articles/tproger-tracker-yandex-cloud/

1.07K viewsedited 18:30

How to DWH with Python

Рекомендую к прочтению тем, кто пробовал NoSQL: https://medium.com/@nabtechblog/advanced-design-patterns-for-amazon-dynamodb-354f97c96c2 — эта статья буквально расширила границы моего сознания!

Здесь рассказывается про то, как проектировать таблицы в NoSQL БД на примере (и с большой привязкой к) AWS DynamoDB.

Расширение сознания вызвано тем, что основной рассмотренный прием — это хранение в одной таблице совершенно разных данных, относящихся к одном объекту, чтобы ускорить к ним доступ. В реляционных СУБД в одной таблице лежат данные «одной грани» каждой сущности, и это логично, привычно и оправданно. А вот идея хранить в одной таблице разные по сути данные звучит провокационно, однако статья вполне обосновывает такой подход.

В конце статьи дан пример, как шесть таблиц реляционной СУБД упихали в одну NoSQL таблицу, обеспечив доступ к разным «срезам» с помощью глобального индекса (Global Secondary Index). И это звучит обоснованно и модно 😉

Почитать про основные аспекты NoSQL и конкретно DynamoDB можно в первой части статьи: https://medium.com/@nabtechblog/advanced-design-patterns-for-amazon-dynamodb-c31d65d2e3de

Medium

Advanced Design Patterns for Amazon DynamoDB

Part two

3.23K views05:21

How to DWH with Python

Подготовил конспект курса по Spark. Поможет освежить важное в памяти или сэкономить время на просмотре. Сам курс к изучению рекомендую.

#briefly #spark Spark Starter Kit
https://telegra.ph/Udemy-Spark-Starter-Kit-part-1-06-19

What's inside:
— Hadoop and Spark comparison: storage, MapReduce, speed, resources management.
— Challenges Spark tries to address.
— How Spark achieves high efficiency.
— How Spark achieves fault-tolerance.
— What is RDD.

Ссылка на курс: Spark Starter Kit

Telegraph

Udemy: Spark Starter Kit, part 1

Spark vs Hadoop: who wins? Link to lecture. Hadoop = HDFS + MapReduce. Spark is not a replacement for Hadoop. In particular, Spark does not come with its own storage: it leverages existing one like HDFS, S3, etc. Distributed filesystem are preferred to accelerate…

👍1

3.27K views19:54

How to DWH with Python

Подготовил конспект статьи от Shopify о сетапе Airflow на 10 тысяч DAG'ов со 150 тысячами запусков в день. Сэкономит вам время на прочтении и поможет освежить в памяти в будущем.

#briefly #airflow Airflow: scaling out recommendations by Shopify
https://telegra.ph/Airflow-scaling-out-recommendations-by-Shopify-06-03

What's inside:
— Cloud Storage vs Network File System.
— Metadata retention policy.
— Manifest file.
— Consistent distribution of load.
— Concurrency management.
— Using different execution environments.

Origin: Lessons Learned From Running Apache Airflow at Scale

Telegraph

Airflow: scaling out recommendations by Shopify

Shopify runs over 10k DAGs. 150k runs per day. Over 400 tasks at a given moment on average. This is a brief overview of their approach. Link to source article. Fast file access Problem: reading DAGs files from Google Cloud Storage (through GCSFuse as a filesystem…

🔥5

7.13K views07:58

How to DWH with Python

#article #ethereum Exporting the full history of Ethereum into S3
https://medium.com/@tony.bryzgaloff/how-to-dump-full-ethereum-history-to-s3-296fb3ad175 (author: @bryzgaloff)

What's inside:
— BigQuery public datasets with Ethereum data: how to transfer to S3 quickly.
— Alternative approach: exporting data from a public Ethereum node. No need to run your own node!
— Processing uint256 with AWS Athena.
— Processing realtime updates from Ethereum.
— Best Data Engineering practices to process Ethereum data.

A short summary inside 👇

Medium

How to dump a full history of Ethereum blockchain to S3

An efficient way to export blockchain data to a cloud storage, by Anton Bryzgalov

👍6

5.54K viewsedited 19:19

How to DWH with Python

#briefly #ethereum How to export a full Ethereum history into S3, efficiently
https://blockchain.works-hub.com/learn/how-to-export-a-full-ethereum-history-into-s3-efficiently-f37df

A brief summary of my original article about building a Data Platform for Ethereum:
— Which node to use: a free public one, a node provider, or run your own?
— Start querying right away: public BigQuery datasets with Ethereum data.
— How large is the dataset and how to process it cost-efficiently?
— Implementing a real-time Ethereum data ingestion.

Give it a chance if the original article is too long for you but you are interested in the best practices for Ethereum data engineering 😉

Blockchain Works

How to export a full Ethereum history into S3, efficiently

Blockchain technologies are not a geeky thing anymore and many companies run their business around crypto assets. A popular scenario is to build a Data Platform operating blockchain data and launch analytical and realtime services on top of it. This is what…

👍2🔥1

1.17K viewsedited 15:16

How to DWH with Python

#youtube #briefly #ethereum Ethereum Data Analysis and Ingestion in AWS
🎥 YouTube talk + trannoscription with slides

What's inside:
— Building a realtime API for calculating tokens balances.
— Public vs own Ethereum nodes comparison.
— Support for other EVM and non-EVM blockchains.

Other formats:
— 🎞 Pictures and text: a trannoscription of the talk with slides, Medium.
— ⚡️ A super-quick summary (2 minutes read).
— 📰 The original article, Medium. Covers all this in detail.

YouTube

Ethereum Data Analysis and Ingestion in AWS | PyChain 2022

This is a video recording of the PyChain 2022 conference sessions.

Speaker: Anton Bryzgalov

Exporting the full history of Ethereum into S3

How to Export a Full History of Ethereum Blockchain to S3

What’s inside:
- BigQuery public datasets with Ethereum…

1.7K viewsedited 09:52

How to DWH with Python

Wow, have you known about this awesome hardware benchmarks page for ClickHouse? 🤩 See it: https://benchmark.clickhouse.com/hardware/

Results are contributed by ClickHouse users with various setups: from local laptops and bare metal VMs to cloud filesystems like AWS EFS.

In particular, I was interested in AWS EFS/EBS comparison: both are quite bad when compared to bare metal (which is no surprise 🤓) but with a huge advantage of EBS on cold runs 👍🏻

Hot runs EFS/EBS performance is comparable: both are about 6 times worse than bare metal.

Thus, both options are good for a quick MVP. EC2+EBS is a simpler setup while EFS can be attached to a disposable serverless ClickHouse container run as an ECS task.

1.28K viewsedited 01:53

How to DWH with Python

#article #blockchain How to work with uint256 blockchain data type using SQL and other Data Analysis tools
https://betterprogramming.pub/how-to-work-with-uint256-blockchain-data-type-using-sql-and-other-data-analysis-tools-a6bb52b1fb97 (author: @bryzgaloff)

What's inside:
— Analyzing numeric blockchain data using SQL: how to work with huge uint256 numbers which do not fit traditional 64 bit data types.
— Choosing between: native uint256 support (ClickHouse), conversion to double, and long arithmetics in pure SQL 🤓
— Implementations with detailed explanations and illustrations.
— Big Data trade off: precision, ease of use, or both.

Implementation tips and shothand summaries for each approach inside 👇

Medium

How to work with uint256 blockchain data type using SQL and other Data Analysis tools

Efficiently aggregate uint256 values using popular tech stacks

👏2👍1🔥1

1.57K viewsedited 08:17

How to DWH with Python

How to DWH with Python pinned «#article #ethereum Exporting the full history of Ethereum into S3 https://medium.com/@tony.bryzgaloff/how-to-dump-full-ethereum-history-to-s3-296fb3ad175 (author: @bryzgaloff) What's inside: — BigQuery public datasets with Ethereum data: how to transfer to…»

10:28

How to DWH with Python

#article #coding Working with Code: 5 Ways AI Can Help
https://medium.com/@bryzgaloff/working-with-code-5-ways-ai-can-help-by-anton-bryzgalov-bf92395dfafd (author: @bryzgaloff)

What's inside:
— Code Simplification: Discover how AI can demystify complex code, making it accessible even for junior developers.
— Automated Documentation and Testing: Learn how AI streamlines code documentation and testing, enhancing codebase understanding and reliability.
— Code Generation: Explore the power of AI in generating code and accelerating the development process.

2 minutes read 👇

Medium

Working with Code: 5 Ways AI Can Help

A short essay answering a question “How can AI help junior developers with new code?”

👏2🔥1

1.16K views13:56

How to DWH with Python

#article #clickhouse How to Implement Lambda Architecture Using ClickHouse
https://medium.com/@bryzgaloff/how-to-implement-lambda-architecture-using-clickhouse-9109e78c718b (author: @bryzgaloff)

What's inside:
— A brief overview of Lambda approach.
— A practical, tried-and-true ClickHouse-only solution using materialized views and partition management.
— A step-by-step guide with SQL examples.
— Insights on potential improvements like more frequent batch updates, managing late data arrivals, and handling JOINs in materialized views.

Find a short summary of advantages in the article's conclusion 👇

Medium

How to Implement Lambda Architecture Using ClickHouse

How to combine batch and real-time data in ClickHouse using tables partitions atomic replacement and materialized views.

🔥3👏3👍2

1.31K views19:18

How to DWH with Python

#article #clickhouse The Unbundling of the Cloud Data Warehouse
https://clickhouse.com/blog/the-unbundling-of-the-cloud-data-warehouse (author: @tbragin)

What's inside:

— Where ClickHouse's development is headed: real-time cloud data warehouse.

— Real-time data warehouse requirements:
1) continuous real-time data loading (e.g. from Apache Kafka),
2) continuously-updating materialized views,
3) quick filtering and aggregation,
4) BI integration,
5) archiving to an object store (e.g. AWS S3),
6) ad hoc queries to an object store.

— History of data solutions as bundling-unbundling cycles:
1) mainframes (bundled)
2) → relational databases (unbundled)
3) → traditional data warehouses (bundled)
4) → early cloud providers (unbundled)
5) → cloud data warehouses (bundled)
------ we are here!
6) → real-time cloud data warehouses (unbundled).

— What makes ClickHouse suitable for Gen AI applications.

👍5🔥1👏1

1.01K views04:09

About

Blog

Apps

Platform