NEW BOT Телеграм, страница

Data Apps Design

Airbyte Clickhouse destination

Airbyte deployed Clickhouse destination which I already use to gather data from multiple sources.

By default it replicates all the data as JSON blobs (all the attributes inside one String field)

To get it flattened you either have to do it yourself or use Airbyte normalization.

1. Flattening manually with JSON functions

JSONExtract(_airbyte_data, 'id', 'UInt64') as id

➕ Could be tricky and exhausting if you have a lot of attributes.

➖ Works extremely fast

2. Airbyte normalization (= dbt underneath 😉)

➕ It will flatten all your data automatically

Technically it is auto-generated dbt project

➖ Still a little bit buggy and looks like a work in progress.

I almost managed to get it done, but I use Yandex’ Managed Clickhouse which demands you use SSL / Secure connection.

Unfortunately, Airbyte’s dbt profiles.yml is hard-configured to secure: False at the moment.

I might create a PR to fix this when I have some time.

#airbyte #clickhouse #dbt

560 viewsArtemiy Kzr, 15:55

568 viewsArtemiy Kzr, 15:56

Data Apps Design

I will try to overcome normalization another day 😄

Leave a comment / reaction if you are interested

610 viewsArtemiy Kzr, 15:56

Data Apps Design

Datalens from Yandex is quite powerful BI tool.

Especially when you use it on top of Clickhouse which makes analytics interactive with sub-second latency.

Amongst outstanding features I've already tried:

— Advanced functions to built almost anything one can imagine: timeseries, arrays, geo, window functions
— Nice and customizable charts integrating with dashboard
— Sharing with team / anyone on the internet

The more I use it, the more I love it.

— Useful docs with examples and how-to
— Really friendly community here in Telegram (important!)
— It is free of charge!

Take a look at how I managed to build Year-over-Year analysis with LAG function and draw different kinds of viz!

LAG([Выручка (₽)], 52 ORDER BY [Неделя] BEFORE FILTER BY [Неделя])

#datalens #clickhouse

Yandex DataLens

Сообщество пользователей Yandex DataLens
- Правила: t.me/YandexDataLens/28609/28610
- Полезное: t.me/YandexDataLens/28609/28894

Номер заявления РКН: 4962849290

636 viewsArtemiy Kzr, edited 16:31

Data Apps Design

photo_2022-01-14_19-25-06.jpg

92.2 KB

photo_2022-01-14_19-25-09.jpg

63.8 KB

photo_2022-01-14_19-25-11.jpg

72.7 KB

Screenshot 2022-01-14 at 19.21.36.png

209.3 KB

667 viewsArtemiy Kzr, 16:31

Data Apps Design

Привет! Новая публикация на Хабр ⬇️⬇️⬇️
Накиньте плюсов, если материал нравится, а я уже готовлю вторую часть.

486 viewsArtemiy Kzr, 06:02

Data Apps Design

[RU] Вредные советы при построении Аналитики (Data Lake / DWH / BI) – чего стоит избегать

Последние месяцы я много занимаюсь рефакторингом кодовой базы, оптимизацией процессов и расчетов в сфере Анализа Данных.

Появилось желание в формате “вредных советов” обратить внимание на набор практик и подходов, которые могут обернуться весьма неприятными последствиями, а порой и вовсе дорого обойтись Вашей компании.

В публикации Вас ожидает:

– Использование select * – всё и сразу
– Употребление чрезмерного количество CTEs (common table expressions)
– NOT DRY (Don’t repeat yourself) – повторение и калейдоскопический характер расчетов

#best_practices #dwh

Читать на Хабр →

Хабр

Вредные советы при построении Аналитики (Data Lake / DWH / BI) – чего стоит избегать

Всем привет! На связи Артемий, со-автор и преподаватель курсов Data Engineer , DWH Analyst . Последние месяцы я много занимаюсь рефакторингом кодовой базы, оптимизацией процессов и расчетов в сфере...

614 viewsArtemiy Kzr, 06:03

Data Apps Design

What is the easiest way to write custom data integration?

1. Fetch source data via API calls – E of ELT
2. Store raw data via S3 / Data Lake – L of ELT
3. Transform data as you wish via dbt – T from ELT

While focusing mainly on Transformations, which the most complex and interesting part and all about delivering business value, it is still essential to perform Extract-Load in clear and understandable way.

Shell noscript is the easist way to perform EL in my opinion (where possible 😉).

Take a look at the example of Fetching exchange rates →

1. Useful shell options options – debugging and safe exit.

set -x expands variables and prints a little + sign before the line.

set -e instructs bash to immediately exit if any command has a non-zero exit status

2. Variables

Either assign directly in bash noscript or provide as Environment Varabiables (preferred).

TS=`date +"%Y-%m-%d-%H-%M-%S-%Z"`

3. Chain or pipe commands

Result of one command could be input to another command.

JSON response fetched from API call gets transferred to AWS S3 bucket directly without any intermediate storage:

curl -H "Authorization: Token $OXR_TOKEN" \
  "https://openexchangerates.org/api/historical/$BUSINESS_DT.json?base=$BASE_CURRENCY&symbols=$SYMBOLS" \
  | aws s3 cp - s3://$BUCKET/$BUCKET_PATH/$BUSINESS_DT-$BASE_CURRENCY-$TS.json

4. Echo log messages

5. Schedule and monitor with Airflow.

Use templates, variables, loops, dynamic DAGs.

Do it right way once and just monitor for any errors. As simple as that.

6. Additional pros:

- Shell (bash, zsh) is already installed on most VMs
- No module importing / lib / dependency crap
- Ability to parallelize heavy commands and do it in optimal way

Gist

Fetching exchange rates

Fetching exchange rates. GitHub Gist: instantly share code, notes, and snippets.

548 viewsArtemiy Kzr, 13:45

Data Apps Design

image_2022-01-31_16-48-38.png

334.2 KB

image_2022-01-31_16-48-39.png

84.5 KB

image_2022-01-31_16-48-39.png

310.1 KB

540 viewsArtemiy Kzr, 13:48

Data Apps Design

Участвовал в подкасте DevsTalk LIVE - Собеседования в Data Science - что, как, за что?

Час с интересными людьми прошел незаметно.

- Что ты сделал в своё время для ~~хип-хопа~~ Data Analytics?
- Are you the same culture as me? 😂
- Кто такой Analytics Engineer
- Do's and Don'ts в составлении CV
- Красные маркеры при подборе людей в команду

Смотреть на YouTube →

YouTube

DevsTalk LIVE - Подкаст. Собеседования в Data Science - что, как, за что?

Во втором подкасте вместе с СберЗвук раскрываем тему найма и собеседований для Data-аналитиков. Обсуждаем топ вопросов для кандидата, ключевые навыки современного аналитика данных, и делимся советами для подготовки к собесу. А ещё - жалуемся на надоевшие…

729 viewsArtemiy Kzr, edited 16:38

Data Apps Design

During one of the lessons we’ve been examining popular file formats. How does columnar storage, data encoding, compression algorithms make difference?

File formats comparison: CSV, JSON, Parquet, ORC

Key results

Whenever you need to store your data on S3 / Data Lake / External table choose file format wisely:

– Parquet / ORC are the best options due to efficient data layout, compression, indexing capabilities
– Columnar formats allow for column projection and partition pruning (reading only relevant data!)
– Binary formats enable schema evolution which is very applicable for constantly changing business environment

Inputs

– I used Clickhouse + S3 table engine to compare different file formats
– Single node Clickhouse database was used – s2.small preset: 4 vCPU, 100% vCPU rate, 16 GB RAM
– Source data: TPCH synthetic dataset for 1 year – 18.2M rows, 2GB raw CSV size
– A single query is run at a time to ensure 100% dedicated resources

To perform it yourself you might need Yandex.Cloud account, set up Clickhouse database, generate S3 keys. Source data is available via public S3 link: https://storage.yandexcloud.net/otus-dwh/dbgen/lineorder.tbl.

Comparison measures

1. Time to serialize / deserialize

What time does it take to write data on disk in a particular format?

– Compressed columnar formats ORC, Parquet take leadership here
– It takes x6 times longer to write JSON data on disk compared with columnar formats on average (120 sec. vs 20 sec.)
– The less data you write on disk the less time it takes - no surprise

2. Storage size

What amount of disk space is used to store data?

– Obviously best results show zstd compressed ORC and Parquet formats
– Worst result is uncompressed JSON which is almost 3 times larger than source CSV data (you have to copy schema for every row!)
– Great results for zstd compressed CSV data which is 632MB vs 2GB of uncompressed data

3. Query latency (response time)

How fast can one get query results for a simple analytical query?

3.1. OLAP query including whole dataset (12/12 months)

– Columnar formats outperform text formats because they allow to access only specific columns and there’s no excessive IO
– Compression accounts for lower IO operations thus lower latency

3.2. OLAP query including subset of rows (1/12 months)

– Results for this query are pretty much the same as for the previous one without WHERE condition
– Although columnar formats allow for partition pruning and reading only relevant rows (according to WHERE condition), it is not pushed down
– Clickhouse EXPLAIN command revealed that filter is applied only after the whole result set is returned from S3 😕

See noscripts and queries →

Gist

File formats comparison: CSV, JSON, Parquet, ORC

File formats comparison: CSV, JSON, Parquet, ORC. GitHub Gist: instantly share code, notes, and snippets.

613 viewsArtemiy Kzr, 07:55

Data Apps Design

4. explain plan.png

37.9 KB

3.2. query latency 1 out of 12 months.png

68.4 KB

3.1. query latency 12 out of 12 months.png

65.3 KB

2. storage size.png