Data Apps Design – Telegram
Data Apps Design
1.53K subscribers
143 photos
2 videos
41 files
231 links
В этом блоге я публикую свои выводы и мнения на работу в Data:

— Data Integration
— Database engines
— Data Modeling
— Business Intelligence
— Semantic Layer
— DataOps and DevOps
— Orchestrating jobs & DAGs
— Business Impact and Value
Download Telegram
Airbyte Clickhouse destination

Airbyte deployed Clickhouse destination which I already use to gather data from multiple sources.

By default it replicates all the data as JSON blobs (all the attributes inside one String field)

To get it flattened you either have to do it yourself or use Airbyte normalization.

1. Flattening manually with JSON functions

JSONExtract(_airbyte_data, 'id', 'UInt64') as id

Could be tricky and exhausting if you have a lot of attributes.

Works extremely fast

2. Airbyte normalization (= dbt underneath 😉)

It will flatten all your data automatically

Technically it is auto-generated dbt project

Still a little bit buggy and looks like a work in progress.

I almost managed to get it done, but I use Yandex’ Managed Clickhouse which demands you use SSL / Secure connection.

Unfortunately, Airbyte’s dbt profiles.yml is hard-configured to secure: False at the moment.

I might create a PR to fix this when I have some time.

#airbyte #clickhouse #dbt
I will try to overcome normalization another day 😄

Leave a comment / reaction if you are interested
Datalens from Yandex is quite powerful BI tool.

Especially when you use it on top of Clickhouse which makes analytics interactive with sub-second latency.

Amongst outstanding features I've already tried:

— Advanced functions to built almost anything one can imagine: timeseries, arrays, geo, window functions
— Nice and customizable charts integrating with dashboard
— Sharing with team / anyone on the internet

The more I use it, the more I love it.

— Useful docs with examples and how-to
— Really friendly community here in Telegram (important!)
— It is free of charge!

Take a look at how I managed to build Year-over-Year analysis with LAG function and draw different kinds of viz!

LAG([Выручка (₽)], 52 ORDER BY [Неделя] BEFORE FILTER BY [Неделя])

#datalens #clickhouse
Привет! Новая публикация на Хабр ⬇️⬇️⬇️
Накиньте плюсов, если материал нравится, а я уже готовлю вторую часть.
[RU] Вредные советы при построении Аналитики (Data Lake / DWH / BI) – чего стоит избегать

Последние месяцы я много занимаюсь рефакторингом кодовой базы, оптимизацией процессов и расчетов в сфере Анализа Данных.

Появилось желание в формате “вредных советов” обратить внимание на набор практик и подходов, которые могут обернуться весьма неприятными последствиями, а порой и вовсе дорого обойтись Вашей компании.

В публикации Вас ожидает:

– Использование select * – всё и сразу
– Употребление чрезмерного количество CTEs (common table expressions)
– NOT DRY (Don’t repeat yourself) – повторение и калейдоскопический характер расчетов

#best_practices #dwh

Читать на Хабр →
What is the easiest way to write custom data integration?

1. Fetch source data via API calls – E of ELT
2. Store raw data via S3 / Data Lake – L of ELT
3. Transform data as you wish via dbt – T from ELT

While focusing mainly on Transformations, which the most complex and interesting part and all about delivering business value, it is still essential to perform Extract-Load in clear and understandable way.

Shell noscript is the easist way to perform EL in my opinion (where possible 😉).

Take a look at the example of Fetching exchange rates →

1. Useful shell options options – debugging and safe exit.

set -x expands variables and prints a little + sign before the line.

set -e instructs bash to immediately exit if any command has a non-zero exit status

2. Variables

Either assign directly in bash noscript or provide as Environment Varabiables (preferred).

TS=`date +"%Y-%m-%d-%H-%M-%S-%Z"`

3. Chain or pipe commands

Result of one command could be input to another command.

JSON response fetched from API call gets transferred to AWS S3 bucket directly without any intermediate storage:

curl -H "Authorization: Token $OXR_TOKEN" \
"https://openexchangerates.org/api/historical/$BUSINESS_DT.json?base=$BASE_CURRENCY&symbols=$SYMBOLS" \
| aws s3 cp - s3://$BUCKET/$BUCKET_PATH/$BUSINESS_DT-$BASE_CURRENCY-$TS.json

4. Echo log messages

5. Schedule and monitor with Airflow.

Use templates, variables, loops, dynamic DAGs.

Do it right way once and just monitor for any errors. As simple as that.

6. Additional pros:

- Shell (bash, zsh) is already installed on most VMs
- No module importing / lib / dependency crap
- Ability to parallelize heavy commands and do it in optimal way
Участвовал в подкасте DevsTalk LIVE - Собеседования в Data Science - что, как, за что?

Час с интересными людьми прошел незаметно.

- Что ты сделал в своё время для хип-хопа Data Analytics?
- Are you the same culture as me? 😂
- Кто такой Analytics Engineer
- Do's and Don'ts в составлении CV
- Красные маркеры при подборе людей в команду

Смотреть на YouTube →
During one of the lessons we’ve been examining popular file formats. How does columnar storage, data encoding, compression algorithms make difference?

File formats comparison: CSV, JSON, Parquet, ORC

Key results

Whenever you need to store your data on S3 / Data Lake / External table choose file format wisely:

– Parquet / ORC are the best options due to efficient data layout, compression, indexing capabilities
– Columnar formats allow for column projection and partition pruning (reading only relevant data!)
– Binary formats enable schema evolution which is very applicable for constantly changing business environment

Inputs

– I used Clickhouse + S3 table engine to compare different file formats
– Single node Clickhouse database was used – s2.small preset: 4 vCPU, 100% vCPU rate, 16 GB RAM
– Source data: TPCH synthetic dataset for 1 year – 18.2M rows, 2GB raw CSV size
– A single query is run at a time to ensure 100% dedicated resources

To perform it yourself you might need Yandex.Cloud account, set up Clickhouse database, generate S3 keys. Source data is available via public S3 link: https://storage.yandexcloud.net/otus-dwh/dbgen/lineorder.tbl.

Comparison measures

1. Time to serialize / deserialize

What time does it take to write data on disk in a particular format?

– Compressed columnar formats ORC, Parquet take leadership here
– It takes x6 times longer to write JSON data on disk compared with columnar formats on average (120 sec. vs 20 sec.)
– The less data you write on disk the less time it takes - no surprise

2. Storage size

What amount of disk space is used to store data?

– Obviously best results show zstd compressed ORC and Parquet formats
– Worst result is uncompressed JSON which is almost 3 times larger than source CSV data (you have to copy schema for every row!)
– Great results for zstd compressed CSV data which is 632MB vs 2GB of uncompressed data

3. Query latency (response time)

How fast can one get query results for a simple analytical query?

3.1. OLAP query including whole dataset (12/12 months)

– Columnar formats outperform text formats because they allow to access only specific columns and there’s no excessive IO
– Compression accounts for lower IO operations thus lower latency

3.2. OLAP query including subset of rows (1/12 months)

– Results for this query are pretty much the same as for the previous one without WHERE condition
– Although columnar formats allow for partition pruning and reading only relevant rows (according to WHERE condition), it is not pushed down
– Clickhouse EXPLAIN command revealed that filter is applied only after the whole result set is returned from S3 😕

See noscripts and queries →