I will try to overcome normalization another day 😄
Leave a comment / reaction if you are interested
Leave a comment / reaction if you are interested
Datalens from Yandex is quite powerful BI tool.
Especially when you use it on top of Clickhouse which makes analytics interactive with sub-second latency.
Amongst outstanding features I've already tried:
— Advanced functions to built almost anything one can imagine: timeseries, arrays, geo, window functions
— Nice and customizable charts integrating with dashboard
— Sharing with team / anyone on the internet
The more I use it, the more I love it.
— Useful docs with examples and how-to
— Really friendly community here in Telegram (important!)
— It is free of charge!
Take a look at how I managed to build Year-over-Year analysis with LAG function and draw different kinds of viz!
Especially when you use it on top of Clickhouse which makes analytics interactive with sub-second latency.
Amongst outstanding features I've already tried:
— Advanced functions to built almost anything one can imagine: timeseries, arrays, geo, window functions
— Nice and customizable charts integrating with dashboard
— Sharing with team / anyone on the internet
The more I use it, the more I love it.
— Useful docs with examples and how-to
— Really friendly community here in Telegram (important!)
— It is free of charge!
Take a look at how I managed to build Year-over-Year analysis with LAG function and draw different kinds of viz!
LAG([Выручка (₽)], 52 ORDER BY [Неделя] BEFORE FILTER BY [Неделя])
#datalens #clickhouseTelegram
Yandex DataLens
Сообщество пользователей Yandex DataLens
- Правила: t.me/YandexDataLens/28609/28610
- Полезное: t.me/YandexDataLens/28609/28894
Номер заявления РКН: 4962849290
- Правила: t.me/YandexDataLens/28609/28610
- Полезное: t.me/YandexDataLens/28609/28894
Номер заявления РКН: 4962849290
Привет! Новая публикация на Хабр ⬇️⬇️⬇️
Накиньте плюсов, если материал нравится, а я уже готовлю вторую часть.
Накиньте плюсов, если материал нравится, а я уже готовлю вторую часть.
[RU] Вредные советы при построении Аналитики (Data Lake / DWH / BI) – чего стоит избегать
Последние месяцы я много занимаюсь рефакторингом кодовой базы, оптимизацией процессов и расчетов в сфере Анализа Данных.
Появилось желание в формате “вредных советов” обратить внимание на набор практик и подходов, которые могут обернуться весьма неприятными последствиями, а порой и вовсе дорого обойтись Вашей компании.
В публикации Вас ожидает:
– Использование select * – всё и сразу
– Употребление чрезмерного количество CTEs (common table expressions)
– NOT DRY (Don’t repeat yourself) – повторение и калейдоскопический характер расчетов
#best_practices #dwh
Читать на Хабр →
Последние месяцы я много занимаюсь рефакторингом кодовой базы, оптимизацией процессов и расчетов в сфере Анализа Данных.
Появилось желание в формате “вредных советов” обратить внимание на набор практик и подходов, которые могут обернуться весьма неприятными последствиями, а порой и вовсе дорого обойтись Вашей компании.
В публикации Вас ожидает:
– Использование select * – всё и сразу
– Употребление чрезмерного количество CTEs (common table expressions)
– NOT DRY (Don’t repeat yourself) – повторение и калейдоскопический характер расчетов
#best_practices #dwh
Читать на Хабр →
Хабр
Вредные советы при построении Аналитики (Data Lake / DWH / BI) – чего стоит избегать
Всем привет! На связи Артемий, со-автор и преподаватель курсов Data Engineer , DWH Analyst . Последние месяцы я много занимаюсь рефакторингом кодовой базы, оптимизацией процессов и расчетов в сфере...
What is the easiest way to write custom data integration?
1. Fetch source data via API calls – E of ELT
2. Store raw data via S3 / Data Lake – L of ELT
3. Transform data as you wish via dbt – T from ELT
While focusing mainly on Transformations, which the most complex and interesting part and all about delivering business value, it is still essential to perform Extract-Load in clear and understandable way.
Shell noscript is the easist way to perform EL in my opinion (where possible 😉).
Take a look at the example of Fetching exchange rates →
1. Useful shell options options – debugging and safe exit.
2. Variables
Either assign directly in bash noscript or provide as Environment Varabiables (preferred).
Result of one command could be input to another command.
JSON response fetched from API call gets transferred to AWS S3 bucket directly without any intermediate storage:
5. Schedule and monitor with Airflow.
Use templates, variables, loops, dynamic DAGs.
Do it right way once and just monitor for any errors. As simple as that.
6. Additional pros:
- Shell (bash, zsh) is already installed on most VMs
- No module importing / lib / dependency crap
- Ability to parallelize heavy commands and do it in optimal way
1. Fetch source data via API calls – E of ELT
2. Store raw data via S3 / Data Lake – L of ELT
3. Transform data as you wish via dbt – T from ELT
While focusing mainly on Transformations, which the most complex and interesting part and all about delivering business value, it is still essential to perform Extract-Load in clear and understandable way.
Shell noscript is the easist way to perform EL in my opinion (where possible 😉).
Take a look at the example of Fetching exchange rates →
1. Useful shell options options – debugging and safe exit.
set -x expands variables and prints a little + sign before the line.set -e instructs bash to immediately exit if any command has a non-zero exit status2. Variables
Either assign directly in bash noscript or provide as Environment Varabiables (preferred).
TS=`date +"%Y-%m-%d-%H-%M-%S-%Z"`3. Chain or pipe commands
Result of one command could be input to another command.
JSON response fetched from API call gets transferred to AWS S3 bucket directly without any intermediate storage:
curl -H "Authorization: Token $OXR_TOKEN" \4. Echo log messages
"https://openexchangerates.org/api/historical/$BUSINESS_DT.json?base=$BASE_CURRENCY&symbols=$SYMBOLS" \
| aws s3 cp - s3://$BUCKET/$BUCKET_PATH/$BUSINESS_DT-$BASE_CURRENCY-$TS.json
5. Schedule and monitor with Airflow.
Use templates, variables, loops, dynamic DAGs.
Do it right way once and just monitor for any errors. As simple as that.
6. Additional pros:
- Shell (bash, zsh) is already installed on most VMs
- No module importing / lib / dependency crap
- Ability to parallelize heavy commands and do it in optimal way
Gist
Fetching exchange rates
Fetching exchange rates. GitHub Gist: instantly share code, notes, and snippets.
Участвовал в подкасте DevsTalk LIVE - Собеседования в Data Science - что, как, за что?
Час с интересными людьми прошел незаметно.
- Что ты сделал в своё время дляхип-хопа Data Analytics?
- Are you the same culture as me? 😂
- Кто такой Analytics Engineer
- Do's and Don'ts в составлении CV
- Красные маркеры при подборе людей в команду
Смотреть на YouTube →
Час с интересными людьми прошел незаметно.
- Что ты сделал в своё время для
- Are you the same culture as me? 😂
- Кто такой Analytics Engineer
- Do's and Don'ts в составлении CV
- Красные маркеры при подборе людей в команду
Смотреть на YouTube →
YouTube
DevsTalk LIVE - Подкаст. Собеседования в Data Science - что, как, за что?
Во втором подкасте вместе с СберЗвук раскрываем тему найма и собеседований для Data-аналитиков. Обсуждаем топ вопросов для кандидата, ключевые навыки современного аналитика данных, и делимся советами для подготовки к собесу. А ещё - жалуемся на надоевшие…
During one of the lessons we’ve been examining popular file formats. How does columnar storage, data encoding, compression algorithms make difference?
File formats comparison: CSV, JSON, Parquet, ORC
Key results
Whenever you need to store your data on S3 / Data Lake / External table choose file format wisely:
– Parquet / ORC are the best options due to efficient data layout, compression, indexing capabilities
– Columnar formats allow for column projection and partition pruning (reading only relevant data!)
– Binary formats enable schema evolution which is very applicable for constantly changing business environment
Inputs
– I used Clickhouse + S3 table engine to compare different file formats
– Single node Clickhouse database was used – s2.small preset: 4 vCPU, 100% vCPU rate, 16 GB RAM
– Source data: TPCH synthetic dataset for 1 year – 18.2M rows, 2GB raw CSV size
– A single query is run at a time to ensure 100% dedicated resources
To perform it yourself you might need Yandex.Cloud account, set up Clickhouse database, generate S3 keys. Source data is available via public S3 link: https://storage.yandexcloud.net/otus-dwh/dbgen/lineorder.tbl.
Comparison measures
1. Time to serialize / deserialize
What time does it take to write data on disk in a particular format?
– Compressed columnar formats ORC, Parquet take leadership here
– It takes x6 times longer to write JSON data on disk compared with columnar formats on average (120 sec. vs 20 sec.)
– The less data you write on disk the less time it takes - no surprise
2. Storage size
What amount of disk space is used to store data?
– Obviously best results show zstd compressed ORC and Parquet formats
– Worst result is uncompressed JSON which is almost 3 times larger than source CSV data (you have to copy schema for every row!)
– Great results for zstd compressed CSV data which is 632MB vs 2GB of uncompressed data
3. Query latency (response time)
How fast can one get query results for a simple analytical query?
3.1. OLAP query including whole dataset (12/12 months)
– Columnar formats outperform text formats because they allow to access only specific columns and there’s no excessive IO
– Compression accounts for lower IO operations thus lower latency
3.2. OLAP query including subset of rows (1/12 months)
– Results for this query are pretty much the same as for the previous one without WHERE condition
– Although columnar formats allow for partition pruning and reading only relevant rows (according to WHERE condition), it is not pushed down
– Clickhouse EXPLAIN command revealed that filter is applied only after the whole result set is returned from S3 😕
See noscripts and queries →
File formats comparison: CSV, JSON, Parquet, ORC
Key results
Whenever you need to store your data on S3 / Data Lake / External table choose file format wisely:
– Parquet / ORC are the best options due to efficient data layout, compression, indexing capabilities
– Columnar formats allow for column projection and partition pruning (reading only relevant data!)
– Binary formats enable schema evolution which is very applicable for constantly changing business environment
Inputs
– I used Clickhouse + S3 table engine to compare different file formats
– Single node Clickhouse database was used – s2.small preset: 4 vCPU, 100% vCPU rate, 16 GB RAM
– Source data: TPCH synthetic dataset for 1 year – 18.2M rows, 2GB raw CSV size
– A single query is run at a time to ensure 100% dedicated resources
To perform it yourself you might need Yandex.Cloud account, set up Clickhouse database, generate S3 keys. Source data is available via public S3 link: https://storage.yandexcloud.net/otus-dwh/dbgen/lineorder.tbl.
Comparison measures
1. Time to serialize / deserialize
What time does it take to write data on disk in a particular format?
– Compressed columnar formats ORC, Parquet take leadership here
– It takes x6 times longer to write JSON data on disk compared with columnar formats on average (120 sec. vs 20 sec.)
– The less data you write on disk the less time it takes - no surprise
2. Storage size
What amount of disk space is used to store data?
– Obviously best results show zstd compressed ORC and Parquet formats
– Worst result is uncompressed JSON which is almost 3 times larger than source CSV data (you have to copy schema for every row!)
– Great results for zstd compressed CSV data which is 632MB vs 2GB of uncompressed data
3. Query latency (response time)
How fast can one get query results for a simple analytical query?
3.1. OLAP query including whole dataset (12/12 months)
– Columnar formats outperform text formats because they allow to access only specific columns and there’s no excessive IO
– Compression accounts for lower IO operations thus lower latency
3.2. OLAP query including subset of rows (1/12 months)
– Results for this query are pretty much the same as for the previous one without WHERE condition
– Although columnar formats allow for partition pruning and reading only relevant rows (according to WHERE condition), it is not pushed down
– Clickhouse EXPLAIN command revealed that filter is applied only after the whole result set is returned from S3 😕
See noscripts and queries →
Gist
File formats comparison: CSV, JSON, Parquet, ORC
File formats comparison: CSV, JSON, Parquet, ORC. GitHub Gist: instantly share code, notes, and snippets.
[RU] Вредные советы при построении Аналитики (Data Lake / DWH / BI) – целеполагание и результаты
Продолжаю серию публикаций в формате “вредных советов”, целью которых является попытка обратить внимание на (не)лучшие практики и подходы в построении аналитических сервисов с реальными примерами и историями.
В этой публикации Вас ожидает:
– Выполнение задач без четкого понимания целей - Question your customer;
– Игнорирование оценки ожидаемых характеристик данных - Assess data expectations;
– Пренебрежение документацией и пояснениями к коду - Ensure access and transparency.
#best_practices #dwh
Читать на Хабр →
Продолжаю серию публикаций в формате “вредных советов”, целью которых является попытка обратить внимание на (не)лучшие практики и подходы в построении аналитических сервисов с реальными примерами и историями.
В этой публикации Вас ожидает:
– Выполнение задач без четкого понимания целей - Question your customer;
– Игнорирование оценки ожидаемых характеристик данных - Assess data expectations;
– Пренебрежение документацией и пояснениями к коду - Ensure access and transparency.
#best_practices #dwh
Читать на Хабр →
Хабр
Вредные советы при построении Аналитики (Data Lake / DWH / BI) – целеполагание и результаты
Всем привет! На связи Артемий Козырь – Analytics Engineer. Продолжаю серию публикаций в формате “вредных советов” , целью которых является попытка обратить внимание на (не)лучшие практики и подходы в...