AWS denounces its own error logs
Your post may include a non-inclusive word (master)
😭😂
Your post may include a non-inclusive word (master)
😭😂
Hey, long time no see 🙂
We have just started our engagement with dbtLabs at Wheely.
Guys will try to conduct audit and help us improve our dbt deployment even further.
They have already access to:
– dbtCloud (jobs)
– Github repo (code)
– Redshift (database)
– Looker (bi + monitoring)
– Slack (communicating)
And yesterday was the Kick-off call. Overall great impressions.
What is going to happen:
– Audit: Deployment + Performance
– Audit: Project Structure
– Features/Fix's
I will keep you posted.
We have just started our engagement with dbtLabs at Wheely.
Guys will try to conduct audit and help us improve our dbt deployment even further.
They have already access to:
– dbtCloud (jobs)
– Github repo (code)
– Redshift (database)
– Looker (bi + monitoring)
– Slack (communicating)
And yesterday was the Kick-off call. Overall great impressions.
What is going to happen:
– Audit: Deployment + Performance
– Audit: Project Structure
– Features/Fix's
I will keep you posted.
Airbyte Clickhouse destination
Airbyte deployed Clickhouse destination which I already use to gather data from multiple sources.
By default it replicates all the data as JSON blobs (all the attributes inside one String field)
To get it flattened you either have to do it yourself or use Airbyte normalization.
1. Flattening manually with JSON functions
➕ Could be tricky and exhausting if you have a lot of attributes.
➖ Works extremely fast
2. Airbyte normalization (= dbt underneath 😉)
➕ It will flatten all your data automatically
Technically it is auto-generated dbt project
➖ Still a little bit buggy and looks like a work in progress.
I almost managed to get it done, but I use Yandex’ Managed Clickhouse which demands you use SSL / Secure connection.
Unfortunately, Airbyte’s dbt profiles.yml is hard-configured to secure: False at the moment.
I might create a PR to fix this when I have some time.
#airbyte #clickhouse #dbt
Airbyte deployed Clickhouse destination which I already use to gather data from multiple sources.
By default it replicates all the data as JSON blobs (all the attributes inside one String field)
To get it flattened you either have to do it yourself or use Airbyte normalization.
1. Flattening manually with JSON functions
JSONExtract(_airbyte_data, 'id', 'UInt64') as id➕ Could be tricky and exhausting if you have a lot of attributes.
➖ Works extremely fast
2. Airbyte normalization (= dbt underneath 😉)
➕ It will flatten all your data automatically
Technically it is auto-generated dbt project
➖ Still a little bit buggy and looks like a work in progress.
I almost managed to get it done, but I use Yandex’ Managed Clickhouse which demands you use SSL / Secure connection.
Unfortunately, Airbyte’s dbt profiles.yml is hard-configured to secure: False at the moment.
I might create a PR to fix this when I have some time.
#airbyte #clickhouse #dbt
I will try to overcome normalization another day 😄
Leave a comment / reaction if you are interested
Leave a comment / reaction if you are interested
Datalens from Yandex is quite powerful BI tool.
Especially when you use it on top of Clickhouse which makes analytics interactive with sub-second latency.
Amongst outstanding features I've already tried:
— Advanced functions to built almost anything one can imagine: timeseries, arrays, geo, window functions
— Nice and customizable charts integrating with dashboard
— Sharing with team / anyone on the internet
The more I use it, the more I love it.
— Useful docs with examples and how-to
— Really friendly community here in Telegram (important!)
— It is free of charge!
Take a look at how I managed to build Year-over-Year analysis with LAG function and draw different kinds of viz!
Especially when you use it on top of Clickhouse which makes analytics interactive with sub-second latency.
Amongst outstanding features I've already tried:
— Advanced functions to built almost anything one can imagine: timeseries, arrays, geo, window functions
— Nice and customizable charts integrating with dashboard
— Sharing with team / anyone on the internet
The more I use it, the more I love it.
— Useful docs with examples and how-to
— Really friendly community here in Telegram (important!)
— It is free of charge!
Take a look at how I managed to build Year-over-Year analysis with LAG function and draw different kinds of viz!
LAG([Выручка (₽)], 52 ORDER BY [Неделя] BEFORE FILTER BY [Неделя])
#datalens #clickhouseTelegram
Yandex DataLens
Сообщество пользователей Yandex DataLens
- Правила: t.me/YandexDataLens/28609/28610
- Полезное: t.me/YandexDataLens/28609/28894
Номер заявления РКН: 4962849290
- Правила: t.me/YandexDataLens/28609/28610
- Полезное: t.me/YandexDataLens/28609/28894
Номер заявления РКН: 4962849290
Привет! Новая публикация на Хабр ⬇️⬇️⬇️
Накиньте плюсов, если материал нравится, а я уже готовлю вторую часть.
Накиньте плюсов, если материал нравится, а я уже готовлю вторую часть.
[RU] Вредные советы при построении Аналитики (Data Lake / DWH / BI) – чего стоит избегать
Последние месяцы я много занимаюсь рефакторингом кодовой базы, оптимизацией процессов и расчетов в сфере Анализа Данных.
Появилось желание в формате “вредных советов” обратить внимание на набор практик и подходов, которые могут обернуться весьма неприятными последствиями, а порой и вовсе дорого обойтись Вашей компании.
В публикации Вас ожидает:
– Использование select * – всё и сразу
– Употребление чрезмерного количество CTEs (common table expressions)
– NOT DRY (Don’t repeat yourself) – повторение и калейдоскопический характер расчетов
#best_practices #dwh
Читать на Хабр →
Последние месяцы я много занимаюсь рефакторингом кодовой базы, оптимизацией процессов и расчетов в сфере Анализа Данных.
Появилось желание в формате “вредных советов” обратить внимание на набор практик и подходов, которые могут обернуться весьма неприятными последствиями, а порой и вовсе дорого обойтись Вашей компании.
В публикации Вас ожидает:
– Использование select * – всё и сразу
– Употребление чрезмерного количество CTEs (common table expressions)
– NOT DRY (Don’t repeat yourself) – повторение и калейдоскопический характер расчетов
#best_practices #dwh
Читать на Хабр →
Хабр
Вредные советы при построении Аналитики (Data Lake / DWH / BI) – чего стоит избегать
Всем привет! На связи Артемий, со-автор и преподаватель курсов Data Engineer , DWH Analyst . Последние месяцы я много занимаюсь рефакторингом кодовой базы, оптимизацией процессов и расчетов в сфере...
What is the easiest way to write custom data integration?
1. Fetch source data via API calls – E of ELT
2. Store raw data via S3 / Data Lake – L of ELT
3. Transform data as you wish via dbt – T from ELT
While focusing mainly on Transformations, which the most complex and interesting part and all about delivering business value, it is still essential to perform Extract-Load in clear and understandable way.
Shell noscript is the easist way to perform EL in my opinion (where possible 😉).
Take a look at the example of Fetching exchange rates →
1. Useful shell options options – debugging and safe exit.
2. Variables
Either assign directly in bash noscript or provide as Environment Varabiables (preferred).
Result of one command could be input to another command.
JSON response fetched from API call gets transferred to AWS S3 bucket directly without any intermediate storage:
5. Schedule and monitor with Airflow.
Use templates, variables, loops, dynamic DAGs.
Do it right way once and just monitor for any errors. As simple as that.
6. Additional pros:
- Shell (bash, zsh) is already installed on most VMs
- No module importing / lib / dependency crap
- Ability to parallelize heavy commands and do it in optimal way
1. Fetch source data via API calls – E of ELT
2. Store raw data via S3 / Data Lake – L of ELT
3. Transform data as you wish via dbt – T from ELT
While focusing mainly on Transformations, which the most complex and interesting part and all about delivering business value, it is still essential to perform Extract-Load in clear and understandable way.
Shell noscript is the easist way to perform EL in my opinion (where possible 😉).
Take a look at the example of Fetching exchange rates →
1. Useful shell options options – debugging and safe exit.
set -x expands variables and prints a little + sign before the line.set -e instructs bash to immediately exit if any command has a non-zero exit status2. Variables
Either assign directly in bash noscript or provide as Environment Varabiables (preferred).
TS=`date +"%Y-%m-%d-%H-%M-%S-%Z"`3. Chain or pipe commands
Result of one command could be input to another command.
JSON response fetched from API call gets transferred to AWS S3 bucket directly without any intermediate storage:
curl -H "Authorization: Token $OXR_TOKEN" \4. Echo log messages
"https://openexchangerates.org/api/historical/$BUSINESS_DT.json?base=$BASE_CURRENCY&symbols=$SYMBOLS" \
| aws s3 cp - s3://$BUCKET/$BUCKET_PATH/$BUSINESS_DT-$BASE_CURRENCY-$TS.json
5. Schedule and monitor with Airflow.
Use templates, variables, loops, dynamic DAGs.
Do it right way once and just monitor for any errors. As simple as that.
6. Additional pros:
- Shell (bash, zsh) is already installed on most VMs
- No module importing / lib / dependency crap
- Ability to parallelize heavy commands and do it in optimal way
Gist
Fetching exchange rates
Fetching exchange rates. GitHub Gist: instantly share code, notes, and snippets.