Andrew @Gunnnn just shared his thoughts on Databricks' Delta Live Tables.
After inspection I can state it is pretty similar to so called Materialized Views.
Recently I've been switching all Wheely's models in layers flatten, staging to MV.
A detailed post is coming.
Briefly, Redshift Materialized Views:
- Pre-compute result sets (that is why they are called 'materialized')
- Can be refreshed automatically as soon as new data in source tables arrives
- Incremental refresh is possible via internal Redshift algorithms
- Run as simple as
#materializedview
After inspection I can state it is pretty similar to so called Materialized Views.
Recently I've been switching all Wheely's models in layers flatten, staging to MV.
A detailed post is coming.
Briefly, Redshift Materialized Views:
- Pre-compute result sets (that is why they are called 'materialized')
- Can be refreshed automatically as soon as new data in source tables arrives
- Incremental refresh is possible via internal Redshift algorithms
- Run as simple as
REFRESH MATERIALIZED VIEW {name} ;
Now I can call it a new paradigm, where DAGs of transformations are triggered automatically by Database Engine as soon as new data arrives to source tables.#materializedview
Schema evolution showcase
Data is ever changing. A couple of examples you will definitely face with while dealing with data pipelines:
– adding new attributes, removing old ones
– changing data types: int -> float, resizeing text
– renaming columns (change mapping)
Got a couple of new events with an attribute exceeding current maximum length in database.
Simple approach is to resize problem column:
In my case it's tricky as I am using Materialized Views:
> SQL Error [500310] [0A000]: Amazon Invalid operation: cannot alter type of a column used by a materialized view
As a result I had to drop dependent objects, resize columns, then re-create objects once again
Data is ever changing. A couple of examples you will definitely face with while dealing with data pipelines:
– adding new attributes, removing old ones
– changing data types: int -> float, resizeing text
– renaming columns (change mapping)
Got a couple of new events with an attribute exceeding current maximum length in database.
Simple approach is to resize problem column:
ALTER TABLE hevo.wheely_prod_orders ALTER COLUMN stops TYPE VARCHAR(4096) ;
A lot of ELT tools can do it automatically without DE attention.In my case it's tricky as I am using Materialized Views:
> SQL Error [500310] [0A000]: Amazon Invalid operation: cannot alter type of a column used by a materialized view
As a result I had to drop dependent objects, resize columns, then re-create objects once again
DROP MATERIALIZED VIEW flatten.flt_orders CASCADE ;#pipelines #schemaevolution
DROP MATERIALIZED VIEW dbt_test.flt_orders CASCADE ;
DROP MATERIALIZED VIEW ci.flt_orders CASCADE ;
ALTER TABLE hevo.wheely_prod_orders ALTER COLUMN stops TYPE VARCHAR(4096) ;
dbt run -m flt_orders+1 --target prod
dbt run -m flt_orders+1 --target dev
dbt run -m flt_orders+1 --target ci
One of the things to improve code quality is doing proper reviews
– Require at least 2 reviews for a PR
– Assign reviewers automatically according to CODEOWNERS file
– Require status checks to pass before merging
CODEOWNERS could look as simple as:
As soon as you face any bug or issue, add a test on it to not miss it next time!
My new expectation is simply not empty table – expect_table_row_count_to_be_between from dbt_expectations package
– Require at least 2 reviews for a PR
– Assign reviewers automatically according to CODEOWNERS file
– Require status checks to pass before merging
CODEOWNERS could look as simple as:
* @owner1 @owner2 @owner3Another thing is testing expectations of your data
/models/flatten/ @owner1
/models/staging/ @owner2
/models/marts/ @owner3
As soon as you face any bug or issue, add a test on it to not miss it next time!
My new expectation is simply not empty table – expect_table_row_count_to_be_between from dbt_expectations package
tests:#codereview #codequality
- dbt_expectations.expect_table_row_count_to_be_between:
min_value: 100
GitHub Docs
About code owners - GitHub Docs
You can use a CODEOWNERS file to define individuals or teams that are responsible for code in a repository.
Metabase + Clickhouse tutorial on Yandex.Cloud
1. Spin up a Virtual Machine
2. Configure VM: SSH, Docker
3. Download Clickhouse plug-in
4. Deploy Metabase
5. Connect to Clickhouse Playground
6. Visualize a question
https://gist.github.com/kzzzr/ecec7dca8bb70586a23569993df470e8
#bi #clickhouse
1. Spin up a Virtual Machine
2. Configure VM: SSH, Docker
3. Download Clickhouse plug-in
4. Deploy Metabase
5. Connect to Clickhouse Playground
6. Visualize a question
https://gist.github.com/kzzzr/ecec7dca8bb70586a23569993df470e8
#bi #clickhouse
Gist
Metabase + Clickhouse tutorial on Yandex.Cloud
Metabase + Clickhouse tutorial on Yandex.Cloud. GitHub Gist: instantly share code, notes, and snippets.
Prior to diving deep into complex analytics modeling topics like Sessionization, Attribution, RFM, one has to understand the motivation behind it.
It is the questions that drive business decisions first, then instruments and practices to find reliable answers to these questions.
Which questions does your business ask?
#analytics #modeling
It is the questions that drive business decisions first, then instruments and practices to find reliable answers to these questions.
Which questions does your business ask?
#analytics #modeling
[RU] Строим Data Vault на данных TPC-H – Greenplum + dbtVault
В публикации:
- Готовим датасет TPC-H
- Поднимаем кластер Greenplum в Яндекс.Облаке
- Погружаемся в кодогенерацию и макросы dbtVault
- Cимулируем инкрементальное наполнение Data Vault
#dbt #dbtvault
В публикации:
- Готовим датасет TPC-H
- Поднимаем кластер Greenplum в Яндекс.Облаке
- Погружаемся в кодогенерацию и макросы dbtVault
- Cимулируем инкрементальное наполнение Data Vault
#dbt #dbtvault
Хабр
Строим Data Vault на данных TPC-H – Greenplum + dbtVault
Привет! На связи Артемий – энтузиаст в сфере Data Warehousing, Analytics, DataOps. Уже продолжительное время я занимаюсь моделированием DWH с использованием dbt, и сегодня пришло время познакомить вас...
I just tried to upgrade onto 21.10 from 21.08.
After 30 minutes this cluster is never coming back.
I hope I can create a new one and restore backup onto it.
After 30 minutes this cluster is never coming back.
I hope I can create a new one and restore backup onto it.
We are in the middle of migration from Amazon Redshift DC2 nodes (2nd gen) to RA3 nodes (3rd gen) at Wheely.
What this means for us:
– Almost unlimited Disk Space (RA3 separate compute and storage)
– Speeding up Data Marts to 2hrs delay from real-time
– Blue/green deployments
I will follow up as soon as we are finished.
Attached simplified checklist plan.
Any questions welcomed.
What this means for us:
– Almost unlimited Disk Space (RA3 separate compute and storage)
– Speeding up Data Marts to 2hrs delay from real-time
– Blue/green deployments
I will follow up as soon as we are finished.
Attached simplified checklist plan.
Any questions welcomed.
Привет! Сегодня 18 ноября в 15.00 приглашаю на вебинар.
Полуструктурированные данные в Аналитических Хранилищах: Nested JSON + Arrays
- Источники полуструктурированных данных: Events, Webhooks, Logs
- Подходы: JSON functions, special data types, External tables (Lakehouse)
- Оптимизация производительности
Смотрим на примерах Amazon Redshift, Clickhouse.
Ссылка на регистрацию: https://otus.ru/lessons/dwh/#event-1661
Ссылка на youtube-трансляцию будет опубликована здесь за 5 минут до начала.
Полуструктурированные данные в Аналитических Хранилищах: Nested JSON + Arrays
- Источники полуструктурированных данных: Events, Webhooks, Logs
- Подходы: JSON functions, special data types, External tables (Lakehouse)
- Оптимизация производительности
Смотрим на примерах Amazon Redshift, Clickhouse.
Ссылка на регистрацию: https://otus.ru/lessons/dwh/#event-1661
Ссылка на youtube-трансляцию будет опубликована здесь за 5 минут до начала.
Data Apps Design
Привет! Сегодня 18 ноября в 15.00 приглашаю на вебинар. Полуструктурированные данные в Аналитических Хранилищах: Nested JSON + Arrays - Источники полуструктурированных данных: Events, Webhooks, Logs - Подходы: JSON functions, special data types, External…
[RU] Вебинар Полуструктурированные данные в Аналитических Хранилищах: Nested JSON + Arrays
Слайды вебинара: https://docs.google.com/presentation/d/1dUxzGkBgXAp6s-VrFKT8Qw8UF6eZVQywpeQT4-ZohPM/edit?usp=sharing
Запись вебинара: https://youtu.be/dtu0yeFdxvY?t=276
Опрос о вебинаре: https://forms.gle/JPFqoDYhJJjvnMj7A
Слайды вебинара: https://docs.google.com/presentation/d/1dUxzGkBgXAp6s-VrFKT8Qw8UF6eZVQywpeQT4-ZohPM/edit?usp=sharing
Запись вебинара: https://youtu.be/dtu0yeFdxvY?t=276
Опрос о вебинаре: https://forms.gle/JPFqoDYhJJjvnMj7A
Google Docs
DWH Analyst – Полуструктурированные данные в Аналитических Хранилищах
1 Онлайн-образование