Data Apps Design – Telegram
Data Apps Design
1.54K subscribers
143 photos
2 videos
41 files
231 links
В этом блоге я публикую свои выводы и мнения на работу в Data:

— Data Integration
— Database engines
— Data Modeling
— Business Intelligence
— Semantic Layer
— DataOps and DevOps
— Orchestrating jobs & DAGs
— Business Impact and Value
Download Telegram
Just received a small viz on Data professions built on Metabase from a student as a homework assignment.

Simple but still incredible.

We might step onto next level by:
– refreshing data on a regular basis
– expand to a list of other professions like Web, Backend, Designer
– introduce some additional dimensions and metrices
– grant public access to this dashboard

#bi #metabase #hh
Data Vault 2.0 + Greenplum + dbtVault assignment

Step-by-step instruction available on Github Gist:

1. Spin up Greenplum Cluster on Yandex.Cloud

2. Generate TPCH dataset (10GB)

3. Populate Data Vault with dbtVault package

https://gist.github.com/kzzzr/4ab36bec6897e48e44e792dc2e706de9

We have been discussing this topic for 3 lessons so far (6 hours).
Anyone interested can try it out.

#datavault #greenplum #dbtvault
Identify Skewed Tables and Redistribute them

Classic issue with MPP and clustered databases

Conveying whole idea in two pictures:

When there's a data skew (among cluster nodes), there's a bottleneck. And total execution time is execution time of the longest node.
Amazon Redshift offers its own Advisor to detect skewed tables, but still you have to decide which dist key to choose on your own.

Using dbt allows changing table distribution style easily.

Those who don't use dbt may trigger:

ALTER TABLE {table_name} ALTER DISTSTYLE EVEN ;

This one is settled.

#mpp #skew #redshift
Andrew @Gunnnn just shared his thoughts on Databricks' Delta Live Tables.

After inspection I can state it is pretty similar to so called Materialized Views.

Recently I've been switching all Wheely's models in layers flatten, staging to MV.
A detailed post is coming.

Briefly, Redshift Materialized Views:

- Pre-compute result sets (that is why they are called 'materialized')
- Can be refreshed automatically as soon as new data in source tables arrives
- Incremental refresh is possible via internal Redshift algorithms
- Run as simple as REFRESH MATERIALIZED VIEW {name} ;

Now I can call it a new paradigm, where DAGs of transformations are triggered automatically by Database Engine as soon as new data arrives to source tables.

#materializedview
Schema evolution showcase

Data is ever changing. A couple of examples you will definitely face with while dealing with data pipelines:

– adding new attributes, removing old ones
– changing data types: int -> float, resizeing text
– renaming columns (change mapping)

Got a couple of new events with an attribute exceeding current maximum length in database.

Simple approach is to resize problem column:

ALTER TABLE hevo.wheely_prod_orders ALTER COLUMN stops TYPE VARCHAR(4096) ;

A lot of ELT tools can do it automatically without DE attention.

In my case it's tricky as I am using Materialized Views:

> SQL Error [500310] [0A000]: Amazon Invalid operation: cannot alter type of a column used by a materialized view

As a result I had to drop dependent objects, resize columns, then re-create objects once again

DROP MATERIALIZED VIEW flatten.flt_orders CASCADE ;
DROP MATERIALIZED VIEW dbt_test.flt_orders CASCADE ;
DROP MATERIALIZED VIEW ci.flt_orders CASCADE ;

ALTER TABLE hevo.wheely_prod_orders ALTER COLUMN stops TYPE VARCHAR(4096) ;

dbt run -m flt_orders+1 --target prod
dbt run -m flt_orders+1 --target dev
dbt run -m flt_orders+1 --target ci

#pipelines #schemaevolution
One of the things to improve code quality is doing proper reviews

– Require at least 2 reviews for a PR
– Assign reviewers automatically according to CODEOWNERS file
– Require status checks to pass before merging

CODEOWNERS could look as simple as:

*  @owner1 @owner2 @owner3

/models/flatten/ @owner1
/models/staging/ @owner2
/models/marts/ @owner3

Another thing is testing expectations of your data

As soon as you face any bug or issue, add a test on it to not miss it next time!

My new expectation is simply not empty table – expect_table_row_count_to_be_between from dbt_expectations package

  tests:
- dbt_expectations.expect_table_row_count_to_be_between:
min_value: 100

#codereview #codequality
17 days. The end 😅
Metabase + Clickhouse tutorial on Yandex.Cloud

1. Spin up a Virtual Machine
2. Configure VM: SSH, Docker
3. Download Clickhouse plug-in
4. Deploy Metabase
5. Connect to Clickhouse Playground
6. Visualize a question

https://gist.github.com/kzzzr/ecec7dca8bb70586a23569993df470e8

#bi #clickhouse