NEW BOT Телеграм, страница

Data Apps Design

Начали пылесосить события Github организации Wheely в наше Хранилище. Интеграция с помощью Webhook: – PushEvent – PullRequestEvent – ReleaseEvent Пока отталкиваемся от опыта Gitlab – Centralized Engineering Metrics. Интересные метрики: – MR Rate – MRs vs…

Here's what Github has sent onto our webhook for past 5 days.

Events of most interest:

- Issues & Pull requests (#, who, when, where, how hard)
- Push (commit frequency and complexity by repos, teams)
- Workflows (Actions metrices)
- Checks (Continuous Integration metrices)

Detailed event payload described at Github Docs.

This data is heavy nested, so new SUPER data type (Redshift) comes really handy for this task to unnest and flatten data.

Soon I will build something worthwhile on top of this data.

#dwh #pipelines #github

304 viewsArtemiy Kzr, 10:31

Data Apps Design

Did a brief Data Infrastructure overview today during onboarding session for new Product Analysts @ Wheely

Follow-up to share with you all

1. dbtCloud – invitations sent, start exploring
– Docs
– Data Sources
– Jobs definition

2. Redshift credentails sent to PM

3. Access Jupyter Hub via corporate email

4. Read more about dbt:
– dbt basics
– Start with dbtCloud IDE
– Alternatively, install dbt@0.20.2 (use Homebrew on Mac) and use with any local IDE (VSCode, PyCharm)

Shout-out to you guys!

#dataops #onboarding

343 viewsArtemiy Kzr, edited 20:23

Data Apps Design

Sometimes you have to test a lot of data quality expectations.

And sometimes tests might catch something glitchy and annoying over and over again, which in fact turns out to be OK.
For example, later arriving data or ELT process time lag.

Since 0.20.0 dbt introduced error_if + warn_if configs

Now it won't fail with ERROR waking me up in the morning

1st pic: error message in Slack
2nd pic: error details (failed tests)
3rd pic: new config which helps avoid errors with < 10 rows

#testing #dbt

311 viewsArtemiy Kzr, 12:03

Data Apps Design

I have updated Wheely's production workloads to a new version of dbt==0.21.0
Along with major improvements to performance, stability, and speed we now have:

– A dbt build command for multi-resource runs
– Handling for column schema changes in incremental models
– Defining configs in all the places you’d expect

An average prod job definition looks like: dbt seed + dbt run + dbt snapshot + dbt test
Now with single dbt build command its going to be really simplified and convenient, building resource by resource, from left to right across your DAG.

New on_schema_change parameter enables additional control when incremental model columns change. Possible strategies are:
– ignore (default): new column will not appear in your target table.
– fail: Triggers an error message when the source and target schemas diverge .
– append_new_columns: Append new columns to the existing table.
– sync_all_columns: Adds any new columns to the existing table, and removes any columns that are now missing.

But nothing comes all good, I’ve faced with a couple of bugs these days:
– One that broke my materialization macro with new dispatch logic
– Serializable isolation violation (unrelated to dbt)

I will describe them in next posts.

Meanwhile read more about upcoming dbt v1.0 in December 2021 !

#dbt #release

dbt Community Forum

Release: dbt Core v0.21 (Louis Kahn)

Updates [Sep 27] v0.21.0 (final) is available for production use. [Sep 27] v0.21.0-rc2 is available. It includes small bug fixes and bumps to schema versions for changed metadata artifacts. [Sep 20] v0.21.0-rc1 is available for prerelease testing. 🔔 Who…

322 viewsArtemiy Kzr, 08:33

Data Apps Design

320 viewsArtemiy Kzr, 08:34

Data Apps Design

This media is not supported in your browser

VIEW IN TELEGRAM

534 viewsArtemiy Kzr, 08:34

Data Apps Design

Just received a small viz on Data professions built on Metabase from a student as a homework assignment.

Simple but still incredible.

We might step onto next level by:
– refreshing data on a regular basis
– expand to a list of other professions like Web, Backend, Designer
– introduce some additional dimensions and metrices
– grant public access to this dashboard

#bi #metabase #hh

304 viewsArtemiy Kzr, 09:02

Data Apps Design

Data Vault 2.0 + Greenplum + dbtVault assignment

Step-by-step instruction available on Github Gist:

1. Spin up Greenplum Cluster on Yandex.Cloud

2. Generate TPCH dataset (10GB)

3. Populate Data Vault with dbtVault package

https://gist.github.com/kzzzr/4ab36bec6897e48e44e792dc2e706de9

We have been discussing this topic for 3 lessons so far (6 hours).
Anyone interested can try it out.

#datavault #greenplum #dbtvault

Gist

Data Vault 2.0 + Greenplum + dbtVault assignment

Data Vault 2.0 + Greenplum + dbtVault assignment. GitHub Gist: instantly share code, notes, and snippets.

329 viewsArtemiy Kzr, edited 17:57

Data Apps Design

290 viewsArtemiy Kzr, 17:57

Data Apps Design

This media is not supported in your browser

VIEW IN TELEGRAM

Full-size: https://habrastorage.org/webt/em/x1/gz/emx1gzcpvtaybd5z0xvxbi2diui.gif

305 viewsArtemiy Kzr, edited 17:57

Data Apps Design

By the way, who has ever used Data Vault?

Anonymous Poll

19%

I work with Data Vault

Tried it

47%

Never did, but want to

28%

Never did and no need

43 voters299 viewsArtemiy Kzr, 17:59

Data Apps Design

Identify Skewed Tables and Redistribute them

Classic issue with MPP and clustered databases

Conveying whole idea in two pictures:

When there's a data skew (among cluster nodes), there's a bottleneck. And total execution time is execution time of the longest node.

300 viewsArtemiy Kzr, 08:59

Data Apps Design

Amazon Redshift offers its own Advisor to detect skewed tables, but still you have to decide which dist key to choose on your own.

Using dbt allows changing table distribution style easily.

Those who don't use dbt may trigger:

ALTER TABLE {table_name} ALTER DISTSTYLE EVEN ;

This one is settled.

#mpp #skew #redshift

311 viewsArtemiy Kzr, 09:00

Data Apps Design

Andrew @Gunnnn just shared his thoughts on Databricks' Delta Live Tables.

After inspection I can state it is pretty similar to so called Materialized Views.

Recently I've been switching all Wheely's models in layers flatten, staging to MV.
A detailed post is coming.

Briefly, Redshift Materialized Views:

- Pre-compute result sets (that is why they are called 'materialized')
- Can be refreshed automatically as soon as new data in source tables arrives
- Incremental refresh is possible via internal Redshift algorithms
- Run as simple as

REFRESH MATERIALIZED VIEW {name} ;

Now I can call it a new paradigm, where DAGs of transformations are triggered automatically by Database Engine as soon as new data arrives to source tables.

#materializedview

322 viewsArtemiy Kzr, 09:30

Data Apps Design

Schema evolution showcase

Data is ever changing. A couple of examples you will definitely face with while dealing with data pipelines:

– adding new attributes, removing old ones
– changing data types: int -> float, resizeing text
– renaming columns (change mapping)

Got a couple of new events with an attribute exceeding current maximum length in database.

Simple approach is to resize problem column:

ALTER TABLE hevo.wheely_prod_orders ALTER COLUMN stops TYPE VARCHAR(4096) ;

A lot of ELT tools can do it automatically without DE attention.

In my case it's tricky as I am using Materialized Views:

> SQL Error [500310] [0A000]: Amazon Invalid operation: cannot alter type of a column used by a materialized view

As a result I had to drop dependent objects, resize columns, then re-create objects once again

DROP MATERIALIZED VIEW flatten.flt_orders CASCADE ;
DROP MATERIALIZED VIEW dbt_test.flt_orders CASCADE ;
DROP MATERIALIZED VIEW ci.flt_orders CASCADE ;

ALTER TABLE hevo.wheely_prod_orders ALTER COLUMN stops TYPE VARCHAR(4096) ; 

dbt run -m flt_orders+1 --target prod
dbt run -m flt_orders+1 --target dev
dbt run -m flt_orders+1 --target ci

#pipelines #schemaevolution

359 viewsArtemiy Kzr, 15:39

About

Blog

Apps

Platform