NEW BOT Телеграм, страница - 451498146

Data Apps Design

2.08K subscribers

143 photos

2 videos

41 files

232 links

В этом блоге я публикую свои выводы и мнения на работу в Data:

— Data Integration
— Database engines
— Data Modeling
— Business Intelligence
— Semantic Layer
— DataOps and DevOps
— Orchestrating jobs & DAGs
— Business Impact and Value

Download Telegram

About

Blog

Apps

Platform

Data Apps Design

2.08K subscribers

Data Apps Design

4. explain plan.png

3.2. query latency 1 out of 12 months.png

3.1. query latency 12 out of 12 months.png

2. storage size.png

1. time to serialize.png

700 viewsArtemiy Kzr, 07:57

Data Apps Design

Новая публикация на Хабр ⬇️⬇️⬇️

584 viewsArtemiy Kzr, 16:30

Data Apps Design

[RU] Вредные советы при построении Аналитики (Data Lake / DWH / BI) – целеполагание и результаты

Продолжаю серию публикаций в формате “вредных советов”, целью которых является попытка обратить внимание на (не)лучшие практики и подходы в построении аналитических сервисов с реальными примерами и историями.

В этой публикации Вас ожидает:

– Выполнение задач без четкого понимания целей - Question your customer;
– Игнорирование оценки ожидаемых характеристик данных - Assess data expectations;
– Пренебрежение документацией и пояснениями к коду - Ensure access and transparency.

#best_practices #dwh

Читать на Хабр →

Вредные советы при построении Аналитики (Data Lake / DWH / BI) – целеполагание и результаты

Всем привет! На связи Артемий Козырь – Analytics Engineer. Продолжаю серию публикаций в формате “вредных советов” , целью которых является попытка обратить внимание на (не)лучшие практики и подходы в...

816 viewsArtemiy Kzr, edited 16:30

Data Apps Design

⚡️Easily sync Google Sheets with Clickhouse

gSheets is the easiest way to maintain and edit mappings, target KPI values, custom dimensions and events.

It allows business users real-time editing data and seeing the results without knowing anything about Big Data, SQL, and columnar databases.

What I love about Clickhouse is its rich data integration capabilities. Here's what you do in 3 minutes:

1. Create gSheet with any structured data.

2. Get a direct link to data export to one of the known formats (CSV, JSON).

3. Create a table with engine URL specifying schema, gSheet link and file format.

CREATE TABLE dbt.margin_rate (
      `date_month` Date
    , `margin_rate` Float64
)
ENGINE=URL('https://docs.google.com/spreadsheets/d/{key}/gviz/tq?tqx=out:csv&sheet=margin_rate', CSVWithNames)
;

4. Query real-time in sub-second manner!

SELECT * FROM dbt.margin_rate LIMIT 200 ;

Data Warehouses have never been such interactive.

#dwh #clickhouse #pipelines

687 viewsArtemiy Kzr, edited 09:47

Data Apps Design

4. Query results.png

3. Table DDL.png

749 viewsArtemiy Kzr, 09:48

Data Apps Design

Pitfalls of incremental models in dbt ⬇️⬇️⬇️

469 viewsArtemiy Kzr, 16:09

Data Apps Design

test_failure.png

Hey, everyone!

Today I’ve faced with an issue of missing rows in my incremental dbt model.

A little background: I calculate GEO-zones for a large amount of events, which is pretty CPU-intensive operation. And of course I want to do it in incremental way, hence only processing deltas - new and changed rows.

If you still don’t use incremental models, you probably want to take a look at dbtLabs Docs on it.

So, first of all, it was relationships test which has given me a clue I was missing some rows.

474 viewsArtemiy Kzr, 16:10

Data Apps Design

The initial incremental clause looked like this:

    where true

    {% if is_incremental() %}
        and orders.__metadata_timestamp >= 
            (select max(__metadata_timestamp) as high_watermark from {{ this }})
    {% endif %}

So I basically took only rows where metadata timestamp was newer (later) than timestamp of those rows I already have in my table.

Then I realized sometimes metadata timestamp (which is the exact time when the row was fetched from source system) does not increase gradually, some rows get into different batches and land to DWH at different times, so I introduced a lookback period of 3 hours to make sure I don’t miss anything:

    where true

    {% if is_incremental() %}
        and orders.__metadata_timestamp >= 
            (select max(__metadata_timestamp) as high_watermark from {{ this }}) - interval '3 hours'
    {% endif %}

That means I have more assurance I will take all the missing rows, but on the other hand I will process a significant amount of rows multiple times as well as trigger more UPDATEs and DELETEs on my target table since incremental periods overlap.

It was all going OK until rare cases of EXTRACT - LOAD errors / pauses / outages showed that even 3 hours of lookback period sometimes is not sufficient.

467 viewsArtemiy Kzr, 16:11

Data Apps Design

One can increase lookback interval to 12, 24 or even 48 hours which totally can solve most of the cases, but I decided to rewrite my query in a smart way using NOT EXISTS clause:

    where true

    {% if is_incremental() %}
        and not exists (
            select 1
            from {{ this }}
            where orders.request_id = {{ this }}.request_id
            and orders.__metadata_timestamp = {{ this }}.__metadata_timestamp
        )
    {% endif %}

That simply means:

– take either completely new rows (‘request_id’ does not exist in {{ this }})
– or take ‘request_id’ which exist in {{ this }} but have different __metadata_timestamp (row has been modified)

I thought it was perfect, but Amazon Redshift didn’t think so 😅:

> This type of correlated subquery pattern is not supported due to internal error

Correlated subqueries - Amazon Redshift

Provides examples of how to use correlated subqueries in the WHERE clause.

464 viewsArtemiy Kzr, 16:13

Data Apps Design

correlated subquery not supported.png

450 viewsArtemiy Kzr, 16:14

Data Apps Design

So took another apporach – Anti-join

{% if is_incremental() %}
        left join {{ this }}
            on orders.request_id = {{ this }}.request_id
                and orders.__metadata_timestamp = {{ this }}.__metadata_timestamp

    where {{ this }}.request_id is null
{% endif %}

But unfortunately EXPLAIN plan revealed Redshift performs anti-join + filtering after performing 2 GEO-spatial joins.
Which technically means processing full data set and resets any incrementality benefits.

463 viewsArtemiy Kzr, 16:15

Data Apps Design

bad query plan.png

461 viewsArtemiy Kzr, 16:15

Data Apps Design

In older times I would just use a hint to make joins run in a specific way to filter rows early, however today just shuffling join order was good enough!

    from {{ ref('stg_orders_tmp') }} as orders

    {% if is_incremental() %}
        left join {{ this }}
            on orders.request_id = {{ this }}.request_id
                and orders.__metadata_timestamp = {{ this }}.__metadata_timestamp
    {% endif %} 

        left join {{ ref('stg_zones_tmp') }} as pickup
            on ST_Intersects(
                ST_Point(orders.pickup_position_lon, orders.pickup_position_lat), pickup.geometry)

        left join {{ ref('stg_zones_tmp') }} as dropoff
            on ST_Intersects(
                ST_Point(orders.dropoff_position_lon, orders.dropoff_position_lat), dropoff.geometry)

{% if is_incremental() %}
    where {{ this }}.request_id is null
{% endif %}

556 viewsArtemiy Kzr, 16:16

Data Apps Design

good query plan.png

587 viewsArtemiy Kzr, 16:16

Data Apps Design

So, what do you think of it all?
Have you ever used dbt incremental models?
Let's discuss in comments section.

#dbt #incremental #subquery

616 viewsArtemiy Kzr, 16:16

Data Apps Design

Have you ever built incremental tables in DWH ?

Anonymous Poll

Yes, I did (and used dbt `incremental` models)

Yes, I did (and didn't use dbt)

No, I didn't, but I really need to

No, I didn't and there's no need for me

48 voters693 viewsArtemiy Kzr, 16:19

Data Apps Design

Новая публикация на Хабр ⬇️⬇️⬇️

576 viewsArtemiy Kzr, 16:19

Data Apps Design

[RU] Стратегия инкрементального наполнения витрин: необходимость, реализация, подводные камни

Итак, Вы работаете с Большими Данными:

— Обработка этих данных требует значительного времени (и затрат 💰)
— Исторические данные не меняются (или не должны меняться) - как правило, это свершившиеся факты
— Если Вам удается не делать повторную обработку исторических данных - Вы экономите время и затраты

#dbt #incremental #dwh

Читать на Хабр →

🌐 @data_apps | Навигация по каналу

Please open Telegram to view this post

VIEW IN TELEGRAM

Стратегия инкрементального наполнения витрин: необходимость, реализация, подводные камни

Зачем нужна инкрементальная стратегия? Итак, Вы работаете с Большими Данными: Обработка этих данных требует значительного времени (и затрат ?) Исторические данные не меняются (или не должны меняться)...

721 viewsArtemiy Kzr, edited 16:20

Data Apps Design

686 viewsArtemiy Kzr, 16:35