Data Apps Design – Telegram
Data Apps Design
1.54K subscribers
143 photos
2 videos
41 files
231 links
В этом блоге я публикую свои выводы и мнения на работу в Data:

— Data Integration
— Database engines
— Data Modeling
— Business Intelligence
— Semantic Layer
— DataOps and DevOps
— Orchestrating jobs & DAGs
— Business Impact and Value
Download Telegram
Новая публикация на Хабр ⬇️⬇️⬇️
[RU] Вредные советы при построении Аналитики (Data Lake / DWH / BI) – целеполагание и результаты

Продолжаю серию публикаций в формате “вредных советов”, целью которых является попытка обратить внимание на (не)лучшие практики и подходы в построении аналитических сервисов с реальными примерами и историями.

В этой публикации Вас ожидает:

– Выполнение задач без четкого понимания целей - Question your customer;
– Игнорирование оценки ожидаемых характеристик данных - Assess data expectations;
– Пренебрежение документацией и пояснениями к коду - Ensure access and transparency.

#best_practices #dwh

Читать на Хабр →
⚡️Easily sync Google Sheets with Clickhouse

gSheets is the easiest way to maintain and edit mappings, target KPI values, custom dimensions and events.

It allows business users real-time editing data and seeing the results without knowing anything about Big Data, SQL, and columnar databases.

What I love about Clickhouse is its rich data integration capabilities. Here's what you do in 3 minutes:

1. Create gSheet with any structured data.

2. Get a direct link to data export to one of the known formats (CSV, JSON).

3. Create a table with engine URL specifying schema, gSheet link and file format.

CREATE TABLE dbt.margin_rate (
`date_month` Date
, `margin_rate` Float64
)
ENGINE=URL('https://docs.google.com/spreadsheets/d/{key}/gviz/tq?tqx=out:csv&sheet=margin_rate', CSVWithNames)
;

4. Query real-time in sub-second manner!

SELECT * FROM dbt.margin_rate LIMIT 200 ;

Data Warehouses have never been such interactive.

#dwh #clickhouse #pipelines
Pitfalls of incremental models in dbt ⬇️⬇️⬇️
test_failure.png
29.8 KB
Hey, everyone!

Today I’ve faced with an issue of missing rows in my incremental dbt model.

A little background: I calculate GEO-zones for a large amount of events, which is pretty CPU-intensive operation. And of course I want to do it in incremental way, hence only processing deltas - new and changed rows.

If you still don’t use incremental models, you probably want to take a look at dbtLabs Docs on it.

So, first of all, it was relationships test which has given me a clue I was missing some rows.
The initial incremental clause looked like this:

    where true

{% if is_incremental() %}
and orders.__metadata_timestamp >=
(select max(__metadata_timestamp) as high_watermark from {{ this }})
{% endif %}


So I basically took only rows where metadata timestamp was newer (later) than timestamp of those rows I already have in my table.

Then I realized sometimes metadata timestamp (which is the exact time when the row was fetched from source system) does not increase gradually, some rows get into different batches and land to DWH at different times, so I introduced a lookback period of 3 hours to make sure I don’t miss anything:

    where true

{% if is_incremental() %}
and orders.__metadata_timestamp >=
(select max(__metadata_timestamp) as high_watermark from {{ this }}) - interval '3 hours'
{% endif %}


That means I have more assurance I will take all the missing rows, but on the other hand I will process a significant amount of rows multiple times as well as trigger more UPDATEs and DELETEs on my target table since incremental periods overlap.

It was all going OK until rare cases of EXTRACT - LOAD errors / pauses / outages showed that even 3 hours of lookback period sometimes is not sufficient.
One can increase lookback interval to 12, 24 or even 48 hours which totally can solve most of the cases, but I decided to rewrite my query in a smart way using NOT EXISTS clause:

    where true

{% if is_incremental() %}
and not exists (
select 1
from {{ this }}
where orders.request_id = {{ this }}.request_id
and orders.__metadata_timestamp = {{ this }}.__metadata_timestamp
)
{% endif %}


That simply means:

– take either completely new rows (‘request_id’ does not exist in {{ this }})
– or take ‘request_id’ which exist in {{ this }} but have different __metadata_timestamp (row has been modified)

I thought it was perfect, but Amazon Redshift didn’t think so 😅:

> This type of correlated subquery pattern is not supported due to internal error
So took another apporach – Anti-join

{% if is_incremental() %}
left join {{ this }}
on orders.request_id = {{ this }}.request_id
and orders.__metadata_timestamp = {{ this }}.__metadata_timestamp

where {{ this }}.request_id is null
{% endif %}


But unfortunately EXPLAIN plan revealed Redshift performs anti-join + filtering after performing 2 GEO-spatial joins.
Which technically means processing full data set and resets any incrementality benefits.
In older times I would just use a hint to make joins run in a specific way to filter rows early, however today just shuffling join order was good enough!

    from {{ ref('stg_orders_tmp') }} as orders

{% if is_incremental() %}
left join {{ this }}
on orders.request_id = {{ this }}.request_id
and orders.__metadata_timestamp = {{ this }}.__metadata_timestamp
{% endif %}

left join {{ ref('stg_zones_tmp') }} as pickup
on ST_Intersects(
ST_Point(orders.pickup_position_lon, orders.pickup_position_lat), pickup.geometry)

left join {{ ref('stg_zones_tmp') }} as dropoff
on ST_Intersects(
ST_Point(orders.dropoff_position_lon, orders.dropoff_position_lat), dropoff.geometry)

{% if is_incremental() %}
where {{ this }}.request_id is null
{% endif %}
So, what do you think of it all?
Have you ever used dbt incremental models?
Let's discuss in comments section.

#dbt #incremental #subquery
Новая публикация на Хабр ⬇️⬇️⬇️
[RU] Стратегия инкрементального наполнения витрин: необходимость, реализация, подводные камни

Итак, Вы работаете с Большими Данными:

— Обработка этих данных требует значительного времени (и затрат 💰)
— Исторические данные не меняются (или не должны меняться) - как правило, это свершившиеся факты
— Если Вам удается не делать повторную обработку исторических данных - Вы экономите время и затраты

#dbt #incremental #dwh

Читать на Хабр →

🌐 @data_apps | Навигация по каналу
Please open Telegram to view this post
VIEW IN TELEGRAM