Data Apps Design – Telegram
Data Apps Design
1.54K subscribers
143 photos
2 videos
41 files
231 links
В этом блоге я публикую свои выводы и мнения на работу в Data:

— Data Integration
— Database engines
— Data Modeling
— Business Intelligence
— Semantic Layer
— DataOps and DevOps
— Orchestrating jobs & DAGs
— Business Impact and Value
Download Telegram
We are in the middle of migration from Amazon Redshift DC2 nodes (2nd gen) to RA3 nodes (3rd gen) at Wheely.

What this means for us:
– Almost unlimited Disk Space (RA3 separate compute and storage)
– Speeding up Data Marts to 2hrs delay from real-time
– Blue/green deployments

I will follow up as soon as we are finished.

Attached simplified checklist plan.

Any questions welcomed.
Привет! Сегодня 18 ноября в 15.00 приглашаю на вебинар.

Полуструктурированные данные в Аналитических Хранилищах: Nested JSON + Arrays

- Источники полуструктурированных данных: Events, Webhooks, Logs
- Подходы: JSON functions, special data types, External tables (Lakehouse)
- Оптимизация производительности

Смотрим на примерах Amazon Redshift, Clickhouse.

Ссылка на регистрацию: https://otus.ru/lessons/dwh/#event-1661
Ссылка на youtube-трансляцию будет опубликована здесь за 5 минут до начала.
Data Apps Design
Привет! Сегодня 18 ноября в 15.00 приглашаю на вебинар. Полуструктурированные данные в Аналитических Хранилищах: Nested JSON + Arrays - Источники полуструктурированных данных: Events, Webhooks, Logs - Подходы: JSON functions, special data types, External…
So the process of Amazon Redshift cluster migration is almost completed.
New cluster is way more powerful. Now seeking ways to fully utilize its resources 😄

I can state that not everything has gone as expected.
The most painful parts turned out to be:

– Migrating S3 bucket with 1M+ files to a new region (took ~4-5 hours) – really challenging
– Not losing data events while switching between clusters
– VPC and network issues (connecting from BI tool)
– Hotfixing several Python UDFs suddenly not working on a new environment

In some time I will publish a detailed reflection on this process.
A nice remark from Dmitry Anoshin @rockyourdata

How one can visualize its own DWH ER (Entity-Relationship) model?

I would use these two ways (applicable to my DWH @ Wheely):

- DBeaver's feature ER diagram
- Looker's LookML Diagram

Both ways require relationships to be modeled in advance i.e. defining FOREIGN KEY / REFERENCES constraints or JOIN conditions.

Can anybody suggest more options?
A joy for an eye
[RU] Полуструктурированные данные в Аналитических Хранилищах

В последние годы явным стал тренд на анализ слабоструктурированных данных – всевозможных событий, логов, API-выгрузок, реплик schemaless баз данных. Но для привычной реляционной модели это требует адаптации ряда новых подходов к работе с данными, о которых я и попробую рассказать сегодня.

В публикации:

– Преимущества гибкой схемы и semi-structured data
– Источники таких данных: Events, Logs, API
– Подходы к обработке: Special Data Types, Functions, Data Lakehouse
– Принципы оптимизации производительности
Oh yeah, by the way our Tech Stack Viz (no bullshit 😏)
How to access Managed Clickhouse (Yandex.Cloud) from PowerBI

Managed Clickhouse cluster with public address is only reachable with SSL enabled, so

1. Download and install Yandex.Cloud certificate

Into Trusted Root Certification Authorities

2. Install Clickhouse ODBC driver

clickhouse-odbc-1.1.10-win64.msi

See more at clickhouse-odbc releases

3. Configure ODBC connection (Windows)

Get Data in PowerBI

4. From ODBC – choose your connection

Voila. By the way, I use Mac, and to work with PowerBI I have to spin up Windows VM 😒

#powerbi #bi #clickhouse
Clickhouse destination for Airbyte is coming

Soon they will meet together

– Open Source pipeline tool with tens of connectors out of the box
– One of the fastest and Feature-rich Analytics Databases

Just imagine you won't need to overpay for black-box connector services, while you integrate all of your data:

– Performance marketing
– CRM
– Event analytics
– Engagement platforms

It isn't going to be that easy, of course.
But still this is going to revolutionize solutions I am currently working on.
Has anyone heard of Datafold?

I bet you use gitdiff tool regularly to compare code changes.
But how these code changes reflect on your actual DWH data?

They offer tool named Data Diff to compare changes on Schema, PK, Column profile levels.
Moreover, they can help you track Column-level lineage and set Metrics Alerts.

Seems to be very handy and useful.
I think I'm going to test it soon.

By the way, it integrates with dbt tightly.