A nice remark from Dmitry Anoshin @rockyourdata
How one can visualize its own DWH ER (Entity-Relationship) model?
I would use these two ways (applicable to my DWH @ Wheely):
- DBeaver's feature ER diagram
- Looker's LookML Diagram
Both ways require relationships to be modeled in advance i.e. defining FOREIGN KEY / REFERENCES constraints or JOIN conditions.
Can anybody suggest more options?
How one can visualize its own DWH ER (Entity-Relationship) model?
I would use these two ways (applicable to my DWH @ Wheely):
- DBeaver's feature ER diagram
- Looker's LookML Diagram
Both ways require relationships to be modeled in advance i.e. defining FOREIGN KEY / REFERENCES constraints or JOIN conditions.
Can anybody suggest more options?
[RU] Полуструктурированные данные в Аналитических Хранилищах
В последние годы явным стал тренд на анализ слабоструктурированных данных – всевозможных событий, логов, API-выгрузок, реплик schemaless баз данных. Но для привычной реляционной модели это требует адаптации ряда новых подходов к работе с данными, о которых я и попробую рассказать сегодня.
В публикации:
– Преимущества гибкой схемы и semi-structured data
– Источники таких данных: Events, Logs, API
– Подходы к обработке: Special Data Types, Functions, Data Lakehouse
– Принципы оптимизации производительности
В последние годы явным стал тренд на анализ слабоструктурированных данных – всевозможных событий, логов, API-выгрузок, реплик schemaless баз данных. Но для привычной реляционной модели это требует адаптации ряда новых подходов к работе с данными, о которых я и попробую рассказать сегодня.
В публикации:
– Преимущества гибкой схемы и semi-structured data
– Источники таких данных: Events, Logs, API
– Подходы к обработке: Special Data Types, Functions, Data Lakehouse
– Принципы оптимизации производительности
Хабр
Полуструктурированные данные в Аналитических Хранилищах
Привет! На связи Артемий – Analytics Engineer. В последние годы явным стал тренд на анализ слабоструктурированных данных – всевозможных событий, логов, API-выгрузок, реплик schemaless баз данных. Но...
How to access Managed Clickhouse (Yandex.Cloud) from PowerBI
Managed Clickhouse cluster with public address is only reachable with SSL enabled, so
1. Download and install Yandex.Cloud certificate
Into Trusted Root Certification Authorities
2. Install Clickhouse ODBC driver
clickhouse-odbc-1.1.10-win64.msi
See more at clickhouse-odbc releases
3. Configure ODBC connection (Windows)
Get Data in PowerBI
4. From ODBC – choose your connection
Voila. By the way, I use Mac, and to work with PowerBI I have to spin up Windows VM 😒
#powerbi #bi #clickhouse
Managed Clickhouse cluster with public address is only reachable with SSL enabled, so
1. Download and install Yandex.Cloud certificate
Into Trusted Root Certification Authorities
2. Install Clickhouse ODBC driver
clickhouse-odbc-1.1.10-win64.msi
See more at clickhouse-odbc releases
3. Configure ODBC connection (Windows)
Get Data in PowerBI
4. From ODBC – choose your connection
Voila. By the way, I use Mac, and to work with PowerBI I have to spin up Windows VM 😒
#powerbi #bi #clickhouse
GitHub
Releases · ClickHouse/clickhouse-odbc
ODBC driver for ClickHouse. Contribute to ClickHouse/clickhouse-odbc development by creating an account on GitHub.
Clickhouse destination for Airbyte is coming
Soon they will meet together
– Open Source pipeline tool with tens of connectors out of the box
– One of the fastest and Feature-rich Analytics Databases
Just imagine you won't need to overpay for black-box connector services, while you integrate all of your data:
– Performance marketing
– CRM
– Event analytics
– Engagement platforms
It isn't going to be that easy, of course.
But still this is going to revolutionize solutions I am currently working on.
Soon they will meet together
– Open Source pipeline tool with tens of connectors out of the box
– One of the fastest and Feature-rich Analytics Databases
Just imagine you won't need to overpay for black-box connector services, while you integrate all of your data:
– Performance marketing
– CRM
– Event analytics
– Engagement platforms
It isn't going to be that easy, of course.
But still this is going to revolutionize solutions I am currently working on.
Has anyone heard of Datafold?
I bet you use gitdiff tool regularly to compare code changes.
But how these code changes reflect on your actual DWH data?
They offer tool named Data Diff to compare changes on Schema, PK, Column profile levels.
Moreover, they can help you track Column-level lineage and set Metrics Alerts.
Seems to be very handy and useful.
I think I'm going to test it soon.
By the way, it integrates with dbt tightly.
I bet you use gitdiff tool regularly to compare code changes.
But how these code changes reflect on your actual DWH data?
They offer tool named Data Diff to compare changes on Schema, PK, Column profile levels.
Moreover, they can help you track Column-level lineage and set Metrics Alerts.
Seems to be very handy and useful.
I think I'm going to test it soon.
By the way, it integrates with dbt tightly.
Datafold
Datafold | Automated Data Migrations and Quality Testing
Datafold automates critical data engineering workflows, dramatically speeding up data migrations, code testing and review, and monitoring and observability.
Have you ever heard of Operational Analytics?
While your data resides in DWH – it is passive. It awaits while somebody queries it.
Operational Analytics is about making data actionable, not only available through SQL and BI on demand, but really working on day-to-day business in customer-facing workflows:
– Ad Networks (Facebook, Google Ads, …)
– E-mail Tools (Hubspot, Mailchimp, …)
– Lifecycle tools (Salesforce, Braze, …)
reverse-ETL is approach / class of tools for implementing Operational Analytics.
It enables data flowing out of your DWH into Operational Systems in a reliable/predictable/fault tolerant way.
Real-world use-cases:
– Prioritizing leads for Account Managers – most valuable customers first
– Generating custom audiences for advertising campaigns
– Delivering personal incentives and bonuses
– Communicating with customers about to churn
#reverse_etl
While your data resides in DWH – it is passive. It awaits while somebody queries it.
Operational Analytics is about making data actionable, not only available through SQL and BI on demand, but really working on day-to-day business in customer-facing workflows:
– Ad Networks (Facebook, Google Ads, …)
– E-mail Tools (Hubspot, Mailchimp, …)
– Lifecycle tools (Salesforce, Braze, …)
reverse-ETL is approach / class of tools for implementing Operational Analytics.
It enables data flowing out of your DWH into Operational Systems in a reliable/predictable/fault tolerant way.
Real-world use-cases:
– Prioritizing leads for Account Managers – most valuable customers first
– Generating custom audiences for advertising campaigns
– Delivering personal incentives and bonuses
– Communicating with customers about to churn
#reverse_etl
Do you use reverse-ETL ?
Anonymous Poll
29%
We use reverse-ETL
22%
Not using it yet, but want to
7%
We don't need it
42%
Never heard of it
There is a number of successful companies who aim to provide reverse-ETL as a service:
– Census
– Hightouch
– Grouparoo
However there are multiple ways to do it yourself:
1. Writing your own noscripts
– Every noscript is adding complexity to your pipelines
– Non-scalable: you should write new code for each integration
– Requires Software Engineering skills: not applicable to business users
Could be automated with:
– Python + Airflow DAGs
– Jupyter + cron + any Python libs (pandas, sklearn, …)
2. Using off the shelf integrations
– Enables using built-in integration capabilities from Software you already use
– Could be tweaked easy enough
Examples are:
– Looker Action Hub
– Redshift with its Spectrum, Datashares, Unloading to S3
#reverse_etl
– Census
– Hightouch
– Grouparoo
However there are multiple ways to do it yourself:
1. Writing your own noscripts
– Every noscript is adding complexity to your pipelines
– Non-scalable: you should write new code for each integration
– Requires Software Engineering skills: not applicable to business users
Could be automated with:
– Python + Airflow DAGs
– Jupyter + cron + any Python libs (pandas, sklearn, …)
2. Using off the shelf integrations
– Enables using built-in integration capabilities from Software you already use
– Could be tweaked easy enough
Examples are:
– Looker Action Hub
– Redshift with its Spectrum, Datashares, Unloading to S3
#reverse_etl
Let’s examine some integrations I currently use
UNLOADing customer/partner/chauffeur labels to S3 bucket files
Let’s say we want to exchange some data with other teams be it Backend, Financial Service, Frontend. You may also treat it as publishing data for whom it may concern, so there could be multiple consumers.
Could be easily automated it with dbt + Redshift UNLOAD:
– Create a model with desired attributes and rows
– Materialize it as a view
– Add dbt post-hook with UNLOAD noscript
– Schedule it on a daily/hourly basis
As a result:
– No actual excessive data is stored – materialized='view'
– Export is always up-to-date as it is scheduled with dbt jobs
– UNLOAD is performed only with production runs
– For consumers snapshot or history is available
– Multiple file formats supported: CSV, JSON, Parquet
UNLOADing customer/partner/chauffeur labels to S3 bucket files
Let’s say we want to exchange some data with other teams be it Backend, Financial Service, Frontend. You may also treat it as publishing data for whom it may concern, so there could be multiple consumers.
Could be easily automated it with dbt + Redshift UNLOAD:
– Create a model with desired attributes and rows
– Materialize it as a view
– Add dbt post-hook with UNLOAD noscript
– Schedule it on a daily/hourly basis
As a result:
– No actual excessive data is stored – materialized='view'
– Export is always up-to-date as it is scheduled with dbt jobs
– UNLOAD is performed only with production runs
– For consumers snapshot or history is available
– Multiple file formats supported: CSV, JSON, Parquet
Data Apps Design
Let’s examine some integrations I currently use UNLOADing customer/partner/chauffeur labels to S3 bucket files Let’s say we want to exchange some data with other teams be it Backend, Financial Service, Frontend. You may also treat it as publishing data for…
Continued:
Also, pay attention to how a model is built with dbt_utils.unpivot macro:
Also, pay attention to how a model is built with dbt_utils.unpivot macro:
{{ dbt_utils.unpivot(
relation=ref('dim_chauffeurs_labels'),
cast_to='varchar',
exclude=['actual_dt', 'chauffeur_id'],
field_name='label',
value_name='value'
) }}
A list of attributes unloaded could be expanded or narrowed without actual code changes. Schema is ['actual_dt', 'chauffeur_id', 'label', 'value']Forwarded from Igor Kalinovskiy
#анонс
В докладе Артемий расскажет про
- Анализ T2M команд разработки на событиях Github (Webhook)
- Аналитику документ-ориентированной БД MongoDB
- Использование специальных функций и типов данных, оптимизация скорости
Добавить в календарь 🗓
(всякое бывает, и время может измениться, хотя мы постараемся, чтобы такого не было! Но все равно в день конференции следите за финальной сеткой выступлений)
#DevsCamp
#TeamLeadCamp
#TeamLead
В докладе Артемий расскажет про
- Анализ T2M команд разработки на событиях Github (Webhook)
- Аналитику документ-ориентированной БД MongoDB
- Использование специальных функций и типов данных, оптимизация скорости
Добавить в календарь 🗓
(всякое бывает, и время может измениться, хотя мы постараемся, чтобы такого не было! Но все равно в день конференции следите за финальной сеткой выступлений)
#DevsCamp
#TeamLeadCamp
#TeamLead