Data Engineering / Инженерия данных / Data Engineer / DWH – Telegram
Data Engineering / Инженерия данных / Data Engineer / DWH
2.28K subscribers
50 photos
7 videos
53 files
356 links
Data Engineering: ETL / DWH / Data Pipelines based on Open-Source software. Инженерия данных.

DWH / SQL
Python / ETL / ELT / dbt / Spark
Apache Airflow

Рекламу не размещаю
Вопросы: @iv_shamaev | datatalks.ru
Download Telegram
Apache Spark / PySpark Tutorial: Basics In 15 Mins

This video gives an introduction to the Spark ecosystem and world of Big Data, using the Python Programming Language and its PySpark API. We also discuss the idea of parallel and distributed computing, and computing on a cluster of machines.

https://youtu.be/QLQsW8VbTN4
👍2
Source: https://www.linkedin.com/posts/timo-dechau_in-our-little-data-world-are-we-naming-things-activity-6925303646817529856-Nu-U/
---
In our little data world are we naming things too much based on our marketing perspective. And is there serious over-selling going on.

Maybe yes.

Let’s do some examples:

dbt is not a data model tool. I see this notion quite often. It’s first a SQL orchestration and testing tool. Of course, I can use it to build and manage a data model. But this requires me to do the thinking not dbt

Snowflake and BigQuery are not data warehouses. Great people like .Rogier Werschkull. and Chad Sanderson remind us about that. They are analytical databases in the cloud. Of course, you can build a data warehouse with them. But this requires you to come up with a concept and architecture.

Fivetran and Airbyte are not ELT tools - they extract and load for you. And you are in charge of the transformation. They are basically supermarkets with self-checkout. Great idea but you have to do more.

Segment and Rudderstack are not really CDPs - Arpit Choudhury has written a great piece about it - they are customer data infrastructure, the collection and identity stitching layer

Reverse ETL is just ETL


Why is this important?

Because often these labels create expectations about the solution that these tools can’t fulfill.

When I set up Snowflake and think that I have a data warehouse now - I create huge expectations in my organization that I can’t fulfill.

Same with dbt - Ok, we need a data model, let’s use dbt for this. And then you add one sql file to the next one and call it a model.

Tools are tools, just that.
Github Actions - Введение в CI/CD

00:00 - О чем курс
03:50 - Github вводный курс
12:35 - Начало работы с Github Actions
18:20 - Пишем первый workflow
29:17 - Автоматически тестируем React
37:57 - Что такое Actions
48:25 - Усложняем workflow (практика)
53:40 - Зависимость job и их порядок
01:00:18 - Context & Events
01:21:19 - Добавление cache
01:28:13 - Matrix
01:35:44 - Artifacts
01:45:25 - Environment & Secrets

https://www.youtube.com/watch?v=e0A2hDObLmg
Интересная модель монетизации у этого софта, вроде опенсоурс, но и есть разумные плюшки, которые можно получить только в платной версии (пользователи и роли + поддержка).
Ну и сама идея появления платформ с low-code подходом как open-source тоже интересная.
----
Tooljet | Open-source low-code platform to build internal tools

Extensible low-code framework for building business applications. Connect to databases, cloud storages, GraphQL, API endpoints, Airtable, etc and build apps using drag and drop application builder. Built using JavaScript/TypeScript.

https://www.tooljet.com/
Prescriber-ETL-data-pipeline

An End-to-End ETL data pipeline that leverages pyspark parallel processing to process about 25 million rows of data coming from a SaaS application using Apache Airflow as an orchestration tool and various data warehouse technologies and finally using Apache Superset to connect to DWH for generating BI dashboards for weekly reports

https://github.com/judeleonard/Prescriber-ETL-data-pipeline
👍1