Data Engineering / Инженерия данных / Data Engineer / DWH – Telegram
Data Engineering / Инженерия данных / Data Engineer / DWH
2.29K subscribers
50 photos
7 videos
53 files
356 links
Data Engineering: ETL / DWH / Data Pipelines based on Open-Source software. Инженерия данных.

DWH / SQL
Python / ETL / ELT / dbt / Spark
Apache Airflow

Рекламу не размещаю
Вопросы: @iv_shamaev | datatalks.ru
Download Telegram
Nico_Loubser_Software_Engineering_for_Absolute_Beginners_Your_Guide.epub
1.5 MB
Software Engineering for Absolute Beginners - 2021

What You Will Learn
🔹 Explore the concepts that you will encounter in the majority of companies doing software development
🔹 Create readable code that is neat as well as well-designed
🔹 Build code that is source controlled, containerized, and deployable
🔹 Secure your codebase
🔹 Optimize your workspace
🔥 Awesome Docker Compose samples

These samples provide a starting point for how to integrate different services using a Compose file and to manage their deployment with Docker Compose.

👉 @devops_dataops

https://github.com/docker/awesome-compose
GitHub - martandsingh/ApacheSpark: This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

https://github.com/martandsingh/ApacheSpark
👍1
Mara Pipelines

This package contains a lightweight data transformation framework with a focus on transparency and complexity reduction. It has a number of baked-in assumptions/ principles:
- Data integration pipelines as code: pipelines, tasks and commands are created using declarative Python code.
- PostgreSQL as a data processing engine.
- Extensive web ui. The web browser as the main tool for inspecting, running and debugging pipelines.
- GNU make semantics. Nodes depend on the completion of upstream nodes. No data dependencies or data flows.
- No in-app data processing: command line tools as the main tool for interacting with databases and data.
- Single machine pipeline execution based on Python's multiprocessing. No need for distributed task queues. Easy debugging and output logging.
- Cost based priority queues: nodes with higher cost (based on recorded run times) are run first.

https://github.com/mara/mara-pipelines
Open Source Guides

Open source software is made by people just like you. Learn how to launch and grow your project.

https://opensource.guide/
Инженерия_машинного_обучения_Андрей_Бурков_2022.pdf
14.9 MB
Инженерия машинного обучения

Содержит множество рекомендаций и паттернов проектирования надежных и масштабируемых решений в области машинного обучения.