Amazon EMR now supports Apache Iceberg, a highly performant, concurrent, ACID-compliant table format for data lakes.
Amazon
Amazon EMR now supports Apache Iceberg, a highly performant, concurrent, ACID-compliant table format for data lakes
Cost Efficiency @ Scale in Big Data File Format
https://eng.uber.com/cost-efficiency-big-data/
https://eng.uber.com/cost-efficiency-big-data/
I think I missed this one. So now both Athena and EMR work with Iceberg.
Amazon
Announcing Amazon Athena ACID transactions, powered by Apache Iceberg (Preview)
The Ubiquity of the Delta Standalone Project for Delta Lake - The Databricks Blog
https://databricks.com/blog/2022/01/28/the-ubiquity-of-delta-standalone-java-scala-hive-presto-trino-power-bi-and-more.html
https://databricks.com/blog/2022/01/28/the-ubiquity-of-delta-standalone-java-scala-hive-presto-trino-power-bi-and-more.html
I came across Argo in an AWS blog post. In particular with Argo Workflows which is an orchestration tool like Airflow which you can use if you already have K8s cluster.
argoproj.github.io
Home
Open source Kubernetes native workflows, events, CI and CD
Kubernetes is probably the only major topic in our field that I never had a chance to work or interact with, but it seems it starts to serve as a meta OS or abstraction layer for major data engineering (and not only) platforms or projects.
Amazon Redshift announces public preview of Streaming Ingestion for Kinesis Data Streams
https://aws.amazon.com/about-aws/whats-new/2022/02/amazon-redshift-public-preview-streaming-ingestion-kinesis-data-streams/
https://aws.amazon.com/about-aws/whats-new/2022/02/amazon-redshift-public-preview-streaming-ingestion-kinesis-data-streams/
Amazon
Amazon Redshift announces public preview of Streaming Ingestion for Kinesis Data Streams
👍1
This article contains combination of multiple individually useful techniques. Especially, I like idea of indexing of S3 files with a cluster of Lambda functions.
Amazon
Doing more with less: Moving from transactional to stateful batch processing | Amazon Web Services
Amazon processes hundreds of millions of financial transactions each day, including accounts receivable, accounts payable, royalties, amortizations, and remittances, from over a hundred different business entities. All of this data is sent to the eCommerce…
I think there are three major platforms I would like to work/play with to get more experience:
1. Google Could Platform
2. Databricks (not just Spark)
3. Kubernetes (maybe to run Spark)
1. Google Could Platform
2. Databricks (not just Spark)
3. Kubernetes (maybe to run Spark)
Google Cloud
Google Cloud Platform Services Summary
A complete list of services that form a part of Google Cloud.
Every product in the Google Cloud family described in the visual sketchnote format to grasp the capability of the tools quickly and easily.
GitHub
GitHub - priyankavergadia/GCPSketchnote: If you are looking to become a Google Cloud Engineer , then you are at the right place.…
If you are looking to become a Google Cloud Engineer , then you are at the right place. GCPSketchnote is series where I share Google Cloud concepts in quick and easy to learn format. - priyankaverg...
Long read about Apache Hudi internals with good visuals.
hudi.apache.org
Apache Hudi - The Data Lake Platform | Apache Hudi
As early as 2016, we set out a bold, new vision reimagining batch data processing through a new “incremental” data processing stack - alongside the existing batch and streaming stacks.
👍2