Very useful and relevant blog post about data deletion in a data lake. Besides suggested solution I would like to mention also using Delta Lake as alternative. And finally, it would be great if the author has mentioned cost considerations .
https://aws.amazon.com/blogs/big-data/how-to-delete-user-data-in-an-aws-data-lake/
https://aws.amazon.com/blogs/big-data/how-to-delete-user-data-in-an-aws-data-lake/
Amazon
How to delete user data in an AWS data lake | Amazon Web Services
General Data Protection Regulation (GDPR) is an important aspect of today’s technology world, and processing data in compliance with GDPR is a necessity for those who implement solutions within the AWS public cloud. One article of GDPR is the “right to erasure”…
Amazing statistics about data.
https://www.datanami.com/2020/09/04/10-big-data-statistics-that-will-blow-your-mind/?utm_source=rss&utm_medium=rss&utm_campaign=10-big-data-statistics-that-will-blow-your-mind
https://www.datanami.com/2020/09/04/10-big-data-statistics-that-will-blow-your-mind/?utm_source=rss&utm_medium=rss&utm_campaign=10-big-data-statistics-that-will-blow-your-mind
Datanami
10 Big Data Statistics That Will Blow Your Mind
They call it “big data” for a reason--it's really, really big. But getting your head wrapped around the growth of information digitization is not easy.
20x improvement compared to #Spark 2.4
https://techcommunity.microsoft.com/t5/azure-databricks/turbocharge-azure-databricks-with-photon-powered-delta-engine/ba-p/1694929
https://techcommunity.microsoft.com/t5/azure-databricks/turbocharge-azure-databricks-with-photon-powered-delta-engine/ba-p/1694929
TECHCOMMUNITY.MICROSOFT.COM
Turbocharge Azure Databricks with Photon powered Delta Engine
Today we are excited to announce the preview of Photon powered Delta Engine on Azure Databricks – fast, easy, and collaborative Analytics and AI service. Built from scratch in C++ and fully compatible with Spark APIs, Photon is a vectorized query engine that…
Most of the subscribers know why I've paused posting in the channel. I think most of you are busy now with other important issues. So I would like to create a poll to ask you whether you would like to see new posts or not yet. Thank you for understanding.
Anonymous Poll
63%
Yes
37%
Not yet
#AWS released open-source Python connector for Redshift with Data API support. By the way Redshift Data API was also announced recently.
https://github.com/aws/amazon-redshift-python-driver
https://github.com/aws/amazon-redshift-python-driver
GitHub
GitHub - aws/amazon-redshift-python-driver: Redshift Python Connector. It supports Python Database API Specification v2.0.
Redshift Python Connector. It supports Python Database API Specification v2.0. - aws/amazon-redshift-python-driver
It seems that #AWS is improving #Redshift on a weekly basis. Here is another cool feature.
https://aws.amazon.com/about-aws/whats-new/2020/11/amazon-redshift-announces-automatic-refresh-and-query-rewrite-for-materialized-views/
https://aws.amazon.com/about-aws/whats-new/2020/11/amazon-redshift-announces-automatic-refresh-and-query-rewrite-for-materialized-views/
Amazon Web Services, Inc.
Amazon Redshift announces automatic refresh and query rewrite for materialized views
A comparison of data version control tools.
https://dagshub.com/blog/data-version-control-tools/
https://dagshub.com/blog/data-version-control-tools/
DagsHub Blog
Comparing Data Version Control Tools - 2020
Data versioning is one of the keys to automating a team's machine learning model development. While it can be very complicated if your team attempts to develop its own system to manage the process, this doesn’t need to be the case.
A short series of articles from Lyft about Gevent #Python library.
https://eng.lyft.com/what-the-heck-is-gevent-4e87db98a8
https://eng.lyft.com/gevent-part-2-correctness-22e3b7998382
https://eng.lyft.com/gevent-part-3-performance-e64303fa102b
https://eng.lyft.com/applying-gevent-learnings-to-deliver-value-to-users-part-4-of-4-36ad932deea8
https://eng.lyft.com/what-the-heck-is-gevent-4e87db98a8
https://eng.lyft.com/gevent-part-2-correctness-22e3b7998382
https://eng.lyft.com/gevent-part-3-performance-e64303fa102b
https://eng.lyft.com/applying-gevent-learnings-to-deliver-value-to-users-part-4-of-4-36ad932deea8
Medium
What the heck is gevent?
Overview
Introduction to Apache Pinot, a real-time distributed OLAP datastore from LinkedIn and Uber
https://docs.pinot.apache.org/
https://docs.pinot.apache.org/
docs.pinot.apache.org
Introduction | Apache Pinot Docs
Apache Pinot is a real-time distributed OLAP datastore purpose-built for low-latency, high-throughput analytics, and perfect for user-facing analytical workloads.
Some important updates from #AWS :
✅ Amazon Kinesis Data Streams enables data stream retention up to one year.
✅ Now you can export your Amazon DynamoDB table data to your data lake in Amazon S3 to perform analytics at any scale.
✅ Amazon Redshift now supports modifying column compression encodings to optimize storage utilization and query performance
✅ Amazon Athena announces availability of engine version 2
✅ Amazon Kinesis Data Streams enables data stream retention up to one year.
✅ Now you can export your Amazon DynamoDB table data to your data lake in Amazon S3 to perform analytics at any scale.
✅ Amazon Redshift now supports modifying column compression encodings to optimize storage utilization and query performance
✅ Amazon Athena announces availability of engine version 2
Amazon
Amazon Kinesis Data Streams enables data stream retention up to one year
➡️ Discover the new syntax for implicits in #Scala 3.
➡️ Learn how to express extension methods, implicit parameters, implicit conversions, and typeclasses in #Scala 3!
https://t.co/BYFnTVc3yh
➡️ Learn how to express extension methods, implicit parameters, implicit conversions, and typeclasses in #Scala 3!
https://t.co/BYFnTVc3yh
www.scala-lang.org
Explicit term inference with Scala 3
#AWS updates:
✅ Amazon EMR now provides up to 35% lower cost and up to 15% improved performance for Spark workloads on Graviton2-based instances
✅ AWS Glue Streaming ETL jobs support reading records in the Apache Avro format
✅ Control the evolution of data streams using the AWS Glue Schema Registry
✅ Amazon EMR now provides up to 35% lower cost and up to 15% improved performance for Spark workloads on Graviton2-based instances
✅ AWS Glue Streaming ETL jobs support reading records in the Apache Avro format
✅ Control the evolution of data streams using the AWS Glue Schema Registry
Amazon
Amazon EMR now provides up to 35% lower cost and up to 15% improved performance for Spark workloads on Graviton2-based instances
Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.
https://github.com/donnemartin/system-design-primer
https://github.com/donnemartin/system-design-primer
GitHub
GitHub - donnemartin/system-design-primer: Learn how to design large-scale systems. Prep for the system design interview. Includes…
Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards. - donnemartin/system-design-primer