Engineer Readings – Telegram
[article]

Many organizations are shifting to a data management paradigm called the “Lakehouse,” which implements the functionality of struc- tured data warehouses on top of unstructured data lakes. This presents new challenges for query execution engines. The execu- tion engine needs to provide good performance on the raw un- curated datasets that are ubiquitous in data lakes, and excellent performance on structured data stored in popular columnar file formats like Apache Parquet. Toward these goals, we present Pho- ton, a vectorized query engine for Lakehouse environments that we developed at Databricks. Photon can outperform existing cloud data warehouses in SQL workloads, but implements a more general exe- cution framework that enables efficient processing of raw data and also enables Photon to support the Apache Spark API. We discuss the design choices we made in Photon (e.g., vectorization vs. code generation) and describe its integration with our existing SQL and Apache Spark runtimes, its task model, and its memory manager. Photon has accelerated some customer workloads by over 10× and has recently allowed Databricks to set a new audited performance record for the official 100TB TPC-DS benchmark.

https://www-cs.stanford.edu/~matei/papers/2022/sigmod_photon.pdf