On transactional data lake

3 min readDec 25, 2020

With big data technologies maturing, big data processing is becoming a norm. Data lake is already adopted by lots of companies processing big data. Cloud vendors also support common big data processing engines, e.g. Hadoop, Spark. Recently there is trend of cloud vendor support for transactional data lake.

New - Insert, Update, Delete Data on S3 with Amazon EMR and Apache Hudi | Amazon Web Services

Storing your data in Amazon S3 provides lots of benefits in terms of scale, reliability, and cost effectiveness. On top…

aws.amazon.com

Amazon Redshift Spectrum adds support for querying open source Apache Hudi and Delta Lake

Redshift Spectrum powers the lake house architecture which allows you to query your data across Redshift, lake house…

aws.amazon.com

Table format projects now available on Dataproc | Google Cloud Blog

At Google Cloud, we're always looking for ways to help you connect data sources and get the most out of the big data…

cloud.google.com

What is Delta Lake - Azure Synapse Analytics

Azure Synapse Analytics is compatible with Linux Foundation Delta Lake. Delta Lake is an open-source storage layer that…

docs.microsoft.com

https://databricks.com/product/delta-lake-on-databricks

But what is transactional data lake? Why do we need it?

Transaction is a basic concept in relational databases, and MPP (massively parallel processing) databases, e.g. when you issue UPDATE command, the changes either succeed as a unit or fail as a unit. This is usually implemented using some form of transaction log.

The ACID Property For Database Transactions

Data integrity in a database is important when it comes to transactions. The data must be verified correct and should…

medium.com

In modern data platform, these technologies usually serve as data warehouse. Another important feature is time travel/temporal capability, which means that you can go back in time to fetch the exact state of the table data at that point of time (usually by playback of transaction log). This is important for machine learning based on data warehouse data. Imagine if you want to reproduce a model, you can store your code, configuration in VCS, result in MLflow, but storing each snapshot of data that is used to train the model would require too much storage. However, with time travel capability, you can just store the time-point.

Getting started with temporal tables - Azure SQL

APPLIES TO: Azure SQL Database Azure SQL Managed Instance Temporal tables are a programmability feature of Azure SQL…

docs.microsoft.com

Accessing historical data using time travel | BigQuery | Google Cloud

"type": "thumb-down", "id": "hardToUnderstand", "label":"Hard to understand" },{ "type": "thumb-down", "id"…

cloud.google.com

Understanding & Using Time Travel - Snowflake Documentation

When data in a table is modified, including deletion of data or dropping an object containing data, Snowflake preserves…

docs.snowflake.com

However, data warehouse is only part of the modern data platform 1, 2, 3, 4, 5. Data warehouse is usually used for dashboarding, ad-hoc data analysis, which usually stores processed, structured data. The multitude of raw data is usually stored in data lake. Then series of data processing would happen to curate the data into analytical data sets. Traditionally data lake is built on HDFS and recommendation of adding data is append-only. This does not solve every use case, because sometimes you want to update data, e.g. update order information, customers, etc.

Hive introduces ACID capability.

Apache Software Foundation

Hive 3 Warning ACID stands for four traits of database transactions: Atomicity (an operation either succeeds completely…

cwiki.apache.org

However, Hive is based on HDFS, but lots of companies are using cloud storage as data lake, e.g. S3, Google Cloud storage, Azure data lake. Therefore, an abstraction layer on top of the underlying storage would be beneficial.

https://docs.delta.io/latest/delta-storage.html

Cloud Storage

Talking to Cloud Storage

hudi.apache.org

Now we know why we need transactional data lake, here are some articles about available transactional data lake storage format solutions.

Comparison of Big Data storage layers: Delta vs Apache Hudi vs Apache Iceberg. Part#1

All you will read here is personal opinion or lack of knowledge :) Please feel free to contact me for fixing incorrect…

medium.com

Apache HUDI vs Delta Lake

The tale of the two ACID platforms on Data Lakes

medium.com

Comparative study of Apache CarbonData, Hudi and Open Delta

Background:

medium.com

Reshape Data Lake: Delta, Iceberg, Hudi, or Hive

The super success of Spark in the ETL area also showed that many paradigms in the traditional data warehouse are indeed…

eric-sun.medium.com

The ACID table storage layer- thorough conceptual comparisons between Delta Lake and Apache Hudi…

While I was doing my data engineer internship at Cathay Financial Holdings, I spent most of my time researching the…

h164654156465.medium.com

On transactional data lake

New - Insert, Update, Delete Data on S3 with Amazon EMR and Apache Hudi | Amazon Web Services

Storing your data in Amazon S3 provides lots of benefits in terms of scale, reliability, and cost effectiveness. On top…

Amazon Redshift Spectrum adds support for querying open source Apache Hudi and Delta Lake

Redshift Spectrum powers the lake house architecture which allows you to query your data across Redshift, lake house…

Table format projects now available on Dataproc | Google Cloud Blog

At Google Cloud, we're always looking for ways to help you connect data sources and get the most out of the big data…

What is Delta Lake - Azure Synapse Analytics

Azure Synapse Analytics is compatible with Linux Foundation Delta Lake. Delta Lake is an open-source storage layer that…

The ACID Property For Database Transactions

Data integrity in a database is important when it comes to transactions. The data must be verified correct and should…

Getting started with temporal tables - Azure SQL

APPLIES TO: Azure SQL Database Azure SQL Managed Instance Temporal tables are a programmability feature of Azure SQL…

Accessing historical data using time travel | BigQuery | Google Cloud

"type": "thumb-down", "id": "hardToUnderstand", "label":"Hard to understand" },{ "type": "thumb-down", "id"…

Understanding & Using Time Travel - Snowflake Documentation

When data in a table is modified, including deletion of data or dropping an object containing data, Snowflake preserves…

Apache Software Foundation

Hive 3 Warning ACID stands for four traits of database transactions: Atomicity (an operation either succeeds completely…

Cloud Storage

Talking to Cloud Storage

Comparison of Big Data storage layers: Delta vs Apache Hudi vs Apache Iceberg. Part#1

All you will read here is personal opinion or lack of knowledge :) Please feel free to contact me for fixing incorrect…

Apache HUDI vs Delta Lake

The tale of the two ACID platforms on Data Lakes

Comparative study of Apache CarbonData, Hudi and Open Delta

Background:

Reshape Data Lake: Delta, Iceberg, Hudi, or Hive

The super success of Spark in the ETL area also showed that many paradigms in the traditional data warehouse are indeed…

The ACID table storage layer- thorough conceptual comparisons between Delta Lake and Apache Hudi…

While I was doing my data engineer internship at Cathay Financial Holdings, I spent most of my time researching the…

Written by Xin Cheng

No responses yet