With big data technologies maturing, big data processing is becoming a norm. Data lake is already adopted by lots of companies processing big data. Cloud vendors also support common big data processing engines, e.g. Hadoop, Spark. Recently there is trend of cloud vendor support for transactional data lake.
https://databricks.com/product/delta-lake-on-databricks
But what is transactional data lake? Why do we need it?
Transaction is a basic concept in relational databases, and MPP (massively parallel processing) databases, e.g. when you issue UPDATE command, the changes either succeed as a unit or fail as a unit. This is usually implemented using some form of transaction log.
In modern data platform, these technologies usually serve as data warehouse. Another important feature is time travel/temporal capability, which means that you can go back in time to fetch the exact state of the table data at that point of time (usually by playback of transaction log). This is important for machine learning based on data warehouse data. Imagine if you want to reproduce a model, you can store your code, configuration in VCS, result in MLflow, but storing each snapshot of data that is used to train the model would require too much storage. However, with time travel capability, you can just store the time-point.
However, data warehouse is only part of the modern data platform 1, 2, 3, 4, 5. Data warehouse is usually used for dashboarding, ad-hoc data analysis, which usually stores processed, structured data. The multitude of raw data is usually stored in data lake. Then series of data processing would happen to curate the data into analytical data sets. Traditionally data lake is built on HDFS and recommendation of adding data is append-only. This does not solve every use case, because sometimes you want to update data, e.g. update order information, customers, etc.
Hive introduces ACID capability.
However, Hive is based on HDFS, but lots of companies are using cloud storage as data lake, e.g. S3, Google Cloud storage, Azure data lake. Therefore, an abstraction layer on top of the underlying storage would be beneficial.
https://docs.delta.io/latest/delta-storage.html
Now we know why we need transactional data lake, here are some articles about available transactional data lake storage format solutions.