Data warehouse, Data lake, data lakehouse, data fabric, data mesh

Xin Cheng
7 min readOct 6, 2022

--

These 5 words are usually mentioned in data stack talk. However, the first 2 are more about data storage, while data lakehouse and data fabric are more about data infrastructure, data mesh is more about decentralized data product building.

Data warehouse comes first in history, when the main use cases are Business intelligence, reporting, visualization. It provides “source of truth” for decision-makers.

Disadvantage of data warehouse: lack of data flexibility: performs well with structured data, but it struggles for semi-structured and unstructured data formats (images, audios, videos). It does not support machine learning use cases well; Expensive: cannot store raw data cheaply, which can enable future cases

Data lake: solve above problems of data warehouse, supporting structure, unstructured data format inexpensively, enabling machine learning use cases. Disadvantage: poor performance for business intelligence and reporting use cases, lack of data reliability.

Therefore, traditional approach is storing raw data in data lake and let machine learning directly access data in data lake, processing and storing curated data in data warehouse. Thus you need to maintain two systems which are based on different technologies.

Data lakehouse is trying to use open source technologies to bring data warehouse performance/consistency (metadata, governance) to data lake at a cheaper cost that data warehouse. For comparison, refer to above article and this.

Data Mesh is a paradigm, while lakehouse is a platform. Data mesh is focused on solving scalability of data ownership. Traditionally a central data platform is both owning data infrastructure and data pipeline building. However, usually they are not data owner and domain expert, so they have to work with domain team and could become bottleneck. Data mesh is giving data pipeline ownership back to domain team, while central data platform team can focuses on providing data infrastructure and data pipeline framework that can be leveraged by different domain teams to enable high-quality development. Data infrastructure can be based on data lakehouse platform. So the article mentions data mesh solves data pipeline development scalability, while data lakehouse (or data fabric) solves use case scalability.

Appendix

Data workloads

Transactional, analytical, translytical (HTAP)

History of data lake, data warehouse, data fabric, data mesh

http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

https://www.oracle.com/a/ocom/docs/datamesh-ebook.pdf

PlainConcepts

DATA

MAY 5, 2022

What Is A Data Mesh Organizational Architecture?

Share

Tweet

Data Mesh is an increasingly popular concept among data platform specialists. Technological innovations and the popularization of Big Data in companies lead to new paradigms for data decentralization and consumption. In this sense, the Data Mesh organizational approach can help corporations looking to organize data teams.

What Is Data Mesh

Data Mesh is a technical and organizational architecture approach aimed at the decentralization and large-scale management of an organization’s analytical data.

Why is Data Mesh Being Adopted

Blanca Mayayo is the Product Owner of Sidra Data Platform at Plain Concepts, has previously worked as an engineer and product leader in companies such as Adidas, Nestlé or Telefónica. In her opinion, there are several trends that are leading companies to take an interest in a new way of managing data:

  • Companies want to differentiate themselves and provide value thanks to the data they possess.
  • They want all areas of the company to take advantage of it, in an effective and efficient way.
  • At the same time, data governance and data sovereignty aspects are becoming increasingly relevant.

Problems that Data Mesh can solve

Data Mesh allows facing several problems that companies have about data management, such as:

  • Lack of clear ownership or responsibility for the data. For example, in centralized data warehouses or data lakes, technical managers do not have the specialized business knowledge to take advantage of and optimize the data.
  • Lack of data metrics translates into distrust of the data to draw conclusions or make decisions.
  • Difficulty in bringing engineering expertise to the rest of the organization. As a single team manages the centralized platform, this can lead to bottlenecks or friction between teams.

If these problems persist in the medium and long term, the situation leads to low use of data and difficulty in innovating or adding value.

Data Mesh Principles

The Data Mesh is built around four principles:

  1. Domain oriented property
  2. Data as a product
  3. Self-service data platform or infrastructure
  4. Federated government

The third and fourth principles are more technological approaches.

Domain-oriented property

A ‘domain’ is a department, section, area… of the company. In the principle of domain-oriented ownership in Data Mesh, the responsibility for data would go beyond the centralized data platform team, to bring this duty to those teams where it is generated (for example, the commercial area where customer information is ‘born’) and that could extract a broad and quality value from it.

Data as a product

The principle of data as a product in Data Mesh means conceiving data as a consumable product in the business.

These data as products have input and output ports:

  • Input ports: Data-producing sources.
  • Output ports: In charge of exposing the data so that other parts of the company or end users can consume it.

And not only this: the products have to be easy to use, with metrics and metadata. Moreover, they are offered in packages that include not only data and metadata, but also the code and infrastructure with which they have been produced.

DATSIS Principles

Within the data grid, these products are governed by DATSIS principles:

  • Discoverable. The product has to be easily found through some tool, such as a data catalog.
  • Addressable. To access it, some kind of generic or global guidelines must be followed.
  • Trustworthy. To be trusted, the product must have quality and service standards.
  • Secure. Effective granular access policies to this data must be defined.
  • Interoperable. Ideally, products should follow open standards and multiple interfaces can be used to search and find the data.
  • Self-describing. The package must include the enunciation of the input and output ports, as well as a product schematic and updated documentation.

https://pages.matillion.com/rs/992-UIW-731/images/Ebook-Guide-to-the-Lakehouse.pdf

https://www.databricks.com/wp-content/uploads/2020/10/The-Modern-Cloud-Data-Platform-For-Dummies-Databricks-Special-Edition.pdf

https://www.databricks.com/wp-content/uploads/2021/10/Big-Book-of-Data-Engineering-Final.pdf

Data platform

Maturity assessment

http://www.cs.uu.nl/research/techreps/repo/CS-2010/2010-021.pdf

Requirement gathering and effort estimation

https://openproceedings.org/2015/conf/edbt/paper-295.pdf

--

--

Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified