Data Lakehouse Best Practices and Latest Trends

Articles that are interesting to read

9 min readAug 28, 2024

Previously I discussed data lake best practices. There is a trend of unifying data lake and data warehouse in single platform, lakehouse. I have put together some best practices resouces.

Lakehouse

6 Guiding Principles to Build an Effective Data Lakehouse

Learn the six guiding principles to build an effective data lakehouse, unifying data, analytics, and AI on a single…

www.databricks.com

Data lakehouse guiding principles

layered (or multi-hop) architecture for increased data trust for final data-as-product
prevent data silos, but leveraging data sharing to reduce data movement
data value creation through self-service
governance through Data Quality, Data Catalog, Access Control, Audit, Data Lineage
open (interface, formats)
optimize for performance and cost

Introducing the Well-Architected Data Lakehouse from Databricks

Explore the Well-Architected Data Lakehouse framework by Databricks, designed for reliable, secure, and efficient cloud…

www.databricks.com

Well architected framework was introduced by AWS for architecting cloud solutions. Nowadays Microsoft, Google also adopt the framework. Originally 5 pillars

Operational excellence
Security
Reliability
Performance efficiency
Cost optimization
Sustainability (added Dec 2021)

Lakehouse-specific pillars

Data governance
Interoperability and usability

Data governance

Unify data and AI on management (data/feature store/models asset, metadata management, discovery, lineage, audit), security, quality (data quality, model testing result)(unity catalog, when it comes to specific cloud, e.g. integration with Purview)

Interoperability and usability

Use open interfaces (REST API) and open data formats (Delta Lake, Delta Sharing), open standards for ML lifecycle management (MLflow)

Operational excellence

Use Enterprise source code management (SCM) (Databricks Git folders), DataOps/MLOps, environment isolation/catalog strategy; Use infrastructure as code (IaC) for deployments and maintenance (serverless compute, predefined compute templates, compute policies), automated workflows for jobs (Databricks Jobs, Delta Live Tables, Auto Loader); model registry to decouple code and model lifecycle; declarative management for complex data and ML projects (Databricks Asset Bundles); service limits and quotas (native quota management (e.g. pipeline, job) on top of cloud platform?); monitoring, alerting, and logging (cloud platform monitoring, Databricks Lakehouse Monitoring, SQL warehouse monitoring, Databricks SQL alerts, Auto Loader monitoring, Job monitoring, Delta Live Tables monitoring, Streaming monitoring, ML and AI monitoring)

Security, compliance, and privacy

Identity management, least privilege; Protect data in transit and at rest; Secure network and identify and protect endpoints; Monitoring

Reliability

Design for failure (Delta Lake for ACID, Apache Spark for resilient distributed data engine, automatic retry policy in Databricks jobs, failure recovery with Delta Live Tables, use managed services (serverless SQL warehouses, model serving, serverless jobs, serverless compute for notebooks, Delta Live Tables)); Manage data quality (Delta Lake supports schema validation and schema enforcement, Delta table supports constraints and data expectations); Design for autoscaling (Delta Live Tables, SQL warehouse); disaster recovery (need to use specific cloud DR capability)

Performance efficiency

Prefer larger clusters, Use native Spark operations (UDF), Photon, use disk cache (formerly known as “Delta cache”)/avoid Spark Caching, Query Result Cache, Delta Lake compaction/data skipping (Z-ordering, Liquid Clustering)/Avoid over-partitioning

Cost optimization

serverless compute (SQL warehouses, Mosaic AI Model Serving), use right instance type, auto-scaling compute, auto termination, compute policies to control costs, tag clusters for cost attribution

https://docs.databricks.com/en/lakehouse-architecture/reference.html

lakehouse reference architectures

Databricks Lakehouse Platform

Supports ingestion (batch, streaming, auto loader, copy into), processing (ETL pipelines, data quality checks, automatic recovery, scheduling/orchestration/workflows, observability), consuming (notebook, chart, SQL, Machine Learning)

Delta Live Tables/DLT

DLT runs on the idea of developer focusing on business logic, while system provides necessary infra, computation, workflow. DLT uses declarative programming which tells what not how (fault-tolerance, state management, scheduling/dependency/parallelism, cdc, schema evolution; optimization/partition; operation/system regression/cloud issues). Core abstractions: streaming table (for ingestion), materialized view (for transformation, precomputed and stored in Delta)

Terminology: streaming live table = streaming table, live table = materialized view; DLT in DBSQL

Use materialized views in Databricks SQL

Learn how to enable data sharing and speed up queries and dashboards by pre-computing results using materialized views.

docs.databricks.com

precompute results for query performance (must use a Unity Catalog-enabled pro or serverless SQL warehouse, refresh to include new data)

Streaming mode: manual, triggered on schedule, continuous

Streaming table (same for streaming joins) is stateful (does not recompute past data), need full refresh (REFRESH <table> FULL)

Support SCD/Slowly Changing Dimension type 1 and 2 (Good for long history time travel, which would be efficient using delta time travel)

Streamline Your ETL: A Beginner’s Guide to Delta Live Tables in Databricks

Data pipelines have become critical infrastructure for organizations today. Moving data from diverse sources…

medium.com

DLT capabilities: Declarative Syntax, Automatic Transformations, Table-based model, Data Quality Checks, Automatic incremental processing, Unified batch and streaming, idempotent logic, reliability

Announcing the General Availability of Serverless Compute for Notebooks, Workflows and Delta Live…

Serverless compute is now generally available for Notebooks, Workflows and Delta Live Tables in Azure and AWS.

www.databricks.com

DLT serverless, serverless SQL, now serverless compute

Configuration driven data pipeline with DLT pipelines (with Python decorators, and dlt-meta)

Applying software development & DevOps best practices to Delta Live Table pipelines

Apply software development and DevOps best practices to Delta Live Table pipelines on Databricks for reliable, scalable…

www.databricks.com

unit testing, unit testing 2

How We Performed ETL on One Billion Records For Under $1 With Delta Live Tables

Databricks showcases best-in-class ETL performance, ingesting one billion records into a data warehouse for under $1…

www.databricks.com

Besides declarative capabilities of DLT, it is 2x faster on DLT compared to the non-DLT. TPC-DI focuses on performance, cost and consistency audits (Slowly Changing Dimensions (CDC), including SCD Type II, referential integrity)

Databricks Asset Bundles

CI/CD solution. Many ways to deploy into production but challenging (IaC with TF, dbx, REST APIs). Components: code, execution environment (compute, workspace), other resources (workflows, delta live tables, mlflow)

Asset Bundle configurations, example

DLT with DAB

Build a Data Product with Databricks

In this article, we will use Data Contracts and the new Databricks Asset Bundles that are a great fit to implement data…

www.datamesh-architecture.com

Use data contract to create data product spec, generate SQL with data contract and test. Code is managed in DAB.

Security operations

Account console

Azure Databricks administration introduction - Azure Databricks

This article provides an introduction to Azure Databricks administration. It includes lists of workspace and account…

learn.microsoft.com

To enable the account console and establish your first account admin, you’ll need to engage someone who has the Microsoft Entra ID Global Administrator role.

Workspace

Get started: Account and workspace setup - Azure Databricks

Learn how to set up a Databricks free trial and a cloud provider account with Azure.

learn.microsoft.com

Databricks administration introduction

This article provides an introduction to Databricks administration. It includes lists of workspace and account admin…

docs.databricks.com

Manage users, groups, service principals

Govenance

Unity Catalog at center for data mesh

Best practices for security, compliance & privacy

This article covers best practices for security, compliance and privacy on the data lakehouse on Databricks.

docs.databricks.com

https://www.databricks.com/sites/default/files/2023-10/final_data-and-ai-governance.6sept2023.pdf

Key data governance challenges: Fragmented data landscape (data silos), Complex access management, Inadequate monitoring and visibility, Limited cross-platform sharing and collaboration

data masking, privacy

Identifying and Tagging PII data with Unity Catalog

American Cryptologist Bruce Schneier once said…

medium.com

Identify PII Presidio, 2, 3, an open source data protection and de-identification SDK from Microsoft.

custom solutions built on Databricks and UC, MS presidio + ML NER model

Lakehouse Monitoring: A Unified Solution for Quality of Data and AI

Edit description

www.databricks.com

Prophecy low code with PII_template

De-identification and re-identification of PII in large-scale datasets using Sensitive Data…

Discusses how to use Sensitive Data Protection to create an automated data transformation pipeline to de-identify…

cloud.google.com

https://privacy.uw.edu/wp-content/uploads/sites/7/2021/03/DataAnonymization_Aug2019.pdf

Deletion, redaction or obfuscation: Direct identifiers (e.g. name, email) are covered, eliminated, removed or hidden.

Pseudonymization: Information from which direct identifiers have been eliminated, transformed or replaced by pseudonyms, but indirect identifiers (e.g. birth date, address) remain intact

De-identification: Direct and known indirect identifiers (perhaps contextually identified by a particular law or regulation, i.e. HIPAA) have been removed or mathematically manipulated to break the linkage to identities.

Anonymization: Direct and indirect identifies are removed or manipulated together with mathematical and technical guarantees, often through aggregation, in order to prevent re-identification. Re-identification of anonymized data is not possible.

https://nvlpubs.nist.gov/nistpubs/ir/2015/NIST.IR.8053.pdf

Identity disclosure: identifying information but preserved
the search terms that the users had typed. The data identified the user with a single numeric code. Because the code was a randomly generated pseudonym it could not itself be tied back to the users’ identity. Identifying information did appear in the searchqueries themselves (for example, people who searched for information about their property). de-identification of search records might remove the searcher’s name but leave an IP address, allowing the data to be linked against a database that maps IP addresses to names.

11-step process for deidentifying data, PHI

Metadata management

Types of Metadata: With Examples, and their Use Cases

Looking for types of Metadata? let's explore the six main types of metadata with examples and their use cases.

atlan.com

Technical metadata
Governance metadata
Operational metadata
Collaboration metadata
Quality metadata
Usage metadata

Unity Catalog

Pillars: discovery (tag column, table, schema, catalog objects), access control, lineage, audit, monitor, sharing (delta sharing, marketplace, clean room)

working with file based data sources: credentials, external locations, managed/external tables (if you have external readers and writers outside of Databricks, specific storage naming or hierarchy, infrastructure isolation requirement, non-delta support), managed/external volumes; working with databases: connections, foreign catalogs

What is Unity Catalog?

Learn how to perform data governance in Databricks using Unity Catalog.

docs.databricks.com

Object model

Admin privileges in Unity Catalog - Azure Databricks

Learn about the Unity Catalog metastore management privileges, the metastore admin role, and the Unity Catalog…

learn.microsoft.com

key roles: account admin (manage workspaces, metastores), metastore admin (manage CATALOG, owner, access control), workspace admin, data owner

Set up and manage Unity Catalog

Learn how to set up and administer Unity Catalog for your Databricks account and workspaces.

docs.databricks.com

Set up and manage Unity Catalog - Azure Databricks

Learn how to set up and administer Unity Catalog for your Azure Databricks account and workspaces.

learn.microsoft.com

Configure workspace to use unity catalog, connect Unity catalog to storage, configure

function can do row-level, column-level masking

Migrate to Databricks Unity Catalog

Planning a Migration to Unity Catalog

Authors: Dhaval Bagadia, Ziyuan Qin Are you using Hive Metastore on Databricks or an external Hive Metastore such as…

community.databricks.com

Use the UCX utilities to upgrade your workspace to Unity Catalog

Learn about UCX, the Databricks Labs project that helps you upgrade your non-Unity-Catalog workspace to Unity Catalog.

docs.databricks.com

Upgrade Hive tables and views to Unity Catalog

Learn how to upgrade tables and views in your Databricks workspace-local Hive metastore to Unity Catalog.

docs.databricks.com

Migrate options:

Managed Hive tables that are in Delta, Parquet, or Iceberg format -> Managed: Create table clone/CTAS

Managed or external Hive tables -> External (prefer SQL): SYNC

Managed or external Hive tables -> External (do in UI): upgrade wizard

Managed or external Hive tables -> Managed or external (comprehensive, huge data): ucx

Steps

Assess
Migrate groups
Attach metastore
Migrate external tables
Migrate SQL warehouse
Migrate jobs
Migrate managed tables
Migrate code/notebooks

ucx flow: assess, group migration, table migration, code migration

ucx demo (assess, assign metastore, table migration (create table mapping, create missing principal, migrate credential, create external location, create catalog schema, create uber principal, migrate table), code migration (linting))

Technical details of UCX

Lakehouse Data Modeling

ELT pattern: DLT for core data ingestion (bronze -> silver), DBT for business transformation (silver -> gold): meshdallion :-)

How to Identity Columns to Generate Surrogate Keys in the Databricks Lakehouse

Learn about the new feature of identity columns in Databricks Lakehouse for generating surrogate keys in data models.

www.databricks.com

identity columns as primary and foreign keys, recommended in SQL as “GENERATED ALWAYS AS IDENTITY”

Data Design and Lakehouse Patterns in Microsoft Fabric: keep file as raw as possible in bronze

The Data Lakehouse: Data Warehousing and More

Relational Database Management Systems designed for Online Analytical Processing (RDBMS-OLAP) have been foundational to…

arxiv.org

Yet Another Data Modeling Approach for the Data Lakehouse

If you were missing a framework for modeling data inside the Data Lakehouse, I hope this can be a good addition to your…

medium.com

Data Modeling Best Practices & Implementation on Modern Lakehouse

Explore best practices for data modeling on Databricks Lakehouse, including dimensional modeling and physical data…

www.databricks.com

Different Data Warehousing Modeling Techniques and How to Implement them on the Databricks…

Explore data warehousing modeling techniques and their implementation on the Databricks Lakehouse Platform.

www.databricks.com

Data Modeling in the Lakehouse

Although a Lakehouse tries to combine Data Warehouse and Data Lake characteristics by integrating ACID and CRUD…

medium.com

One Big Table vs. Dimensional Modeling on Databricks SQL

Why to use each and best practices in Databricks SQL

medium.com

Data Vault vs. Dimensional Modeling: Choosing the Right Data Warehousing Approach

In the era of big data and analytics, data warehousing has become an essential component of business intelligence…

medium.com

Appendix

Unity Catalog best practices

Learn best practices for setting up data governance and data isolation in Databricks using Unity Catalog and Delta…

docs.databricks.com

Microsoft Fabric and Databricks Unity Catalog — unraveling the integration scenarios

The first Microsoft Fabric community conference occurred this year, featuring two sneak peeks showcasing Fabric and…

murggu.medium.com

Enable Unity Catalog and Delta Sharing for your Databricks workspace

With Databricks becoming more popular day-by-day as a solution for data engineering & data science, the onus was on…

roysandip.medium.com

Data Lakehouse Best Practices and Latest Trends

Articles that are interesting to read

Lakehouse

6 Guiding Principles to Build an Effective Data Lakehouse

Learn the six guiding principles to build an effective data lakehouse, unifying data, analytics, and AI on a single…

Introducing the Well-Architected Data Lakehouse from Databricks

Explore the Well-Architected Data Lakehouse framework by Databricks, designed for reliable, secure, and efficient cloud…

Databricks Lakehouse Platform

Use materialized views in Databricks SQL

Learn how to enable data sharing and speed up queries and dashboards by pre-computing results using materialized views.

Streamline Your ETL: A Beginner’s Guide to Delta Live Tables in Databricks

Data pipelines have become critical infrastructure for organizations today. Moving data from diverse sources…

Announcing the General Availability of Serverless Compute for Notebooks, Workflows and Delta Live…

Serverless compute is now generally available for Notebooks, Workflows and Delta Live Tables in Azure and AWS.

Applying software development & DevOps best practices to Delta Live Table pipelines

Apply software development and DevOps best practices to Delta Live Table pipelines on Databricks for reliable, scalable…

How We Performed ETL on One Billion Records For Under $1 With Delta Live Tables

Databricks showcases best-in-class ETL performance, ingesting one billion records into a data warehouse for under $1…

Build a Data Product with Databricks

In this article, we will use Data Contracts and the new Databricks Asset Bundles that are a great fit to implement data…

Security operations

Azure Databricks administration introduction - Azure Databricks

This article provides an introduction to Azure Databricks administration. It includes lists of workspace and account…

Get started: Account and workspace setup - Azure Databricks

Learn how to set up a Databricks free trial and a cloud provider account with Azure.

Databricks administration introduction

This article provides an introduction to Databricks administration. It includes lists of workspace and account admin…

Govenance

Best practices for security, compliance & privacy

This article covers best practices for security, compliance and privacy on the data lakehouse on Databricks.

Identifying and Tagging PII data with Unity Catalog

American Cryptologist Bruce Schneier once said…

Lakehouse Monitoring: A Unified Solution for Quality of Data and AI

Edit description

De-identification and re-identification of PII in large-scale datasets using Sensitive Data…

Discusses how to use Sensitive Data Protection to create an automated data transformation pipeline to de-identify…

Metadata management

Types of Metadata: With Examples, and their Use Cases

Looking for types of Metadata? let's explore the six main types of metadata with examples and their use cases.

Unity Catalog

What is Unity Catalog?

Learn how to perform data governance in Databricks using Unity Catalog.

Admin privileges in Unity Catalog - Azure Databricks

Learn about the Unity Catalog metastore management privileges, the metastore admin role, and the Unity Catalog…

Set up and manage Unity Catalog

Learn how to set up and administer Unity Catalog for your Databricks account and workspaces.

Set up and manage Unity Catalog - Azure Databricks

Learn how to set up and administer Unity Catalog for your Azure Databricks account and workspaces.

Migrate to Databricks Unity Catalog

Planning a Migration to Unity Catalog

Authors: Dhaval Bagadia, Ziyuan Qin Are you using Hive Metastore on Databricks or an external Hive Metastore such as…

Use the UCX utilities to upgrade your workspace to Unity Catalog

Learn about UCX, the Databricks Labs project that helps you upgrade your non-Unity-Catalog workspace to Unity Catalog.

Upgrade Hive tables and views to Unity Catalog

Learn how to upgrade tables and views in your Databricks workspace-local Hive metastore to Unity Catalog.

Lakehouse Data Modeling

How to Identity Columns to Generate Surrogate Keys in the Databricks Lakehouse

Learn about the new feature of identity columns in Databricks Lakehouse for generating surrogate keys in data models.

The Data Lakehouse: Data Warehousing and More

Relational Database Management Systems designed for Online Analytical Processing (RDBMS-OLAP) have been foundational to…

Yet Another Data Modeling Approach for the Data Lakehouse

If you were missing a framework for modeling data inside the Data Lakehouse, I hope this can be a good addition to your…

Data Modeling Best Practices & Implementation on Modern Lakehouse

Explore best practices for data modeling on Databricks Lakehouse, including dimensional modeling and physical data…

Different Data Warehousing Modeling Techniques and How to Implement them on the Databricks…

Explore data warehousing modeling techniques and their implementation on the Databricks Lakehouse Platform.

Data Modeling in the Lakehouse

Although a Lakehouse tries to combine Data Warehouse and Data Lake characteristics by integrating ACID and CRUD…

One Big Table vs. Dimensional Modeling on Databricks SQL

Why to use each and best practices in Databricks SQL

Data Vault vs. Dimensional Modeling: Choosing the Right Data Warehousing Approach

In the era of big data and analytics, data warehousing has become an essential component of business intelligence…

Appendix

Unity Catalog best practices

Learn best practices for setting up data governance and data isolation in Databricks using Unity Catalog and Delta…

Microsoft Fabric and Databricks Unity Catalog — unraveling the integration scenarios

The first Microsoft Fabric community conference occurred this year, featuring two sneak peeks showcasing Fabric and…

Enable Unity Catalog and Delta Sharing for your Databricks workspace

With Databricks becoming more popular day-by-day as a solution for data engineering & data science, the onus was on…

Written by Xin Cheng