Data Lakehouse Best Practices and Latest Trends

Articles that are interesting to read

Xin Cheng
9 min readAug 28, 2024

Previously I discussed data lake best practices. There is a trend of unifying data lake and data warehouse in single platform, lakehouse. I have put together some best practices resouces.

Lakehouse

Data lakehouse guiding principles

  1. layered (or multi-hop) architecture for increased data trust for final data-as-product
  2. prevent data silos, but leveraging data sharing to reduce data movement
  3. data value creation through self-service
  4. governance through Data Quality, Data Catalog, Access Control, Audit, Data Lineage
  5. open (interface, formats)
  6. optimize for performance and cost

Well architected framework was introduced by AWS for architecting cloud solutions. Nowadays Microsoft, Google also adopt the framework. Originally 5 pillars

  • Operational excellence
  • Security
  • Reliability
  • Performance efficiency
  • Cost optimization
  • Sustainability (added Dec 2021)

Lakehouse-specific pillars

  • Data governance
  • Interoperability and usability

Data governance

Unify data and AI on management (data/feature store/models asset, metadata management, discovery, lineage, audit), security, quality (data quality, model testing result)(unity catalog, when it comes to specific cloud, e.g. integration with Purview)

Interoperability and usability

Use open interfaces (REST API) and open data formats (Delta Lake, Delta Sharing), open standards for ML lifecycle management (MLflow)

Operational excellence

Use Enterprise source code management (SCM) (Databricks Git folders), DataOps/MLOps, environment isolation/catalog strategy; Use infrastructure as code (IaC) for deployments and maintenance (serverless compute, predefined compute templates, compute policies), automated workflows for jobs (Databricks Jobs, Delta Live Tables, Auto Loader); model registry to decouple code and model lifecycle; declarative management for complex data and ML projects (Databricks Asset Bundles); service limits and quotas (native quota management (e.g. pipeline, job) on top of cloud platform?); monitoring, alerting, and logging (cloud platform monitoring, Databricks Lakehouse Monitoring, SQL warehouse monitoring, Databricks SQL alerts, Auto Loader monitoring, Job monitoring, Delta Live Tables monitoring, Streaming monitoring, ML and AI monitoring)

Security, compliance, and privacy

Identity management, least privilege; Protect data in transit and at rest; Secure network and identify and protect endpoints; Monitoring

Reliability

Design for failure (Delta Lake for ACID, Apache Spark for resilient distributed data engine, automatic retry policy in Databricks jobs, failure recovery with Delta Live Tables, use managed services (serverless SQL warehouses, model serving, serverless jobs, serverless compute for notebooks, Delta Live Tables)); Manage data quality (Delta Lake supports schema validation and schema enforcement, Delta table supports constraints and data expectations); Design for autoscaling (Delta Live Tables, SQL warehouse); disaster recovery (need to use specific cloud DR capability)

Performance efficiency

Prefer larger clusters, Use native Spark operations (UDF), Photon, use disk cache (formerly known as “Delta cache”)/avoid Spark Caching, Query Result Cache, Delta Lake compaction/data skipping (Z-ordering, Liquid Clustering)/Avoid over-partitioning

Cost optimization

serverless compute (SQL warehouses, Mosaic AI Model Serving), use right instance type, auto-scaling compute, auto termination, compute policies to control costs, tag clusters for cost attribution

https://docs.databricks.com/en/lakehouse-architecture/reference.html

lakehouse reference architectures

Databricks Lakehouse Platform

Supports ingestion (batch, streaming, auto loader, copy into), processing (ETL pipelines, data quality checks, automatic recovery, scheduling/orchestration/workflows, observability), consuming (notebook, chart, SQL, Machine Learning)

Delta Live Tables/DLT

DLT runs on the idea of developer focusing on business logic, while system provides necessary infra, computation, workflow. DLT uses declarative programming which tells what not how (fault-tolerance, state management, scheduling/dependency/parallelism, cdc, schema evolution; optimization/partition; operation/system regression/cloud issues). Core abstractions: streaming table (for ingestion), materialized view (for transformation, precomputed and stored in Delta)

Terminology: streaming live table = streaming table, live table = materialized view; DLT in DBSQL

precompute results for query performance (must use a Unity Catalog-enabled pro or serverless SQL warehouse, refresh to include new data)

Streaming mode: manual, triggered on schedule, continuous

Streaming table (same for streaming joins) is stateful (does not recompute past data), need full refresh (REFRESH <table> FULL)

Support SCD/Slowly Changing Dimension type 1 and 2 (Good for long history time travel, which would be efficient using delta time travel)

DLT capabilities: Declarative Syntax, Automatic Transformations, Table-based model, Data Quality Checks, Automatic incremental processing, Unified batch and streaming, idempotent logic, reliability

DLT serverless, serverless SQL, now serverless compute

Configuration driven data pipeline with DLT pipelines (with Python decorators, and dlt-meta)

unit testing, unit testing 2

Besides declarative capabilities of DLT, it is 2x faster on DLT compared to the non-DLT. TPC-DI focuses on performance, cost and consistency audits (Slowly Changing Dimensions (CDC), including SCD Type II, referential integrity)

Databricks Asset Bundles

CI/CD solution. Many ways to deploy into production but challenging (IaC with TF, dbx, REST APIs). Components: code, execution environment (compute, workspace), other resources (workflows, delta live tables, mlflow)

Asset Bundle configurations, example

DLT with DAB

Use data contract to create data product spec, generate SQL with data contract and test. Code is managed in DAB.

Security operations

Account console

To enable the account console and establish your first account admin, you’ll need to engage someone who has the Microsoft Entra ID Global Administrator role.

Workspace

Manage users, groups, service principals

Govenance

Unity Catalog at center for data mesh

https://www.databricks.com/sites/default/files/2023-10/final_data-and-ai-governance.6sept2023.pdf

Key data governance challenges: Fragmented data landscape (data silos), Complex access management, Inadequate monitoring and visibility, Limited cross-platform sharing and collaboration

data masking, privacy

Identify PII Presidio, 2, 3, an open source data protection and de-identification SDK from Microsoft.

custom solutions built on Databricks and UC, MS presidio + ML NER model

Prophecy low code with PII_template

https://privacy.uw.edu/wp-content/uploads/sites/7/2021/03/DataAnonymization_Aug2019.pdf

Deletion, redaction or obfuscation: Direct identifiers (e.g. name, email) are covered, eliminated, removed or hidden.

Pseudonymization: Information from which direct identifiers have been eliminated, transformed or replaced by pseudonyms, but indirect identifiers (e.g. birth date, address) remain intact

De-identification: Direct and known indirect identifiers (perhaps contextually identified by a particular law or regulation, i.e. HIPAA) have been removed or mathematically manipulated to break the linkage to identities.

Anonymization: Direct and indirect identifies are removed or manipulated together with mathematical and technical guarantees, often through aggregation, in order to prevent re-identification. Re-identification of anonymized data is not possible.

https://nvlpubs.nist.gov/nistpubs/ir/2015/NIST.IR.8053.pdf

Identity disclosure: identifying information but preserved
the search terms that the users had typed. The data identified the user with a single numeric code. Because the code was a randomly generated pseudonym it could not itself be tied back to the users’ identity. Identifying information did appear in the searchqueries themselves (for example, people who searched for information about their property). de-identification of search records might remove the searcher’s name but leave an IP address, allowing the data to be linked against a database that maps IP addresses to names.

11-step process for deidentifying data, PHI

Metadata management

  1. Technical metadata
  2. Governance metadata
  3. Operational metadata
  4. Collaboration metadata
  5. Quality metadata
  6. Usage metadata

Unity Catalog

Pillars: discovery (tag column, table, schema, catalog objects), access control, lineage, audit, monitor, sharing (delta sharing, marketplace, clean room)

working with file based data sources: credentials, external locations, managed/external tables (if you have external readers and writers outside of Databricks, specific storage naming or hierarchy, infrastructure isolation requirement, non-delta support), managed/external volumes; working with databases: connections, foreign catalogs

Object model

key roles: account admin (manage workspaces, metastores), metastore admin (manage CATALOG, owner, access control), workspace admin, data owner

Configure workspace to use unity catalog, connect Unity catalog to storage, configure

function can do row-level, column-level masking

Migrate to Databricks Unity Catalog

Migrate options:

Managed Hive tables that are in Delta, Parquet, or Iceberg format -> Managed: Create table clone/CTAS

Managed or external Hive tables -> External (prefer SQL): SYNC

Managed or external Hive tables -> External (do in UI): upgrade wizard

Managed or external Hive tables -> Managed or external (comprehensive, huge data): ucx

Steps

  1. Assess
  2. Migrate groups
  3. Attach metastore
  4. Migrate external tables
  5. Migrate SQL warehouse
  6. Migrate jobs
  7. Migrate managed tables
  8. Migrate code/notebooks

ucx flow: assess, group migration, table migration, code migration

ucx demo (assess, assign metastore, table migration (create table mapping, create missing principal, migrate credential, create external location, create catalog schema, create uber principal, migrate table), code migration (linting))

Technical details of UCX

Lakehouse Data Modeling

ELT pattern: DLT for core data ingestion (bronze -> silver), DBT for business transformation (silver -> gold): meshdallion :-)

identity columns as primary and foreign keys, recommended in SQL as “GENERATED ALWAYS AS IDENTITY”

Data Design and Lakehouse Patterns in Microsoft Fabric: keep file as raw as possible in bronze

Appendix

--

--

Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified