Data Lake Best Practices

Articles that are interesting to read

Xin Cheng
8 min readAug 18, 2024

I had an article on Data warehouse, Data lake, data lakehouse, data fabric, data mesh. Since there are multiple data types (e.g. structured data, files (text, pdf, docs), images, audios, videos), data lakes are usually the first entry points for enterprise data platform. Let’s survey best practices for data lake.

Design/Architecture

Deciding number of storage accounts (key considerations, customer geographical locations, isolated management policy, cost/billing vs. overhead of managing multiple accounts, copying data back and forth): scenarios where enterprises have their analytics scenarios factoring multiple geographic regions, provision region-specific storage accounts to store data for a particular region and allow sharing of specific data with other regions; Create different storage accounts (ideally in different subscriptions) for your development and production environments. Enterprises that serve as a multi-tenant analytics platform serving multiple customers could end up provisioning individual data lakes for their customers in different subscriptions to help ensure that the customer data and their associated analytics workloads are isolated from other customers to help manage their cost and billing models.

Data organization: Create different folders or containers (more below on considerations between folders vs containers) for the different data zones — raw, enriched, curated and workspace data sets; Inside a zone, choose to organize data in folders according to logical separation, e.g. datetime or business units or both.

Access control of data: RBACs (scoped to top-level resources like storage accounts or containers), ACLs (files, directories). Create security groups, add security principals to security groups, add security group ACLs.

Data format: Avro file format is favored where the I/O patterns are more write heavy or the query patterns favor retrieving multiple rows of records in their entirety. E.g. Avro format is favored by message bus such as Event Hub or Kafka writes multiple events/messages in succession. Parquet and ORC file formats are favored when the I/O patterns are more read heavy and/or when the query patterns are focused on a subset of columns in the records.

Cost of data lake: data lifecycle management. Right data redundancy. Optimizing for more data in a single transaction since ead and write transactions are billed in 4 MB increments.

Data lake monitoring: Azure Storage logs in Azure Monitor. Log Analytics with KQL.

Performance: larger sized files (target at least 100 MB or more) for better performance. Leverage Query Acceleration.

Folder layout of raw, enriched, curated data zone

AWS data lake: cost, use S3 Storage Classes, Data Lifecycle Policies, S3 Object Tagging, Combine Small Files, Compress Data to Maximize Data Retention and Reduce Storage Costs

Performance: Combine Small Files

Governance: Metadata management

https://library.si.edu/sites/default/files/pdf/rdm_best_practices.pdf

Mainly focus on data storage: file formats, file organization, data dictionaries/metadata

Architecting an Open Data Lake for the Enterprise

Phase 1: capture and store raw data of different types at scale, Phase 2: augment enterprise DW strategy. Elements: data preparation, data profiling, governance, self-service data ingest, metadata management, data lineage, data classification, security

Project vision: Bring your own tool, decentralized data ownership, centralized processes/tools, capture all data as real-time as possible, self-service access by authorized user, encryption compliance

Governance

Define clear roles and responsibilities (Data Owners, Data Stewards, Data Managers or Custodians, Data Consumers, Data Governance Council, Data Architects, Data Quality Analysts)

Data governance in Dataplex. Solving dark data: data marketplace, tagging (thinking of using GenAI to assist metadata enrichment), surface data value

Automatic cataloguing of data sources, column lineage

Data management

https://people.umass.edu/biostat690c/pdf/1.%20%20Principles%20of%20Data%20Management%202020.pdf

Data Management Plan (DMP) (outlines how to collect, organize, manage, store, describe, secure, backup, preserve, and share data), Manual of Operation Procedures (MOOP)

https://data.uq.edu.au/files/6833/Data%20Governance%20Essentials%20handbook%20August%202021.pdf

https://standards.ieee.org/wp-content/uploads/import/governance/iccom/bdgmm-standards-roadmap-2020.pdf

List challenges of big data management in a few industrial case studies.

Data governance playbook

Data quality

Inconsistent format/unit/language, incomplete information, duplicate, inaccurate, invalid

Security

Main challenges: Access Control (RBAC), Data Protection (data loss protection, encryption), Privacy, and Compliance (Data Masking and Tokenization, Logs of Data Access and Modifications, Regular Audits and Compliance Checks, Anomaly Detection and Threat Intelligence)

Access control (Bucket policy, lake formation), Encryption (S3 encryption with AWS KMS, replication, object lock), object tagging

Metadata management

Metadata helps

  1. Data discovery (providing details such as its source, its structure, its meaning, its relationships with other data, and its usage.)
  2. Data quality (lineage of data (where it came from, who modified it, when, and why) can help track and fix data quality issues.)
  3. Data governance and compliance (comply with data privacy and protection regulations. It allows tracking of who has access to what data, what they can do with it, and what they have done with it.)
  4. Data integration (map data elements across different systems, enabling a consistent view of data across the enterprise.)
  5. Data security (helps implement security controls, such as access permissions and data masking.)
  6. AI and machine learning (helps algorithms understand what the data represents, which can improve their performance.)

Functionality of metadata management

  1. Data cataloging (record information about the source, format, structure, and content of the data, transformations applied)
  2. Data lineage tracking (record where it came from, who has accessed it, what changes have been made to it, and where it has been used.)
  3. Semantic tagging (meaning and context of the data)
  4. Data quality monitoring (check whether the data conforms to specified formats, ranges, or other constraints, and alert users to any anomalies.)
  5. Data governance enforcement (control who has access to what data, what they can do with it, and what they have done with it)
  6. Integration with other tools (feed metadata to data integration tools to facilitate data mapping and transformation.)

Operational

Lifecycle management

JSON rules for lifecycle management policy on blobs (move between tiers, delete versions)

AWS CloudTrail logs every API call to S3 server access logging. S3 inventoryaudits and reports replication and encryption status for data.

S3 Intelligent-Tiering provides automatic cost savings by moving data between frequent and infrequent access tiers when the access patterns change

Main goals of DLM

  1. To protect the confidentiality
  2. To ensure data integrity
  3. To ensure the availability of data

https://static.tti.tamu.edu/tti.tamu.edu/documents/PRC-17-84-F.pdf

Lifecycle: Collect, Process, Store and secure, Use, Share and communicate, Archive, Destroy or re-use (concurrent phases). Delta lake retention period is to configure for table versioning/history (time-travel queries).

Data lifecycle management (formerly information governance): M365, Exchange data (retention policy (all items in specific location): retain/delete, use retention label (on single item) to choose what happens after retention period/how to dispose, auto-labeling policy). Locations: Exchange mailboxes, Onedrive accounts, M365 Group mailboxes, Skype for Business, Exchange public folder, Teams messages/chats, etc. Adaptive scope (able to select specific attribute and define value)

test data management

archiving

Disaster recovery

DR strategies for

Databricks Objects: use CI/CD and Infrastructure as Code (IaC) tooling, e.g. Terraform (TF), Databricks Repos, Databricks REST APIs.

Databases and Tables: Data Replication using underlying object storage (GRR: Files that cannot be converted to Delta should rely on GRR), Delta DEEP CLONE to simplify replication for DR (load multiple storage accounts to same cluster)

Workspace replication: No automatic continuous sync between different workspaces. Need to use custom solution (e.g. batch sync, stream sync, file sync)

Use Databricks Terraform Provider to provision Infra

Databricks Sync (DBSync)

Additional regional disaster recovery topology support and more concrete approach.

  • Provision multiple Azure Databricks workspaces in paired Azure regions
  • Use geo-redundant storage which by default matches Databricks workspaces
  • Migrate the users, user folders, notebooks, cluster configuration, jobs configuration, libraries, storage, init scripts, and reconfigure access control with scripts provided in the article

Appendix

https://docs.aws.amazon.com/pdfs/whitepapers/latest/building-data-lakes/building-data-lakes.pdf

AWS services integrated with S3 for ingestion: Amazon Kinesis Data Firehose, AWS Snow Family, AWS Glue, AWS DataSync, AWS Transfer Family, Storage Gateway, Apache Hadoop distributed copy command, AWS Database Migration Service

Governance: AWS Glue catalog and search, AWS Lake Formation

https://info.talend.com/rs/talend/images/WP_EN_BD_TDWI_DataLakes.pdf

http://pages.matillion.com/rs/992-UIW-731/images/2019C2%20-%20Data%20Lakes%20eBook.pdf

--

--

Xin Cheng
Xin Cheng

Written by Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified