Data Lake Best Practices

Articles that are interesting to read

8 min readAug 18, 2024

I had an article on Data warehouse, Data lake, data lakehouse, data fabric, data mesh. Since there are multiple data types (e.g. structured data, files (text, pdf, docs), images, audios, videos), data lakes are usually the first entry points for enterprise data platform. Let’s survey best practices for data lake.

Design/Architecture

The Hitchhiker's Guide to the Data Lake

The Hitchhiker's Guide to the Data Lake # A comprehensive guide on key considerations involved in building your…

azure.github.io

Deciding number of storage accounts (key considerations, customer geographical locations, isolated management policy, cost/billing vs. overhead of managing multiple accounts, copying data back and forth): scenarios where enterprises have their analytics scenarios factoring multiple geographic regions, provision region-specific storage accounts to store data for a particular region and allow sharing of specific data with other regions; Create different storage accounts (ideally in different subscriptions) for your development and production environments. Enterprises that serve as a multi-tenant analytics platform serving multiple customers could end up provisioning individual data lakes for their customers in different subscriptions to help ensure that the customer data and their associated analytics workloads are isolated from other customers to help manage their cost and billing models.

Data organization: Create different folders or containers (more below on considerations between folders vs containers) for the different data zones — raw, enriched, curated and workspace data sets; Inside a zone, choose to organize data in folders according to logical separation, e.g. datetime or business units or both.

Access control of data: RBACs (scoped to top-level resources like storage accounts or containers), ACLs (files, directories). Create security groups, add security principals to security groups, add security group ACLs.

Data format: Avro file format is favored where the I/O patterns are more write heavy or the query patterns favor retrieving multiple rows of records in their entirety. E.g. Avro format is favored by message bus such as Event Hub or Kafka writes multiple events/messages in succession. Parquet and ORC file formats are favored when the I/O patterns are more read heavy and/or when the query patterns are focused on a subset of columns in the records.

Cost of data lake: data lifecycle management. Right data redundancy. Optimizing for more data in a single transaction since ead and write transactions are billed in 4 MB increments.

Data lake monitoring: Azure Storage logs in Azure Monitor. Log Analytics with KQL.

Performance: larger sized files (target at least 100 MB or more) for better performance. Leverage Query Acceleration.

Best practices for using Azure Data Lake Storage Gen2 - Azure Storage

Learn how to optimize performance, reduce costs, and secure your Data Lake Storage Gen2 enabled Azure Storage account.

learn.microsoft.com

Data lake zones and containers - Cloud Adoption Framework

Learn about the three Azure Data Lake Storage Gen2 accounts that can be provisioned for each data landing zone.

learn.microsoft.com

Folder layout of raw, enriched, curated data zone

10 Data Lake Best Practices When Using AWS S3 | ChaosSearch

Already working with AWS S3 and interested in creating a data lake? Learn how with this deep dive on best practices.

www.chaossearch.io

AWS data lake: cost, use S3 Storage Classes, Data Lifecycle Policies, S3 Object Tagging, Combine Small Files, Compress Data to Maximize Data Retention and Reduce Storage Costs

Performance: Combine Small Files

Governance: Metadata management

https://library.si.edu/sites/default/files/pdf/rdm_best_practices.pdf

Mainly focus on data storage: file formats, file organization, data dictionaries/metadata

Architecting an Open Data Lake for the Enterprise

Phase 1: capture and store raw data of different types at scale, Phase 2: augment enterprise DW strategy. Elements: data preparation, data profiling, governance, self-service data ingest, metadata management, data lineage, data classification, security

Project vision: Bring your own tool, decentralized data ownership, centralized processes/tools, capture all data as real-time as possible, self-service access by authorized user, encryption compliance

Governance

11 Data Governance Best Practices in 2024 | Alation

Explore 11 data governance best practices to optimize data management. Enhance data quality, security, and usability…

www.alation.com

Define clear roles and responsibilities (Data Owners, Data Stewards, Data Managers or Custodians, Data Consumers, Data Governance Council, Data Architects, Data Quality Analysts)

Data governance in Dataplex. Solving dark data: data marketplace, tagging (thinking of using GenAI to assist metadata enrichment), surface data value

Automatic cataloguing of data sources, column lineage

Data management

https://people.umass.edu/biostat690c/pdf/1.%20%20Principles%20of%20Data%20Management%202020.pdf

Data Management Plan (DMP) (outlines how to collect, organize, manage, store, describe, secure, backup, preserve, and share data), Manual of Operation Procedures (MOOP)

https://data.uq.edu.au/files/6833/Data%20Governance%20Essentials%20handbook%20August%202021.pdf

https://standards.ieee.org/wp-content/uploads/import/governance/iccom/bdgmm-standards-roadmap-2020.pdf

List challenges of big data management in a few industrial case studies.

Data governance playbook

Data quality

Data Quality: 7 Key Issues & Best Practices to Avoid | Yellowfin

While preparing data for analysis and managing it on an ongoing basis, ensuring data quality is key. Here are 7 issues…

www.yellowfinbi.com

Inconsistent format/unit/language, incomplete information, duplicate, inaccurate, invalid

Security

Data Lake Security: Challenges and 6 Critical Best Practices

Data lake security refers to the measures and technologies used to protect data stored in data lakes from unauthorized…

cloudian.com

Main challenges: Access Control (RBAC), Data Protection (data loss protection, encryption), Privacy, and Compliance (Data Masking and Tokenization, Logs of Data Access and Modifications, Regular Audits and Compliance Checks, Anomaly Detection and Threat Intelligence)

Securing, protecting, and managing data

Building a data lake, and making it the centralized repository for assets that were previously duplicated and placed…

docs.aws.amazon.com

Access control (Bucket policy, lake formation), Encryption (S3 encryption with AWS KMS, replication, object lock), object tagging

Metadata management

Data Lake Metadata Management: Benefits, Examples, & Tools

Learn how metadata management optimizes the performance and value of your data lake, factors to consider when…

atlan.com

Metadata helps

Data discovery (providing details such as its source, its structure, its meaning, its relationships with other data, and its usage.)
Data quality (lineage of data (where it came from, who modified it, when, and why) can help track and fix data quality issues.)
Data governance and compliance (comply with data privacy and protection regulations. It allows tracking of who has access to what data, what they can do with it, and what they have done with it.)
Data integration (map data elements across different systems, enabling a consistent view of data across the enterprise.)
Data security (helps implement security controls, such as access permissions and data masking.)
AI and machine learning (helps algorithms understand what the data represents, which can improve their performance.)

Functionality of metadata management

Data cataloging (record information about the source, format, structure, and content of the data, transformations applied)
Data lineage tracking (record where it came from, who has accessed it, what changes have been made to it, and where it has been used.)
Semantic tagging (meaning and context of the data)
Data quality monitoring (check whether the data conforms to specified formats, ranges, or other constraints, and alert users to any anomalies.)
Data governance enforcement (control who has access to what data, what they can do with it, and what they have done with it)
Integration with other tools (feed metadata to data integration tools to facilitate data mapping and transformation.)

Operational

Lifecycle management

Optimize costs by automatically managing the data lifecycle - Azure Blob Storage

Use Azure Blob Storage lifecycle management policies to create automated rules for moving data between hot, cool, cold…

learn.microsoft.com

Key considerations for Azure Data Lake Storage - Cloud Adoption Framework

Understand key Azure Data Lake Storage considerations for cloud-scale analytics.

learn.microsoft.com

JSON rules for lifecycle management policy on blobs (move between tiers, delete versions)

Data lake lifecycle

Data lake lifecycledocs.aws.amazon.com

AWS CloudTrail logs every API call to S3 server access logging. S3 inventoryaudits and reports replication and encryption status for data.

S3 Intelligent-Tiering provides automatic cost savings by moving data between frequent and infrequent access tiers when the access patterns change

Data Lifecycle Management: A Complete Guide | Splunk

Learn data lifecycle management (DLM) to effectively manage data throughout its lifecycle, from creation to deletion…

www.splunk.com

Main goals of DLM

To protect the confidentiality
To ensure data integrity
To ensure the availability of data

Date retention in Databricks

Sometimes you get a customer request which causes the mental equivalent of holding a sneeze. 🤧 You have the quick…

medium.com

https://static.tti.tamu.edu/tti.tamu.edu/documents/PRC-17-84-F.pdf

Lifecycle: Collect, Process, Store and secure, Use, Share and communicate, Archive, Destroy or re-use (concurrent phases). Delta lake retention period is to configure for table versioning/history (time-travel queries).

Data lifecycle management (formerly information governance): M365, Exchange data (retention policy (all items in specific location): retain/delete, use retention label (on single item) to choose what happens after retention period/how to dispose, auto-labeling policy). Locations: Exchange mailboxes, Onedrive accounts, M365 Group mailboxes, Skype for Business, Exchange public folder, Teams messages/chats, etc. Adaptive scope (able to select specific attribute and define value)

Data Lifecycle Management in Evolving Input Distributions for Learning-based Aerospace Applications

As input distributions evolve over a mission lifetime, maintaining performance of learning-based models becomes…

arxiv.org

test data management

archiving

Disaster recovery

Disaster Recovery Overview, Strategies, and Assessment

Learn more about how to architect a Disaster Recovery Solution for a Databricks workspace.

www.databricks.com

DR strategies for

Databricks Objects: use CI/CD and Infrastructure as Code (IaC) tooling, e.g. Terraform (TF), Databricks Repos, Databricks REST APIs.

Databases and Tables: Data Replication using underlying object storage (GRR: Files that cannot be converted to Delta should rely on GRR), Delta DEEP CLONE to simplify replication for DR (load multiple storage accounts to same cluster)

Disaster Recovery on Databricks

Learn more about how to automate a Disaster Recovery Solution for a Databricks workspace.

www.databricks.com

Implementing Disaster Recovery for a Databricks Workspace

Learn about tools available for monitoring platform health and high-level steps of implementing a Disaster Recovery…

www.databricks.com

How to Easily Clone Your Delta Lake Data Tables with Databricks

Learn how to clone Delta Lake for efficient testing, sharing, and ensuring ML reproducibility with Databricks.

www.databricks.com

Workspace replication: No automatic continuous sync between different workspaces. Need to use custom solution (e.g. batch sync, stream sync, file sync)

Disaster recovery

Learn about disaster recovery planning with Databricks.

docs.databricks.com

Use Databricks Terraform Provider to provision Infra

Databricks Sync (DBSync)

Disaster recovery - Azure Databricks

Learn about disaster recovery planning with Azure Databricks.

learn.microsoft.com

Additional regional disaster recovery topology support and more concrete approach.

Provision multiple Azure Databricks workspaces in paired Azure regions
Use geo-redundant storage which by default matches Databricks workspaces
Migrate the users, user folders, notebooks, cluster configuration, jobs configuration, libraries, storage, init scripts, and reconfigure access control with scripts provided in the article

Appendix

https://docs.aws.amazon.com/pdfs/whitepapers/latest/building-data-lakes/building-data-lakes.pdf

AWS services integrated with S3 for ingestion: Amazon Kinesis Data Firehose, AWS Snow Family, AWS Glue, AWS DataSync, AWS Transfer Family, Storage Gateway, Apache Hadoop distributed copy command, AWS Database Migration Service

Governance: AWS Glue catalog and search, AWS Lake Formation

Data Lake Architecture: Visual Guide to Creating Data Lakes

Explore the essentials of data lake architecture, its benefits, security, & best practices. Unlock your data-driven…

www.integrate.io

Best Practices To Keep in Mind While Data Lake Implementation - Brickclay

Navigate the complexities of data lake implementation effortlessly. Our blog provides a roadmap for success, covering…

www.brickclay.com

8 Data Lake Best Practices: Make the Most of Your Data Lake

Data lakes store vast, diverse data from multiple sources, enabling organizations to manage unstructured…

cloudian.com