Data Lake Best Practices
I had an article on Data warehouse, Data lake, data lakehouse, data fabric, data mesh. Since there are multiple data types (e.g. structured data, files (text, pdf, docs), images, audios, videos), data lakes are usually the first entry points for enterprise data platform. Let’s survey best practices for data lake.
Design/Architecture
Deciding number of storage accounts (key considerations, customer geographical locations, isolated management policy, cost/billing vs. overhead of managing multiple accounts, copying data back and forth): scenarios where enterprises have their analytics scenarios factoring multiple geographic regions, provision region-specific storage accounts to store data for a particular region and allow sharing of specific data with other regions; Create different storage accounts (ideally in different subscriptions) for your development and production environments. Enterprises that serve as a multi-tenant analytics platform serving multiple customers could end up provisioning individual data lakes for their customers in different subscriptions to help ensure that the customer data and their associated analytics workloads are isolated from other customers to help manage their cost and billing models.
Data organization: Create different folders or containers (more below on considerations between folders vs containers) for the different data zones — raw, enriched, curated and workspace data sets; Inside a zone, choose to organize data in folders according to logical separation, e.g. datetime or business units or both.
Access control of data: RBACs (scoped to top-level resources like storage accounts or containers), ACLs (files, directories). Create security groups, add security principals to security groups, add security group ACLs.
Data format: Avro file format is favored where the I/O patterns are more write heavy or the query patterns favor retrieving multiple rows of records in their entirety. E.g. Avro format is favored by message bus such as Event Hub or Kafka writes multiple events/messages in succession. Parquet and ORC file formats are favored when the I/O patterns are more read heavy and/or when the query patterns are focused on a subset of columns in the records.
Cost of data lake: data lifecycle management. Right data redundancy. Optimizing for more data in a single transaction since ead and write transactions are billed in 4 MB increments.
Data lake monitoring: Azure Storage logs in Azure Monitor. Log Analytics with KQL.
Performance: larger sized files (target at least 100 MB or more) for better performance. Leverage Query Acceleration.
Folder layout of raw, enriched, curated data zone
AWS data lake: cost, use S3 Storage Classes, Data Lifecycle Policies, S3 Object Tagging, Combine Small Files, Compress Data to Maximize Data Retention and Reduce Storage Costs
Performance: Combine Small Files
Governance: Metadata management
https://library.si.edu/sites/default/files/pdf/rdm_best_practices.pdf
Mainly focus on data storage: file formats, file organization, data dictionaries/metadata
Architecting an Open Data Lake for the Enterprise
Phase 1: capture and store raw data of different types at scale, Phase 2: augment enterprise DW strategy. Elements: data preparation, data profiling, governance, self-service data ingest, metadata management, data lineage, data classification, security
Project vision: Bring your own tool, decentralized data ownership, centralized processes/tools, capture all data as real-time as possible, self-service access by authorized user, encryption compliance
Governance
Define clear roles and responsibilities (Data Owners, Data Stewards, Data Managers or Custodians, Data Consumers, Data Governance Council, Data Architects, Data Quality Analysts)
Data governance in Dataplex. Solving dark data: data marketplace, tagging (thinking of using GenAI to assist metadata enrichment), surface data value
Automatic cataloguing of data sources, column lineage
Data management
https://people.umass.edu/biostat690c/pdf/1.%20%20Principles%20of%20Data%20Management%202020.pdf
Data Management Plan (DMP) (outlines how to collect, organize, manage, store, describe, secure, backup, preserve, and share data), Manual of Operation Procedures (MOOP)
https://data.uq.edu.au/files/6833/Data%20Governance%20Essentials%20handbook%20August%202021.pdf
List challenges of big data management in a few industrial case studies.
Data quality
Inconsistent format/unit/language, incomplete information, duplicate, inaccurate, invalid
Security
Main challenges: Access Control (RBAC), Data Protection (data loss protection, encryption), Privacy, and Compliance (Data Masking and Tokenization, Logs of Data Access and Modifications, Regular Audits and Compliance Checks, Anomaly Detection and Threat Intelligence)
Access control (Bucket policy, lake formation), Encryption (S3 encryption with AWS KMS, replication, object lock), object tagging
Metadata management
Metadata helps
- Data discovery (providing details such as its source, its structure, its meaning, its relationships with other data, and its usage.)
- Data quality (lineage of data (where it came from, who modified it, when, and why) can help track and fix data quality issues.)
- Data governance and compliance (comply with data privacy and protection regulations. It allows tracking of who has access to what data, what they can do with it, and what they have done with it.)
- Data integration (map data elements across different systems, enabling a consistent view of data across the enterprise.)
- Data security (helps implement security controls, such as access permissions and data masking.)
- AI and machine learning (helps algorithms understand what the data represents, which can improve their performance.)
Functionality of metadata management
- Data cataloging (record information about the source, format, structure, and content of the data, transformations applied)
- Data lineage tracking (record where it came from, who has accessed it, what changes have been made to it, and where it has been used.)
- Semantic tagging (meaning and context of the data)
- Data quality monitoring (check whether the data conforms to specified formats, ranges, or other constraints, and alert users to any anomalies.)
- Data governance enforcement (control who has access to what data, what they can do with it, and what they have done with it)
- Integration with other tools (feed metadata to data integration tools to facilitate data mapping and transformation.)
Operational
Lifecycle management
JSON rules for lifecycle management policy on blobs (move between tiers, delete versions)
AWS CloudTrail logs every API call to S3 server access logging. S3 inventoryaudits and reports replication and encryption status for data.
S3 Intelligent-Tiering provides automatic cost savings by moving data between frequent and infrequent access tiers when the access patterns change
Main goals of DLM
- To protect the confidentiality
- To ensure data integrity
- To ensure the availability of data
https://static.tti.tamu.edu/tti.tamu.edu/documents/PRC-17-84-F.pdf
Lifecycle: Collect, Process, Store and secure, Use, Share and communicate, Archive, Destroy or re-use (concurrent phases). Delta lake retention period is to configure for table versioning/history (time-travel queries).
Data lifecycle management (formerly information governance): M365, Exchange data (retention policy (all items in specific location): retain/delete, use retention label (on single item) to choose what happens after retention period/how to dispose, auto-labeling policy). Locations: Exchange mailboxes, Onedrive accounts, M365 Group mailboxes, Skype for Business, Exchange public folder, Teams messages/chats, etc. Adaptive scope (able to select specific attribute and define value)
test data management
archiving
Disaster recovery
DR strategies for
Databricks Objects: use CI/CD and Infrastructure as Code (IaC) tooling, e.g. Terraform (TF), Databricks Repos, Databricks REST APIs.
Databases and Tables: Data Replication using underlying object storage (GRR: Files that cannot be converted to Delta should rely on GRR), Delta DEEP CLONE to simplify replication for DR (load multiple storage accounts to same cluster)
Workspace replication: No automatic continuous sync between different workspaces. Need to use custom solution (e.g. batch sync, stream sync, file sync)
Use Databricks Terraform Provider to provision Infra
Additional regional disaster recovery topology support and more concrete approach.
- Provision multiple Azure Databricks workspaces in paired Azure regions
- Use geo-redundant storage which by default matches Databricks workspaces
- Migrate the users, user folders, notebooks, cluster configuration, jobs configuration, libraries, storage, init scripts, and reconfigure access control with scripts provided in the article
Appendix
https://docs.aws.amazon.com/pdfs/whitepapers/latest/building-data-lakes/building-data-lakes.pdf
AWS services integrated with S3 for ingestion: Amazon Kinesis Data Firehose, AWS Snow Family, AWS Glue, AWS DataSync, AWS Transfer Family, Storage Gateway, Apache Hadoop distributed copy command, AWS Database Migration Service
Governance: AWS Glue catalog and search, AWS Lake Formation
https://info.talend.com/rs/talend/images/WP_EN_BD_TDWI_DataLakes.pdf
http://pages.matillion.com/rs/992-UIW-731/images/2019C2%20-%20Data%20Lakes%20eBook.pdf