Modern data stack stories roundup 2023.2

Data engineering articles that are interesting to read

Xin Cheng
11 min readFeb 24, 2023

General

Data Architect design area and platform/tools selection

Ingestion framework: ingest data from various sources(CSV, RDBMS, NoSQL, Kafka), No Code/Low Code/GUI-based tool (AWS Glue/Azure Data Factory, Amazon EMR, Databricks, or Azure Synapse Analytics), managed ingestion SaaS (Fivetran or Airbyte)

ETL/ELT

Workflow Orchestration: AWS Step Functions or Azure Data Factory, Amazon MWAA (Amazon Managed Worlkfow for Apache Airflow) or Databricks Worklfows, Astronomer, Dagster, Prefect

Designing Control and Audit Tables: enable operation dashboards, like the number of files pending, completed, or aborted, number of records each batch, control tables schemas, and attributes, best approach to identify which records in the target table were inserted or updated by which batch

Data storage: for raw, transformed, curated, Data Lake or Data Warehouse or a Lakehouse?

File and Open Table Formats: Apache Hudi vs. Apace Iceberg vs. Delta Lake

Data Catalog: AWS Glue Catalog or Databricks Unity Catalog, Azure Purview, Atlan, Alation

Data Validations and Transformations: what are “must-have” data validations? Databricks DLT or AWS deequ, great expectations or soda, Data Observability platforms

Data Protection

Access Control and Sensitive Data Handling: access control strategy based on the roles of every consumer, RBAC (role-based access control) or ABAC (attribute-based access control), AWS Lake Formation for column hiding, Dynamic Masking, Snowflake/Databricks Dynamic masking, or SaaS products like Immuta, or Raito. Table-level/Column-level/Row-level access control

Hadoop challenges

Operational

  • Heavy and complex infrastructure maintenance.​
  • High operational cost.​
  • Frequent downtimes with aging systems.​
  • Disaster recovery is challenging and complex.​
  • Scalability of storage, compute, and memory requires new infrastructure, causing implementation delays.​
  • Lack of enterprise-grade data governance, and security tools.​
  • Fine-grained auditing of access, data and monitoring is complex to implement.

Technical

  • Multiple technologies are used to solve similar use cases.​
  • Integration challenges with modern tools for ETL, Visualization, and reporting.​
  • Complexity in fine-grained access controls.​
  • Limited scaling for Advanced Analytical & AI capabilities

DataBricks benefits

Operational

  • Cloud infrastructure reduces maintenance and operational cost.​
  • Very high level of uptime guarantee from the cloud platform.​
  • Platform provides capabilities for data governance, lineage tracking, monitoring, and auditing.​
  • The ability to scale up/down ensures the required processing and storage capacity is always available.​
  • Simpler DR Implementation is backed by cloud platforms, guaranteeing data availability across regions.

Technical

  • Unified programming model for various use cases.​
  • Simpler integration with other vendor tools and also backed by cloud platform integration capabilities.​
  • Better security, authorization, and governance controls.​
  • Integrated ML, AI, and Analytics Capabilities.

Data governance

The article paves a step-by-step picture of why data mesh, why each principle.

Domain-oriented ownership: this is the most important factor for data mesh, since central data team cannot cope with fast-growing data demands. So the ownership is given back to team who knows domain data best. Side effect: Even after decentralization, data is still siloed within the domain, causing concerns over accessibility and usability of the data.

Data as a product: this principle addresses data siloing and accessibility created by principle 1 and enables data consumers to easily discover, understand, and securely access data distributed across many domains. Side effect: heavy cost of ownership to build, execute, and maintain a data product. This reduces speed to innovation and lead time for building new data products.

Self-serve data infrastructure as a platform: A shared data platform team provides data infrastructure to domain team, abstracting complex underlying details from them.

Federated computational governance: governance needs to be managed across domains. That means a central team is taking the responsibility of cross-domain functionalities and provide libraries so each domain team is focused on delivering business values, not infrastructure.

Tells how to evaluate data catalog product, and provides general features of data catalog.

Data catalog generation

1st generation: Basic software, similar to Excel, that syncs with your data warehouse.
2nd generation: Software designed to help the data steward in maintaining data documentation (metadata), lineage, and treatments.
3rd generation: Software designed to deliver business value to end-users automatically within hours after the deployment. It then guides users to document in a collaborative painless way.

Metric 5W1H framework: what, why, where, who, when, how for metadata of data documentation.

Metadata/data catalog

Data catalog -> data is like book library catalog -> book.

3 Types of metadata

  1. Technical metadata: Provides information on the structure, format and location of the data, data dictionary: including schemas, tables, columns, file names, report names, and anything else documented by the source system.
  2. Business metadata: Provides the meaning, business context of data, defined in everyday language terms regardless of technical implementation, including business descriptions, comments, annotations, classifications, fitness-for-use, rating.
  3. Operational metadata: Describes details of the processing and accessing of data, including data lineage (history of migrated data and the sequence of transformations applied to it), currency of data (active, archived, or purged), access management, and monitoring information (warehouse usage statistics, profiling, error reports, and audit trails)

A comprehensive list

The article tests 2open source data catalog solutions:

  1. Magda: maintained by small team. The author ran into several issues and eventually had to postpone further research of the Magda Data Catalog.
  2. Amundsen: according to architeture below
    Frontend Service — Flask application with a React frontend where users can search through the fetched metadata as in a Data Catalog.
    Search Service — leverages Elastic Search for search capabilities, used to power frontend metadata searching; Elastic Search — search engine based on the Lucene library whose API is here used to connect the metadata from the Databuilder to the Search Service.
    Metadata Service — leverages Neo4j (or Apache Atlas) as the persistent layer, to provide various metadata; Neo4j — graph database management system, used here to fetch and display the metadata from the Databuilder and provide it to the Metadata Service.
    Databuilder ingestion: data ingestion library inspired by Apache Gobblin. Its purpose is to transport the data from all the different metadata sources (such as Postgres, Hive, BigQuery, GCS, etc.) to the Neo4j and Elastic Search services.
    According to Amundsen extractors list, they can retrieve schema information from sources that support schema information (e.g. Bigquery, relational database, Kafka schema registry, AWS Glue (which covers S3), Delta Lake). Currently, it seems missing major connectors to Microsoft Purview data catalog, Google Cloud data catalog.
    Business Glossary: currently the only option that Amundsen provides for Business Glossary are the so-called Custom Descriptions, written in Markdown.
    Data Profiling: supported by pandas/pandas-profiling

Syntio continues to test open source data catalog solution. This article covers OpenMetadata.

According to architecture, here are components:

  • User Interface (UI) — a central place for users to discover and collaborate on all data.
  • Ingestion framework — a pluggable framework for integrating tools and ingesting metadata to the metadata store. The ingestion framework already supports well-known data warehouses.
  • Metadata APIs — for producing and consuming metadata built on schemas for User Interfaces and for Integrating tools, systems, and services.
  • Metadata/Entity store — stores a metadata graph that connects data assets and user and tool generated metadata.
  • Search Engine — Elasticsearch to support querying metadata

Connectors list: includes relatinal database, bigquery, Kafka

Business Glossary: seems more mature than Amundsen, is a controlled vocabulary that helps to establish consistent meaning for terms, establish a common understanding, and to build a knowledge base. Glossary terms can also help to organize or discover data entities. OpenMetadata models a Glossary that organizes terms with hierarchical, equivalent, and associative relationships.

Data Profiling: Data profiler is enabled as part of metadata ingestion, using a separate configuration file.

DataHub

From architecture, DataHub is similar to OpenMetadata, the components are:

Ingestion Framework: The Ingestion Framework is a modular, extensible Python library for extracting Metadata from external source systems (e.g. Snowflake, Looker, MySQL, Kafka), transforming it into DataHub’s Metadata Model, and writing it into DataHub via either Kafka or using the Metadata Store Rest APIs directly. To create ingestion pipeline, you can use configuration recipe file or with Python.

Metadata Store: storing the Entities & Aspects comprising the Metadata Graph. It consists of a Spring Java Service hosting a set of Rest.li API endpoints, along with MySQL, Elasticsearch, & Kafka for primary storage & indexing (Graph Index (Neo4j or Elasticsearch)).

GraphQL API: provides a strongly-typed, entity-oriented API that makes interacting with the Entities comprising the Metadata Graph simple. This API is consumed by the User Interface.

User Interface: React UI to support Data Discovering, Data Governing

Data Lineage: column-level lineage support is recently added, also lineage from external sources like Airflow, Spark, Snowflake, Bigquery.

Analytics: usage analytics.

Data Profiling: SQL Profiler

The article provides data catalog vendors (open source, commercial and cloud service providers) from Gartner and DBMS Tools, as well as a very high-level overview of DataHub features.

DataHub offers a Helm chart, making deployment to Kubernetes straightforward. Authors used ArgoCD, the declarative GitOps continuous delivery tool for Kubernetes, to deploy the DataHub Helm charts to Amazon EKS.

In order for data catalog to work, it needs to be able to connect to different sources to ingest metadata. DataHub uses a plugin architecture. Plugins allow you to install only the datasource dependencies you need. For example, if you want to ingest metadata from Amazon Athena, just install the Athena plugin: pip install 'acryl-datahub[athena]'. DataHub cli provides a local environment to test datahub and run ingestion, which is good during development phase.

From this article, once you define source, transformers, sink, ingestion pipeline will crawl all objects and populate DataHub metadata store.

It also mentions 2 more modules in DataHub

Metadata Change Event Consumer (optional): Its main function is to listen to change proposal events emitted by clients of DataHub which request changes to the Metadata Graph. It then applies these requests against DataHub’s storage layer: the Metadata Service.

Metadata Audit Event Consumer (optional): Its main function is to listen to change log events emitted as a result of changes made to the Metadata Graph, converting changes in the metadata model into updates against secondary search & graph indexes (among other things).

Monitoring and Observability

  • A comprehensive data observability layer extends beyond data and into the applications and infrastructure layers, to have a complete picture emerges of your data’s journey and its interactions with the infrastructure.
  • Traditionally, observability has been focused on IT, infrastructure and operations metrics (which does not care about data context), but need for observability for business data is rising.

Like typical IT observability, data observability lifecyle is: detect, alert, root cause analysis, remediate.

Evaluation Criteria

Data Sources and Collectors: richness of supported data sources, data pipeline sources, compute sources (e.g. Kafka, RabbitMQ)

Monitor and Measure Data Environments: supported data quality dimensions, pipeline monitoring (identify source schema drift and potential impact on downstream systems)

Analyze and Optimize Data Pipelines and Infrastructures: how detailed is the root cause analysis, level of remediation / recommendation.

Operate and Manage Data Operations: alert, incident management capabilities (ability to annotate lineage, provide feedback loop, and open tickets in the popular workload systems, such as ServiceNow, Jira, and other systems.), integrations between other dataops products or collaboration apps (Slack, Microsoft Teams).

Not only monitor infra components and application (for SRE/platform team), but the most important is the user experience and business transactions monitoring (for application team).

4 Pillars of monitoring: computing infra, dependencies, networking infra, application

Data Observability key areas (infrastructure, service, data)

Data pipeline Infrastructure monitoring

Server (CPU, memory)

Disk

Data pipeline service monitoring (SRE focus)

The Four Golden Signals are:
Latency — The time it takes for your service to fulfill a request
Traffic — How much demand is directed at your service
Errors — The rate at which your service fails
Saturation — A measure of how close to fully utilized the service’s resources are

Five pillars of data observability (focused on data itself)

Freshness
Quality
Volume
Schema
Lineage

Same as this article and this.

Data Observability patterns and metrics for Azure Databricks

--

--

Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified