Modern data stack stories roundup 2023.2

Data engineering articles that are interesting to read

Xin Cheng

11 min readFeb 24, 2023

General

Why do you need a Data Architect?

and how one can help you to make key decisions.

medium.com

Data Architect design area and platform/tools selection

Ingestion framework: ingest data from various sources(CSV, RDBMS, NoSQL, Kafka), No Code/Low Code/GUI-based tool (AWS Glue/Azure Data Factory, Amazon EMR, Databricks, or Azure Synapse Analytics), managed ingestion SaaS (Fivetran or Airbyte)

ETL/ELT

Workflow Orchestration: AWS Step Functions or Azure Data Factory, Amazon MWAA (Amazon Managed Worlkfow for Apache Airflow) or Databricks Worklfows, Astronomer, Dagster, Prefect

Designing Control and Audit Tables: enable operation dashboards, like the number of files pending, completed, or aborted, number of records each batch, control tables schemas, and attributes, best approach to identify which records in the target table were inserted or updated by which batch

Data storage: for raw, transformed, curated, Data Lake or Data Warehouse or a Lakehouse?

File and Open Table Formats: Apache Hudi vs. Apace Iceberg vs. Delta Lake

Data Catalog: AWS Glue Catalog or Databricks Unity Catalog, Azure Purview, Atlan, Alation

Data Validations and Transformations: what are “must-have” data validations? Databricks DLT or AWS deequ, great expectations or soda, Data Observability platforms

Data Protection

Access Control and Sensitive Data Handling: access control strategy based on the roles of every consumer, RBAC (role-based access control) or ABAC (attribute-based access control), AWS Lake Formation for column hiding, Dynamic Masking, Snowflake/Databricks Dynamic masking, or SaaS products like Immuta, or Raito. Table-level/Column-level/Row-level access control

Hadoop To Databricks Migration | Datametica

how does datametica migrate hadoop to databricks We use cookies on our website to give you the most relevant experience…

www.datametica.com

Hadoop challenges

Operational

Heavy and complex infrastructure maintenance.
High operational cost.
Frequent downtimes with aging systems.
Disaster recovery is challenging and complex.
Scalability of storage, compute, and memory requires new infrastructure, causing implementation delays.
Lack of enterprise-grade data governance, and security tools.
Fine-grained auditing of access, data and monitoring is complex to implement.

Technical

Multiple technologies are used to solve similar use cases.
Integration challenges with modern tools for ETL, Visualization, and reporting.
Complexity in fine-grained access controls.
Limited scaling for Advanced Analytical & AI capabilities

DataBricks benefits

Operational

Cloud infrastructure reduces maintenance and operational cost.
Very high level of uptime guarantee from the cloud platform.
Platform provides capabilities for data governance, lineage tracking, monitoring, and auditing.
The ability to scale up/down ensures the required processing and storage capacity is always available.
Simpler DR Implementation is backed by cloud platforms, guaranteeing data availability across regions.

Technical

Unified programming model for various use cases.
Simpler integration with other vendor tools and also backed by cloud platform integration capabilities.
Better security, authorization, and governance controls.
Integrated ML, AI, and Analytics Capabilities.

Enabling Hadoop Migration to Azure

Why migrate on-premises big data workloads to Azure? There are many reasons why customers consider migrating their…

techcommunity.microsoft.com

Hadoop migration to Azure - Azure Architecture Center

Apache Hadoop provides a distributed file system and a framework for using MapReduce techniques to analyze and…

learn.microsoft.com

Data governance

Demystifying Data Mesh Principles

How the principles of data mesh define the reference architecture and operating model for a new world of data

medium.com

The article paves a step-by-step picture of why data mesh, why each principle.

Domain-oriented ownership: this is the most important factor for data mesh, since central data team cannot cope with fast-growing data demands. So the ownership is given back to team who knows domain data best. Side effect: Even after decentralization, data is still siloed within the domain, causing concerns over accessibility and usability of the data.

Data as a product: this principle addresses data siloing and accessibility created by principle 1 and enables data consumers to easily discover, understand, and securely access data distributed across many domains. Side effect: heavy cost of ownership to build, execute, and maintain a data product. This reduces speed to innovation and lead time for building new data products.

Self-serve data infrastructure as a platform: A shared data platform team provides data infrastructure to domain team, abstracting complex underlying details from them.

Federated computational governance: governance needs to be managed across domains. That means a central team is taking the responsibility of cross-domain functionalities and provide libraries so each domain team is focused on delivering business values, not infrastructure.

The Ultimate Guide to Evaluating a Data Catalog - Castor Blog

TLDR - Access the Data Catalog RFI/RFP Template here: ‍ Data catalogs were introduced to help data people find and…

www.castordoc.com

Tells how to evaluate data catalog product, and provides general features of data catalog.

Data catalog generation

1st generation: Basic software, similar to Excel, that syncs with your data warehouse.
2nd generation: Software designed to help the data steward in maintaining data documentation (metadata), lineage, and treatments.
3rd generation: Software designed to deliver business value to end-users automatically within hours after the deployment. It then guides users to document in a collaborative painless way.

Data Documentation Woes? Here’s a Framework.

The principles and steps we used to build a documentation-first culture

towardsdatascience.com

Metric 5W1H framework: what, why, where, who, when, how for metadata of data documentation.

Metadata/data catalog

Data catalog

Introduction

medium.com

Data catalog -> data is like book library catalog -> book.

3 Types of metadata

Technical metadata: Provides information on the structure, format and location of the data, data dictionary: including schemas, tables, columns, file names, report names, and anything else documented by the source system.
Business metadata: Provides the meaning, business context of data, defined in everyday language terms regardless of technical implementation, including business descriptions, comments, annotations, classifications, fitness-for-use, rating.
Operational metadata: Describes details of the processing and accessing of data, including data lineage (history of migrated data and the sequence of transformations applied to it), currency of data (active, archived, or purged), access management, and monitoring information (warehouse usage statistics, profiling, error reports, and audit trails)

A comprehensive list

Testing Open Source Data Catalogs

Previous blog post from Data Catalog series: Data Catalog

medium.com

The article tests 2open source data catalog solutions:

Magda: maintained by small team. The author ran into several issues and eventually had to postpone further research of the Magda Data Catalog.
Amundsen: according to architeture below
Frontend Service — Flask application with a React frontend where users can search through the fetched metadata as in a Data Catalog.
Search Service — leverages Elastic Search for search capabilities, used to power frontend metadata searching; Elastic Search — search engine based on the Lucene library whose API is here used to connect the metadata from the Databuilder to the Search Service.
Metadata Service — leverages Neo4j (or Apache Atlas) as the persistent layer, to provide various metadata; Neo4j — graph database management system, used here to fetch and display the metadata from the Databuilder and provide it to the Metadata Service.
Databuilder ingestion: data ingestion library inspired by Apache Gobblin. Its purpose is to transport the data from all the different metadata sources (such as Postgres, Hive, BigQuery, GCS, etc.) to the Neo4j and Elastic Search services.
According to Amundsen extractors list, they can retrieve schema information from sources that support schema information (e.g. Bigquery, relational database, Kafka schema registry, AWS Glue (which covers S3), Delta Lake). Currently, it seems missing major connectors to Microsoft Purview data catalog, Google Cloud data catalog.
Business Glossary: currently the only option that Amundsen provides for Business Glossary are the so-called Custom Descriptions, written in Markdown.
Data Profiling: supported by pandas/pandas-profiling

Testing Open Source Data Catalogs — Part 2

Previous blog posts fom Data Catalog series: Data Catalog and Testing Open Source Data Catalogs

medium.com

Syntio continues to test open source data catalog solution. This article covers OpenMetadata.

According to architecture, here are components:

User Interface (UI) — a central place for users to discover and collaborate on all data.
Ingestion framework — a pluggable framework for integrating tools and ingesting metadata to the metadata store. The ingestion framework already supports well-known data warehouses.
Metadata APIs — for producing and consuming metadata built on schemas for User Interfaces and for Integrating tools, systems, and services.
Metadata/Entity store — stores a metadata graph that connects data assets and user and tool generated metadata.
Search Engine — Elasticsearch to support querying metadata

Connectors list: includes relatinal database, bigquery, Kafka

Business Glossary: seems more mature than Amundsen, is a controlled vocabulary that helps to establish consistent meaning for terms, establish a common understanding, and to build a knowledge base. Glossary terms can also help to organize or discover data entities. OpenMetadata models a Glossary that organizes terms with hierarchical, equivalent, and associative relationships.

Data Profiling: Data profiler is enabled as part of metadata ingestion, using a separate configuration file.

DataHub

Datahub — An introduction

Most trusted open-source data catalog

blog.devgenius.io

From architecture, DataHub is similar to OpenMetadata, the components are:

Ingestion Framework: The Ingestion Framework is a modular, extensible Python library for extracting Metadata from external source systems (e.g. Snowflake, Looker, MySQL, Kafka), transforming it into DataHub’s Metadata Model, and writing it into DataHub via either Kafka or using the Metadata Store Rest APIs directly. To create ingestion pipeline, you can use configuration recipe file or with Python.

Metadata Store: storing the Entities & Aspects comprising the Metadata Graph. It consists of a Spring Java Service hosting a set of Rest.li API endpoints, along with MySQL, Elasticsearch, & Kafka for primary storage & indexing (Graph Index (Neo4j or Elasticsearch)).

GraphQL API: provides a strongly-typed, entity-oriented API that makes interacting with the Entities comprising the Metadata Graph simple. This API is consumed by the User Interface.

User Interface: React UI to support Data Discovering, Data Governing

Data Lineage: column-level lineage support is recently added, also lineage from external sources like Airflow, Spark, Snowflake, Bigquery.

Analytics: usage analytics.

Data Profiling: SQL Profiler

End-to-End Data Discovery, Observability, and Governance on AWS with LinkedIn’s Open-source DataHub

Use DataHub’s data catalog capabilities to collect, organize, enrich, and search for metadata across multiple platforms

garystafford.medium.com

The article provides data catalog vendors (open source, commercial and cloud service providers) from Gartner and DBMS Tools, as well as a very high-level overview of DataHub features.

DataHub offers a Helm chart, making deployment to Kubernetes straightforward. Authors used ArgoCD, the declarative GitOps continuous delivery tool for Kubernetes, to deploy the DataHub Helm charts to Amazon EKS.

In order for data catalog to work, it needs to be able to connect to different sources to ingest metadata. DataHub uses a plugin architecture. Plugins allow you to install only the datasource dependencies you need. For example, if you want to ingest metadata from Amazon Athena, just install the Athena plugin: pip install 'acryl-datahub[athena]'. DataHub cli provides a local environment to test datahub and run ingestion, which is good during development phase.

From this article, once you define source, transformers, sink, ingestion pipeline will crawl all objects and populate DataHub metadata store.

It also mentions 2 more modules in DataHub

Metadata Change Event Consumer (optional): Its main function is to listen to change proposal events emitted by clients of DataHub which request changes to the Metadata Graph. It then applies these requests against DataHub’s storage layer: the Metadata Service.

Metadata Audit Event Consumer (optional): Its main function is to listen to change log events emitted as a result of changes made to the Metadata Graph, converting changes in the metadata model into updates against secondary search & graph indexes (among other things).

Monitoring and Observability

A Guide to Evaluating Data Observability Tools

acceldataio.medium.com

A comprehensive data observability layer extends beyond data and into the applications and infrastructure layers, to have a complete picture emerges of your data’s journey and its interactions with the infrastructure.
Traditionally, observability has been focused on IT, infrastructure and operations metrics (which does not care about data context), but need for observability for business data is rising.

Like typical IT observability, data observability lifecyle is: detect, alert, root cause analysis, remediate.

Evaluation Criteria

Data Sources and Collectors: richness of supported data sources, data pipeline sources, compute sources (e.g. Kafka, RabbitMQ)

Monitor and Measure Data Environments: supported data quality dimensions, pipeline monitoring (identify source schema drift and potential impact on downstream systems)

Analyze and Optimize Data Pipelines and Infrastructures: how detailed is the root cause analysis, level of remediation / recommendation.

Operate and Manage Data Operations: alert, incident management capabilities (ability to annotate lineage, provide feedback loop, and open tickets in the popular workload systems, such as ServiceNow, Jira, and other systems.), integrations between other dataops products or collaboration apps (Slack, Microsoft Teams).

What is APM? Application Performance Monitoring Guide

Application performance monitoring (APM) is the collection of tools and processes designed to help IT professionals…

www.techtarget.com

Not only monitor infra components and application (for SRE/platform team), but the most important is the user experience and business transactions monitoring (for application team).

How to Monitor Service: The Fundamental Framework

The place to start when planning monitoring for a new service or a checklist when revisiting an existing one.

hackernoon.com

4 Pillars of monitoring: computing infra, dependencies, networking infra, application

Data observability - Cloud Adoption Framework

Data observability is your ability to understand the health of your data and data systems by collecting and correlating…

learn.microsoft.com

Data Observability key areas (infrastructure, service, data)

Building Highly Reliable Data Pipelines at Datadog

At Datadog, our data pipelines process trillions of data points every day to power core product features like long-term…

www.datadoghq.com

Data pipeline Infrastructure monitoring

Server (CPU, memory)

Disk

The right metrics to monitor cloud data pipelines | Google Cloud Blog

Data pipelines provide the ability to operate on streams of real-time data and process large data volumes. Monitoring…

cloud.google.com

Data pipeline service monitoring (SRE focus)

The Four Golden Signals are:
Latency — The time it takes for your service to fulfill a request
Traffic — How much demand is directed at your service
Errors — The rate at which your service fails
Saturation — A measure of how close to fully utilized the service’s resources are

What Is Data Observability? 5 Key Pillars To Know In 2023

Editor's Note: So much has happened since we first published this post and created the data observability category and…

www.montecarlodata.com

Five pillars of data observability (focused on data itself)

Freshness
Quality
Volume
Schema
Lineage

Same as this article and this.

Observability patterns and metrics - Azure Example Scenarios

Note This article relies on an open source library hosted on GitHub at: https://github.com/mspnp/spark-monitoring. The…

learn.microsoft.com

Data Observability patterns and metrics for Azure Databricks

Modern data stack stories roundup 2023.2

Data engineering articles that are interesting to read

General

Why do you need a Data Architect?

and how one can help you to make key decisions.

Hadoop To Databricks Migration | Datametica

how does datametica migrate hadoop to databricks We use cookies on our website to give you the most relevant experience…

Enabling Hadoop Migration to Azure

Why migrate on-premises big data workloads to Azure? There are many reasons why customers consider migrating their…

Hadoop migration to Azure - Azure Architecture Center

Apache Hadoop provides a distributed file system and a framework for using MapReduce techniques to analyze and…

Data governance

Demystifying Data Mesh Principles

How the principles of data mesh define the reference architecture and operating model for a new world of data

The Ultimate Guide to Evaluating a Data Catalog - Castor Blog

TLDR - Access the Data Catalog RFI/RFP Template here: ‍ Data catalogs were introduced to help data people find and…

Data Documentation Woes? Here’s a Framework.

The principles and steps we used to build a documentation-first culture

Metadata/data catalog

Data catalog

Introduction

Testing Open Source Data Catalogs

Previous blog post from Data Catalog series: Data Catalog

Testing Open Source Data Catalogs — Part 2

Previous blog posts fom Data Catalog series: Data Catalog and Testing Open Source Data Catalogs

DataHub

Datahub — An introduction

Most trusted open-source data catalog

End-to-End Data Discovery, Observability, and Governance on AWS with LinkedIn’s Open-source DataHub

Use DataHub’s data catalog capabilities to collect, organize, enrich, and search for metadata across multiple platforms

Monitoring and Observability

A Guide to Evaluating Data Observability Tools

What is APM? Application Performance Monitoring Guide

Application performance monitoring (APM) is the collection of tools and processes designed to help IT professionals…

How to Monitor Service: The Fundamental Framework

The place to start when planning monitoring for a new service or a checklist when revisiting an existing one.

Data observability - Cloud Adoption Framework

Data observability is your ability to understand the health of your data and data systems by collecting and correlating…

Building Highly Reliable Data Pipelines at Datadog

At Datadog, our data pipelines process trillions of data points every day to power core product features like long-term…

The right metrics to monitor cloud data pipelines | Google Cloud Blog

Data pipelines provide the ability to operate on streams of real-time data and process large data volumes. Monitoring…

What Is Data Observability? 5 Key Pillars To Know In 2023

Editor's Note: So much has happened since we first published this post and created the data observability category and…

Observability patterns and metrics - Azure Example Scenarios

Note This article relies on an open source library hosted on GitHub at: https://github.com/mspnp/spark-monitoring. The…

Written by Xin Cheng

No responses yet