Modern data stack stories roundup 2023.3

Data engineering articles that are interesting to read

5 min readMar 28, 2023

Data Product

Data Product Specification

power up your metadata layer and automate your Data Mesh with this practical reference

medium.com

Discusses the importance of metadata in a data mesh architecture and introduces an open-source data product specification to standardize Data Product definition, which aims to gather together crucial pieces of information related to all these aspects, under the strict ownership of the Data Product Owner, including;

Output Ports: representing all the different interfaces of the Data Product to expose the data to consumers
Workloads: internal jobs/processes to feed the Data Product and to perform housekeeping (GDPR, regulation, audit, data quality, etc)
Storage Areas: internal data storages where the Data Product is deployed, not exposed to consumers
Observability: provides transparency to the data consumer about how the Data Product is currently working. This is not declarative, but exposing runtime data.

The open source repo has a sample file for Data Product definition.

Data domains and data products

practical guidance from the field

towardsdatascience.com

Talks about 18 principles in implementing data products:

Clearly define your domain boundaries: each data domain must be distinct: Business concerns, processes, the solution area, and data that belong together must stay together and be maintained and managed within the domain
Be concrete on your data products and interoperability standards. I think we should use standardized Data Product specification to achieve this
No raw data: whether for internal, external data provider, ask to conform to some standards
Define patterns for overlapping domains (Separate ways, as a design pattern, can be used if the associated cost of duplication is preferred over reusability. This pattern is typically a choice when reusability is sacrificed for higher flexibility and agility; A customer-supplier pattern can be used if one domain is strong and willing to take ownership of the data and needs of downstream consumers. The drawbacks of this pattern can be conflicting concerns, forcing downstream teams to negotiate deliverables and schedule priorities; In the partnership model, the integration logic is coordinated in an ad hoc manner within a newly created domain. All teams cooperate with and regard each other’s needs. A big commitment is needed from everybody because each cannot change the shared logic freely; A conformist pattern can be used to conform all domains to all requirements. This pattern can be a choice 1) when the integration work is extremely complex 2) no other parties are allowed to have control 3) or when vendor packages are used.)
A data product contains data, which is made available for wide consumption: if the same data will be used repeatedly by different domain teams, your teams shouldn’t conform their data products to specific needs of data consumers. A governance body that oversees that data products aren’t created consumer specific and can step in to guide domain teams, organize walk-in and knowledge sharing sessions, provide practical feedback and resolve issues between teams.
Create specific guidance on missing values, defaults, and data types
Data products are semantically consistent across all delivery methods: batch, event-driven, and API-based
Data products inherit the ubiquitous language, could be using Data Product specification
Data product attributes are atomic: represent the lowest level of granularity and have precise meaning or precise semantics, linked one-to-one to the items within your data catalog
Data products remain compatible from the moment created: remain stable and are decoupled from the operational/transactional application, including schema drift detection, versioning
Abstract volatile reference data to less granular value ranges
Optimized (transformed) for readability: complex application models are abstracted away: looks like a denormalized star schema model
Data products are directly captured from the source?
Newly created data means new data products: immutability
Encapsulate metadata for security
You may want to introduce some enterprise consistency: enterprise consistency might help in a large-scale organization in which many domains rely on the same reference values, e.g. currency codes, country codes, product codes, client segmentation codes
Addressing time-variant and non-volatile concerns
Use data product blueprints

The article also includes a check list ensures better data ownership, data usability and data platform usage.

Scorecard of a Data Platform for Analytics & ML

Enterprises in every vertical are developing Data Platforms to enable Analytics and ML-based insights. The charter of…

medium.com

Talks about KPIs to measure data platform maturity. One most important metric: Time-to-Reliable-Insights. Can depend on following KPIs:

Discovery Phase KPIs: Time-to-Find (Time-to-Find-Sources, Time-to-Find-Attributes), Time-to-Interpret

Ingestion Phase KPIs: Time-Lag-to-Analyze, Time-to-Evolve

Governance Phase KPIs: Time-to-Compliance, Time-to-Quality, Time-to-Transform, Time-to-Standardize

Analyze: Time-to-Iterate, Time-to-Query, Time-to-Optimize

Publish: Time-to-Productionalize, Time-to-Retrain, Time-to-Last-Mile, Time-to-Resolve-issues

Data Governance

Data Lineage: Just a Feature or Something Else?

Yesterday when I was reading the article The Many Layers of Data Lineage by Borja Vázquez, I started wondering when…

medium.com

Lineage is important to real-world data solutions, and should not be just treated as a feature. It is hard to do good lineage, and automated data lineage has the potential to empower data professionals and even replace some boring aspects of their daily jobs.

14 Questions to Ask When Evaluating Data Lineage

Looking for a data lineage tool? These are the key “gotchas” and features you should be asking about.

towardsdatascience.com

Question in evaluating data lineage solutions, mainly are;

Metadata source support
SQL support: parsing SQL
Granularity: column-level
Lineage support
Scalability
Integration with modern data integration tools, e.g. Fivetran, airbyte, Spark/Databricks
Richness of metadata

Top 5 Open-Source Data Catalogs in 2023

Apache Atlas, Lyft Amundsen, Linkedin Datahub, Netflix Metacat, OpenMetadata

pub.towardsai.net

5 Open-Source Data Catalogs

Apache Atlas
Lyft Amundsen
LinkedIn Datahub
Netflix Metacat
OpenMetadata

Other

Deep Dive into Handling Apache Spark Data Skew

The Ultimate Guide To Handle Data Skew In Distributed Compute

towardsdatascience.com

Spark data skew, use Spark UI to view job/stage, identify big discrepancy

Good (notice Sort on left and right are similar, 3 numbers for duration are similar)

Skew (duration for Sort left and right have big gap, 3 numbers also: 4ms vs. 481ms, 64KiB vs 56MiB)

Resolution: adjust partition number to break into smaller partitions, broadcast for small table, salting (add random number within integers range, e.g. 0–9 on left, integers range on right, at the end, join them will only produce one row) to break into smaller partitions.

Modern data stack stories roundup 2023.3

Data engineering articles that are interesting to read

Data Product

Data Product Specification

power up your metadata layer and automate your Data Mesh with this practical reference

Data domains and data products

practical guidance from the field

Scorecard of a Data Platform for Analytics & ML

Enterprises in every vertical are developing Data Platforms to enable Analytics and ML-based insights. The charter of…

Data Governance

Data Lineage: Just a Feature or Something Else?

Yesterday when I was reading the article The Many Layers of Data Lineage by Borja Vázquez, I started wondering when…

14 Questions to Ask When Evaluating Data Lineage

Looking for a data lineage tool? These are the key “gotchas” and features you should be asking about.

Top 5 Open-Source Data Catalogs in 2023

Apache Atlas, Lyft Amundsen, Linkedin Datahub, Netflix Metacat, OpenMetadata

Other

Deep Dive into Handling Apache Spark Data Skew

The Ultimate Guide To Handle Data Skew In Distributed Compute

Written by Xin Cheng

No responses yet