Modern data stack stories roundup 2023.3

Data engineering articles that are interesting to read

Xin Cheng
5 min readMar 28, 2023

Data Product

Discusses the importance of metadata in a data mesh architecture and introduces an open-source data product specification to standardize Data Product definition, which aims to gather together crucial pieces of information related to all these aspects, under the strict ownership of the Data Product Owner, including;

  • Output Ports: representing all the different interfaces of the Data Product to expose the data to consumers
  • Workloads: internal jobs/processes to feed the Data Product and to perform housekeeping (GDPR, regulation, audit, data quality, etc)
  • Storage Areas: internal data storages where the Data Product is deployed, not exposed to consumers
  • Observability: provides transparency to the data consumer about how the Data Product is currently working. This is not declarative, but exposing runtime data.

The open source repo has a sample file for Data Product definition.

Talks about 18 principles in implementing data products:

  1. Clearly define your domain boundaries: each data domain must be distinct: Business concerns, processes, the solution area, and data that belong together must stay together and be maintained and managed within the domain
  2. Be concrete on your data products and interoperability standards. I think we should use standardized Data Product specification to achieve this
  3. No raw data: whether for internal, external data provider, ask to conform to some standards
  4. Define patterns for overlapping domains (Separate ways, as a design pattern, can be used if the associated cost of duplication is preferred over reusability. This pattern is typically a choice when reusability is sacrificed for higher flexibility and agility; A customer-supplier pattern can be used if one domain is strong and willing to take ownership of the data and needs of downstream consumers. The drawbacks of this pattern can be conflicting concerns, forcing downstream teams to negotiate deliverables and schedule priorities; In the partnership model, the integration logic is coordinated in an ad hoc manner within a newly created domain. All teams cooperate with and regard each other’s needs. A big commitment is needed from everybody because each cannot change the shared logic freely; A conformist pattern can be used to conform all domains to all requirements. This pattern can be a choice 1) when the integration work is extremely complex 2) no other parties are allowed to have control 3) or when vendor packages are used.)
  5. A data product contains data, which is made available for wide consumption: if the same data will be used repeatedly by different domain teams, your teams shouldn’t conform their data products to specific needs of data consumers. A governance body that oversees that data products aren’t created consumer specific and can step in to guide domain teams, organize walk-in and knowledge sharing sessions, provide practical feedback and resolve issues between teams.
  6. Create specific guidance on missing values, defaults, and data types
  7. Data products are semantically consistent across all delivery methods: batch, event-driven, and API-based
  8. Data products inherit the ubiquitous language, could be using Data Product specification
  9. Data product attributes are atomic: represent the lowest level of granularity and have precise meaning or precise semantics, linked one-to-one to the items within your data catalog
  10. Data products remain compatible from the moment created: remain stable and are decoupled from the operational/transactional application, including schema drift detection, versioning
  11. Abstract volatile reference data to less granular value ranges
  12. Optimized (transformed) for readability: complex application models are abstracted away: looks like a denormalized star schema model
  13. Data products are directly captured from the source?
  14. Newly created data means new data products: immutability
  15. Encapsulate metadata for security
  16. You may want to introduce some enterprise consistency: enterprise consistency might help in a large-scale organization in which many domains rely on the same reference values, e.g. currency codes, country codes, product codes, client segmentation codes
  17. Addressing time-variant and non-volatile concerns
  18. Use data product blueprints

The article also includes a check list ensures better data ownership, data usability and data platform usage.

Talks about KPIs to measure data platform maturity. One most important metric: Time-to-Reliable-Insights. Can depend on following KPIs:

Discovery Phase KPIs: Time-to-Find (Time-to-Find-Sources, Time-to-Find-Attributes), Time-to-Interpret

Ingestion Phase KPIs: Time-Lag-to-Analyze, Time-to-Evolve

Governance Phase KPIs: Time-to-Compliance, Time-to-Quality, Time-to-Transform, Time-to-Standardize

Analyze: Time-to-Iterate, Time-to-Query, Time-to-Optimize

Publish: Time-to-Productionalize, Time-to-Retrain, Time-to-Last-Mile, Time-to-Resolve-issues

Data Governance

Lineage is important to real-world data solutions, and should not be just treated as a feature. It is hard to do good lineage, and automated data lineage has the potential to empower data professionals and even replace some boring aspects of their daily jobs.

Question in evaluating data lineage solutions, mainly are;

  1. Metadata source support
  2. SQL support: parsing SQL
  3. Granularity: column-level
  4. Lineage support
  5. Scalability
  6. Integration with modern data integration tools, e.g. Fivetran, airbyte, Spark/Databricks
  7. Richness of metadata

5 Open-Source Data Catalogs

  1. Apache Atlas
  2. Lyft Amundsen
  3. LinkedIn Datahub
  4. Netflix Metacat
  5. OpenMetadata

Other

Spark data skew, use Spark UI to view job/stage, identify big discrepancy

Good (notice Sort on left and right are similar, 3 numbers for duration are similar)

Skew (duration for Sort left and right have big gap, 3 numbers also: 4ms vs. 481ms, 64KiB vs 56MiB)

Resolution: adjust partition number to break into smaller partitions, broadcast for small table, salting (add random number within integers range, e.g. 0–9 on left, integers range on right, at the end, join them will only produce one row) to break into smaller partitions.

--

--

Xin Cheng
Xin Cheng

Written by Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified

No responses yet