Modern data stack stories roundup 2023.1

Data engineering articles that are interesting to read

5 min readJan 12, 2023

General data pipeline

Eats data platform: Empowering businesses with data

How Coupang Eats built a configuration-driven data ingestion, processing, and utilization system — Part 1

medium.com

Talks about data platform for Coupang Eats, Online food order and delivery service. Centralized data platform supporting training and serving ML models and other data science services, types of data pipeline

Non-real time: for ML related feature production, user profiling and tags generation, and data visualization; at the last step to push offline features and signals to the online storage, configuration-driven pipeline is used to define sync feature group (from Hive tables), generic Spark job definition and scheduling information

Near real-time: ingest Kafka or Hive on Cloud storage to OLAP engine, then wide tables are created by joining multiple table sources using OLAP engine SQL, then again use OLAP engine SQL to generate metrics and signals

Pure real-time: for low latency scenarios like flood detection and risk control, configuration-based pipeline from Kafka and output of real-time features to Kafka (calculate statistical aggregates like SUM, COUNT, UNIQUE COUNT, TOPN).

Apache Open-source Projects in Modern Data Stacks

Editor: Detong, github.com/mischaZhang

medium.com

Talks about Apache projects in modern data stack, but focusing on Apache SeaTunnel(incubating) (project focused on synchronizing data and connecting data to different systems), Apache DolphinScheduler (data orchestration system), an Airflow alternative, but drag-and-drop based.

Apache SeaTunnel (incubating) is a project for data synchronization, supporting 50+ data sources and destinations, like MySQL, Presto, PostgreSQL, TiDB, and Elasticsearch (as well as Hive, Hudi and other data lake formats). SeaTunnel supports Spark, Flink, and SeaTunnel engines, and CDC (e.g. MySQL CDC, Kafka, but not with Debezium yet). SeaTunnel now supports 50+ connectors, including 20+ data sources, 20+ types of sinks, and 10+ types of transforms.

DolphinScheduler can schedule SeaTunnel sink jobs, so batch data can be fed back to Oracle, SAP, SaaS systems, social media, and relational databases.

What I Got Wrong: Looking Back at My 2022 Predictions for the Modern Data Stack

Where we started vs. where we are now in the data world

towardsdatascience.com

Data mesh: isn’t a platform or a service that you can buy off the shelf. It’s a design concept with concepts like distributed ownership, domain-based design, data discoverability, and data product shipping standards (my understanding is more like paradigm (like microservices, which takes long time to build inception into people’s mind and then land)). It is still in the early phases as teams figure out what implementing the data mesh really means and mesh tooling stack is still premature. Suggestion: important to stick to the first principles at a conceptual level, rather than buy into the hype

Metrics layer (still hype to me for centralizing business metrics across enterprise): although dbt Labs’ Semantic Layer launched in October 2022, along with integrations across the modern data stack from companies like Hex, Mode, Thoughtspot, and Atlan, the change management process for people to write metrics is massive, and it’s more likely that the switch to the metrics layer will take years, rather than months (is it really worth first-class citizen in data space?).

Reverse ETL: previously I see it is more like another name for application integration tool. Now they’ve shifted from talking about “pushing data” to actually driving customer use cases with data.

Active metadata: I think as ML-augmented metadata, where you don’t manually enter metadata, but dynamically generated from code. Gartner, G2, Forrester, industry starts to align on what a successful data catalog should look like, but truly third-gen data catalogs are still yet to be seen.

Data Observability: not sure where data observability is heading — towards independence or a merge with data reliability, active metadata, or some other category. But at a high level, it seems that it’s moving closer to data quality, with a focus on ensuring high-quality data, rather than active metadata.

Towards A Practical Data Mesh Roadmap

The journey to an enterprise Data Mesh can be challenging. It can be made a bit easier and quicker by using a practical…

towardsdatascience.com

The article talks about several work streams in enterprise data mesh journey:

Strategy Stream, which establishes key data mesh concepts and creates an implementation plan while mapping our key opportunities and risks (10–12 weeks). Main outputs are Architecture, Roadmap, Risks. Successful outcome is gaining buy-in from a diverse set of stakeholders.
Technology Stream, which defines and builds the technology foundation and industrialization activities required for enterprise data mesh (16–24 weeks.). Two activities are: (a) the buildout of the technology foundational components (Registry/Catalog (easy to find, discover, observe, and operate data products), Access interface: Federated Query Platform (consume and share data managed by data products), API platform (consume data via APIs), Streaming/Event Platform) and (b) the industrialization of those components (Security, Operability, Observability, Support).
Factory Stream, that introduces repeatable processes and templates to permit rapid scaling of our enterprise data mesh. Parallel to technology stream, deliver 3–5 MVPs (8–10 weeks) and create data product factory including Repeatable Processes and Templates, DevSecOps Tools and Pipelines, Secure Environments.
Operating Model Stream, which defines the team structure, interactions, and governance techniques to build and operate out enterprise data mesh.
Socialization Stream, that is used to not only communicate successes but also to continuously build momentum required to build our enterprise data mesh. It can use communication vehicles available — articles, blogs, podcasts, presentations, “office-hours” sessions, of “lunch-and-learn” sessions — to engage stakeholders, find and signup sponsors
Rollout Stream, that accelerates the adoption of data products with enterprise data mesh.

Medallion architecture: best practices for managing Bronze, Silver and Gold

Best practices for designing and building a lake house architecture.

piethein.medium.com

Talks about Medallion architecture, which is a popular architecture to build lakehouse. Bronze (raw/unprocessed, immutable/append-only, using interval partitioned tables, for example, using a YYYYMMDD or datetime folder structure), Silver (data quality rules applied (deal with missing/duplicate/inconsistent/inaccurate values), can have SCD, storage format Delta (or at least Parquet), data enrichment), Gold (complex business rules applied, calculations, enrichments, use-case specific optimizations).

Streaming platform

Kappa Architecture is Mainstream Replacing Lambda

Architectures and trade-offs for replacing Lambda and batch with unified real-time Kappa architecture. Examples from…

kai-waehner.medium.com

Kappa is mainstream ins Uber, Shopify, Disney, Twitter (vs. Lambda, there is batch processing system extracting from source periodically (typically transactional systems in enterprises, main difference of Kappa is no batch layer and querying on real-time layer directly), along with natural streaming sources (e.g. clickstream, device/sensor telemetry (package tracking, temperature-sensitive goods monitoring, workplace safety surveillance), twitter, in-game player activity), but Kappa first turns everything into streaming). Made possible with Kafka tiered storage, otherwise storing long-time data in Kafka is expensive.

Others

2023 — Rockstar Data Engineer Roadmap

This article presents a roadmap for those who want to become Data Engineers in 2023. It also serves as a reference to…

akaabachi.medium.com

11 Great Data Engineering Youtube Channels You Should Be Watching In 2023

Do You Know Them All?

medium.com

12 Best Jupyter Notebook Alternatives In 2023

Love Jupyter Notebooks? Here are 12 online Jupyter Notebook alternatives that you will love! Plus, find out which…

medium.com

Modern data stack stories roundup 2023.1

Data engineering articles that are interesting to read

General data pipeline

Eats data platform: Empowering businesses with data

How Coupang Eats built a configuration-driven data ingestion, processing, and utilization system — Part 1

Apache Open-source Projects in Modern Data Stacks

Editor: Detong, github.com/mischaZhang

What I Got Wrong: Looking Back at My 2022 Predictions for the Modern Data Stack

Where we started vs. where we are now in the data world

Towards A Practical Data Mesh Roadmap

The journey to an enterprise Data Mesh can be challenging. It can be made a bit easier and quicker by using a practical…

Medallion architecture: best practices for managing Bronze, Silver and Gold

Best practices for designing and building a lake house architecture.

Streaming platform

Kappa Architecture is Mainstream Replacing Lambda

Architectures and trade-offs for replacing Lambda and batch with unified real-time Kappa architecture. Examples from…

Others

2023 — Rockstar Data Engineer Roadmap

This article presents a roadmap for those who want to become Data Engineers in 2023. It also serves as a reference to…

11 Great Data Engineering Youtube Channels You Should Be Watching In 2023

Do You Know Them All?

12 Best Jupyter Notebook Alternatives In 2023

Love Jupyter Notebooks? Here are 12 online Jupyter Notebook alternatives that you will love! Plus, find out which…

Written by Xin Cheng

No responses yet