Modern data stack stories roundup 2023.1
General data pipeline
Talks about data platform for Coupang Eats, Online food order and delivery service. Centralized data platform supporting training and serving ML models and other data science services, types of data pipeline
Non-real time: for ML related feature production, user profiling and tags generation, and data visualization; at the last step to push offline features and signals to the online storage, configuration-driven pipeline is used to define sync feature group (from Hive tables), generic Spark job definition and scheduling information
Near real-time: ingest Kafka or Hive on Cloud storage to OLAP engine, then wide tables are created by joining multiple table sources using OLAP engine SQL, then again use OLAP engine SQL to generate metrics and signals
Pure real-time: for low latency scenarios like flood detection and risk control, configuration-based pipeline from Kafka and output of real-time features to Kafka (calculate statistical aggregates like SUM, COUNT, UNIQUE COUNT, TOPN).
Talks about Apache projects in modern data stack, but focusing on Apache SeaTunnel(incubating) (project focused on synchronizing data and connecting data to different systems), Apache DolphinScheduler (data orchestration system), an Airflow alternative, but drag-and-drop based.
Apache SeaTunnel (incubating) is a project for data synchronization, supporting 50+ data sources and destinations, like MySQL, Presto, PostgreSQL, TiDB, and Elasticsearch (as well as Hive, Hudi and other data lake formats). SeaTunnel supports Spark, Flink, and SeaTunnel engines, and CDC (e.g. MySQL CDC, Kafka, but not with Debezium yet). SeaTunnel now supports 50+ connectors, including 20+ data sources, 20+ types of sinks, and 10+ types of transforms.
DolphinScheduler can schedule SeaTunnel sink jobs, so batch data can be fed back to Oracle, SAP, SaaS systems, social media, and relational databases.
Data mesh: isn’t a platform or a service that you can buy off the shelf. It’s a design concept with concepts like distributed ownership, domain-based design, data discoverability, and data product shipping standards (my understanding is more like paradigm (like microservices, which takes long time to build inception into people’s mind and then land)). It is still in the early phases as teams figure out what implementing the data mesh really means and mesh tooling stack is still premature. Suggestion: important to stick to the first principles at a conceptual level, rather than buy into the hype
Metrics layer (still hype to me for centralizing business metrics across enterprise): although dbt Labs’ Semantic Layer launched in October 2022, along with integrations across the modern data stack from companies like Hex, Mode, Thoughtspot, and Atlan, the change management process for people to write metrics is massive, and it’s more likely that the switch to the metrics layer will take years, rather than months (is it really worth first-class citizen in data space?).
Reverse ETL: previously I see it is more like another name for application integration tool. Now they’ve shifted from talking about “pushing data” to actually driving customer use cases with data.
Active metadata: I think as ML-augmented metadata, where you don’t manually enter metadata, but dynamically generated from code. Gartner, G2, Forrester, industry starts to align on what a successful data catalog should look like, but truly third-gen data catalogs are still yet to be seen.
Data Observability: not sure where data observability is heading — towards independence or a merge with data reliability, active metadata, or some other category. But at a high level, it seems that it’s moving closer to data quality, with a focus on ensuring high-quality data, rather than active metadata.
The article talks about several work streams in enterprise data mesh journey:
- Strategy Stream, which establishes key data mesh concepts and creates an implementation plan while mapping our key opportunities and risks (10–12 weeks). Main outputs are Architecture, Roadmap, Risks. Successful outcome is gaining buy-in from a diverse set of stakeholders.
- Technology Stream, which defines and builds the technology foundation and industrialization activities required for enterprise data mesh (16–24 weeks.). Two activities are: (a) the buildout of the technology foundational components (Registry/Catalog (easy to find, discover, observe, and operate data products), Access interface: Federated Query Platform (consume and share data managed by data products), API platform (consume data via APIs), Streaming/Event Platform) and (b) the industrialization of those components (Security, Operability, Observability, Support).
- Factory Stream, that introduces repeatable processes and templates to permit rapid scaling of our enterprise data mesh. Parallel to technology stream, deliver 3–5 MVPs (8–10 weeks) and create data product factory including Repeatable Processes and Templates, DevSecOps Tools and Pipelines, Secure Environments.
- Operating Model Stream, which defines the team structure, interactions, and governance techniques to build and operate out enterprise data mesh.
- Socialization Stream, that is used to not only communicate successes but also to continuously build momentum required to build our enterprise data mesh. It can use communication vehicles available — articles, blogs, podcasts, presentations, “office-hours” sessions, of “lunch-and-learn” sessions — to engage stakeholders, find and signup sponsors
- Rollout Stream, that accelerates the adoption of data products with enterprise data mesh.
Talks about Medallion architecture, which is a popular architecture to build lakehouse. Bronze (raw/unprocessed, immutable/append-only, using interval partitioned tables, for example, using a YYYYMMDD or datetime folder structure), Silver (data quality rules applied (deal with missing/duplicate/inconsistent/inaccurate values), can have SCD, storage format Delta (or at least Parquet), data enrichment), Gold (complex business rules applied, calculations, enrichments, use-case specific optimizations).
Streaming platform
Kappa is mainstream ins Uber, Shopify, Disney, Twitter (vs. Lambda, there is batch processing system extracting from source periodically (typically transactional systems in enterprises, main difference of Kappa is no batch layer and querying on real-time layer directly), along with natural streaming sources (e.g. clickstream, device/sensor telemetry (package tracking, temperature-sensitive goods monitoring, workplace safety surveillance), twitter, in-game player activity), but Kappa first turns everything into streaming). Made possible with Kafka tiered storage, otherwise storing long-time data in Kafka is expensive.