Databricks Data Intelligence Platform

Latest about Unified Data AI Platform

Xin Cheng
6 min readSep 26, 2024

In Databricks Data + AI Summit 2024, we see exciting new/enhanced features for data AI practitioners around platform, data engineering, data warehousing, AI/GenAI.

Platform overview

The speaker first throws challenges: multimodal requirement for intelligence, data silos and data integration

Then goes over Databricks Data Intelligence Platform

Unity catalog capability, demos data asset discovery

Databricks Assistant (available at no additional cost) which assists writing pyspark, SQL code

AI functions, predictive I/O, Gen AI

Serverless

Classic compute start takes time. Serverless is about 10sec (no knobs for version, instance type, scaling/spot parameters, instance profiles, spark config, environments, etc.). Not for streaming yet

Data engineering

Gartner magic quadrant leader

Little modified Medallion Architecture: bronze (raw, schema validated), silver (cleansed & validated, conformed/clean), gold (analytical models, reporting models)

Real-world zones: landing, raw, base, enriched, curated, semantic

Leader in Stream Processing and Cloud Data Pipelines

Multiple native, third-party data sources in “data ingestion” UI

Announced serverless DLT (in addition serverless notebooks, jobs/workflows, pipelines, SQL warehouses)

Databricks workflows to orchestrate anything

Integration with other partners

Demos GenAI-assisted metadata capture/code development

Databricks workflow is airflow on Databricks

For streaming DLT, SQL is “CREATE OR REFRESH STREAMING TABLE” OR “ALTER STREAMING TABLE”

DLT, dbt pipeline supports MV and ST. DLT supports flow, sink (sample shows Kafka sink, in private preview).

APPLY CHANGES APIs: Simplify change data capture with Delta Live Tables. SCD Type 2 support.

Vertical autoscaling: serverless DLT pipelines adds to the horizontal autoscaling provided by Databricks Enhanced Autoscaling by automatically allocating the most cost-efficient instance type (mainly solving out-of-memory errors). DLT operational dashboard

Incremental refresh for MVs with cost-based optimization by Enzyme. Serverless DLT 4x throughput and 32% less TCO vs classic DLT. DLT sample with NASA data.

Other videos 1, 2

Databricks workflows

Modern data engineering requires modern data orchestration: ETL, BI dashboard refresh, real-time application, ML Model training

Budget monitoring, alerting

Trigger types: scheduled, file arrival, table update, continuous

Serverless key technologies: warm-pool of VMs (fast up- and down-scaling), horizontal autoscaler, versionless (automatically latest features DBR, photon), lakeguard, environment caching, smart fail-over, query insights (lineage), automatic vertical scaling

Ingestion Connectors: simplify previous steps of ETL, structured streaming, notebook

Databricks LakeFlow (like Mapping data flows in Azure Data Factory), unified solution for data engineering, including connect (ingest), pipelines (transform), jobs (orchestrate), built on declarative delta live tables. Databricks LakeFlow is native to the Data Intelligence Platform, providing serverless compute and unified governance with Unity Catalog.

Data warehousing

Data warehousing platform

Leader in the Gartner Cloud Database Management Systems MQ

Leader in Forrester Wave for Data Lakehouses

Databricks SQL features: Intelligent Workload Management, predictive I/O, new features

Starrocks 4.6 times faster than Trino

Lakehouse Federation

Lakehouse Federation, smart pushdown, sharing for Lakehouse Federation (share data from any DB without ETL)

Manage connections

foreign catalogs

HMS federation

Monitoring

Challenges in adopting DQ tools: high friction, hard to scale, noisy alerts

Data, AI assets types: bronze/silver/gold, time series, MVs, STs, features, model, inference table (profiling, data drift)

Demos how to setup table monitor to capture quality and stat change, on expectation on ST, MV, health dashboard, intelligent forecasting, classifiers for 16 types of PII

6 dimension of DQ and techniques to handle: consistency, accuracy, validity, completeness, uniqueness, timeliness

Discusses DataOps, MLOps, LLMOps and tooling

AI/GenAI

Compound AI system, components to drive accuracy: models (embedding, LLMs), data (retrieval), tools (function calling, tools in unity catalog), evaluation (tracing, LLM & human judges), serving, monitoring, governance (model, embedding, tools, access) and guardrails

Demos AutoML on Databricks Mosaic AI

Data Science and Machine Learning Platforms

A query performance involves lots of factors: file sizes (AI-optimized file sizes based on data characteristics and query patterns), data layout (partition/zorder, small-files problem, liquid clustering intelligent balancing clustering vs. file size, row-level concurrency to avoid concurrent partition update conflict, automatic liquid clustering key selection/CLUSTER BY AUTO), continuous maintenance with analyze/vacuum (predictive optimization/automatic statistics)

Delta table system table managed_tables: table metadata

Smart Manufacturing with Sight Machine

GenAI

Supports prompt engineering, RAG, fine-tuning, pre-training

AI playground to experiement with LLMs

On Dataiku LLMOps tooling

125 LLMs available in model landscape

LLM: non-deterministic (LLM-as-a-judge), need human review (guardrails), cost (cost review)

Weighting system for LLM-as-a-judge: e.g. 60% correctness, 20% failthfulness, 20% professionalism

Evaluation

Mentions ChainPoll: LLM as a judge, CoT explanation, correctness for open-domain errors, context adherence for closed-domain errors. Guardrails metrics: toxicity, PII info, tone, sexism, custom metrics.

Metrics: groundedness (based on the retrieved context), correctness, safety, relevance

Mosaic AI agent evaluation: built-in LLM as a judge; framework: serving, lakehouse monitoring

Vector search, RAG, Agent

Explains vector search is based on latent features (uses age, gender, royalty features and king, queen, prince, princess to explain similarities in 3D space). Databricks Vector Search: fully managed, native lakehouse integration, supports hybrid search (semantic, keyword), re-ranking

Two types of Vector Search Indexes on Databricks platform -

  • Delta Sync Index, which automatically syncs with a source Delta Table, automatically and incrementally updating the index as the underlying data in the Delta Table changes.
  • Direct Vector Access Index, which supports direct read and write of vectors and metadata. The user is responsible for updating this table using the REST API or the Python SDK. This type of index is created using REST API or the SDK.

Delta Sync Index provides easy to use automatic and managed ETL pipeline to keep Vector Index up to date and you can use Databricks managed embeddings computation.

Explains RAG app development, deployment process on Databricks

Discusses RAG pipeline and evaluation-driven development (many right/more right, need human or model judge). Evaluation set, Agent Framework, Agent Evaluation, Databricks Generative AI Cookbook. Uses evaluation set and mlflow to find grouned but incorrect, corrects chunking strategy and increases accuracy and reduces cost/latency.

Hitachi solution discusses a solution for centralized/automated vectorization, with delta sync index

Basic introduction of flow engineering with langgraph, with LCEL, langchain hub template (e.g. self rag retriever grader, self rag question rewriter)

Responsible AI

3 Pillars: Transparency, Effectiveness, Reliability

Microsoft Fabric

Like Databricks intelligence Platform (unified storage, DI, DE, DS, DW, BI experience), also with Kusto query/KQL for real-time intelligence, Data Activator like Power Automate.

Recommended MS Fabric youtube channels, blogs, Advancing Analytics, KratosBI, Data Mozart, Learn Microsoft Fabric with Will, Azure Synapse Analytics, fabric.guru

Capacities: are the foundation for simplicity and flexibility of Fabric’s licensing model.

V-Order, 2: write time optimization to the parquet file format that enables lightning-fast reads under the Microsoft Fabric compute engines, such as Power BI, SQL, Spark, and others.

Apache Iceberg support: bi-directional data access between Snowflake and Fabric

OneLake shortcuts: enables data to be reused without copying it, like federation

Mirroring, 2: low-cost and low-latency solution to bring data from various systems together into a single analytics platform

External data sharing

High Concurrency Mode: allows sharing of Spark compute across multiple notebooks and allows their queries to execute in parallel.

ML model train, track, scoring/2

Data Activator’s items are called reflexes.

Semantic link/2: allows you to establish a connection between semantic models and Synapse Data Science in Microsoft Fabric.

Microsoft Purview to govern Microsoft Fabric

Microsoft Fabric roadmap

Appendix

--

--

Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified