Databricks Data Intelligence Platform
In Databricks Data + AI Summit 2024, we see exciting new/enhanced features for data AI practitioners around platform, data engineering, data warehousing, AI/GenAI.
Platform overview
The speaker first throws challenges: multimodal requirement for intelligence, data silos and data integration
Then goes over Databricks Data Intelligence Platform
Unity catalog capability, demos data asset discovery
Databricks Assistant (available at no additional cost) which assists writing pyspark, SQL code
AI functions, predictive I/O, Gen AI
Serverless
Classic compute start takes time. Serverless is about 10sec (no knobs for version, instance type, scaling/spot parameters, instance profiles, spark config, environments, etc.). Not for streaming yet
Data engineering
Gartner magic quadrant leader
Little modified Medallion Architecture: bronze (raw, schema validated), silver (cleansed & validated, conformed/clean), gold (analytical models, reporting models)
Real-world zones: landing, raw, base, enriched, curated, semantic
Leader in Stream Processing and Cloud Data Pipelines
Multiple native, third-party data sources in “data ingestion” UI
Announced serverless DLT (in addition serverless notebooks, jobs/workflows, pipelines, SQL warehouses)
Databricks workflows to orchestrate anything
Integration with other partners
Demos GenAI-assisted metadata capture/code development
Databricks workflow is airflow on Databricks
For streaming DLT, SQL is “CREATE OR REFRESH STREAMING TABLE” OR “ALTER STREAMING TABLE”
DLT, dbt pipeline supports MV and ST. DLT supports flow, sink (sample shows Kafka sink, in private preview).
APPLY CHANGES APIs: Simplify change data capture with Delta Live Tables. SCD Type 2 support.
Vertical autoscaling: serverless DLT pipelines adds to the horizontal autoscaling provided by Databricks Enhanced Autoscaling by automatically allocating the most cost-efficient instance type (mainly solving out-of-memory errors). DLT operational dashboard
Incremental refresh for MVs with cost-based optimization by Enzyme. Serverless DLT 4x throughput and 32% less TCO vs classic DLT. DLT sample with NASA data.
Databricks workflows
Modern data engineering requires modern data orchestration: ETL, BI dashboard refresh, real-time application, ML Model training
Trigger types: scheduled, file arrival, table update, continuous
Serverless key technologies: warm-pool of VMs (fast up- and down-scaling), horizontal autoscaler, versionless (automatically latest features DBR, photon), lakeguard, environment caching, smart fail-over, query insights (lineage), automatic vertical scaling
Ingestion Connectors: simplify previous steps of ETL, structured streaming, notebook
Databricks LakeFlow (like Mapping data flows in Azure Data Factory), unified solution for data engineering, including connect (ingest), pipelines (transform), jobs (orchestrate), built on declarative delta live tables. Databricks LakeFlow is native to the Data Intelligence Platform, providing serverless compute and unified governance with Unity Catalog.
Data warehousing
Leader in the Gartner Cloud Database Management Systems MQ
Leader in Forrester Wave for Data Lakehouses
Databricks SQL features: Intelligent Workload Management, predictive I/O, new features
Starrocks 4.6 times faster than Trino
Lakehouse Federation
Lakehouse Federation, smart pushdown, sharing for Lakehouse Federation (share data from any DB without ETL)
Monitoring
Challenges in adopting DQ tools: high friction, hard to scale, noisy alerts
Data, AI assets types: bronze/silver/gold, time series, MVs, STs, features, model, inference table (profiling, data drift)
Demos how to setup table monitor to capture quality and stat change, on expectation on ST, MV, health dashboard, intelligent forecasting, classifiers for 16 types of PII
6 dimension of DQ and techniques to handle: consistency, accuracy, validity, completeness, uniqueness, timeliness
Discusses DataOps, MLOps, LLMOps and tooling
AI/GenAI
Compound AI system, components to drive accuracy: models (embedding, LLMs), data (retrieval), tools (function calling, tools in unity catalog), evaluation (tracing, LLM & human judges), serving, monitoring, governance (model, embedding, tools, access) and guardrails
Demos AutoML on Databricks Mosaic AI
Data Science and Machine Learning Platforms
A query performance involves lots of factors: file sizes (AI-optimized file sizes based on data characteristics and query patterns), data layout (partition/zorder, small-files problem, liquid clustering intelligent balancing clustering vs. file size, row-level concurrency to avoid concurrent partition update conflict, automatic liquid clustering key selection/CLUSTER BY AUTO), continuous maintenance with analyze/vacuum (predictive optimization/automatic statistics)
Delta table system table managed_tables: table metadata
Smart Manufacturing with Sight Machine
GenAI
Supports prompt engineering, RAG, fine-tuning, pre-training
AI playground to experiement with LLMs
On Dataiku LLMOps tooling
125 LLMs available in model landscape
LLM: non-deterministic (LLM-as-a-judge), need human review (guardrails), cost (cost review)
Weighting system for LLM-as-a-judge: e.g. 60% correctness, 20% failthfulness, 20% professionalism
Evaluation
Mentions ChainPoll: LLM as a judge, CoT explanation, correctness for open-domain errors, context adherence for closed-domain errors. Guardrails metrics: toxicity, PII info, tone, sexism, custom metrics.
Metrics: groundedness (based on the retrieved context), correctness, safety, relevance
Mosaic AI agent evaluation: built-in LLM as a judge; framework: serving, lakehouse monitoring
Vector search, RAG, Agent
Explains vector search is based on latent features (uses age, gender, royalty features and king, queen, prince, princess to explain similarities in 3D space). Databricks Vector Search: fully managed, native lakehouse integration, supports hybrid search (semantic, keyword), re-ranking
Two types of Vector Search Indexes on Databricks platform -
- Delta Sync Index, which automatically syncs with a source Delta Table, automatically and incrementally updating the index as the underlying data in the Delta Table changes.
- Direct Vector Access Index, which supports direct read and write of vectors and metadata. The user is responsible for updating this table using the REST API or the Python SDK. This type of index is created using REST API or the SDK.
Delta Sync Index provides easy to use automatic and managed ETL pipeline to keep Vector Index up to date and you can use Databricks managed embeddings computation.
Explains RAG app development, deployment process on Databricks
Discusses RAG pipeline and evaluation-driven development (many right/more right, need human or model judge). Evaluation set, Agent Framework, Agent Evaluation, Databricks Generative AI Cookbook. Uses evaluation set and mlflow to find grouned but incorrect, corrects chunking strategy and increases accuracy and reduces cost/latency.
Hitachi solution discusses a solution for centralized/automated vectorization, with delta sync index
Basic introduction of flow engineering with langgraph, with LCEL, langchain hub template (e.g. self rag retriever grader, self rag question rewriter)
Responsible AI
3 Pillars: Transparency, Effectiveness, Reliability
Microsoft Fabric
Like Databricks intelligence Platform (unified storage, DI, DE, DS, DW, BI experience), also with Kusto query/KQL for real-time intelligence, Data Activator like Power Automate.
Recommended MS Fabric youtube channels, blogs, Advancing Analytics, KratosBI, Data Mozart, Learn Microsoft Fabric with Will, Azure Synapse Analytics, fabric.guru
Capacities: are the foundation for simplicity and flexibility of Fabric’s licensing model.
V-Order, 2: write time optimization to the parquet file format that enables lightning-fast reads under the Microsoft Fabric compute engines, such as Power BI, SQL, Spark, and others.
Apache Iceberg support: bi-directional data access between Snowflake and Fabric
OneLake shortcuts: enables data to be reused without copying it, like federation
Mirroring, 2: low-cost and low-latency solution to bring data from various systems together into a single analytics platform
High Concurrency Mode: allows sharing of Spark compute across multiple notebooks and allows their queries to execute in parallel.
ML model train, track, scoring/2
Data Activator’s items are called reflexes.
Semantic link/2: allows you to establish a connection between semantic models and Synapse Data Science in Microsoft Fabric.
Microsoft Purview to govern Microsoft Fabric