00-DBT-on-databricks

Adding an ingestion step before dbt pipeline: "Extract" part of your ETL

DBT doesn't offer direct ways to ingest data from different sources. Typical ingestion source can be:

files delivered on blob storage (S3/ADLS/GCS...)
Message queue (kafka)
External databases...

Databricks lakehouse solves this gap easily. You can leverage all our connectors, including partners (ex: Fivetran) to incrementally load new incoming data.

In this demo, our workflow will have 3 tasks:

01: task to incrementally extract files from a blob storage using Databricks Autoloader and save this data in our raw layer.
02: run the dbt pipeline to do the transformations, consuming the data from these raw tables
03: final task for the final operation (ex: ML predictions, or refreshing a dashboard)

A note on Delta Live Table

Delta Live Table is a declarative framework build by Databricks. It can also be used to build data pipeline within Databricks and provides among other:

Ingestion capabilities to ingest data from any sources within your pipeline (no need for external step)
Out of the box streaming capabilities for near-realtime inferences
Incremental support: ingest and transform new data as they come
Advanced capabilities (simple Change Data Capture, SCDT2 etc)

If you wish to know more about dlt, install the DLT demos: dbdemos.install('dlt-loan')