databricks-logo

00-DBT-on-databricks

(Python)
Loading...

Adding an ingestion step before dbt pipeline: "Extract" part of your ETL

DBT doesn't offer direct ways to ingest data from different sources. Typical ingestion source can be:

  • files delivered on blob storage (S3/ADLS/GCS...)
  • Message queue (kafka)
  • External databases...

Databricks lakehouse solves this gap easily. You can leverage all our connectors, including partners (ex: Fivetran) to incrementally load new incoming data.

In this demo, our workflow will have 3 tasks:

  • 01: task to incrementally extract files from a blob storage using Databricks Autoloader and save this data in our raw layer.
  • 02: run the dbt pipeline to do the transformations, consuming the data from these raw tables
  • 03: final task for the final operation (ex: ML predictions, or refreshing a dashboard)

Accessing the dbt pipeline & demo content

A workflow with all the steps have been created for you

Click here to access your Workflow job, it was setup and has been started when you installed your demo.

The dbt project has been loaded as part of your repos

Because dbt integration works with git repos, we loaded the demo dbt repo in your repo folder :

Click here to explore the dbt pipeline installed as a repository
The workflow has been setup to use this repo.

Going further with dbt

A note on Delta Live Table

Delta Live Table is a declarative framework build by Databricks. It can also be used to build data pipeline within Databricks and provides among other:

  • Ingestion capabilities to ingest data from any sources within your pipeline (no need for external step)
  • Out of the box streaming capabilities for near-realtime inferences
  • Incremental support: ingest and transform new data as they come
  • Advanced capabilities (simple Change Data Capture, SCDT2 etc)

If you wish to know more about dlt, install the DLT demos: dbdemos.install('dlt-loan')

;