01-DLT-Loan-pipeline-SQL

Simple ETL with DLT

DLT makes Data Engineering accessible for all. Just declare your transformations in SQL or Python, and DLT will handle the Data Engineering complexity for you.

**Accelerate ETL development**

Enable analysts and data engineers to innovate rapidly with simple pipeline development and maintenance

**Remove operational complexity**

By automating complex administrative tasks and gaining broader visibility into pipeline operations

**Trust your data**

With built-in quality controls and quality monitoring to ensure accurate and useful BI, Data Science, and ML

**Simplify batch and streaming**

With self-optimization and auto-scaling data pipelines for batch or streaming processing

Simple ingestion with Lakeflow Connect

Lakeflow Connect offers built-in data ingestion connectors for popular SaaS applications, databases and file sources, such as Salesforce, Workday, and SQL Server to build incremental data pipelines at scale, fully integrated with Databricks.

To give it a try, check our [Lakeflow Connect Product Tour](https://www.databricks.com/resources/demos/tours/platform/discover-databricks-lakeflow-connect-demo)

Our DLT pipeline

We'll be using as input a raw dataset containing information on our customers Loan and historical transactions.

Our goal is to ingest this data in near real time and build table for our Analyst team while ensuring data quality.

**Your DLT Pipeline is ready!** Your pipeline was started using this notebook and is available here.

Our datasets are coming from 3 different systems and saved under a cloud storage folder (S3/ADLS/GCS):

* `loans/raw_transactions` (loans uploader here in every few minutes)
* `loans/ref_accounting_treatment` (reference table, mostly static)
* `loans/historical_loans` (loan from legacy system, new data added every week)

Let's ingest this data incrementally, and then compute a couple of aggregates that we'll need for our final Dashboard to report our KPI.

Bronze layer: incrementally ingest data leveraging Databricks Autoloader

Our raw data is being sent to a blob storage.

Autoloader simplify this ingestion, including schema inference, schema evolution while being able to scale to millions of incoming files.

Autoloader is available in SQL using the `read_files` function and can be used with a variety of format (json, csv, avro...):

For more detail on Autoloader, you can see `dbdemos.install('auto-loader')`

STREAMING LIVE TABLE

Defining tables as `STREAMING` will guarantee that you only consume new incoming data. Without `STREAMING`, you will scan and ingest all the data available at once. See the [documentation](https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-incremental-data.html) for more details

Silver layer: joining tables while ensuring data quality

Once the bronze layer is defined, we'll create the sliver layers by Joining data. Note that bronze tables are referenced using the `LIVE` spacename.

To consume only increment from the Bronze layer like `BZ_raw_txs`, we'll be using the `stream` keyworkd: `stream(LIVE.BZ_raw_txs)`

Note that we don't have to worry about compactions, DLT handles that for us.

Expectations

By defining expectations (`CONSTRAINT EXPECT `), you can enforce and track your data quality. See the [documentation](https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-expectations.html) for more details

Gold layer

Our last step is to materialize the Gold Layer.

Because these tables will be requested at scale using a SQL Endpoint, we'll add Zorder at the table level to ensure faster queries using `pipelines.autoOptimize.zOrderCols`, and DLT will handle the rest.

Next steps

Your DLT pipeline is ready to be started. Click here to access the pipeline created for you using this notebook.

To create a new one, Open the DLT menu, create a pipeline and select this notebook to run it. To generate sample data, please run the [companion notebook]($./_resources/00-Loan-Data-Generator) (make sure the path where you read and write the data are the same!)

Datas Analyst can start using DBSQL to analyze data and track our Loan metrics. Data Scientist can also access the data to start building models to predict payment default or other more advanced use-cases.

Tracking data quality

Expectations stats are automatically available as system table.

This information let you monitor your data ingestion quality.

You can leverage DBSQL to request these table and build custom alerts based on the metrics your business is tracking.

See [how to access your DLT metrics]($./03-Log-Analysis)

Data Quality Dashboard example