Delta live table - Unit testing

Why testing?

Deploying tests on your DLT pipelines will guarantee that your ingestion is always stable and future proof.

The tests can be deployed as part of traditional CI/CD pipeline and can be run before a new version deployment, ensuring that a new version won't introduce a regression.

This is critical in the Lakehouse ecosystem, as the data we produce will then leveraged downstream:

* By Data Analyst for reporting/BI
* By Data Scientists to build ML model for downstream applications

Unit testing strategy with DLT

Delta Live Table logic can be unit tested leveraging Expectation.

At a high level, the DLT pipelines can be constructed as following:

* The ingestion step (first step of the pipeline on the left) is written in a separate notebook. This correspond to the left **green** (prod) and **blue** (test) input sources.
* The Production pipeline is defined with the PROD ingestion notebook:[./ingestion_profile/DLT-ingest_prod]($./ingestion_profile/DLT-ingest_prod) and connects to the live datasource (ex: kafka server, staging blob storage)
* The Test pipeline (only used to run the unit test) is defined with the TEST ingestion notebook: [./ingestion_profile/DLT-ingest_test]($./ingestion_profile/DLT-ingest_test) and can consume from local files used for our unit tests (ex: adhoc csv file)
* A common DLT pipeline logic is used for both the prod and the test pipeline (the **yellow** in the graph)
* An additional notebook containing all the unit tests is used in the TEST pipeline (the **blue `TEST_xxx` tables** in the image on the right side)

Accessing the DLT pipeline

Your pipeline has been created! You can directly access the Delta Live Table Pipeline for unit-test demo.

Main Pipeline definition

This notebook contains the main pipeline definition, the one we want to test (in yellow in the diagram).

For this example, we centralized our main expectations in a metadata table that we'll use in the table definition.

Theses expectations are your usual expectations, used to ensure and track data quality during the ingestion process.

We can then build DBSQL dashboard on top of it and triggers alarms when we see error in our data (ex: incompatible schema, increasing our expectation count)

1/ Loading our data

This is the first step of the pipeline. Note that we consume the data from the `raw_user_data` view.

This view is defined in the ingestion notebooks:
* For PROD: [./ingestion_profile/DLT-ingest_prod]($./ingestion_profile/DLT-ingest_prod), reading from prod system (ex: kafka)
* For TEST: [./ingestion_profile/DLT-ingest_test]($./ingestion_profile/DLT-ingest_test), consuming the test dataset (csv files)

Start by reviewing the notebooks to see how the data is ingested.

*Note: DLT is available as SQL or Python, this example will use Python*

2/ Customer Silver layer

The silver layer is consuming **incremental** data from the bronze one, and cleaning up some information.

We're also adding an expectation on the ID. As the ID will be used in the next join operation, ID should never be null and be positive.

Note that the expectations have been defined in the metadata expectation table under `user_silver_dlt`

3/ Ingest spend information

This is the same logic as for the customer data, we consume from the view defined in the TEST/PROD ingestion notebooks.

We're also adding an expectation on the ID column as we'll join the 2 tables based on this field, and we want to track it's data quality

4/ Joining the 2 tables to create the gold layer

We can now join the 2 tables on customer ID to create our final gold table.

As our ML model will be using `age`, `annual_income` and `spending_score` we're adding expectation to only keep valid entries

Our pipeline is now ready to be tested!

Our pipeline now entirely defined.

Here are a couple of example we might want to test:

* Are we safely handling wrong data type as entry (ex: customer ID is sent as an incompatible STRING)
* Are we resilient to NULL values in our primary keys
* Are we enforcing uniqueness in our primary keys
* Are we properly applying business logic (ex: proper aggregation, anonymization of PII field etc)

Creating the test dataset

The next step is to create a test dataset.

Creating the test dataset is a critical step. As any Unit tests, we need to add all possible data variation to ensure our logic is properly implemented.

As example, let's make sure we'll ingest data having NULL id or ids as string.

Open the [./test/DLT-Test-Dataset-setup]($./test/DLT-Test-Dataset-setup) notebook to see how this is done

Defining the Unit Tests

We now have the data ready.

The final step is creating the actual test.

Open the [./test/DLT-Tests]($./test/DLT-Tests) notebook to see how this is done!

That's it! our pipeline is fully ready & tested.

We can then process as usual: build dashboard to track production metrics (ex: data quality & quantity) but also BI reporting & Data Science for final business use-case leveraging the Lakehouse:

Here is a full example of the test pipeline definition.

Note that we have 3 notebooks in the DLT pipeline:

* **DLT-ingest_test**: ingesting our test datasets
* **DLT-pipeline-to-test**: the actual pipeline we want to test
* **test/DLT-Tests**: the test definition

Remember that you'll have to schedule FULL REFRESH everytime your run the pipeline to get accurate test results (we want to consume all the entry dataset from scratch).

This test pipeline can be scheduled to run within a Workflow, or as part of a CICD step (ex: triggered after a git commit)

```
{
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"development": true,
"continuous": false,
"channel": "CURRENT",
"edition": "advanced",
"libraries": [
{
"notebook": {
"path": "/Repos/xxxx/Delta-Live-Table-Unit-Test/ingestion_profile/DLT-ingest_test"
}
},
{
"notebook": {
"path": "/Repos/xxxx/Delta-Live-Table-Unit-Test/DLT-pipeline-to-test"
}
},
{
"notebook": {
"path": "/Repos/xxxx/Delta-Live-Table-Unit-Test/test/DLT-Tests"
}
}
],
"name": "dbdemos_dlt_unit_test_{{CATALOG}}_{{SCHEMA}}",
"catalog": "{{CATALOG}}",
"target": "{{SCHEMA}}"
}
```

Going further with DLT

Checking your data quality metrics with Delta Live Table

Delta Live Tables tracks all your data quality metrics. You can leverage the expecations directly as SQL table with Databricks SQL to track your expectation metrics and send alerts as required. This let you build the following dashboards:

Building our first business dashboard with Databricks SQL

Once the data is ingested, we switch to Databricks SQL to build a new dashboard based on all the data we ingested.

Here is an example: