Implement CDC: Change Data Capture

Use-case: Synchronize your SQL Database with your Lakehouse

Delta Lake is an open-source storage layer with Transactional capabilities and increased Performances.

Delta lake is designed to support CDC workload by providing support for UPDATE / DELETE and MERGE operation.

In addition, Delta table can support CDC to capture internal changes and propagate the changes downstream.

Note that this is a fairly advaned demo. Before going into this content, we recommend you get familiar with Delta Lake `dbdemos.install('delta-lake')`.

Simplifying CDC with Delta Live Table

As you'll see, implementing a CDC pipeline from scratch is slightly advanced.

To simplify these operation & implement a full CDC flow with SQL expression, we strongly advise to use Delta Live Table with `APPLY CHANGES`: `dbdemos.install('delta-live-table')` (including native SCDT2 support)

As you'll see, `APPLY CHANGES` handles the MERGE INTO + DEDUPLICATION complexity for you.

CDC flow

Here is the flow we'll implement, consuming CDC data from an external database. Note that the incoming could be any format, including message queue such as Kafka.

Make all your data ready for BI and ML

Bronze: Incremental data loading using Auto Loader

Working with external system can be challenging due to schema update. The external database can have schema update, adding or modifying columns, and our system must be robust against these changes.

Databricks Autoloader (`cloudFiles`) handles schema inference and evolution out of the box.

For more details on Auto Loader, run `dbdemos.install('auto-loader')`

Silver: Materialize the table

The silver `retail_client_silver` table will contains the most up to date view. It'll be a replicat of the original MYSQL table.

Because we'll propagate the `MERGE` operations downstream to the `GOLD` layer, we need to enable Delta Lake CDF: `delta.enableChangeDataFeed = true`

Testing the first CDC layer

Let's send a new CDC entry to simulate an update and a DELETE for the ID 1 and 2

Gold: capture and propagate Silver modifications downstream

We need to add a final Gold layer based on the data from the Silver table. If a row is DELETED or UPDATED in the SILVER layer, we want to apply the same modification in the GOLD layer.

To do so, we need to capture all the tables changes from the SILVER layer and incrementally replicate the changes to the GOLD layer.

This is very simple using Delta Lake CDF from our SILVER table!

Delta Lake CDF provides the `table_changes('< table_name >', < delta_version >)` that you can use to select all the tables modifications from a specific Delta version to another one:

Working with Delta Lake CDF

Delta CDF table_changes output

Table Changes provides back 4 cdc types in the "_change_type" column:

| CDC Type | Description |
|----------------------|---------------------------------------------------------------------------|
| **update_preimage** | Content of the row before an update |
| **update_postimage** | Content of the row after the update (what you want to capture downstream) |
| **delete** | Content of a row that has been deleted |
| **insert** | Content of a new row that has been inserted |

Therefore, 1 update will result in 2 rows in the cdc stream (one row with the previous values, one with the new values)

Synchronizing our downstream GOLD table based from the Silver changes

Let's now say that we want to perform another table enhancement and propagate these changes downstream.

To keep this example simple, we'll just add a column name `gold_data` with random data, but in real world this could be an aggregation, a join with another datasource, an ML model etc.

The same logic as the Silver layer must be implemented. Since we now consume the CDF data, we also need to perform a deduplication stage. Let's do it using the python APIs this time for the example.

*Note: Streaming operations with CDC are supported from DBR 8.1+*

Support for data sharing and Datamesh organization

As we've seen during this demo, you can track all the changes (INSERT/UPDATE/DELETE) from any Detlta table using the CDC option.

It's then easy to subscribe the table modifications as an incremental process.

This makes the Data Mesh implementation easy: each Mesh can publish a set of tables, and other meshes can subscribe the original changes.

They are then in charge of propagating the changes (ex GDPR DELETE) to their own Data Mesh

Data is now ready for BI & ML use-case !

We now have our final table, updated based on the initial CDC information we receive.

As next step, we can leverage Databricks Lakehouse platform to start creating SQL queries / dashboards or ML models

Next step: [Implement a CDC pipeline for multiple tables]($./02-CDC-CDF-full-multi-tables)