00-Lakeflow-Declarative-Pipeline-Introduction

Intro to Lakeflow Declaritive Pipelines

Declarative Pipelines simplify batch and streaming ETL with automated reliability and built-in data quality. Let's give it a try!

Optimizing our bike rental business - ETL pipeline

Our fictional company operates bike rental stations across the city. The primary goal of this data pipeline is to transform raw operational data—such as ride logs, maintenance records, and weather information—into a structured and refined format, enabling comprehensive analytics.

This allows us to track key business metrics like total revenue, forecast future earnings, understand revenue contributions from members versus non-members, analyze customer behavior and lifetime value, and crucially, identify and quantify revenue loss due to maintenance issues.

By providing these insights, the pipeline empowers us to optimize operations, improve bike availability, and ultimately maximize profitability.

We'll be using as input a raw dataset containing information coming from our ride tracking system as well as data from our maintenence system, weather data, and customer CDC events. Our goal is to ingest this data in near real time and build table for our analyst team while ensuring data quality.

Getting started with the new pipeline editor

Databricks provides a [rich editor](https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/declarative-pipelines/declarative-pipelines-0.png?raw=true) to help you build and navigate through your different pipeline steps!

1/ Exploring the data

First, open the [notebook in the Exploration folder]($./explorations/01-Exploring-the-Data) to discover our dataset.

We'll consume data from 4 sources, all available to us as raw CSV or JSON file in our schema volume:

- **maintenance_logs** (all the maintenance details, as csv files)
- **rides** (the ride informations, including comments from users using the mobile application)
- **weather** (current and forecast, as JSON file)
- **customers** (customer CDC data for Auto CDC processing, as parquet files)

2/ Get started with Streaming Tables and Materialized view

Creating your pipeline is super simple! If you're new to the Declarative Pipelines, it's best to start with the [UI introduction from the documentation](https://docs.databricks.com/aws/en/dlt/dlt-multi-file-editor)!

**Your Lakeflow Declarative Pipeline has been installed and started for you!** Open the Bike Rental Declarative Pipeline to see it in action.

*(Note: The pipeline will automatically start once the initialization job is completed, this might take a few minutes... Check installation logs for more details)*

3/ Ingesting and transforming your data

Now that we reviewed the data available to us, it's time to start creating our pipeline! We'll do it one step at a time.

Open the [00-pipeline-tutorial notebook]($./transformations/00-pipeline-tutorial) if you want to start with the basics behind Streaming Table and Materialized View.

Bronze: Raw data ingested into Delta tables. Our bronze layer contains our raw data loaded with minimal schema changes into tables using Autoloader. Tables in our bronze layer: - maintenance_logs_raw - rides_raw - weather_raw - customers_cdc_raw	Silver: Cleaned and enriched with data quality rules Filter out invalid rides and maintenance logs, enrich data with ride revenue, categorize maintenance issues, and process customer CDC events using Auto CDC for SCD Type 2 (historical tracking). Tables in our silver layer: - maintenance_logs - rides - weather - customers (SCD Type 2)	Gold: Curated for analytics & AI. Aggregates data for reporting by pre-calulating how much revenue each station makes as a origin and destination as well as calculates how much revenue loss each maintenance event costs. Tables: - maintenance_events - stations - bikes

Open transformations/01-bronze.sql	Open transformations/01-silver.sql	Open transformations/01-gold.sql

4/ Visualizing the data with Databricks AI/BI

Business Dasbhoard	- Bike Rental Data Monitoring Dashboard - Bike Rental Pipeline Operational Dashboard