Getting started with Delta Lake

[Delta Lake](https://delta.io/) is an open storage format used to save your data in your Lakehouse. Delta provides an abstraction layer on top of files. It's the storage foundation of your Lakehouse.

Why Delta Lake?

Running ingestion pipeline on Cloud Storage can be very challenging. Data teams are typically facing the following challenges:

* Hard to append data *(Adding newly arrived data leads to incorrect reads)*
* Modification of existing data is difficult (*GDPR/CCPA requires making fine grained changes to existing data lake)*
* Jobs failing mid way (*Half of the data appears in the data lake, the rest is missing)*
* Real-time operations (*Mixing streaming and batch leads to inconsistency)*
* Costly to keep historical versions of the data (*Regulated environments require reproducibility, auditing, governance)*
* Difficult to handle large metadata (*For large data lakes the metadata itself becomes difficult to manage)*
* “Too many files” problems (*Data lakes are not great at handling millions of small files)*
* Hard to get great performance (*Partitioning the data for performance is error-prone and difficult to change)*
* Data quality issues (*It’s a constant headache to ensure that all the data is correct and high quality)*

Theses challenges have a real impact on team efficiency and productivity, spending unecessary time fixing low-level, technical issues instead on focusing on the high-level, business implementation.

Because Delta Lake solves all the low level technical challenges saving PB of data in your lakehouse, it let you focus on implementing simple data pipeline while providing blazing-fast query answers for your BI & Analytics reports.

In addition, Delta Lake being a fully open source project under the Linux Foundation and adopted by most of the data players, you know you own your own data and won't have vendor lock-in.

Delta Lake capabilities?

You can think about Delta as a file format that your engine can leverage to bring the following capabilities out of the box:

* ACID transactions
* Support for DELETE/UPDATE/MERGE
* Unify batch & streaming
* Time Travel
* Clone zero copy
* Generated partitions
* CDF - Change Data Flow (DBR runtime)
* Blazing-fast queries
* ...

Let's explore these capabilities! *We'll mainly use SQL, but all the operations are available in python/scala*

![](https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-tiny-logo.png) 1/ Getting started with Delta Lake

Start learning about Delta Lake:

* Table creation & migration
* Streaming
* Time Travel
* Upsert (merge)
* Enforce Quality with Constraint
* Clone & Restore
* Advanced:
* PK/FK support
* Share data with Delta Sharing Open Protocol

Open the first [01-Getting-Started-With-Delta-Lake notebook]($./01-Getting-Started-With-Delta-Lake)

![](https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-tiny-logo.png) 2/ Speedup your queries with Liquid Clustering

Delta Lake let you add Liquid clustering column (similar to indexes).

This automatically adapt your data layout accordingly and drastically accelerate your reads, providing state of the art performances.

Liquid Clustering makes Hive Partitioning skew and small size a thing of the past.

Open [the 02-Delta-Lake-Performance notebook]($./02-Delta-Lake-Performance) to explore Liquid Clustering.

![](https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-tiny-logo.png) 3/ Delta Lake Uniform (Universal Format)

While Delta Lake includes unique features to simplify data management and provide the best performances (see [CIDR benchmark paper](https://petereliaskraft.net/res/cidr_lakehouse.pdf)), external systems might require to read other formats such as Iceberg or Hudi.

Because your Lakehouse is open, Delta Lake let you write your Delta tables with metadata to support these formats.

**This makes Delta Lake the de-facto standard for all your lakehouse tables, leveraging its unique capabilities such as Liquid Clustering to speed up queries, while making sure you won't have any lock-in.**

For more details, open [the 03-Delta-Lake-Uniform notebook]($./03-Delta-Lake-Uniform) to explore Delta Lake Uniform.

![](https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-tiny-logo.png) 4/ Change Data Capture with Delta Lake CDF

Delta Lake makes it easy to capture changes on a table.

External users can stream the row modifications, making it easy to capture UPDATE, APPEND or DELETE and apply these changes downstream.

This is key to share data across organization and building Delta Mesh, including DELETE propagation to support GDPR compliance.

CDC is also available through Delta Sharing, making it easy to share data with external organizations.

For more details, open [the 04-Delta-Lake-CDF notebook]($./04-Delta-Lake-CDF) to Capture your Delta Lake changes.

![](https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-tiny-logo.png) 5/ Explore advanced Delta Lake internal

Want to know more about Delta Lake and its underlying metadata?

Open [the 05-Advanced-Delta-Lake-Internal notebook]($./05-Advanced-Delta-Lake-Internal) for more details.