What is Databricks Auto Loader?

[Databricks Auto Loader](https://docs.databricks.com/ingestion/auto-loader/index.html) lets you scan a cloud storage folder (S3, ADLS, GS) and only ingest the new data that arrived since the previous run.

This is called **incremental ingestion**.

Auto Loader can be used in a near real-time stream or in a batch fashion, e.g., running every night to ingest daily data.

Auto Loader provides a strong gaurantee when used with a Delta sink (the data will only be ingested once).

How Auto Loader simplifies data ingestion

Ingesting data at scale from cloud storage can be really hard at scale. Auto Loader makes it easy, offering these benefits:

* **Incremental** & **cost-efficient** ingestion (removes unnecessary listing or state handling)
* **Simple** and **resilient** operation: no tuning or manual code required
* Scalable to **billions of files**
* Using incremental listing (deprecated, relies on filename order)
* Leveraging notification + message queue (recommended)
* **Schema inference** and **schema evolution** are handled out of the box for most formats (csv, json, avro, images...)

Auto Loader basics

Let's create a new Auto Loader stream that will incrementally ingest new incoming files.

In this example we will specify the full schema. We will also use `cloudFiles.maxFilesPerTrigger` to take 1 file a time to simulate a process adding files 1 by 1.

Schema inference

Specifying the schema manually can be a challenge, especially with dynamic JSON. Notice that we are missing the "age" data because we overlooked specifying this column in the schema.

* Schema inference has always been expensive and slow at scale, but not with Auto Loader. Auto Loader efficiently samples data to infer the schema and stores it under `cloudFiles.schemaLocation` in your bucket.
* Additionally, `cloudFiles.inferColumnTypes` will determine the proper data type from your JSON.

Let's redefine our stream with these features. Notice that we now have all of the JSON fields.

*Notes:*
* *With Delta Live Tables you don't even have to set this option, the engine manages the schema location for you.*
* *Sampling size can be changed with `spark.databricks.cloudFiles.schemaInference.sampleSize.numBytes`*

Schema hints

You might need to enforce a part of your schema, e.g., to convert a timestamp. This can easily be done with Schema Hints.

In this case, we'll make sure that the `id` is read as `bigint` and not `int`:

Schema evolution

Incorrect schema

Auto Loader automatically recovers from incorrect schema and conflicting type. It'll save incorrect data in the `_rescued_data` column.

Adding a new column

By default the stream will tigger a `UnknownFieldException` exception on new column. You then have to restart the stream to include the new column.

Make sure your previous stream is still running and run the next cell.

*Notes*:
* *See `cloudFiles.schemaEvolutionMode` for different behaviors and more details.*
* *Don't forget to add `.writeStream.option("mergeSchema", "true")` to dynamically add when columns when writting to a delta table*

Ingesting a high volume of input files

Scanning folders with many files to detect new data is an expensive operation, leading to ingestion challenges and higher cloud storage costs.

To solve this issue and support an efficient listing, Databricks autoloader offers two modes:

- Incremental listing with `cloudFiles.useIncrementalListing` (deprecated), based on the alphabetical order of the file's path to only scan new data: (`ingestion_path/YYYY-MM-DD`)
- Notification system, which sets up a managed cloud notification system sending new file name to a queue (recommended). See `cloudFiles.useNotifications` for more details.

Use the notification system option whenever possible.

Support for images

Databricks Auto Loader provides native support for images and binary files.

Just set the format accordingly and the engine will do the rest: `.option("cloudFiles.format", "binaryFile")`

Use-cases:

- ETL images into a Delta table using Auto Loader
- Automatically ingest continuously arriving new images
- Easily retrain ML models on new images
- Perform distributed inference using a pandas UDF directly from Delta

Deploying robust ingestion jobs in production

Let's see how to use Auto Loader to ingest JSON files, support schema evolution, and automatically restart when a new column is found.

If you need your job to be resilient with regard to an evolving schema, you have multiple options:

* Let the full job fail & configure Databricks Workflow to restart it automatically
* Leverage Delta Live Tables to simplify all the setup (DLT handles everything for you out of the box)
* Wrap your call to restart the stream when the new column appears.

Here is an example:

Conclusion

We've seen how Databricks Auto Loader can be used to easily ingest your files, solving all ingestion challenges!

You're ready to use it in your projects!