Incredibly serious ham sandwich custom config code examples#
-- Cluster: classic cluster, 15.4LTS, Shared access mode
CREATE CATALOG IF NOT EXISTS demo_sample_data;
USE CATALOG demo_sample_data;
CREATE DATABASE IF NOT EXISTS tables;
USE DATABASE tables;
Adding custom config to a Delta table#
Steps
Create a table and add in the table properties at the same time
Show the table properties
CREATE OR REPLACE TABLE demo_sample_data.tables.ham_types (
ham_type STRING,
is_edible BOOLEAN
)
TBLPROPERTIES("ham" = "sandwich")
SHOW TBLPROPERTIES demo_sample_data.tables.ham_types;
key | value |
---|---|
I have a real job | false |
delta.enableDeletionVectors | true |
delta.feature.deletionVectors | supported |
delta.minReaderVersion | 3 |
delta.minWriterVersion | 7 |
ham | sandwich |
Adding custom config to a Delta commit message#
Steps
Set the commitInfo settings to our ham related info
Do something to cause a commit and add to our delta log
Access the detla log to view the commit message
SET spark.databricks.delta.commitInfo.userMetadata=add-ham-types;
key | value |
---|---|
spark.databricks.delta.commitInfo.userMetadata | add-ham-types |
INSERT INTO demo_sample_data.tables.ham_types (ham_type, is_edible)
VALUES ('sandwich', 1),('West', 0);
num_affected_rows | num_inserted_rows |
---|---|
2 | 2 |
%python
spark.sql("DESCRIBE HISTORY demo_sample_data.tables.ham_types")\
.select("userMetadata")\
.limit(1) \
.display()
userMetadata |
---|
add-ham-types |
No but seriously, why would you do this?#
Spark#
Passing variables between languages Switching between python, scala an d SQL has its benefits, but passing variables between them can be tricky. Putting the output into a config value means you can share it across the session.
When I originally posted this, I got questions about distributing values to worker nodes. This is similar to broadcasting small-ish datasets to speed up joins. Distributing variables allegedly speeds up queries by reducing network traffic …but I could only see that being of use if you had very very large variables.
Delta#
Adding custom config or labels to a table allows the tags to always be stored with the table, no matter who’s querying it, with what engine or what platform. This can give you:
Better auditability: Name and shame those bad data sources contributing gibberish to your curated tables
Better discoverability: Labelling your Gold / Consumption layers means people can search for verified quality data
Automation super powers: For example, as soon as you label a table with PII it kicks off a process that every 30 days reviews who’s got access to the table and emails the owner to check the access level is correct
Credits#
Author: holly@databricks.com
Inspiration: Sean Owen 🍿