Incredibly serious ham sandwich custom config code examples

Incredibly serious ham sandwich custom config code examples#

-- Cluster: classic cluster, 15.4LTS, Shared access mode
CREATE CATALOG IF NOT EXISTS demo_sample_data;
USE CATALOG demo_sample_data;
CREATE DATABASE IF NOT EXISTS tables;
USE DATABASE tables;

Sharing a variable between python and scala#

Steps:

Make a table containing some values
Set the config
Use the config in a python query using f strings
Use the config in a scala query

CREATE OR REPLACE TABLE demo_sample_data.tables.ham_types (
  ham_type STRING,
  is_edible BOOLEAN
)

INSERT INTO demo_sample_data.tables.ham_types (ham_type, is_edible)
VALUES ('sandwich', 1),('West', 0);

num_affected_rows	num_inserted_rows
2	2

%python
spark.conf.set("ham", "sandwich")
print(spark.conf.get("ham"))

sandwich

%python
display(spark.read.table("demo_sample_data.tables.ham_types") \
.filter(f"ham_type = '{spark.conf.get('ham')}'"))

ham_type	is_edible
sandwich	true

%scala
display(spark.read.table("demo_sample_data.tables.ham_types")
.filter(col("ham_type") =!= spark.conf.get("ham")))

ham_type	is_edible
West	false

Adding custom config to a Delta table#

Steps

Create a table and add in the table properties at the same time
Show the table properties

CREATE OR REPLACE TABLE demo_sample_data.tables.ham_types (
  ham_type STRING,
  is_edible BOOLEAN
)
TBLPROPERTIES("ham" = "sandwich")

SHOW TBLPROPERTIES demo_sample_data.tables.ham_types;

key	value
I have a real job	false
delta.enableDeletionVectors	true
delta.feature.deletionVectors	supported
delta.minReaderVersion	3
delta.minWriterVersion	7
ham	sandwich

Adding custom config to a Delta commit message#

Steps

Set the commitInfo settings to our ham related info
Do something to cause a commit and add to our delta log
Access the detla log to view the commit message

SET spark.databricks.delta.commitInfo.userMetadata=add-ham-types;

key	value
spark.databricks.delta.commitInfo.userMetadata	add-ham-types

INSERT INTO demo_sample_data.tables.ham_types (ham_type, is_edible)
VALUES ('sandwich', 1),('West', 0);

num_affected_rows	num_inserted_rows
2	2

%python
spark.sql("DESCRIBE HISTORY demo_sample_data.tables.ham_types")\
    .select("userMetadata")\
    .limit(1) \
    .display()

userMetadata
add-ham-types

No but seriously, why would you do this?#

Spark#

Passing variables between languages Switching between python, scala an d SQL has its benefits, but passing variables between them can be tricky. Putting the output into a config value means you can share it across the session.

When I originally posted this, I got questions about distributing values to worker nodes. This is similar to broadcasting small-ish datasets to speed up joins. Distributing variables allegedly speeds up queries by reducing network traffic …but I could only see that being of use if you had very very large variables.

Delta#

Adding custom config or labels to a table allows the tags to always be stored with the table, no matter who’s querying it, with what engine or what platform. This can give you:

Better auditability: Name and shame those bad data sources contributing gibberish to your curated tables
Better discoverability: Labelling your Gold / Consumption layers means people can search for verified quality data
Automation super powers: For example, as soon as you label a table with PII it kicks off a process that every 30 days reviews who’s got access to the table and emails the owner to check the access level is correct

Credits#

Author: holly@databricks.com

Inspiration: Sean Owen 🍿