Incredibly serious ham sandwich custom config code examples#

-- Cluster: classic cluster, 15.4LTS, Shared access mode
CREATE CATALOG IF NOT EXISTS demo_sample_data;
USE CATALOG demo_sample_data;
CREATE DATABASE IF NOT EXISTS tables;
USE DATABASE tables;

Sharing a variable between python and scala#

Steps:

  • Make a table containing some values

  • Set the config

  • Use the config in a python query using f strings

  • Use the config in a scala query

CREATE OR REPLACE TABLE demo_sample_data.tables.ham_types (
  ham_type STRING,
  is_edible BOOLEAN
)

INSERT INTO demo_sample_data.tables.ham_types (ham_type, is_edible)
VALUES ('sandwich', 1),('West', 0);
num_affected_rowsnum_inserted_rows
22
%python
spark.conf.set("ham", "sandwich")
print(spark.conf.get("ham"))
sandwich
%python
display(spark.read.table("demo_sample_data.tables.ham_types") \
.filter(f"ham_type = '{spark.conf.get('ham')}'"))
ham_typeis_edible
sandwichtrue
%scala
display(spark.read.table("demo_sample_data.tables.ham_types")
.filter(col("ham_type") =!= spark.conf.get("ham")))
ham_typeis_edible
Westfalse

Adding custom config to a Delta table#

Steps

  • Create a table and add in the table properties at the same time

  • Show the table properties

CREATE OR REPLACE TABLE demo_sample_data.tables.ham_types (
  ham_type STRING,
  is_edible BOOLEAN
)
TBLPROPERTIES("ham" = "sandwich")
SHOW TBLPROPERTIES demo_sample_data.tables.ham_types;
keyvalue
I have a real jobfalse
delta.enableDeletionVectorstrue
delta.feature.deletionVectorssupported
delta.minReaderVersion3
delta.minWriterVersion7
hamsandwich

Adding custom config to a Delta commit message#

Steps

  • Set the commitInfo settings to our ham related info

  • Do something to cause a commit and add to our delta log

  • Access the detla log to view the commit message

SET spark.databricks.delta.commitInfo.userMetadata=add-ham-types;
keyvalue
spark.databricks.delta.commitInfo.userMetadataadd-ham-types
INSERT INTO demo_sample_data.tables.ham_types (ham_type, is_edible)
VALUES ('sandwich', 1),('West', 0);
num_affected_rowsnum_inserted_rows
22
%python
spark.sql("DESCRIBE HISTORY demo_sample_data.tables.ham_types")\
    .select("userMetadata")\
    .limit(1) \
    .display()
userMetadata
add-ham-types

No but seriously, why would you do this?#

Spark#

Passing variables between languages Switching between python, scala an d SQL has its benefits, but passing variables between them can be tricky. Putting the output into a config value means you can share it across the session.

When I originally posted this, I got questions about distributing values to worker nodes. This is similar to broadcasting small-ish datasets to speed up joins. Distributing variables allegedly speeds up queries by reducing network traffic …but I could only see that being of use if you had very very large variables.

Delta#

Adding custom config or labels to a table allows the tags to always be stored with the table, no matter who’s querying it, with what engine or what platform. This can give you:

  • Better auditability: Name and shame those bad data sources contributing gibberish to your curated tables

  • Better discoverability: Labelling your Gold / Consumption layers means people can search for verified quality data

  • Automation super powers: For example, as soon as you label a table with PII it kicks off a process that every 30 days reviews who’s got access to the table and emails the owner to check the access level is correct

Credits#

Author: holly@databricks.com

Inspiration: Sean Owen 🍿