Databricks Autoloader and Medallion Architecture… Pt 2.

Matthew Salminen
4 min readAug 13, 2023

In my last post, Ingest Data with Databricks Autoloader, I introduced autoloader as a way to ingest raw data and incrementally load them into your data pipelines:

# Import the functions you need

from pyspark.sql.functions import col, current_timestamp

# Create and define your variables which include file path, table_name, checkpoint, and schema location

file_format = "csv"
file_path = "your_file_path"
checkpoint_location = "your_checkpoint_location"
schema_location = "your_schema_location"

# Confirm files within your cloud storage file path (optional)

dbutils.fs.ls("your_file_path")

# Configure Autoloader to inhest CSV data to your Delta table

(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", file_format)
.option("cloudFiles.schemaLocation", checkpoint_location)
.load(file_path)
.writeStream
.option("checkpointLocation", checkpoint_location)
.option("cloudFiles.inferColumnTypes", "true"
.trigger(availableNow = True)
.toTable(your_delta_table)

This works great for those that are using autoloader to add files with the same schema but as you transform your data with a medallion architecture you may come across particular barriers when new data comes in. This new data may have different schema or data types which require modifications to your existing databricks notebooks to manage new/modified schema.

What is Medallion Architecture?

--

--

Matthew Salminen
Matthew Salminen

Written by Matthew Salminen

Marathoner | Trail Runner | Data Engineer | living in Irvine, CA

No responses yet