Databricks Autoloader and Medallion Architecture… Pt 2.
4 min readAug 13, 2023
In my last post, Ingest Data with Databricks Autoloader, I introduced autoloader as a way to ingest raw data and incrementally load them into your data pipelines:
# Import the functions you need
from pyspark.sql.functions import col, current_timestamp
# Create and define your variables which include file path, table_name, checkpoint, and schema location
file_format = "csv"
file_path = "your_file_path"
checkpoint_location = "your_checkpoint_location"
schema_location = "your_schema_location"
# Confirm files within your cloud storage file path (optional)
dbutils.fs.ls("your_file_path")
# Configure Autoloader to inhest CSV data to your Delta table
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", file_format)
.option("cloudFiles.schemaLocation", checkpoint_location)
.load(file_path)
.writeStream
.option("checkpointLocation", checkpoint_location)
.option("cloudFiles.inferColumnTypes", "true"
.trigger(availableNow = True)
.toTable(your_delta_table)
This works great for those that are using autoloader to add files with the same schema but as you transform your data with a medallion architecture you may come across particular barriers when new data comes in. This new data may have different schema or data types which require modifications to your existing databricks notebooks to manage new/modified schema.
What is Medallion Architecture?