Member-only story

Optimize your Delta Tables & ETLs with Change Data Feed (CDF) in Databricks

Matthew Salminen
7 min readOct 8, 2023
Image Source: https://www.databricks.com/blog/2021/06/09/how-to-simplify-cdc-with-delta-lakes-change-data-feed.html

After explaining what Delta Live Tables are and then going in depth on how we can record data source changes of those tables with Change Data Capture (CDC), there is yet another useful feature for your Delta Tables called Change Data Feed or CDF. This feature will record changes in your data at the row-level while also optimizing your ETL pipelines performance.

But before I explain Change Data Feed, for those of you reading my articles for the first time, let me provide a brief summary of what Delta Tables, Delta Live Tables, and Change Data Capture are:

What are Delta Tables in Databricks?

Remember that all things Delta in Databricks refers to the storage layer of the Delta Lake, with the capabilities of handling real-time and batch big data. A Delta Table is the default data table structure used within data lakes for data ingestion via streaming or batches. A general way of creating a DT in databricks is provided below. Please note, you do not have to import and initialize your spark session as Databricks already includes this but adding for reference:

# Import libraries for spark session
from pyspark.sql import SparkSession
from delta import DeltaTable

# Initialize spark session
spark = SparkSession.builder \…

--

--

Matthew Salminen
Matthew Salminen

Written by Matthew Salminen

Marathon Runner, Online Coach, and Data Engineer. See the latest training and racing insights as well as data engineering topics!

No responses yet