Optimizing Apache Spark File Compression with LZ4 or Snappy

Matthew Salminen
6 min readDec 17, 2023
Image Source: https://www.vectorlogo.zone/logos/apache_spark/index.html

One challenge you may face when working with Apache Spark is that when you are writing data to a final destination such as S3 or cloud service and the latency and processing time to completion takes longer than anticipated. Often you are working with large datasets or source tables that require long processing times once you are complete with all your table transformations. This is where file compression comes in handy when working through your data pipelines.

Compression is the ability for the files you are working with to be compacted in a way to reduce the size of data being stored or processed. When you are writing to an AWS S3 bucket for example, compression is helpful for better storage and more importantly better data transfer. Compression ultimately removes redundancy without removing the quality of your data.

I came across a very interesting scatter plot comparing generation time to compression rate between popular file compression algorithms:

Image Source: https://www.adaltas.com/en/2021/03/22/performance-comparison-of-file-formats/

What I came to notice is that two compression algorithms stood out that are popular among Spark: LZ4 and Snappy. Let me go over what these are. But before I do, let me briefly…

--

--

Matthew Salminen

Marathoner | Trail Runner | Data Engineer | living in Irvine, CA