Linear Regression using Apache Spark ML vs Sci-Kit Learn

Matthew Salminen
7 min readNov 12, 2023
Image Source: https://www.analyticsvidhya.com/blog/2022/05/an-end-to-end-guide-on-ml-pipeline-using-apache-spark-in-python/

In my last article I was able to do the best I could in predicting when a two-hour marathon would be broken using machine learning. I am not an ML expert by any means but I wanted to dig as deep as I could in understanding linear, log, and polynomial regression. Although it was possible to complete a model with the popular sci-kit learn library, my prediction didn’t seem to support my assumptions so I wanted to take a step back and think about this. After some research into other ML libraries and platforms, I considered Apache Spark ML as an alternative framework.

What is Apache Spark ML?

Apache Spark ML is new package introduced by Apache Spark 1.2, thereby helping users create some very fine tuned ML pipelines. Apache Spark within itself is a robust engine for big-data processing. Big data platforms like Databricks utilizes Apache Spark under the hood for its core data processing and pipelines, making it a valuable tool in machine learning applications.

My initial linear regression model utilized sci-kit learn library, a very popular python based ML library:

Using Linear Regression to predict marathon finishing times

--

--

Matthew Salminen

Marathoner | Trail Runner | Data Engineer | living in Irvine, CA