Linear Regression using Apache Spark ML vs Sci-Kit Learn

7 min readNov 12, 2023

Image Source: https://www.analyticsvidhya.com/blog/2022/05/an-end-to-end-guide-on-ml-pipeline-using-apache-spark-in-python/

In my last article I was able to do the best I could in predicting when a two-hour marathon would be broken using machine learning. I am not an ML expert by any means but I wanted to dig as deep as I could in understanding linear, log, and polynomial regression. Although it was possible to complete a model with the popular sci-kit learn library, my prediction didn’t seem to support my assumptions so I wanted to take a step back and think about this. After some research into other ML libraries and platforms, I considered Apache Spark ML as an alternative framework.

What is Apache Spark ML?

Apache Spark ML is new package introduced by Apache Spark 1.2, thereby helping users create some very fine tuned ML pipelines. Apache Spark within itself is a robust engine for big-data processing. Big data platforms like Databricks utilizes Apache Spark under the hood for its core data processing and pipelines, making it a valuable tool in machine learning applications.

My initial linear regression model utilized sci-kit learn library, a very popular python based ML library:

Linear Regression using Apache Spark ML vs Sci-Kit Learn

Written by Matthew Salminen