This is the first article in a serie that’ll discuss the mechanisms behind Apache Spark and how this data-processing Framework disrupted the Big Data ecosystem. While giving you key recommendations to fine-tune your Spark jobs.
Spark does things fast. That has always been the framework’s main selling point since it was first introduced back in 2010.
Offering a memory-based alternative to Map-Reduce gave the Big Data ecosystem a major boost, and throughout the past few years it represented one of the key reasons for which companies adopted Big Data systems.
With its vast range of use cases, its ease-of-use, and its record-setting capabilities, Spark rapidly became everyone’s go-to framework when it comes to data processing within a Big Data architecture.
One of Spark’s key components is its SparkSQL module that offers the possibility to write batch Spark jobs as SQL-like queries. To do so, Spark relies behind the scenes on a complex mechanism to run these queries through the execution engine. This mechanism’s centerpiece is Catalyst.
Spark’s query optimizer that does much of the heavy-lifting by generating the job’s physical execution plan.
Even though every step of this process was meticulously refined to optimize every aspect of the job. There is still plenty you could do from your end of the chain to make your Spark jobs run even faster. But before getting into that, through this article let’s take a deeper dive into how Catalyst does things.
First of all, let’s start with the basics
Spark offers multiple ways to interact with its SparkSQL interfaces, with the main APIs being DataSet and DataFrame. These high-level APIs were built upon the object-oriented RDD API. And they kept its main characteristics while adding certain key features like the usage of schemas. (For a detailed comparison, please refer to this article on the Databricks blog).
The choice of the API to use depends mainly on the language you’re using. With DataSet being only available in Scala / Java, and replacing DataFrame for these languages since the release of Spark 2.0. And each one offers certain perks and advantages. The good news is that Spark uses the same execution engine under the hood to run your computations. So you can switch easily from one API to another without worrying about what’s happening on the execution level.
That means that no matter which API you’re using, when you submit your job it’ll go through a unified optimization process.
In the upcoming articles of the series, we’ll be going through each step of that process to understand how Spark manages to run batch jobs with such an astonishing speed. But first, let’s see what our jobs look like for Spark.
How Spark sees the world
The operations you can do within your Spark application are divided into two types:
- Transformations: these are the operations that, when applied to an RDD, return a reference to a new RDD created via the transformation. Some of the most used transformations are
filter
and map. (Here’s a complete list of the available transformations) - Actions: when applied to an RDD, these operations return a non-RDD value. A good example would be the
count
action, that returns the number of elements within an RDD to the Spark driver, or
collect, an action that sends the contents of an RDD to the driver. (Please refer to this link for a complete list of the actions that can be applied on RDDs)
The DataFrame and DataSet operations are divided into the same categories since these APIs are built upon the RDD mechanism.
The next differentiation to make is between the two types of transformations, which are the following:
- Narrow transformations: When these transformations are applied on an RDD, there is no data movement between partitions. The transformation is applied on the data of each partition of the RDD and results in a new RDD with the same number of partitions, as demonstrated in the below illustration. For example,
filter
is a narrow transformation, because a filter is applied on the data of each partition and the resulting data represents a partition within the newly created RDD.
- Wide transformations: These transformations necessitate data movement between partitions, or what is known as shuffle. The data is moved across the network and the partitions of the newly-created RDD are based on the data of multiple input partitions, as illustrated below. A good example would be the
sortBy
operation, where data from all of the input partitions is sorted based on a certain column in a process that generates an RDD with new partitions.
So when you submit a job to Spark, what you’re submitting is basically a set of actions and transformations that are then turned into the job’s logical plan by Catalyst, before it generates the ideal physical plan.
Now that we know how Spark sees the jobs that we submit, through the next article in the series we’ll take a look at how it runs them.