This is the third and last article of the Spark-centered series. Reading the first and second parts is highly recommended before going through this one, in which we’ll discuss how you could optimize Spark jobs from your end of the spectrum.
Throughout the past article we examined Spark’s sophisticated optimization process, and it’s now clear to us that Spark relies on a meticulously crafted mechanism to achieve its mind-boggling speed. But to think that Spark will give you optimal results no matter how you do things on your side is a mistake.
The assumption is easily made especially when migrating from another data-processing tool. A 50% shrink in processing time compared to the tool that you’ve been using could make you believe that Spark is running at full-speed and that you can’t reduce the execution time any further. The thing is, you can.
Spark SQL and its optimizer, Catalyst, can do wonders on their own, via the process we discussed in the second article of the series, but through some twists and techniques, you can take Spark to the next level.
Always take a look under the hood
The first thing to keep in mind when working with Spark is that the execution time doesn’t have much significance on its know. To evaluate the job’s performance, it’s important to know what’s happening under the hood while it’s running. During the development and testing phases, you need to frequently use the explain
function to see the physical plan generated from the statements you wish to analyze, and for an in-depth analysis you could add the extended
flag to see the different plans that Spark SQL opted for (from the parsed logical plan to the physical plan). This is a great way to detect potential problems and unnecessary stages without even having to actually execute the job.
Know when to use the cache
Caching is very important when dealing with large datasets and complex jobs. It allows you to save the datasets that you plan on using in subsequent stages so that Spark doesn’t create them again from scratch. This advantage sometimes pushes developers into “over-caching” in a way that makes the cached datasets a burden that slow down your job instead of optimizing it. To decide which datasets you need to cache you have to prepare the totality of your job, and then through testing try to figure out which datasets are actually worth caching and at which point you could unpersist them to free up the space they occupy in memory when cached. Using the cache efficiently allows Spark to run certain computations 10 times faster, which could dramatically reduce the total execution time of your job.
Know your cluster, and your data
A key element to getting the most out of Spark is fine-tuning its configuration according to your cluster. Relying on the default configuration may be the way to go in certain situation, but usually you’re one parameter away from getting even more impressive results. Selecting the appropriate number of executors, the number of cores per executor, and the memory size for each executor are all elements that could greatly influence the performance of your jobs, so don’t hesitate to perform benchmark testing to see if certain parameters could be optimized.
Finally, an important factor to keep in mind is that you need to know the data that you’re dealing with and what to expect from every operation. When one stage is taking too long even though it’s dealing with less data than other stages, then you should inspect what’s happening on the other side. Spark is great when it comes to doing the heavy-lifting and running your code, but only you could detect business-related issues that may be related to the way you defined your job.
If you apply all of these rules while developing and implementing your Spark jobs, you can expect the record-breaking processing tool to reward you with jaw-dropping results.
This recommendations are merely a first step towards mastering Apache Spark, in upcoming articles we’ll discuss its different modules in detail to get a better understanding of how Spark functions.