When it comes to using thread in a pyspark script, it might seem confusing at first. It took me a while to realize that one of the answers to my tuning problem was the time I spent writing data in HDFS. So I thought it would be possible to apply some sort of parallelism where, given my use case, it was possible to generate results simultaneously, regardless of how the program was distributed.
As well put by the apache website itself:
“Spark is a fast and powerful framework that provides an API to perform massive distributed processing over resilient sets of data.”
But why would anyone try to use threads in a distributed process?
Imagine code lines, which are called functions. We create a framework where a call is made line by line, starting only after the previous function results. Now, imagine that you have a file and need to perform two well-known types of processing: wordCount and charCount. It might seem simple but when applying this example in a larger and more complex project scenario, the use of threads becomes trivial, especially when it comes to independent actions, such as the recording of several different parquets from a single dataframe or the creation of a new stream, in addition to the main one, applying rules and agreements over a rdd up until its final recording step in HDFS, and running the stream normally. The process is distributed but not running in parallel (at the code lines level). When we apply threads, depending on their use case, the result can be awesome.
Let’s look at one more explanation
By definition a thread is:
“A small program that works as a subsystem, which is a way for a process to self-divide into two or more tasks. It is another term for Line or Execution Sequence. These multiple tasks can be performed concurrently to run faster than a program in a single block, or parallelized, being so fast that they seem to be working at the same time.”
(Source: O que é Thread)
In other words, a thread makes your tasks parallelized by running them simultaneously, thus being faster than a task executed line after line. Imagine that, given a functional rule, you are able to perform tasks in parallel, such as filtering and separating bad and good data from a dataframe, and also performing certain activities such as
.write(). These files are generated from a single variable, which have no degree of dependency.
Spark itself works in a distributed way, however, when we talk about code lines, the process is not smart enough to start the first function and the following ones in parallel, so that they are seamlessly processed at the same time. But , it is smart enough to assign resources to smaller tasks, freeing up space as soon as a process is finished. Therefore, when using threads, these functions execute in parallel, using the resources dynamically. (Especially when we switch the
scheduler.mode from FIFO to FAIR.)
How to apply threads in a pyspark process?
The script below provides a simple example of how to apply threads in a scope by using pyspark, loading a file and processing two functions in parallel. At first we imported the python threading package.
· A valuable point in using threads is combining their execution shortly after a .cache(). When it comes to action/transformation in spark, we have to keep in mind that, given a method, there are high chances of an action calling up the same transformations countless times when not optimally designed. Worse than that, in a parallel process, where it is possible to access the same record simultaneously. By putting your variable in memory, the tasks will use the data in parallel without having to load them again.
file = sc.textFile(“/inFiles/shakespeare.txt”).flatMap(lambda line: line.split(“ “)).cache()
· We instantiate variable threads with functions passing the variable (e.g. an rdd) as input parameter.
#Instantiating thread variables with functions in memory.
T1 = threading.Thread(target=wordCount, args=(file,))
T2 = threading.Thread(target=charCount, args=(file,))
· We start executing with .start() and stop with .join(), otherwise we can have it running in parallel up until conclusion, along with the main stream.
# Starting threads execution.
# Pausing thread execution to follow the main stream.
With the script ready, it is possible to view the execution and breaks the parallel processes perform within the stream.
In the example below, I am performing five tasks in parallel, where two of them are writing data to HDFS and one is performing a .count():
The first thread occurrence in my spark script was when we needed to apply tuning, in addition to the documentation, and shorten the execution time. Some considerations were raised after this work:
· It is necessary to be familiarized with your stream to consider the use of threads, which may or may not make your process faster. In small test scenarios, the running time of a parallel stream was identical to a line-after-line execution time,
· If you are not using classes and objects or an overall variable, it won’t be possible to use the result inside and outside a thread, especially if it is running in parallel next to its main stream, because the variable might be created in the same time of use.
· Whenever possible, use
.cache()! Threads that use an already loaded variable in memory may further reduce its processing time, because the action/transformation time is shorter for a dataframe in memory.
· Create Pool’s! by using job scheduling from spark, it is possible to view your tasks running in parallel in applicationMaster alternating the way spark works,
spark.scheduler.mode=FAIR(Default:FIFO) in spark-submit or within the script with SparkContext, among other possibilities, as shown by the following image:
Finally, after a year working with spark, I was able to realize just how powerful it is. And, also, that is still a new tool for many and that few, in fact, understand how it works. Many hours of hard research and testing later, I felt that the information could be explained simply and concisely so that any layman can understand and apply it (documentation will not teach you how to start a project from scratch). The big data world has never been easier to put into practice, and after my own experience, I was motivated to share themes that were difficult to spread at the beginning of this journey.
So, from now on, I hope to share a series of articles with you and work out details in a clear, simple and complete way, and I hope it may be a two-way street so that, together, we can share ideas and experiences. So please send your comments and thoughts.