Persist and Cache in Apache Spark | Spark Optimization Technique


In our today's Chapter, we are going to learn yet another important interview question, What is the difference between Persist and Cache in Apache Spark. We will have a detailed discussion about the importance of persist and cache and finally, have some hands-on with this concept and learn how to use Cache() and Persist() using PySpark. Come, lets get started.

Spark Optimization;



Optimizing Spark application play a significant role in handling and processing huge volume of data. Job with the bad design will ruin the computation power and downgrades the performance your Spark application. To make your Spark application run faster and utilize the resource effectively, we need to optimize the Spark code. One such optimization technique, that provides a better result in improving the performance of spark application or the job are Spark Persist and Cache.

Lets assume our Spark application has a few sequential steps and the data from one of the step is used in more two or three upcoming steps. Then, in this case Spark computes the data each and every time and repeats the same operation for those upcoming steps. We can overcome this repeating computation by computing the data and store it in a memory, and finally use the data stored in-memory to compute upcoming steps. Apache Spark provides us this feature using cache() and persist() method.



One can use Cache() or persist() to store the intermediate dataset and reuse the same in upcoming actions.

Apache Spark Persist Vs Cache:


Both persist() and cache() are the Spark optimization technique, used to store the data, but only difference is cache() method by default stores the data in-memory (MEMORY_ONLY) whereas in persist() method developer can define the storage level to in-memory or in-disk.




Cache() - Overview with Syntax:


Spark on caching the Dataframe or RDD stores the data in-memory. It take Memory as a default storage level (MEMORY_ONLY) to save the data in Spark DataFrame or RDD. When the Data is cached, Spark stores the partition data in the JVM memory of each nodes and reuse them in upcoming actions. The persisted data on each node is fault-tolerant. So, even if any of the node lose its data stored in memory, Spark will automatically recompute it using the original DAG.

Spark Caching internally invokes persist() to cache the resulting Dataframe or RDD. The Syntax to perform Cache() on RDD and dataframe is as follows,

Syntax:

#cache RDD to store data in MEMORY_ONLY
rdd.cache()

#cache DF to store data in MEMORY_ONLY
df.cache()

To check whether the dataframe is cached or not, we can use df.is_cached or df.storageLevel.useMemory. Both the methods will return a bool value as True or False.



Example:

Let us consider the input data as csv file, and we need to read the data as dataframe and use cache on top of df.


Here, we can notice that before cache(), bool value returned False and after caching it returned True.

Persist() - Overview with Syntax:

Persist() in Apache Spark by default takes the storage level as MEMORY_AND_DISK to save the Spark dataframe and RDD. Using persist(), will initially start storing the data in JVM memory and when the data requires additional storage to accommodate, it pushes some excess data in the partition to disk and reads back the data from disk when it is required. Since it involves some I/O operation, persist() is considerably slower than cache().

We have many option of storage levels that can be used with persist(). Before moving on to syntax, let us know about all those storage levels. The following code contains the class definition of the each storage level in persist()

class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication)


      • MEMORY_ONLY,
      • MEMORY_AND_DISK,
      • MEMORY_ONLY_SER,
      • MEMORY_AND_DISK_SER,
      • DISK_ONLY,
      • MEMORY_ONLY_2,
      • MEMORY_AND_DISK_2,
      • DISK_ONLY_2
      • MEMORY_ONLY_SER_2,
      • MEMORY_AND_DISK_SER_2

To check the storage level of the dataframe or RDD, we can use rdd.getStorageLevel or df.storageLevel

In the above type,

  • if we use any of the levels with _2, it represents that the data is persisted and each is replicated into two cluster nodes. 
  • if we use any of the level with _SER, it represents that dataframe or RDD objects persisted in either memory or disk are being serialized. Serialization of data makes it space efficient as it consumes less memory, but needs few more CPU process to deserialize it.

Below image gives, you some over all view of all the storage levels in Persist()


Syntax:

We have two methods to use persist, one with defining any storage level as persist() and another one with storage level as persist(storage-level). Syntax for both the method is given below,

#persist dataframe with default storage-level
df.persist()

#persist dataframe with MEMORY_AND_DISK_2
df.persist(pyspark.StorageLevel.MEMORY_AND_DISK_2)



Example:

Same as Cache, we will try to persist a dataframe,


How to Remove cached or Persisted Data:

To remove the data cached in memory or disk, we can use the below syntax.

#remove the cached or persisted data
df.unpersist()

Significance of Cache and Persistence in Spark:

Persist() and Cache() both plays an important role in the Spark Optimization technique.It

  • Reduces the Operational cost (Cost-efficient),
  • Reduces the execution time (Faster processing)
  • Improves the performance of Spark application

Hope you all enjoyed this article on cache and persist using PySpark. Please do subscribe to my new YouTube channel for more videos. Post your comments and doubts in the comment box below.

Happy Learning!!!

Post a Comment

1 Comments

  1. cache() Persists the DataFrame with the default storage level (MEMORY_AND_DISK)

    ReplyDelete