Spark API Resilient Distributed Dataset (RDD) | What is RDD in Spark ? | Apache Spark Tutorial


What is RDD?


Apache Spark to interact and process the data, creates a specialized data structure called RDD, which is a basic building block of any spark application. RDD is a immutable collection of dataset distributed across cluster as logical partitions. In this blog we will familiarize our self with concept of RDD and also have some hands-on knowledge on how to play with RDD in Spark. 

MapReduce jobs are observed slowness due to replication, disk-IO operations and serialization. Most of the time is spend in doing HDFS read-write operation in Hadoop application. To overcome this sort of issue, the key idea used in Spark is RDD. It supports in-memory computation, which reduces the HDFS read-write operations on intermediate results ie., it stores the status of memory as an object across the job and this object is shared across different jobs of same application. This Data sharing in memory benefits us by processing 10 to 100 times faster than network or disk.

Features of Spark API - RDD:

Fault tolerance:

RDD's, keeps track of operations performed on data ie., it maintains a data lineage information. With help of this lineage spark recovers the lost data automatically on the event of failure. This feature is also called as a Resilient.



Parallelism: 

RDD spreads the data to be processed across multiple nodes of a cluster, which makes it advantage to run the task/job parallel across multiple worker node. 



Lazy Evaluation: 

RDD's never starts execution until the actions to be performed are called out ie., Data will not get loaded into RDD even if the transformations are defined. It starts computations only when action is performed on top of that RDD.

Immutability:

Once RDD is created, can't be modified or changed. Data available in RDD will be in a form of read-only mode. But we can create another RDD from an existing RDD using transformation on top of it.


In-memory Computation:

With it immutability feature, in-memory comes into picture. RDD's stores the intermediate process data into RAM, internal memory instead of spilling over into disk. This feature of processing in-memory helps the computation to be much much faster compared to  legacy MapReduce application.

Partitioning:

Partitions can be created on top of existing RDD's. The Partitions created are logical chunk of data parts that are immutable.


Hands-On, Ways to create RDD's:


As a part of hands-on lets us look into the multiple ways involved in creating RDD's in PySpark using Jupyter notebook. As a first step create spark session and spark context as shown below.

# Create SparkSession and sparkcontext
from pyspark.sql import SparkSession
spark = SparkSession.builder\
                    .master("local")\
                    .appName('createRDD')\
                    .getOrCreate()
sc=spark.sparkContext


Using parallelize() :


We can create RDD's from the collection of objects/elements which is nothing but a "python list" using parallelize method. Make use of below code snippet to achieve the objective.

Here, we create a list with six elements in it and apply parallelize on top of list, which output the distributed dataset called RDD. One can notice the difference in type when the collection of element is parallelized.

#input collection of elements
list_in=["Way","to","crate","RDD","in","pySpark"]
type(list_in)

Out[]:list

rdd1=sc.parallelize(list_in)
type(rdd1)

Out[]:pyspark.rdd.RDD 

Reading from Data File:


We can create RDD's by reading an external data file. Files can be of any format such as csv, json, txt etc. Below is a code snippet to load data file as RDD.

Input File:

We created a dummy input file with one sentence in it as in the below picture.


Code Snippet:

Type in the code snippet to read data from the dummy input file that we created. We can notice that the type of "rdd2" is pyspark.rdd.RDD.

rdd2=sc.textFile('inputfile.txt')
type(rdd2)

Out[]:pyspark.rdd.RDD 

Output:

To see the content in RDD, we can use collect() method.

rdd2.collect()

Out[]:['Hi, snippet to create RDD from file in pyspark.'] 



From Existing RDD:


As we know, RDD's are immutable and cannot be modified, the last method to create RDD is from existing RDD by applying transformation like map, flatmap on top of it

Code Snippet:

We apply map() operation on "rdd1" to create "rdd3". We can notice type of rdd3 as a RDD.

rdd3=rdd1.map(lambda x: x.split(','))
type(rdd3)

Out[]:pyspark.rdd.PipelinedRDD 

Output:

Collect rdd3 to see the result.

rdd3.collect()

Out[]:[['Way'], ['to'], ['crate'], ['RDD'], ['in'], ['pySpark']] 

Limitations of RDD:


Though we say RDD's are the basic building block of spark, we do have some limitations to it.

RDD on Structured Data:

RDD cannot be efficiently used with structured dataset, as RDD can be created from any kind of data formats. Also we RDD's dont have hold on schema of the datafile. Hence it will not be able to use sql queries on top of RDD. We will see how to handle this use case in upcoming chapters.


In-Memory(RAM) Size:

If size of the data that needs to fit in with memory exceeds and if there is not enough memory in RAM or disk will degrade the performance. 

Full Program:


Hope you all understood, Spark building block - RDD, leave your comment if any clarity needed in the above theory and hands on. In upcoming article, we will look in detail about the operations that can be performed on RDD.

Happy Learning!!!

Post a Comment

1 Comments