Spark Architecture | Apache Spark Tutorial | LearntoSpark

Spark Architecture




In this tutorial, we will look in detail about the Apache Spark Architecture, What are the components available in architecture of spark, What are the roles of each components, and Working model of Spark architecture. Coding can be easily learnt, but understanding the working of Spark and its architecture is trivial. Come lets move on to today's session

Introduction:


Apache Spark is the parallel computing powerful framework that utilizes the in-memory computation to run the application 100 times faster than the Hadoop MapReduce computation. It is a open-Source framework that trends in IT market because of its features such as, high speed computation with low cost, highly scalable, handles large volume of data to process i.e. Petabytes of data can also be handled easily and fault tolerance. Spark support both the batch mode processing and Real time data processing.


Spark Architecture - Overview


Apache Spark follow the typical master-slave architecture similar to Hadoop HDFS. The driver node in the cluster acts as a single master node and all the other nodes will behaves a slave nodes/worker nodes. These two are the main daemons responsible for the task submitted to the cluster for processing. The processing inside the cluster is based on the two main classification in architecture. They are

  • RDD - Resilient Distributed Datset
  • DAG - Directed Acyclic Graph



Resilient Distributed Dataset:


RDD's are the building block of Spark core component, that has a ability to split the collection of data blocks into number of partitions. This partitions are stored in in-memory on each of the slave nodes or executors to process. RDD can be created by reading a file HDFS, or by parallelizing the collection of elements. To have more insight on RDD, I would recommend you to go through the chapter on RDD by following this link Insights on RDD.


Directed Acyclic Graph:


Spark permits two types of operations, One is Transformations and other one is Actions. DAG, is nothing but it maintains the order or the sequence in which the steps needs to be executed. Each steps executed in each worker node has input as RDD partition and output will also be the RDD partition. The series of Transformation will get executed only when the action is triggered on top execution steps delivered to cluster for processing. This property of Apache Spark is called Lazy Evaluation. Also DAG help the Spark with multi-stage execution approach, which improves the performance over Hadoop computation.



Components of Spark Architecture:


As discussed previously, Apache Spark consist of three main component in it.

Master Node - (Spark Driver Program)
Worker Node - (Spark Executor)
Cluster Manager - (Resource Manager)




Working Model of Spark Architecture:



Let us learn the working of each components inside the architecture with one simple example. Consider the above shown diagram.

Let us assume, we submitted the spark code with some transformations and actions to the cluster using Spark-Submit. The code submitted might be a jar file or the py file.

Roles of Spark-driver in Spark Architecture:


The first step is, the code will be moved to Spark driver. Spark driver is the central unit of processing, which is responsible for the initiation of Spark Context and execution of submitted application by Splitting the application code into, sequence of job i.e. DAG scheduler and splits the each Job into stages and each stages into small task. This task is given to Task scheduler.

Once the Code is submitted, driver analyzes the code and transforms it into a logical plan called DAG. It also performs optimization on the flow, like Pipe-lining the task and transformations. This logical plan undergoes the series of process to get the optimized physical execution plan for our code. Physical code is split in to Stages and each stages were bundled with tasks which is forwarded to cluster.

Co-ordinates with the cluster manager for getting the resource to execute the task that are ready to get launched in Task scheduler. Submits the Code or the physical execution model to the each executor allocated by cluster manager based on the data placement. Driver also schedules the future task of the executor based on the data placement and finally collects the final output and pass it on to the Client who made a spark submit.


Roles of Cluster Manager in Spark Architecture:


Spark driver submits a request to Cluster Manager for available worker nodes or Executors in the cluster to perform the task scheduled. Cluster Manager maintains the worker node details available in the cluster. Cluster Manager in Apache Spark will allocate the resources requested by the Spark driver program and provides the instruction to slave/worker node to execute the task. It tracks the performance of the executor on the particular task and sends the status of the job completion back to the driver. If in case of any worker node failure or lost the node in-between then the cluster manager will allocate the same task to another executor which is ready to take that task for execution

Roles of Executors in Spark Architecture:


Executors are the low level labors that holds the data and executes the optimized task assigned to it. Parallelism is directly proportional to the number the Executors available in the cluster. If Number of Worker node is increased, it will increase the in-memory RAM which in turn improves the performance.  Executors will be assigned to driver program by cluster manager and once it is assigned, Worker node automatically register themselves to the Spark driver and start processing the assigned unit of problem. After successful completion of assigned task, Executors returns the final output to the driver.

Hope you got the big picture on what is happening in the background when spark application is submitted. Let me know your doubts or queries in the comment session.


Happy Learning!!!

Post a Comment

2 Comments

  1. Hi There,

    Very good article on spark Architecture.

    Can you please prepare one to decide number of executors memory and core to submit any spark job.

    Thank you!

    ReplyDelete
  2. Hi, Very informative article. Appreciated !

    ReplyDelete