Spark Performance Tuning - Catalyst Optimizer:
Our primary objective in this chapter is to understand the optimization framework of Spark SQL, that comes with the feature that allows spark application developers to transform complex queries into simple code with few lines. We also learn How Spark engine utilizes the Catalyst Optimizer to efficiently reduce the time of execution of query. Finally, we will see some examples to have a better understanding of Optimization in Spark SQL.
What is Optimization in Apache Spark:
Optimization is nothing but the method by which the long running application or system is fine tuned and made some change to make application to manage resources effectively and reduce the processing time efficiently.
Apache Spark 2.0 is a major release that brought drastic changes to many framework, API's and libraries in that framework. In earlier version of Apache Spark, for optimization, developers need to change the source code of spark framework. This solved the problem for the particular user or to the particular account and also it is not advisable to change the code base of framework. There comes an idea to enhance the optimization features in later version that will act as an catalyst to run the application efficiently. Now in advanced Apache spark framework, we have a pluggable method which helps one to define a set of optimization rules and add it to the Catalyst.
Spark SQL Catalyst Optimizer:
As you all know the performance of transformation done directly with RDD will not be that efficient and Spark SQL API dataframe as well as dataset out performs the RDD. Spark SQL runs with an optimization engine called Catalyst optimizer, which help the developer to optimize the queries built on top of both dataframe and dataset with making any changes to the source code.
Catalyst is one of the Spark SQL modular library that is supported by rule-based optimization and cost-based optimization. Rule-based optimization proposes the set of rules to find the number of ways in which query can be executed, whereas Cost-based optimization is a step to compute cost for all the rule-based queries and helps the engine to find the best suited approach to execute the SQL query. Catalyst optimizer also offer an additional extension point, that can also deal with the external data source and UDF's.
Components of Spark Catalyst optimizer:
Since Apache Spark is built in Scala, catalyst optimizer by default has an in-built features of Scala programming such as pattern matching, runtime metaprogramming that allows developers to specify complex optimization. Catalyst is broadly divided into two, namely
- Abstract Syntax Trees (AST)
- Rules
Trees:
In a Catalyst Optimizer (CO), Trees plays a significant role as it is a main data type. Trees as shown in the below diagram, comprises of nodes and each nodes object has a node which are subdivided into one or more child attributes. The node objects are immutable in nature and we can process them easily using functional transformations.
For better understanding of trees, let us consider one simple expression and try to draw tree for that.
Expression: .where(“tutorial=‘learntospark’”).limit(2)
Here, we can list down the nodes to be a limit and filter operation and attribute to be column named tutorial and child literals are 2 and learntospark. So the tree structure will be as follows
Rules:
Using Rules, we can manipulate the Trees that we learnt above. This Rules approach specifically used in set of pattern matching functions, that replaces the nodes with the specific repeated structure. This is done by applying transform method on top of trees in catalyst. By applying transformation, we recursively match the pattern with result using pattern matching method. Ultimately, Rules is nothing but the Scala-code that provides catalyst power to optimize than the other domain specific languages.
Spark Catalyst - Execution plan:
Now, as we learnt in detail about the query optimization using Spark SQL - catalyst optimizer, let us move ahead and learn how the query gets executed i.e. execution plan for a query in Spark. Execution plan can be understood clearly by learning about the below four components.
- Analyzer
- Optimizer
- Physical planning
- Code generator
Let us look into each topic one by one. Before that let us see the diagrammatic representation of execution flow using Catalyst optimizer.
From the flow, we can get an idea of how many task in the background is done by Spark Catalyst to run the spark queries that we execute. If your application contains any SQL queries, it will be first converted into Dataframe API's. We can use DF.explain() method to understand the various plans generated by Spark to perform the execution of query.
1. Analyzer:
Spark SQL catalyst analyzer helps us to resolve the data type and name of the attribute in the unresolved query by using the catalog available with the statistical information of the given data. For example, let us assume a small problem statement like
SELECT (col_name * 2) as col_out from SQLtable;
Spark Catalyst analyzer, by checking the catalog information confirms that col_name is the valid available column in the table and also check for the type of col_name, so that it confirms that type casting is not required to perform the SQL operation. In simple word, analyzer does,
- Search the catalog for relation mapping
- Determining the datatype of the named attribute
- Analyze on casting, by hovering over the expression
2. Optimizer:
Now the Logical plan that we got from the unresolved query plan has to be converted into Optimized logical plan. Yes, Optimizer will generate the Optimized logical plan from the logical plan that we got from Analyzer. Each operations in a query is represented as a Tree and when the analysed plan is passed through optimizer, the tree is recursively transformed into tree by applying set of optimized rules until the tree reaches the constant value. Output of Optimizer is the Optimized Logical Plan.
One feature of Catalyst optimizer is we can write a custom rules to our expression and make it plug-in to framework. Lets look into this topic in detail in our later chapters.
3. Physical Planning:
Now as we got Optimized Logical plan, we need to execute this plan in the cluster. Physical plan is nothing but the conversion of Optimized logical plan into physical plan that can be executed in the cluster. This is done by passing the logical plan into the set of SparkStrategies to produce the physical plan.
After the physical plans are generated, by using a rule it manipulates and estimates the cost recursively for the tree end-to-end. Pipelines, the physical optimization is carried out using Rule-based optimization and it can push operations that supports predicate or projection pushdown. Finally, Spark uses the Cost Based Optimization to select the best suited physical plan to execute in cluster based on the provided data source.
4. Code Generator:
Final phase of execution plan in Spark catalyst optimization is code generation. To run the plan in cluster on each node, our plan needs to be converted into Javabyte code. In earlier version, to compile the SQL queries into bytecode, spark uses the quasiquotes after 1.5 version, it is replaced with janino compiler. Later, Apache spark introduces, whole stage codegen to help us in grouping all the SQL operation into a single pass. We can construct AST's in Scala using quasi-quotes, that can be given to Scala compiler during run time to generate bytecode.
Then generated JVM bytecode is sent to the cluster to compile and execute the code.
Finishing Touch:
This is just a few thing to my knowledge and there are many thing on the plate to learn when it comes to Spark's Catalyst. We learnt about internal working Spark Catalyst to optimize the query that helps developer to code better.
Hope you enjoyed learning new concept through my post. Leave your comment if you need any further assistance.
Happy Learning!!!
1 Comments
excellent
ReplyDelete