How to Use QUALIFY Keyword in Databricks SQL - Real-time Example and Performance Optimization

How to Use Qualify Keyword in Databricks SQL | Real-time Case on Qualify in Spark SQL



In this blog, we will learn how to use the QUALIFY keyword and also we will understand its real time usage and performance advantage 

QUALIFY Clause:

Qualify Keyword is available GA from Databricks Runtime 10.0 and above. It is a SQL extension exclusively used to filter the results of window functions. This Spark SQL function comes handy when you have a complex analytical queries and needs a filter on top of window function.

General Syntax:

QUALIFY <boolean_expression>

Parameter: 

boolean_expression

Any expression that evaluates to a result type boolean. Two or more expressions may be combined together using the logical operators ( AND, OR). The expressions specified in the QUALIFY clause cannot contain aggregate functions.

To use the Qualify keyword in the SQL statement, adhere to the below given SQL syntax,


SELECT <columns>
FROM <table>
WHERE <condition>
QUALIFY <Filter condition on top of window function>

Video Explanation and Demo - QUALIFY Keyword:



Real Time Example:

Find the Top Performer By Category

Consider a sales data of a company is provided as shown in the below image and the problem statement for you to solve is,



Question: Find the Top selling product in each category for the specific date.

Solution:

We will explore both the solution, 

  • With Qualify keyword i.e., after Databricks Runtime 10.0
  • Before Databricks Runtime 10.0 solution

Let us start by reading the input file and create temporary view out of it for using SQL syntax.


#Read the input file into dataframe

df1 = (
       spark.read.format("csv")
                 .option("header", "true")
		 .option("delimiter","|")
		 .load("sales_trans.csv")
	 )

df1.display()

#Create Temp View
df1.createOrReplaceTempView("sales_transaction")

Let us look in to the solutions one by one.

Before Databricks Runtime 10.0

Method 1: Using PySpark API

from pyspark.sql.functions import rank
from pyspark.sql.window import Window

window_spec = Window.partitionBy("Category").orderBy(df1["Sales"].desc())

# Add a rank column based on sales within each category
df_with_rank = (
                 df1.filter("date='2023-07-02'")
                    .withColumn("Rank", 
                                  rank().over(window_spec))
                )

# Filter to get top-selling products in each category
top_products_df = df_with_rank.filter(df_with_rank["Rank"] == 1)

top_products_df.display()
Out[]:


Method 2: Using Spark SQL


SELECT Category, Product, Sales
FROM 
(
SELECT Category, Product, Sales,
RANK() OVER (PARTITION BY Category ORDER BY Sales DESC) as rnk
FROM sales_transaction
WHERE Date = '2023-07-02'
)
WHERE rnk=1
Out[]:


Using QUALIFY keyword - After Databricks Runtime 10.0


SELECT Category, Product, Sales
FROM sales_transaction
WHERE Date = '2023-07-02'
QUALIFY RANK() OVER (PARTITION BY Category ORDER BY Sales DESC) = 1;
Out[]:


Performance Advantage:

With Qualify keyword in the Spark SQL there will be a performance advantages compared to the approach using PySpark API or subqueries. By utilizing this keyword, you can perform data filtering and manipulation within a single SQL query, thereby minimizing the need for multiple subqueries or temporary tables. This streamlined approach significantly enhances query execution time and overall performance.

Hope you all understood the syntax and usage of QUALIFY keyword with an example. Try to solve it in your own workspace for better understanding.


Happy Learning!!!

Post a Comment

0 Comments