How to Filter Data in Apache Spark | Spark Dataframe Filter using PySpark



In this article, we will learn how to filter data in Apache Spark, which is a simple transformation in Apache Spark and also we will have a demo on how to apply filter in Spark dataframe and also we will have some hands-on with demo video on code execution to understand the concept of filtering in Apache Spark. We are going to using local setup of Spark with Jupyter Notebook. We can also use Databricks Community Edition to practice such kind of problem statements. Databricks gives you the feel of cluster setup. To make your own cluster setup, create an account with Databricks community Edition for free, follow the link to setup databricks for free.



Coding and Execution - Demo Video:

Have a look at my video, prepared for this topic. Provide your valuable comments in the comments box about the video, which encourages me to make more video in future. Also, if you like my content, subscribe to my YouTube channel.




Problem Statement:

Consider we have a Students academic data as a CSV file and we need to filter the data of male students alone, i.e, we need to filter on Marks_obtained column. The Sample CSV file for our demo is as show below.


Problem is, Filter and fetch the records of students, whose MARKS_OBTAINED is more than or equal to 35.

Apply Filter using PySpark:

Filtering data from dataframe is a basic transformation in Apache Spark, which can be applied in different ways. 
  • Filter can be applied directly on a Spark dataframe using filter() API
  • we can use where() function on Spark dataframe 
  • we can also register the Spark dataframe as a temporary view and write a SQL query on top of it to  apply filter. 
We can also use multiple condition separated by and, or to apply filter conditions.

Syntax:

           df.filter("Filter condition")

Let us look into both the ways one by one. Open new Jupyter notebook and initiate Spark Session and Spark Context.

Coding:

Method 1: Using Temp view

Step 1: Read the input file as a dataframe.
Step 2: Register the dataframe as a temporary view using createOrReplaceTempView().
Step 3: Write a sql query and assign the output to dataframe as below.

Snippet;

df=spark.read.option('delimiter','|').csv('input.csv',header=True)                                       
                                                                                                                                                      
#Create temporary view on dataframe                                                                                    
df.createOrReplaceTempView('filter_view')                                                                           
                                                                                                                                                     
#Write a SQL query to filter and fetch record                                                                       
df_filter=spark.sql("""select * from filter_view where MARKS_OBTAINED >= 35""")


Method 2: Using Filter() or Where()

Step 1: Read the give raw file and create a Spark Dataframe to it.
Step 2: Use Filter() or Where() api, on top of Spark Dataframe. Refer the filter syntax mentioned in the above session.

The code snippet will be as follows,

df=spark.read.option('delimiter','|').csv('input.csv',header=True)
                                                                                                               
#apply filter api with filter condition                                                
df_filter=df.filter("marks_obtained >= 35")                                    


Output:

The filtered output is as shown below. I would recommend you to try this on your own to have a better understanding of concept.

out[]:

Full program:




Hope you enjoyed learning from this article. If you have any doubts or comments, please drop down in the below comment box.

Happy Learning!!!

Post a Comment

0 Comments