How to Filter Data in Apache Spark | Spark Dataframe Filter using PySpark

In this article, we will learn how to filter data in Apache Spark, which is a simple transformation in Apache Spark and also we will have a demo on how to apply filter in Spark dataframe and also we will have some hands-on with demo video on code execution to understand the concept of filtering in Apache Spark. We are going to using local setup of Spark with Jupyter Notebook. We can also use Databricks Community Edition to practice such kind of problem statements. Databricks gives you the feel of cluster setup. To make your own cluster setup, create an account with Databricks community Edition for free, follow the link to setup databricks for free.

Coding and Execution - Demo Video:

Have a look at my video, prepared for this topic. Provide your valuable comments in the comments box about the video, which encourages me to make more video in future. Also, if you like my content, subscribe to my YouTube channel.

Problem Statement:

Consider we have a Students academic data as a CSV file and we need to filter the data of male students alone, i.e, we need to filter on Marks_obtained column. The Sample CSV file for our demo is as show below.

Problem is, Filter and fetch the records of students, whose MARKS_OBTAINED is more than or equal to 35.

Apply Filter using PySpark:

Filtering data from dataframe is a basic transformation in Apache Spark, which can be applied in different ways.

Filter can be applied directly on a Spark dataframe using filter() API
we can use where() function on Spark dataframe
we can also register the Spark dataframe as a temporary view and write a SQL query on top of it to apply filter.

We can also use multiple condition separated by and, or to apply filter conditions.

Syntax:

df.filter("Filter condition")

Let us look into both the ways one by one. Open new Jupyter notebook and initiate Spark Session and Spark Context.

Coding:

Method 1: Using Temp view

Step 1: Read the input file as a dataframe.
Step 2: Register the dataframe as a temporary view using createOrReplaceTempView().
Step 3: Write a sql query and assign the output to dataframe as below.

Snippet;

df=spark.read.option('delimiter','|').csv('input.csv',header=True)

#Create temporary view on dataframe
df.createOrReplaceTempView('filter_view')

#Write a SQL query to filter and fetch record
df_filter=spark.sql("""select * from filter_view where MARKS_OBTAINED >= 35""")

Method 2: Using Filter() or Where()

Step 1: Read the give raw file and create a Spark Dataframe to it.
Step 2: Use Filter() or Where() api, on top of Spark Dataframe. Refer the filter syntax mentioned in the above session.

The code snippet will be as follows,

df=spark.read.option('delimiter','|').csv('input.csv',header=True)

#apply filter api with filter condition
df_filter=df.filter("marks_obtained >= 35")

Output:

The filtered output is as shown below. I would recommend you to try this on your own to have a better understanding of concept.

out[]:

Full program:

Hope you enjoyed learning from this article. If you have any doubts or comments, please drop down in the below comment box.

Happy Learning!!!

How to Filter Data in Apache Spark | Spark Dataframe Filter using PySpark

Coding and Execution - Demo Video:

Problem Statement:

Apply Filter using PySpark:

Syntax:

Coding:

Method 1: Using Temp view

Method 2: Using Filter() or Where()

Output:

Full program:

Post a Comment

0 Comments

Featured Post

Spark Interview Question - Online Assessment Coding Test Round | Using Spark with Scala

Subscribe to LeantoSpark

Follow on FB Page

Most Viewed Post

Word Count Program Using PySpark

How to Replace a String in Spark DataFrame | Spark Scenario Based Question

How to Transform Rows and Column using Apache Spark

Pages

Menu Footer Widget

How to Filter Data in Apache Spark | Spark Dataframe Filter using PySpark

Coding and Execution - Demo Video:

Problem Statement:

Apply Filter using PySpark:

Syntax:

Coding:

Method 1: Using Temp view

Method 2: Using Filter() or Where()

Output:

Full program:

Post a Comment

0 Comments

Featured Post

Spark Interview Question - Online Assessment Coding Test Round | Using Spark with Scala

Social Plugin

Subscribe to LeantoSpark

Follow on FB Page

Most Viewed Post

Word Count Program Using PySpark

How to Replace a String in Spark DataFrame | Spark Scenario Based Question

How to Transform Rows and Column using Apache Spark

Pages

Menu Footer Widget