How to Find Duplicates in Spark | Apache Spark Window Function

In this article, we will see a scenario based question in Spark to understand the concept of windowing function in Spark.

Problem Statement:

Consider we have a CSV file with some duplicate records in it as shown in the picture. Our requirement is to find duplicate records or duplicate rows in spark dataframe and report the output Spark Dataframe as shown in diagram (output dataframe).

Solution:

We can solve this problem to find duplicate rows by two Method,

PySpark GroupBy
PySpark Window Rank Function

For the Explanation and demo on the above given two methods, please watch the video embedded below

Subscribe to my YouTube channel for more Spark related Question.

Code snippets:

Step 1;

Initialize the SparkSession and read the sample CSV file

import findspark
findspark.init()

# Create SparkSession
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("Report_Duplicate").getOrCreate()

#Read CSV File
in_df=spark.read.csv("duplicate.csv",header=True)
in_df.show()

Out[]:

Approach 1: GroupBy

in_df.groupby("Name","Age","Education","Year") \
.count() \
.where("count > 1") \
.drop("count").show()

Out[]:

Approach 2: Window Ranking Function

from pyspark.sql.window import Window
from pyspark.sql.functions import col,row_number

#Create window
win=Window.partitionBy("name").orderBy(col("Year").desc())

in_df.withColumn("rank", row_number().over(win)) \
.filter("rank > 1") \
.drop("rank").dropDuplicates().show()

out[]:

Happy Learning !!!

4 Comments

UnknownFebruary 17, 2021 at 2:12 AM
Kindly let me know how to do it in spark scala
UnknownFebruary 27, 2022 at 9:27 PM
the df.dropDuplicates() is enough.
There is still a bug, use the below data set,

Name,Age,Education,Year
RAM,28,BE,2012
Rakesh,53,MBA,1985
Madhu,22,B.Com,2018
Rakesh,53,MBA,1985
Bill,32,ME,2007
Madhu,22,B.Com,2018
Rakesh,53,MBA,1985
RAM,25,MA,2012

Now you can see, there are two RAM, but they are different students not duplicate,
Gajanan vachaneDecember 28, 2022 at 9:57 PM
Hello Bhai mujhe spark ke bare me theory to pata Chali ki usme spark SQL, pyspark hota hai lekin Bhai yah kahase sikhe step-by-step please reply

How to Find Duplicates in Spark | Apache Spark Window Function

Problem Statement:

Solution:

Code snippets:

Approach 1: GroupBy

Approach 2: Window Ranking Function

Post a Comment

4 Comments

Featured Post

Spark Interview Question - Online Assessment Coding Test Round | Using Spark with Scala

Subscribe to LeantoSpark

Follow on FB Page

Most Viewed Post

PySpark For Beginners | Word Count Program Using PySpark

How to Replace a String in Spark DataFrame | Spark Scenario Based Question

How to Transform Rows and Column using Apache Spark

Pages

Menu Footer Widget

How to Find Duplicates in Spark | Apache Spark Window Function

Problem Statement:

Solution:

Code snippets:

Approach 1: GroupBy

Approach 2: Window Ranking Function

Post a Comment

4 Comments

Featured Post

Spark Interview Question - Online Assessment Coding Test Round | Using Spark with Scala

Social Plugin

Subscribe to LeantoSpark

Follow on FB Page

Most Viewed Post

PySpark For Beginners | Word Count Program Using PySpark

How to Replace a String in Spark DataFrame | Spark Scenario Based Question

How to Transform Rows and Column using Apache Spark

Pages

Menu Footer Widget