How to Find Duplicates in Spark | Apache Spark Window Function


In this article, we will see a scenario based question in Spark to understand the concept of windowing function in Spark.

Problem Statement:

Consider we have a CSV file with some duplicate records in it as shown in the picture. Our requirement is to find duplicate records or duplicate rows in spark dataframe and report the output Spark Dataframe as shown in diagram (output dataframe).

Solution:

We can solve this problem to find duplicate rows by two Method,
  • PySpark GroupBy
  • PySpark Window Rank Function



For the Explanation and demo on the above given two methods, please watch the video embedded below



Subscribe to my YouTube channel for more Spark related Question.

Code snippets:

Step 1;

Initialize the SparkSession and read the sample CSV file

import findspark
findspark.init()

# Create SparkSession
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("Report_Duplicate").getOrCreate()

#Read CSV File
in_df=spark.read.csv("duplicate.csv",header=True)
in_df.show()

Out[]:

Approach 1: GroupBy


in_df.groupby("Name","Age","Education","Year") \
     .count() \
     .where("count > 1") \
     .drop("count").show()

Out[]:


Approach 2: Window Ranking Function




from pyspark.sql.window import Window
from pyspark.sql.functions import col,row_number

#Create window
win=Window.partitionBy("name").orderBy(col("Year").desc())

in_df.withColumn("rank", row_number().over(win)) \
     .filter("rank > 1") \
     .drop("rank").dropDuplicates().show()

out[]:



Happy Learning !!!

Post a Comment

4 Comments

  1. Kindly let me know how to do it in spark scala

    ReplyDelete
  2. the df.dropDuplicates() is enough.
    There is still a bug, use the below data set,

    Name,Age,Education,Year
    RAM,28,BE,2012
    Rakesh,53,MBA,1985
    Madhu,22,B.Com,2018
    Rakesh,53,MBA,1985
    Bill,32,ME,2007
    Madhu,22,B.Com,2018
    Rakesh,53,MBA,1985
    RAM,25,MA,2012


    Now you can see, there are two RAM, but they are different students not duplicate,

    ReplyDelete
    Replies
    1. This is not a bug in code. U will have to add all the required columns inside dropDuplicate. In your scenario, we can add Name and Age which help you. Hope you understood this concept.

      Delete
  3. Hello Bhai mujhe spark ke bare me theory to pata Chali ki usme spark SQL, pyspark hota hai lekin Bhai yah kahase sikhe step-by-step please reply

    ReplyDelete