In this tutorial, we will see about a scenario based problem that we face in real time spark implementation and also learn to solve this using PySpark. Please let me know your inputs and comments in the below given comment box. Let us start with our problem statement
Problem Statement:
Consider a input Spark Dataframe as shown in the above figure, which is derived from a nested JSON file. You can download the sample dataset from this link sample dataset. We can observe that there is a duplicate column name named name, our requirement is to rename any one of the duplicate or ambiguous column from the dataframe. The output with renamed output is show in the above figure.
Before going further down to check the answer, I would recommend you to try to answer on your own. Post your answers in comment box below.
Impact with Ambiguous Column in Spark
Ambiguous column in Spark DataFrame leads to the worst impact and we will not be able to perform any transformations on top of the duplicate column as it through the error message as shown in the below figure.
Solution:
Renaming the one of the ambiguous column name into differrent name will sort out this issue. But in Spark, we don't have a direct method to handle this use case and we need to make use of df.columns to get the duplicate columns count and index and to rename the duplicate column in Spark Dataframe.
Solution with Explanation in YT:
Please do support by subscribing to my youtube channel, if you like the content.
Code Snippets to Solve the Ambiguity:
Step1:
#Read the input json file and flatten the data to replicate the use-case
df=spark.read.json('input1.json',multiLine=True)
df1=df.select("*","Delivery.*").drop("Delivery")
df1.show()
out[]:
Step2:
lst=[]
df_cols=df1.columns
for i in df_cols:
if df_cols.count(i)==2:
ind=df_cols.index(i)
lst.append(ind)
lst1=list(set(lst))
for i in lst1:
df_cols[i]=df_cols[i]+'_0'
df1=df1.toDF(*df_cols)
df1.show()
out[]:
Hope you guys learnt a technique to remove duplicated column in the Spark dataframe. I would recommend you to try this in your own machine and understand the concept better. Let me know if you face any hurdles while executing, I will be happy to help you
Happy learning !!!
1 Comments
Hi azar,
ReplyDeletewhat if there are two name columns inside delivery. Because, when I am converting it to dataframe, then one of the columns shows null value throughout