How to Replace a String in Spark DataFrame | Spark Scenario Based Question

In this tutorial, we will see how to solve the problem statement and get required output as shown in the below picture. We will learn, how to replace a character or String in Spark Dataframe using both PySpark and Spark with Scala as a programming language. We use Databricks community Edition for our demo. Let us move on to the problem statement.

Problem Statement:

Consider we have a dataframe with columns as shown in the below figure (Input_DF). Our requirement is to replace the string value Checking in column called Card_type to Cash. The output that we are expected to workout is shown in the below figure for your reference.

Dataset:

Dataset can be downloaded from the given Github link. Click on the link personal_transaction.csv to download dataset. It is a simple CSV file with transaction details in it.

Solution with Demo:

We have different ways to achieve our expected output. We will see all the approach one by one to get the final required output. Before proceeding with the answer I would recommend you to give it a try on your own to solve this problem.

For more videos on Spark Scenario Based Interview Question, please do subscribe to my YouTube channel.

Solution - Using PySpark:

trans_path="dbfs:/FileStore/shared_uploads/personal_transactions.csv"

df1 = spark.read.format("csv").option("header","true").load(trans_path)

display(df1)

Out[]:

Method 1: Using na.replace

We can use na.replace to replace a string in any column of the Spark dataframe.

na_replace_df=df1.na.replace("Checking","Cash")

na_replace_df.show()

Out[]:

From the above output we can observe that the highlighted value Checking is replaced with Cash.

Method 2: Using regular expression replace

The most common method that one uses to replace a string in Spark Dataframe is by using Regular expression Regexp_replace function. The Code Snippet to achieve this, as follows.

#import the required function

from pyspark.sql.functions import regexp_replace

reg_df=df1.withColumn("card_type_rep",regexp_replace("Card_type","Checking","Cash"))

reg_df.show()

Out[]:

Method 3: Using Case When

The traditional method that fits in to solve many problem is to simply write case when condition. The Code Snippet to replace the string using case when is given below.

from pyspark.sql.functions import when,col,lit

when_df=df1.withColumn("card_type_repl",when(col("Card_type").rlike("Checking"),lit("Cash")).otherwise(col("Card_type")))

when_df.show()

Out[]:

Solution - Using Spark With Scala:

Let us see how we can approach this problem using Spark with Scala. There will be a minimal syntax change with respect to the above pyspark code for implementing using Scala. As a first step let us read the csv file that we have

val trans_path="dbfs:/FileStore/shared_uploads/personal_transactions.csv"

val df1 = spark.read.format("csv").option("header","true").load(trans_path)

df1.show()

Out[]:

Method 1: Using Regexp Replace

import org.apache.spark.sql.functions.{regexp_replace,lit}

val reg_df=df1.withColumn("card_type_repl",regexp_replace($"Card_type",lit("Checking"),lit("Cash")))

reg_df.show()

Out[]:

Method 2: Using Case When

Case When Statement is similar to the Pyspark approach that we saw above with some minor change to the syntax. The code snippet as follows.

import org.apache.spark.sql.functions.{when,lit,col}

val when_df=df1.withColumn("card_type_repl",when(col("Card_type").rlike("Checking"),lit("Cash")).otherwise(col("Card_type")))

when_df.show()

Out[]:

Method 3: Using UDF

We don't have na.replace function in Scala. We can write our own custom function to replace the character in the dataframe using native Scala functions. The code snippet for UDF is given below

val replace = udf((data: String , rep : String, withrep:String)=>data.replaceAll(rep, withrep))

val udf_df=df1.withColumn("card_type_repl",replace($"Card_type",lit("Checking"),lit("Cash"))

udf_df.show()

Out[]:

Hope you understood the concept of replacing the string in Spark Dataframe. Kindly try all this method in your own setup, let me know if you face any issues, I am happy to help you.

Happy Learning !!!

How to Replace a String in Spark DataFrame | Spark Scenario Based Question

Problem Statement:

Dataset:

Solution with Demo:

Solution - Using PySpark:

Method 1: Using na.replace

Method 2: Using regular expression replace

Method 3: Using Case When

Solution - Using Spark With Scala:

Method 1: Using Regexp Replace

Method 2: Using Case When

Method 3: Using UDF

Post a Comment

1 Comments

Featured Post

Spark Interview Question - Online Assessment Coding Test Round | Using Spark with Scala

Subscribe to LeantoSpark

Follow on FB Page

Most Viewed Post

Word Count Program Using PySpark