How to Convert RDD to Spark Dataframe Using Apache Spark



One of the most common and frequently asked question in Spark interview is How to convert RDD to DataFrame in Apache Spark ? or Can we convert back the dataframe to RDD ?. In this today's blog, we will be learning to answer both the questions with one simple example and we also gain some hands-on experience on this. Hope you have a Spark installed on your windows machine, if not kindly follow the link to "Setup Spark in Windows", before starting with this chapter.

RDD to DF & DataFrame to RDD:


Problem Statement:


Consider, we have a file in text format, and we read these file as a RDD using sparkContext. Sample file looks like in the below diagram. It is a pipe delimited file with four columns.

"STUDENT_ID"|"SUBJECT"|"MARKS"|"RESULT" 



Our requirement is to convert this RDD into dataframe. Also we need to check whether we can convert the dataframe back as RDD. Come let's get started


Solution:


Convert RDD to DF:


To start with hands-on, open a new Jupyter notebook and establish SparkSession and SparkContext. Read the input file as a RDD, as show below. To view the collection of elements present in our RDD, we have to use either in_rdd.collect() or in_rdd.take(number of elemets) command, where collect returns the entire collection of element and take returns only the 'n' number of records from the collection of input element.

input=sc.textFile('input.txt')
input.collect()

Out[]:

From the output, we can notice that the entire row is represented as a single column element. So, before we convert to Dataframe in spark, we need to split each elements with column of four. To achieve this, we use map function. Code snippet for your reference is given below.

input_split=input.map(lambda x: x.split('|'))
input_split.collect()

Out[]:


Now, one can observe that each element is split into four required columns accordingly. Next step is to define column for our dataframe, we can create dataframe from RDD, without defining a schema, Spark will assign a default column value to the Dataframe as "_c0, _c1, _c2 .. etc". It is good practice to infer schema. To understand how we define schema, follow the link "define schema to Spark Dataframe".

from pyspark.sql.types import *
#Construct Schema using Struct Field and Struct Type
data_schema=[ StructField("STUDENT_ID", StringType(), True),
              StructField("SUBJECT", StringType(), True),
              StructField("MARKS", StringType(), True),
              StructField("RESULT", StringType(), True) ]
#Provide the Struct Fields to the Struct Type
struct_schema=StructType(fields=data_schema)

Out[]:

Now, the final step in conversion is to use any one of the method either toDF() or createDataFrame(), we can have a look into both the methods of conversions one by one.

Using toDF():


Code Snippet, for this method is as follows,

df_out=input_split.toDF(struct_schema)
df_out.show()

Out[]:

Using createDataFrame():

Code snippet for converting RDD of input_split into dataframe using createDataFrame is give below,

df_out1=spark.createDataFrame(input_split,struct_schema)
df_out1.show()

Out[]:


Thus, we converted RDD into a dataframe in Apache spark. Now lets also examine whether can convert back the dataframe as RDD again.

Convert Dataframe to RDD in Spark:


We might end up in a requirement that after processing a dataframe, resulting dataframe needs to be saved back again as a text file and for doing so, we need to convert the dataframe into RDD first. We can convert Dataframe to RDD in spark using df.rdd(). Code snippet for doing the same is as follows,

out_rdd=df_out1.rdd.map(lambda x: '|'.join(x))
out_rdd.collect()

Out[]:

Here, if we did't apply map function to join the columns df_out1.rdd() will result with a RDD of Rows format like the below one,


To save the file as a textfile, we can use rdd.saveAsTextfile().

out_rdd.coalesce(1).saveAsTextFile("output")

Full Program:



Hope you will now be able to answer the question in your next interview in Spark. Leave your comments below, if you need any assistance in above steps.

Happy Learning!!!

Post a Comment

1 Comments

  1. in spark 3.x also rdd most frequently using... RDD convert to dataframe and dataset..
    is there any performance difference if convert like this?
    pls let me know
    Regards
    Venu
    bigdata training institute in Hyderabad
    spark training in Hyderabad

    ReplyDelete