How to Create Spark Dataframe Using PySpark | Apache Spark Tutorial


What is Spark SQL DataFrame?

Our previous chapter  was fully focused on what is RDD?, its features, limitations and had some real coding knowledge to create the Spark RDD. We used PySpark and if you missed that, I would recommend you guys to go through that by clicking the link here, Spark API - RDD. In this chapter, we will learn about another API of spark SQL dataframe. Spark dataframe are the distributed collection of data which resembles SQL like table format with named columns that helps Spark application developers to perform all SQL operation easily. It is suitable for the developers who hates to learn coding in Scala or Python, but still remains strong with their SQL skill set.

Create Spark Dataframe Using PySpark:

We can construct dataframes in Spark by different ways as given in bullets below
  • from structured file format, 
  • from existing RDD, 
  • from Distributed file system, 
  • from tables in Hive, or by reading from external database

Let us have a hands-on on creating Dataframe from existing RDD and from structured file format of type CSV.

From Structured File:

Let's assume we have a CSV file with five columns in it as shown in the figure below, let us see how to read the file as dataframe using PySpark.


First step is to open a fresh Jupyter notebook, [To install the Spark in windows machine, follow the steps here Install Spark in Windows]. If you need to have a basic hands on experience on the Spark with word count program, follow Word count program in pyspark

Program:

1. Create an entry point SparkSession and sparkContext. Refer code snippet

# Spark-Session creation
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local")\
                                   .appName('createDF').getOrCreate()

# Spark-Context creation 
sc=spark.sparkContext


2. Read the input file that we created using spark.read command

#Read the input file
df=spark.read.csv('input_file')
type(df)

Here, we notice that the type(df) returns type as dataframe pyspark.sql.dataframe.DataFrame.

3. To show the content of dataframe samples.

We are able to display the data in table format and since we didn't give any schema type while reading the file as dataframe, the column names for the file is automatically assigned as _c0 ... _c4. To define schema if csv file has a header column in it, then we can define using header attribute while reading the file as show below.

#Read the input file
df=spark.read.csv('input_file', header=True)

Else one can also define their own schema to the dataframe by defining the schema as struct-type or by case clause and using infer-schema attribute to read them. This step we will look in detail in a separate blog.

From Existing RDD:

The next step that we hands-on today is creating from existing RDD. If you are a beginner and new to RDD, I would strongly recomment you to read how to create RDD, before moving on to the next step. Let's move on to program. Assume we have a text file with few sentence in it as shown below.


Program:

1.  Read the text file using the sparkContext to create RDD.

# Read the input file
rdd2=sc.textFile('inputfile.txt')
# Print the type of rdd2
type(rdd2)

It's observed that the rdd2 is of type RDD

2. To convert the RDD as Dataframe, you can either use toDF() method or you can use CreateDataFrame() as show in the below snippet.

from pyspark.sql.types import StringType
df2=spark.createDataFrame(rdd2, StringType())

Using toDF(),

Code Snippet:

from pyspark.sql import Row
row = Row("col_1")
df1= rdd2.map(row).toDF()

Here, you can observe that we are importing StringType and Row package to achieve our results. Reason behind this is SparkSession.CreateDataframe which is using under the hood, requires RDD as a type of Row, tuple, list, dict or pandas.Dataframe, unless schema with Datatype is provided. Else we can use map function to apply transformation, which converts the RDD to the pipelinedRDD and help us in converting the RDD to DF easily

3. Finally, to display the full content of df, use df.show() with truncate keyword as show in the snippet below.

Full Program:


We created dataframe successfully using RDD  and a sample CSV file. In our upcoming chapter, we can learn in detail about creating dataframe by infering schema with case clause or struct clause in upcoming chapters.

Leave your comments if your have any doubts in any of the above steps.

Happy Learning!!!

Post a Comment

0 Comments