How to Define Schema in Spark | InferSchema with StructType and StructField | Using PySpark



In this chapter, we discuss on how to provide of define a schema to the dataframe in PySpark. In previous chapter we learnt about different ways of creating dataframe in Spark, if not yet gone through the last chapter, I would recommend to read and have a hands-on before proceeding by visiting "Spark API - Dataframe". Dataframe are the table structured object, which makes user to perform SQL kind of operation such as select, filter, group by, aggregate etc, very easily.



Explanation with Demo:



Problem Statement:


Consider we create a Spark dataframe from a CSV file which is not having a header column in it. Since the file don't have header in it, the Spark dataframe will be created with the default column names named _c0, _c1 etc. as shown in the below figure. This column naming convention looks awkward and will be difficult for the developers to prepare a query statement using this column names. It will be helpful if we can create a dataframe with some meaningful column name.

Solution to Infer / Define Schema in PySpark:

We can apply schema to the dataframe using StructType clause. For better understanding, let's create a sample input file of type CSV as below and learn how to define schema in spark using StructType and StructField.


Open a fresh Jupyter notebook and create SparkSession, If you are not so familier with Spark API - Dataframe, then I would recommend you to go through the post on Dataframe . We have a input CSV file with no header and by observing we could see the meaning full column name can be,

"ID", "NAME", "EXPERTISE", "ADDRESS", "MOBILE"



StructType Clause:

StructType clause are used to provide schema to the Spark datframe. StructType object contains list of StructField objects that defines the name, datatype and flag to indicate null-ability. We can create schema as struct type and merge this schema with the data that we have. To do this we need to import all the sql.types and have a column list with its datatype in StructField, also have to provide nullable or not details. From StructField create StructType as shown in the below code snippet.

from pyspark.sql.types import *
data_schema=[ StructField("ID", IntegerType(), True),
              StructField("NAME", StringType(), True),
              StructField("EXPERTISE", StringType(), True),
              StructField("ADDRESS", StringType(), True),
              StructField("MOBILE", StringType(), True) ]
struct_schema=StructType(fields=data_schema)
print(struct_schema)


We can notice that the Schema with five columns of StructType are ready and can be merged with data


To add schema with the data, follow below code snippet.

df=spark.read.csv('input_file', schema=struct_schema)
df.show(truncate=0)

Output:

Now, we can notice that the column names are inferred from StructType for the input data in Spark dataframe.


Full Program:



Hope you learnt how to infer or define schema to the Spark Dataframe. If you still face any issues while defining schema, leave a comment about the problem that you face.

Happy Learning !!!

Post a Comment

0 Comments