How to Define Schema in Spark | InferSchema with StructType and StructField | Using PySpark

In this chapter, we discuss on how to provide of define a schema to the dataframe in PySpark. In previous chapter we learnt about different ways of creating dataframe in Spark, if not yet gone through the last chapter, I would recommend to read and have a hands-on before proceeding by visiting "Spark API - Dataframe". Dataframe are the table structured object, which makes user to perform SQL kind of operation such as select, filter, group by, aggregate etc, very easily.

Explanation with Demo:

Problem Statement:

Consider we create a Spark dataframe from a CSV file which is not having a header column in it. Since the file don't have header in it, the Spark dataframe will be created with the default column names named _c0, _c1 etc. as shown in the below figure. This column naming convention looks awkward and will be difficult for the developers to prepare a query statement using this column names. It will be helpful if we can create a dataframe with some meaningful column name.

Solution to Infer / Define Schema in PySpark:

We can apply schema to the dataframe using StructType clause. For better understanding, let's create a sample input file of type CSV as below and learn how to define schema in spark using StructType and StructField.

Open a fresh Jupyter notebook and create SparkSession, If you are not so familier with Spark API - Dataframe, then I would recommend you to go through the post on Dataframe . We have a input CSV file with no header and by observing we could see the meaning full column name can be,


StructType Clause:

StructType clause are used to provide schema to the Spark datframe. StructType object contains list of StructField objects that defines the name, datatype and flag to indicate null-ability. We can create schema as struct type and merge this schema with the data that we have. To do this we need to import all the sql.types and have a column list with its datatype in StructField, also have to provide nullable or not details. From StructField create StructType as shown in the below code snippet.

from pyspark.sql.types import *
data_schema=[ StructField("ID", IntegerType(), True),
              StructField("NAME", StringType(), True),
              StructField("EXPERTISE", StringType(), True),
              StructField("ADDRESS", StringType(), True),
              StructField("MOBILE", StringType(), True) ]

We can notice that the Schema with five columns of StructType are ready and can be merged with data

To add schema with the data, follow below code snippet.'input_file', schema=struct_schema)


Now, we can notice that the column names are inferred from StructType for the input data in Spark dataframe.

Full Program:

Hope you learnt how to infer or define schema to the Spark Dataframe. If you still face any issues while defining schema, leave a comment about the problem that you face.

Happy Learning !!!

Post a Comment