In this chapter, we discuss on how to provide of define a schema to the dataframe in PySpark. In previous chapter we learnt about different ways of creating dataframe in Spark, if not yet gone through the last chapter, I would recommend to read and have a hands-on before proceeding by visiting "Spark API - Dataframe". Dataframe are the table structured object, which makes user to perform SQL kind of operation such as select, filter, group by, aggregate etc, very easily.
Explanation with Demo:
Problem Statement:
Consider we create a Spark dataframe from a CSV file which is not having a header column in it. Since the file don't have header in it, the Spark dataframe will be created with the default column names named _c0, _c1 etc. as shown in the below figure. This column naming convention looks awkward and will be difficult for the developers to prepare a query statement using this column names. It will be helpful if we can create a dataframe with some meaningful column name.
Solution to Infer / Define Schema in PySpark:
We can apply schema to the dataframe using StructType clause. For better understanding, let's create a sample input file of type CSV as below and learn how to define schema in spark using StructType and StructField.Open a fresh Jupyter notebook and create SparkSession, If you are not so familier with Spark API - Dataframe, then I would recommend you to go through the post on Dataframe . We have a input CSV file with no header and by observing we could see the meaning full column name can be,
"ID", "NAME", "EXPERTISE", "ADDRESS", "MOBILE"
StructType Clause:
StructType clause are used to provide schema to the Spark datframe. StructType object contains list of StructField objects that defines the name, datatype and flag to indicate null-ability. We can create schema as struct type and merge this schema with the data that we have. To do this we need to import all the sql.types and have a column list with its datatype in StructField, also have to provide nullable or not details. From StructField create StructType as shown in the below code snippet.from pyspark.sql.types import *
data_schema=[ StructField("ID", IntegerType(), True),
StructField("NAME", StringType(), True),
StructField("EXPERTISE", StringType(), True),
StructField("ADDRESS", StringType(), True),
StructField("MOBILE", StringType(), True) ]
struct_schema=StructType(fields=data_schema)
print(struct_schema)
We can notice that the Schema with five columns of StructType are ready and can be merged with data
To add schema with the data, follow below code snippet.
df=spark.read.csv('input_file', schema=struct_schema)
df.show(truncate=0)
Output:
Now, we can notice that the column names are inferred from StructType for the input data in Spark dataframe.
Full Program:
Hope you learnt how to infer or define schema to the Spark Dataframe. If you still face any issues while defining schema, leave a comment about the problem that you face.
Happy Learning !!!
0 Comments