InferSchema Vs StructType in Spark | How to Define Schema in Apache Spark




Today, in this tutorial we will learn about one of the simple spark optimization trick which might save your spark performance and runtime taken to complete the job. We will discuss on the best approach to define the schema in production in-order to get the best efficiency out of it with a simple demo using PySpark. We will use Databricks Community Edition for our demo and if you haven't created an account please follow the link Setup Databricks for Free and keep the the setup ready.

Problem Statement:

Most commonly asked Spark interview question are, 
  • What are the Performance Tuning that you tried? or 
  • What are the Spark Optimization technique that you tried in your project? or 
  • How does inferschema option work internally in Spark?

Our problem statement for today is, What is the optimized approach to define a schema to the spark dataframe.
  • Using InferSchema option while loading the CSV file (or)
  • Defining Schema using StructType and using it while reading the CSV file

Video Explanation with Answer:

Video helps you to understand the answer. 




Please do subscribe to my YT channel, share and like this video if you find this useful.

Solution:

In this Demo, I will be using the data of US Census/population data. The data can be downloaded form the given link Download US Data for your practice. 


We will develop a code snippet using both method and try to understand which one is efficient and improves the performance of our Spark Code.

Method 1 - Code Snippet Using InferSchema:

file_location = "/FileStore/tables/uspopulation.csv"

df = spark.read.option("header", True) \
  .option("delimiter", "|") \
  .option("inferSchema", True) \
  .csv(file_location)

In this, we can observe that the above command triggers the spark, reads the entire data in the CSV file to apply the matching datatype to the columns. From the below image, we can notice the Spark job gets triggered and DAG visualization also shows that inferschema reads the entire dataset to get the datatype of columns. Also, this code snippet took around 1.75 seconds to complete.



Method 2 - Code Snippet Using StructType:

#import datatype
from pyspark.sql.types import *

#predefined schema of StructType 
sch=StructType([
  StructField("2019_rank",IntegerType(),True),
  StructField("City",StringType(),True),
  StructField("State_Code",StringType(),True),
  StructField("2019_estimate",IntegerType(),True),
  StructField("2010_Census",IntegerType(),True),
  StructField("Change",StringType(),True),
])

#read the data with defined schema
df = spark.read \
  .option("header", True) \
  .option("delimiter", "|") \
  .schema(sch) \
  .csv(file_location)

The result from the above code is show in the below diagram. We can understand from the figure that, there is no spark job gets triggered. It is because the predefined schema make it easier for the spark to get columns and datatype without reading through the entire file, this gives the improvement in performance of your spark code, if you are dealing with the huge volume of data. Also, code snippet with StructType schema took just 0.45 seconds to complete which is far better than the inferschema.


Summary:

From the above results, we can conclude that StructType to define the schema while reading a csv file is more efficient way to improve the Spark performance. I would recommend you to use the same in your production to make your spark code run faster. 

Hope you understood the concept. Try this code snippet and let me know if you are facing any hurdles in the below comment session.

Happy Learning!!!

Post a Comment

1 Comments

  1. What if we have multiple files to read ? Is there any automated way ?

    ReplyDelete