How to Read MultiLine JSON in Apache Spark | Handle JSON File Format using PySpark

How to Read and Write JSON File Format using PySpark:



Today in this chapter, we are going to answer the frequently asked interview question on Apache Spark. Question that we are taking today is How to read the JSON file in Spark and How to handle nested data in JSON using PySpark. Hope you all made the Spark setup in your windows machine, if not yet configured, go through the link Install Spark on Windows and make the set up ready before moving forward.



Problem Statement:


Transform the multiline JSON file into readable Spark Dataframe as shown in diagram.


JSON File Format:


JSON stands for JavaScript Object Notation is a file format is a semi-structured data consisting of data in a form of key-value pair and array data type. It is a light-weighted data interchange format that are in human-readable format. Data that we extract from mobile application or by scraping any website will be in JSON format as it is very for the system to interact with web service API's and also provides long-term storage. After scraping the website or mobile application, we need to take data insights, extract those information and transform into the useful data for our analytics. But in real world scenario, developers often face difficulties in handling JSON data and Apache Spark SQL after a release of Apache Spark 1.1, it supports JSON file format for doing the above task.


Input File:


If you are a beginner and newly made a spark set, then before going further try this word count program in Spark by clicking here Word Count in Spark. Let us create a sample input file in JSON format as given in problem statement and place in file system. We could see two records in the file with two attributes in it "Name" and "Education". We can also observe that value of "Education" attribute is further nested as "Qualification" and "Year".



Come let's get started on with our hands-on.


How to Load JSON File using PySpark:


We can read the JSON file in PySpark using spark.read.json(filepath). Sample code to read JSON by parallelizing the data is given below

Pyspark Corrupt_record: If the records in the input files are in a single line like show above, then spark.read.json will give us the expected output. If we have a single record in a multiple lines then the above command will show "_corrupt_record".



To over come this sort of corrupted issue, we need to set multiLine parameter as True while reading the JSON file. Code snippet to do so is as follows.

#read multiline JSON file
input_df=spark.read.json('input.json', multiLine=True)

Out[]:


We can see the schema of dataframe in Spark using function printSchema(). Here we can notice the column "Education" is of type array and it has a nested group named as element, which is of type struct


Explode Array Column in Spark SQL DF:


Our next step is to convert Array of strings  i.e. Education column. We explode or flattens this column using the Spark SQL Dataframe API and SQL function explode().

Code snippet:

from pyspark.sql.functions import explode
flat=input_df.select('name',explode('Education').alias('education_flat'))

Out[]:


Now, we can observe Education column is flattened as Education_flat, with three records in the dataframe. By checking the Schema of Spark Dataframe, one can observe that education_flat column is of type Struct


Flatten Struct Columns in Spark SQL Dataframe:


Our final task is to convert the Struct data column into two different column as Qualification and year. For doing this we are again using Spark SQL and code snippet to achieve this is as follows

Code Snippet:

#flattening data
out_df=flat.select('name','education_flat.Qualification', 'education_flat.year' )

Out[]:

Thus we achieved our goal in loading JSON file as structured dataframe in PySpark. From this structured dataframe we can execute out sql operations or transformations such as map, flatmap, filter easily.

Full Program:




By this way, we can read the JSON file format and process the data, and write the file as JSON to the target location. Hope you enjoyed learning this session and if you have any queries regarding the above topic, let me know by commenting below.

Happy Learning!!!

Post a Comment

2 Comments

  1. hi , i have doubt , is it possible to convert corrupt record to multiple lines without manually entering the column names for each data framelike

    ReplyDelete
  2. is it possible to covert record to multiples lines without hard coding if so how

    ReplyDelete