ORC File Format | Spark Performace Tuning

Spark Performance Tuning | ORC File Format:

In today's chapter, the question that we are going to deal with is How to read and write ORC file format in Apache Spark. Also, we will deep dive into ORC structure and advantage of using this file formats while writing the files to target location in PySpark. 

What file format you use and why? in spark application development has become a commonly asked question in Spark interview.  We have different types of file formats widely used in dealing with big data such as Parquet, Sequential, Avro, CSV, JSON. Each type has its own advantage and it's own application usage. By gaining knowledge about the file format, it will be helpful for the spark application developers to choose one of the file format that fits their use case. Today we will learn in detail about the ORC file format in Apache spark. Come let's get started.

ORC - Role in Spark Performance Tuning:

File format is an important factor for optimizing the application efficiency that is written in spark. Following factors should be take care while choosing the file format type by a developer before planning for the new application development. The factors that we should consider are as follows,
  • Data stored in the HDFS location should be retrieved faster for processing i.e. readability should be faster.
  • Save the processed data into the any location such as HDFS or AWS s3 bucket or azure blob or GCP with much faster speed and also without any data loss.
  • Format should be Splittable, so that one can run multiple task parallely across cluster.
  • Adopt to the changes in schema if any in future, i.e. should have flexibility to retro-feed data with the updated schema
  • Adopting proper compression technique that enables efficient block storage and processing.

ORC - Overview:

Apache ORC is very similar to the Apache parquet that we learnt in last chapter. is a free and open-source column-oriented data storage format inspired by Facebook, who demonstrated that ORC is much faster than RC files. It is efficient in handling huge streaming data as it provides good support for finding the required rows quickly. In ORC data is stored in a row columnar format that lets the reader to read the data, decompress the the data, and process the values that are required for the current query.

ORC file format can be majorly classified into three parts, namely
  • Header
  • Stripes
  • Footer

Header will store the 4-byte number containing 'ORC', which tell the system that it is a file format of type ORC.

DataBlock as shown in the above diagram comprises of the actual data elements and their indexes. Data are stored as the ORC file in the form group of stripes. Each Stripe holds the rows of data and the size of stripe by default is of 250 MB.

Stripes are further sub-divided into three more categories namely, 
    • Indexing,
    • Data and 
    • Stripe Footer

Index data consists of statistical information of data such as count, minimum and maximum values for all the column present in data as well as the positions of rows within each column. This helps the index to locate the stripes based on the data required as well as row groups.

Stripe footer includes three sections i.e. the encoding of each column, the directory of the data streams, and their exact location.

Footer consists of three section in it namely, metadata, file footer and postscript.

  • The Metadata section contains the various statistical data of each columns in a stripe level. These data helps us in eliminating input split based on predicate push down for each stripe. 

  • The file footer provides information about the list of stripes available the file, total number of rows in a single stripe, and the data type column. It also handles aggregated counts like min, max, avg and sum in each column-level. 

  • The Postscript contains information about the file such as file footer's length and Metadata length, the file version, and the compression technique used (eg. none, zlib, or snappy) and the size of the compressed folder.

Save as ORC using PySpark:

It's time for hands-on. Let us consider that we have a dataframe with some students data as show below and we need to save this dataframe in parquet file format. 

The code snippet to achieve the same is given below.

#Save as orc file
input_df.coalesce(1).write \
                    .format('orc') \
                    .mode('overwrite') \

Here, we coalesce(1) is used to merge all the partitioned file into single file before writing it to target. We can observe that the file is saved as parquet format in the target location as shown below.


Read the ORC file using PySpark:

Now, we will learn how to read the parquet file that we wrote in the previous step. We can observe that the parquet or any other RC format file is not in a readable, i.e. one can see the sample data by executing cat command in command window. We can read the parquet file using two ways. One is by using the dataframe API and other one is to run a select query in hive table built on top of parquet format file.

#Read the orc file format


Advantage of ORC:

Predictive Pushdown efficiency: Predictive push-down is the feature that stores the insights of data such as count, max and min value right at the data storage level itself. ORC supports predictive pushdown and with this data analytics can be done much faster.

Highly Compressed data: ORC is more compression efficient data storage than other file format. Hortonworks performed a comparison task of compression of all file formats and published a report, that ORC achieves the highest compression of 78% when compared to parquet, which compresses the data up to 62%.

ACID Support: ORC extends its support for ACID transaction and snapshot isolation. This property is helpful for the developers who deals with the streaming data.

Complex datatypes: ORC can handle any datatype be it simple int, float, string etc. or the complex data type such as struct, list, map, and unions.

Indexes: ORC provides three level of indexing for all the data files to make it work faster. They are,
    • File indexing - Stats about value in each column across the entire file
    • Stripe indexing - Stats about value in each column for each stripe
    • Row indexing - Stats about value in each column for 10,000 rows in each stripe

In this session we learnt about ORC file format which is mostly used in most of the Spark application. I am sure you all would have got good understanding of ORC file format. Post your comments if you have any hurdles in understanding the concept and hands-on to read and write using PySpark.

Happy Learning !!!

Post a Comment