How to Remove First N lines from Header Using PySpark | Apache Spark

Problem Statement:


Consider we have a report of web-page traffic generated everyday which contains the analytics information such as session, pageviews, unique views etc. The sample report is shown in the figure given below. To process the data and load into Spark DataFrame, we need to remove the first 7 lines from the file, as this data is not a relevant data. Our problem statement is how will you handle this sort of files and how will you load the data into Spark DataFrame by removing first seven lines as shown in the diagram.


We can often encounter this Question in Spark Interview Questions. Let us see how to solve this problem using PySpark 

Video Explanation with Answer:



For more scenario based Spark Interview Question, please subscribe to my YT Channel.

Solution:


Data-Set:


Sample dataset can be downloaded from the give link Download data. I would recommend you to try solving this problem on your own before looking for the solution. If you face any issues, let me know the hurdle in the comment box, I am happy to help you.

Code Snippet:

Step 1: 

Create SparkSession and SparkContext as in below snippet

from pyspark.sql import SparkSession
spark=SparkSession.builder.master("local").appName("Remove N lines").getOrCreate()
sc = spark.sparkContext

Step 2: 

Read the file as RDD. Here we are reading with the partition as 2. Refer code snippet

inp=sc.textFile("pageview.csv",2).map(lambda x: x.split(','))
inp.getNumPartitions()
inp.collect()

Out[]:


Step 3: 

We apply MapPartitionWithIndex transformation to iterate through the index of partition and remove line from 0 to 7, if the index is equal to 0 ie. first partition of the Spark RDD.

rdd_drop=inp.mapPartitionsWithIndex(lambda id_x, iter: list(iter)[8:] if(id_x == 0) else iter)
rdd_drop.collect()

Out[]:



From the output image, we can notice the difference as initial few lines were removed in the rdd_drop RDD.

Step 4:

Convert the RDD to Dataframe

schema=['Page','Date','Pageviews','unique_views','session']
out_df=spark.createDataFrame(rdd_drop,schema)
out_df.show(10,truncate=0)

Out[]:


Thus, from the output, we can see that we are able to remove first few data from the file before loading it as a Spark DataFrame. Hope you learnt to answer a Spark Interview Question.

Happy Learning !!!

Post a Comment

1 Comments

  1. how are you sure that first seven lines will go in the partition 0 . Isn't it possible that first 5 records will go in partition 0 and then from record 6 partition 1 , in this case we will end up printing line 6 and line 7

    ReplyDelete