How to Remove First N lines from Header Using PySpark

Problem Statement:

Consider we have a report of web-page traffic generated everyday which contains the analytics information such as session, pageviews, unique views etc. The sample report is shown in the figure given below. To process the data and load into Spark DataFrame, we need to remove the first 7 lines from the file, as this data is not a relevant data. Our problem statement is how will you handle this sort of files and how will you load the data into Spark DataFrame by removing first seven lines as shown in the diagram.

We can often encounter this Question in Spark Interview Questions. Let us see how to solve this problem using PySpark

Video Explanation with Answer:

For more scenario based Spark Interview Question, please subscribe to my YT Channel.

Solution:

Data-Set:

Sample dataset can be downloaded from the give link Download data. I would recommend you to try solving this problem on your own before looking for the solution. If you face any issues, let me know the hurdle in the comment box, I am happy to help you.

Code Snippet:

Step 1:

Create SparkSession and SparkContext as in below snippet

from pyspark.sql import SparkSession

spark=SparkSession.builder.master("local").appName("Remove N lines").getOrCreate()

sc = spark.sparkContext

Step 2:

Read the file as RDD. Here we are reading with the partition as 2. Refer code snippet

inp=sc.textFile("pageview.csv",2).map(lambda x: x.split(','))

inp.getNumPartitions()

inp.collect()

Out[]:

Step 3:

We apply MapPartitionWithIndex transformation to iterate through the index of partition and remove line from 0 to 7, if the index is equal to 0 ie. first partition of the Spark RDD.

rdd_drop=inp.mapPartitionsWithIndex(lambda id_x, iter: list(iter)[8:] if(id_x == 0) else iter)

rdd_drop.collect()

Out[]:

From the output image, we can notice the difference as initial few lines were removed in the rdd_drop RDD.

Step 4:

Convert the RDD to Dataframe

schema=['Page','Date','Pageviews','unique_views','session']

out_df=spark.createDataFrame(rdd_drop,schema)

out_df.show(10,truncate=0)

Out[]:

Thus, from the output, we can see that we are able to remove first few data from the file before loading it as a Spark DataFrame. Hope you learnt to answer a Spark Interview Question.

Happy Learning !!!

How to Remove First N lines from Header Using PySpark | Apache Spark

Problem Statement:

Video Explanation with Answer:

Solution:

Data-Set:

Code Snippet:

Post a Comment

1 Comments

Featured Post

Spark Interview Question - Online Assessment Coding Test Round | Using Spark with Scala

Subscribe to LeantoSpark

Follow on FB Page

Most Viewed Post

PySpark For Beginners | Word Count Program Using PySpark

How to Replace a String in Spark DataFrame | Spark Scenario Based Question

How to Transform Rows and Column using Apache Spark

Pages

Menu Footer Widget

How to Remove First N lines from Header Using PySpark | Apache Spark

Problem Statement:

Video Explanation with Answer:

Solution:

Data-Set:

Code Snippet:

Post a Comment

1 Comments

Featured Post

Spark Interview Question - Online Assessment Coding Test Round | Using Spark with Scala

Social Plugin

Subscribe to LeantoSpark

Follow on FB Page

Most Viewed Post

PySpark For Beginners | Word Count Program Using PySpark

How to Replace a String in Spark DataFrame | Spark Scenario Based Question

How to Transform Rows and Column using Apache Spark

Pages

Menu Footer Widget