Word Count Program Using PySpark

Word Count Using PySpark:




In this chapter we are going to familiarize on how to use the Jupyter notebook with PySpark with the help of word count example. I recommend the user to do follow the steps in this chapter and practice to make themselves familiar with the environment. In our previous chapter, we installed all the required software to start with PySpark, hope you are ready with the setup, if not please follow the steps and install before starting from Install pyspark in jupyter notebook

Let's start writing our first pyspark code in a Jupyter notebook, Come lets get started, Open your  anaconda prompt and type "jupyter notebook"  to open a web page and choose "New > python 3" as shown below to start fresh notebook for our program.


Note: If you have Python 2 installed on your machine then you will see Python instead of Python 3.





Demo with Explanation:


Start Coding Word Count Using PySpark:

Our requirement is to write a small program to display the number of occurrence of each word in the given input file. Let is create a dummy file with few sentences in it.

Input file:
Program:
To find where the spark is installed on our machine, by notebook, type in the below lines.

# To find out path where pyspark installed
import findspark
findspark.init()


Next step is to create a SparkSession and sparkContext. While creating sparksession we need to mention the mode of execution, application name. Below is the snippet to create the same.

# Create SparkSession and sparkcontext
from pyspark.sql import SparkSession
spark = SparkSession.builder\
                    .master("local")\
                    .appName('Firstprogram')\
                    .getOrCreate()
sc=spark.sparkContext

Note: we will look in detail about SparkSession in upcoming chapter, for now remember it as a entry point to run spark application

Our Next step is to read the input file as RDD and provide transformation to calculate the count of each word in our file. To know about RDD and how to create it, go through the article on What is RDD?



Below the snippet to read the file as RDD.

# Read the input file and Calculating words count
text_file = sc.textFile("firstprogram.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
                            .map(lambda word: (word, 1)) \
                           .reduceByKey(lambda x, y: x + y)

Note that here "text_file" is a RDD and we used "map", "flatmap", "reducebykey" transformations

Finally, initiate an action to collect the final result and print. Use the below snippet to do it and Here collect is an action that we used to gather the required output.

# Printing each word with its respective count
output = counts.collect()
for (word, count) in output:
    print("%s: %i" % (word, count))


Output looks something like this:


After all the execution step gets completed, don't forgot to stop the SparkSession. Copy the below piece of code to end the Spark session and spark context that we created.

# Stopping Spark-Session and Spark context
sc.stop()
spark.stop()


Congratulation, you had created your first PySpark program using Jupyter notebook.


Hope you learned how to start coding with the help of PySpark Word Count Program example. If you have any doubts or problem with above coding and topic, kindly let me know by leaving a comment here.


Happy Learning!!!

Post a Comment

2 Comments

  1. Thanks for this blog, got the output properly when i had many doubts with other code.

    ReplyDelete
  2. Thank you for this blog got the output

    ReplyDelete