How to Use Pandas API in Spark | Spark Performance Tuning - Apache Spark 3.2.0

In this session, we will check on the new feature Pandas On Spark updated in the Apache Spark release 3.2.0. We will have a demo to understand the working of this feature and also learn what are the advantage of this update and how it boost the performance over the native pandas library.

Pandas On Spark:


If you are a Data Scientist or Data Analyst involved in building Machine learning model, then your go to language in most of the case will be a python for development and in specific you would be using Pandas API to perform various cleansing and modelling operation. But native Pandas API has its own limitation, as it can't scale up with the huge amount of data volume. For example, your Pandas API might fail with OOM (Out Of Memory) error while trying to read huge volume of data that doesn't fit into the memory available in single machine.

Now with the release of Apache Spark 3.2.0 version, we have a update on new feature to have Pandas API on Spark i.e, we can perform the same pandas operation without any change to syntax and leverage the advantage of running the code in Spark cluster. 

Demo on Pandas Spark:





Working Of Pandas on Spark:


We don't have any major syntactical changes from native pandas library. Because of this even migration from native to pandas on Spark becomes easy. Let us understand working of Pandas on Spark with below example.

Let us consider a csv file which has some data on google play store. You can go ahead and download the input data from this link GooglePlayStore.CSV. We will read this file using native pandas and also with pandas on spark to understand the working and changes to syntax.


Native Pandas:

import pandas as pd
pdf = pd.read_csv('file:/tmp/googleplaystore.csv')
pdf[pdf.Genres=='Entertainment;Music & Video'].count()

Out[]:

Pandas on Spark:

import pyspark.pandas as pds
from pyspark.sql.functions import col
pdf_s = pds.read_csv('dbfs:/FileStore/shared_uploads/googleplaystore.csv')
pdf_s[pdf_s.Genres=='Entertainment;Music & Video'].count()

Out[]:

From the above code, one can understand that there is not much difference in syntax, except for importing required package. But on the process level, if one check the output screenshot native Pandas API is working on the driver memory and didn't trigger any distributed task. But, in the second screenshot ie, the output of Pandas on Spark API clearly shows that spark job is triggered and executing the task in distributed fashion.

SQL Queries on Top of Pandas on Spark DataFrame:

Pandas on Spark API holds many advantages over the Pandas API. We can overcome the major issue of Out Of Memory while dealing with the large amount of data. Another major advantage of migrating to the Pandas on Spark API is that, we can use a SQL queries directly on top of Pandas on  Spark API as shown in the below snippet.

We will use the Spark Pandas API that we created in the above step and use it directly in the SQL query.

Code:
pds.sql("select * from {pdf_s} where Genres in ('Entertainment;Music & Video')")

Out[]:

Performance Testing:

Lets quickly compare the performance of both native as well as spark approach with the 100 Million record.





From the above screenshots its evident that Pandas on Spark API with help of distributed computing make the processing faster than that of native Pandas API. 

Hope you understood the concept. Try to test and implement this new feature of over the existing Pandas API and let me know the performance and your thoughts in comment box. 

Happy Learning !!!

References:

  • https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

Post a Comment

0 Comments