Apache Spark SQL Date and Timestamp Functions Using PySpark

 In this tutorial, we will cover almost all the spark SQL functions available in Apache Spark and understand the working of each date and time functions in apache spark with the help of demo.


Commonly, in all production use case we will face a scenario on date and timestamp to be sorted out. The issue might be on casting a string column into date or timestamp column or the issue might be on extracting the information from the give column. These can be easily sorted out using inbuild Spark SQL functions. 

Spark version used - 3.1.2

Basics of Spark Date and Time Functions:



Input Dataset:

We will create our own dataset with help of list of date time values and use that list to create Spark dataframe as shown below.



#Create input Spark Dataframe
list_data=[
           ["2022/03/31 01:55 AM"],
           ["2022/03/30 01:15 AM"],
           ["2022/03/29 02:15 PM"],
           ["2022/04/01 04:15 PM"],
          ]

list_schema=["inp_col"]

#Create DataFrame from the list 
df1=spark.createDataFrame(list_data,list_schema)

df1.printSchema()

df1.show()

Out[]:


Note: If you are using Databricks cluster of runtime with spark 3.0, then set spark.sql.legacy.timeParserPolicy as Legacy as shown below.

As a first step let us convert the input string into datetime or timestamp,

Cast String to DateType or TimeStamp Using Spark:

To cast the string column into dateType or timestamp, we can use any of the below functions from apache spark.

1) to_date

    Syntax - to_date(col, format=None) 

2) to_timestamp

    Syntax - to_timestamp(col, format=None)

3) date_format

    Syntax - date_format(date, format)


Code Snippet:

#Convert string to dateType or timestamp

from pyspark.sql.functions import to_date,date_format,to_timestamp

df2 = df1.withColumn("date",to_date("inp_col","yyyy/MM/dd")) \
         .withColumn("time",to_timestamp("inp_col","yyyy/MM/dd hh:mm a"))


df2.printSchema()
df2.display()
Out:[]



Extract Year, Month, Day, Hour, Minute, Quarter, Week of year From Spark DF:



#Import all the required functions
from pyspark.sql.functions import year,month,dayofmonth,hour,minute,second,dayofweek,quarter,weekofyear  

# To Get Year from date or Time column
df3 = df2.withColumn("year",year("time"))

# To Get Month from date or Time column
df3 = df3.withColumn("month",month("time"))

# To Get day from date or Time column
df3 = df3.withColumn("day",dayofmonth("time"))

# To Get hour from Time column
df3 = df3.withColumn("hour",hour("time"))

# To Get Minute from Time column
df3 = df3.withColumn("min",minute("time"))

# To Get Quarter from date or Time column
df3 = df3.withColumn("quarter-of-year",quarter("time"))

# To Get week of year from date or Time column
df3 = df3.withColumn("week-of-year",weekofyear("time")) 
   

df3.display()
Out[]:



Day and Month in Words in Spark DataFrame:

In the below code snippet we will see how to get the day and month of the give date or time in words using spark function. Here we simply can use date_format function and in format parameter we can fill in as shown


For more info on date format check the links added in Reference session.


Code Snippet:

#Day and Month in Words Spark Dataframe

df3=df2.withColumn("dayofweek",dayofweek("time")) \
   .withColumn("dayinwords",date_format("time", "EEE")) \
   .withColumn("monthinwords",date_format("time", "LLL")) 

df3.display()
Out[]:


Miscellanies Function on Spark Date and Time:

We will also see how to get current date, and find date difference between the two dates and how to get date after or before.

Code Snippet:


#import the required functions
from pyspark.sql.functions import current_date,datediff,date_add,date_sub,date_trunc


#Get Todays date
df3=df2.withColumn("cur_date",current_date()) 

#Get Date difference
df3=df3.withColumn("datedif",datediff("date","cur_date"))

#Add N days to date
df3=df3.withColumn("dateadd",date_add("date",5))

#Subtract N days to date
df3=df3.withColumn("datesub",date_sub("date",5))

#date truncate
df3=df3.withColumn("datetrnc",date_trunc('mm',"time"))
   
df3.display()
Out[]:



GitHub Link:




Reference:


Happy Learning !!!

Post a Comment

0 Comments