How to Cast String Datatype to Date Timestamp in Spark | Using Apache Spark

Cast String to DateTime:


We often encounter a situation of casting columns in Spark dataframe from one datatype to another datatype. In this blog, we are going to discuss about one such problem that we often come across while developing Spark projects. Our today's problem statement is to convert the datatype of String to datatype of date or datetime using Spark with python.

Let us assume the input dataframe is create with five columns and all the column is of type string as displayed below. To view the schema of the dataframe, we can type in  a command df.printSchema() or else simply df.schema will show the Struct-type schema of the dataframe


df.printSchema() , will show result like in image


Problem:


From above input dataframe, consider column datatime. We could observe the column datatype is of string and we have a requirement to convert this string datatype to timestamp column. Simple way in spark to convert is to import TimestampType from pyspark.sql.types and cast column with below snippet

df_conv=df_in.withColumn("datatime",df_in["datatime"].cast(TimestampType()))

But, due to the problem with casting we might sometime get null value as highlighted below



Reason:


Reason behind getting null values as in the above diagram is Spark can cast from String to Datetime only if the given string value is in the format yyyy-MM-dd HH:mm:ss, whereas in our case the format of the datatime column that we have is MM/dd/yyyy HH:mm. Let us check this by converting a sample column with the prescribed format as string. I created a sample dataframe to do this, check out Ways to create DF in Apache Spark.

df = spark.createDataFrame([
    ("John", "2018-04-01 03:00:00", "60"),  
    ("John", "2018-02-01 03:03:00", "66"),  
  ], ("ID", "timestamp", "counts"))
df.printSchema()
df.show()


Result of above snippet,

from pyspark.sql.types import TimestampType
df_chk=df.withColumn("timestamp",df["timestamp"].cast(TimestampType()))
df_chk.printSchema()
df_chk.show()

Result of above snippet,

From the above screenshots it is clear that if the format of the string column is not as yyyy-MM-dd HH:mm:ss then we will not be able convert the string to datetime, and even if it shows it is converted, the resultant dataframe will show null value in that column.

Solution:


To overcome this sort of issues while casting the columns in dataframe, we need to leverage the usage of unix timestamp functions available in spark package to convert the column first to unix timestamp and then apply cast function on top of that to get the expected result.

Code Snippet:

from pyspark.sql.types import TimestampType
from pyspark.sql.functions import unix_timestamp
df_conv=df_in.withColumn("datatime",unix_timestamp("datatime", 'MM/dd/yyyy HH:mm')\                         .cast(TimestampType()))
df_conv.printSchema()
df_conv.show(5)

Output of above snippet:


We can also split the timestamp column as date-only column with the below line of code. For doing so, we need to import date_format function package.

from pyspark.sql.functions import date_format
df_in.printSchema()
df1=df_in.withColumn("DateOnly", date_format(unix_timestamp("datatime", 'MM/dd/yyyy HH:mm').cast(TimestampType()), "yyyyMMdd"))
df1.show(2)


Above code results as,


Hope, this simple trick in spark will be useful to you. If you encounter any other kind of similar issues while applying casting in spark, leave us a comment below.

Happy Learning !!!

Post a Comment

0 Comments