Pyspark 1.5 How To Truncate Timestamp To Nearest Minute From Seconds

March 31, 2024 Post a Comment

I am using PySpark. I have a column ('dt') in a dataframe ('canon_evt') that this a timestamp. I am trying to remove seconds from a DateTime value. It is originally read in from pa

Solution 1:

Spark >= 2.3

You can use date_trunc

df.withColumn("dt_truncated", date_trunc("minute", col("dt"))).show()

## +-------------------+-------------------+## |                 dt|       dt_truncated|## +-------------------+-------------------+## |1970-01-01 00:00:00|1970-01-01 00:00:00|## |2015-09-16 05:39:46|2015-09-16 05:39:00|## |2015-09-16 05:40:46|2015-09-16 05:40:00|## |2016-03-05 02:00:10|2016-03-05 02:00:00|## +-------------------+-------------------+

Spark < 2.3

Converting to Unix timestamps and basic arithmetics should to the trick:

from pyspark.sql import Row
from pyspark.sql.functions import col, unix_timestamp, round

df = sc.parallelize([
    Row(dt='1970-01-01 00:00:00'),
    Row(dt='2015-09-16 05:39:46'),
    Row(dt='2015-09-16 05:40:46'),
    Row(dt='2016-03-05 02:00:10'),
]).toDF()


## unix_timestamp converts string to Unix timestamp (bigint / long)## in seconds. Divide by 60, round, multiply by 60 and cast## should work just fine.## 
dt_truncated = ((round(unix_timestamp(col("dt")) / 60) * 60)
    .cast("timestamp"))

df.withColumn("dt_truncated", dt_truncated).show(10, False)
## +-------------------+---------------------+## |dt                 |dt_truncated         |## +-------------------+---------------------+## |1970-01-01 00:00:00|1970-01-01 00:00:00.0|## |2015-09-16 05:39:46|2015-09-16 05:40:00.0|## |2015-09-16 05:40:46|2015-09-16 05:41:00.0|## |2016-03-05 02:00:10|2016-03-05 02:00:00.0|## +-------------------+---------------------+

Solution 2:

This question was asked a few years ago, but if anyone else comes across it, as of Spark v2.3 this has been added as a feature. Now this is as simple as (assumes canon_evt is a dataframe with timestamp column dt that we want to remove the seconds from)

from pyspark.sql.functions importdate_trunccanon_evt= canon_evt.withColumn('dt', date_trunc('minute', canon_evt.dt))

Solution 3:

I think zero323 has the best answer. It's kind of annoying that Spark doesn't support this natively, given how easy it is to implement. For posterity, here is a function that I use:

deftrunc(date, format):
    """Wraps spark's trunc fuction to support day, minute, and hour"""import re
    import pyspark.sql.functions as func

    # Ghetto hack to get the column name from Column object or string:try:
        colname = re.match(r"Column<.?'(.*)'>", str(date)).groups()[0]
    except AttributeError:
        colname = date

    alias = "trunc(%s, %s)" % (colname, format)

    ifformatin ('year', 'YYYY', 'yy', 'month', 'mon', 'mm'):
        return func.trunc(date, format).alias(alias)
    elifformatin ('day', 'DD'):
        return func.date_sub(date, 0).alias(alias)
    elifformatin ('min', ):
        return ((func.round(func.unix_timestamp(date) / 60) * 60).cast("timestamp")).alias(alias)
    elifformatin ('hour', ):
        return ((func.round(func.unix_timestamp(date) / 3600) * 3600).cast("timestamp")).alias(alias)

Solution 4:

truncate the time stamp to some other minutes say 5 minutes or 10 mins or 7 min

from pyspark.sql.functions import *
df = spark.createDataFrame([("2016-03-11 09:00:07", 1, 5),("2016-03-11 09:00:57", 2, 5)]).toDF("date", "val1","val2")
w = df.groupBy('val',window("date", "5 seconds")).agg(sum("val1").alias("sum"))
w.select(w.window.start.cast("string").alias("start"),w.window.end.cast("string").alias("end"), "sum", "val").show(10, False)

alezinhacris

Pyspark 1.5 How To Truncate Timestamp To Nearest Minute From Seconds

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Post a Comment for "Pyspark 1.5 How To Truncate Timestamp To Nearest Minute From Seconds"

Widget HTML #3