Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I want to make a request to find the most busy time of the day on average in 1-hour intervals.

I have on my dataframe the row date in format "%d/%b/%Y:%H:%M:%S".

I begin like that:

mostBusyTimeDF = logDF.groupBy("date") ...

For example input:

               date
 2015-12-01 21:04:00
 2015-12-01 10:04:00
 2015-12-01 21:07:00
 2015-12-01 21:34:00

In output :

               date         count(1 hour interval)
 2015-12-01 21:04:00                          3
 2015-12-01 10:04:00                          1

After I don't know how can I do it..

Can you help me?

Thanks a lot

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
144 views
Welcome To Ask or Share your Answers For Others

1 Answer

You can use built-in Spark date functions:

from pyspark.sql.functions import *

logDF = sqlContext.createDataFrame([("2015-12-01 21:04:00", 1), ("2015-12-01 10:04:00", 2), ("2015-12-01 21:07:00", 9), ("2015-12-01 21:34:00", 1)], ['somedate', 'someother'])

busyTimeDF = logDF.groupBy(year("somedate").alias("cnt_year"), 
    month("somedate").alias("cnt_month"), 
    dayofmonth("somedate").alias("cnt_day"), 
    hour('somedate').alias("cnt_hour")) 
       .agg(functions.count("*").alias("cntHour")) 

cond = [busyTimeDF.cnt_year == year(logDF.somedate), 
    busyTimeDF.cnt_month == month(logDF.somedate), 
    busyTimeDF.cnt_day == dayofmonth(logDF.somedate), 
    busyTimeDF.cnt_hour == hour(logDF.somedate)]

busyTimeDF.join(logDF, cond).select('somedate', 'cntHour').show()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...