Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

i'm preparing a dataset to train machine learning model using PySpark. The dataframe on which i'm working contains a thousands of records about presences registered inside rooms of different buildings and cities in different days. These presences are in this format:

+----+----+----+---+---------+------+--------+-------+---------+
|room|building|city|day|month|inHour|inMinute|outHour|outMinute|
+----+--------+----+---+-----+------+--------+-------+---------+
|   1|       1|   1|  9|   11|     8|      27|     13|       15|
|   1|       1|   1|  9|   11|     8|      28|     13|        5|
|   1|       1|   1|  9|   11|     8|      32|     13|        7|
|   1|       1|   1|  9|   11|     8|      32|      8|       50|
|   1|       1|   1|  9|   11|     8|      32|      8|       48|
+----+--------+----+---+-----+------+--------+-------+---------+

inHour and inMinute stands for the hour and minute of access and, of course, outHour and outMinute refers to time of exit. The hours are considered in a 0-23 format. All the column contains just integer values.

What i'm missing is the target value of my machine learning model which is the number of persons for the combination of room, building, city, day, month and a time interval. I will try to explain better, the first row refers to a presence with access time 8 and exit time 13 so it should be counted in the record with the interval 8-9, 9-10, 10-11, 11-12 and also 13-14. What i want to accomplish is something like the following:

+----+----+----+---+---------+------+-------+-----+
|room|building|city|day|month|timeIn|timeOut|count|
+----+--------+----+---+-----+------+-------+-----+
|   1|       1|   1|  9|   11|     8|      9|    X|   
|   1|       1|   1|  9|   11|     9|     10|    X|  
|   1|       1|   1|  9|   11|    10|     11|    X|   
|   1|       1|   1|  9|   11|    11|     12|    X|   
|   1|       1|   1|  9|   11|    12|     13|    X|     
+----+--------+----+---+-----+------+-------+-----+

So the 4th row of the first table should be counted in the 1st row of this table and so on...

question from:https://stackoverflow.com/questions/65645790/manipulate-a-complex-dataframe-in-pyspark

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
68 views
Welcome To Ask or Share your Answers For Others

1 Answer

You can explode a sequence of hours (e.g. the first row would have [8,9,10,11,12,13]), group by the hour (and other columns) and get the aggregate count for each group. Here hour refers to timeIn. I think it's not necessary to specify timeOut in the result dataframe because it's always timeIn + 1.

import pyspark.sql.functions as F

df2 = df.withColumn(
    'hour',
    F.explode(F.sequence('inHour', 'outHour'))
).groupBy(
    'room', 'building', 'city', 'day', 'month', 'hour'
).count().orderBy('hour')

df2.show()
+----+--------+----+---+-----+----+-----+
|room|building|city|day|month|hour|count|
+----+--------+----+---+-----+----+-----+
|   1|       1|   1|  9|   11|   8|    5|
|   1|       1|   1|  9|   11|   9|    3|
|   1|       1|   1|  9|   11|  10|    3|
|   1|       1|   1|  9|   11|  11|    3|
|   1|       1|   1|  9|   11|  12|    3|
|   1|       1|   1|  9|   11|  13|    3|
+----+--------+----+---+-----+----+-----+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...