i'm preparing a dataset to train machine learning model using PySpark. The dataframe on which i'm working contains a thousands of records about presences registered inside rooms of different buildings and cities in different days. These presences are in this format:
+----+----+----+---+---------+------+--------+-------+---------+
|room|building|city|day|month|inHour|inMinute|outHour|outMinute|
+----+--------+----+---+-----+------+--------+-------+---------+
| 1| 1| 1| 9| 11| 8| 27| 13| 15|
| 1| 1| 1| 9| 11| 8| 28| 13| 5|
| 1| 1| 1| 9| 11| 8| 32| 13| 7|
| 1| 1| 1| 9| 11| 8| 32| 8| 50|
| 1| 1| 1| 9| 11| 8| 32| 8| 48|
+----+--------+----+---+-----+------+--------+-------+---------+
inHour and inMinute stands for the hour and minute of access and, of course, outHour and outMinute refers to time of exit. The hours are considered in a 0-23 format. All the column contains just integer values.
What i'm missing is the target value of my machine learning model which is the number of persons for the combination of room, building, city, day, month and a time interval. I will try to explain better, the first row refers to a presence with access time 8 and exit time 13 so it should be counted in the record with the interval 8-9, 9-10, 10-11, 11-12 and also 13-14. What i want to accomplish is something like the following:
+----+----+----+---+---------+------+-------+-----+
|room|building|city|day|month|timeIn|timeOut|count|
+----+--------+----+---+-----+------+-------+-----+
| 1| 1| 1| 9| 11| 8| 9| X|
| 1| 1| 1| 9| 11| 9| 10| X|
| 1| 1| 1| 9| 11| 10| 11| X|
| 1| 1| 1| 9| 11| 11| 12| X|
| 1| 1| 1| 9| 11| 12| 13| X|
+----+--------+----+---+-----+------+-------+-----+
So the 4th row of the first table should be counted in the 1st row of this table and so on...
question from:https://stackoverflow.com/questions/65645790/manipulate-a-complex-dataframe-in-pyspark