I have 14.5 years of budget data, contract IDs, project types, etc. with which I am trying to build an 18-month, time-series forecast. The data started out as individual payments on contract IDs by (non-continuous) date. Using Excel, I pivoted into total payments by month; later on I will include total active contracts in a month, composition of contract type, etc. There are a total of 3134 day-rows (out of 5296) on which payments where made - days on which no payments were made are not recorded in this data*.
The features I'm currently using are listed and structured as follows (not all features are below, just trying to get a model piped together using a linear t for now):
head(exp)
Amount Day Month Year t
1 269909.4 5 7 2000 1
2 792078.6 6 7 2000 2
3 140065.5 7 7 2000 3
4 190553.2 11 7 2000 4
5 119208.6 12 7 2000 5
6 1068156.3 16 7 2000 6
> str(exp)
'data.frame': 3134 obs. of 5 variables:
$ Amount: num 269909 792079 140066 190553 119209 ...
$ Day : int 5 6 7 11 12 16 17 21 26 28 ...
$ Month : int 7 7 7 7 7 7 7 7 7 7 ...
$ Year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
$ t : int 1 2 3 4 5 6 7 8 9 10 ...
I'm running into these problems / questions:
Dplyr is not at all liking the
ts()
objects I've used in my data.frame, so filtering and sorting by month/contract/contract type isn't working. What's the best approach here? I'm unsure about the pros/cons of using ts vs. timeSeries, especially as they relate to compatibility with other packages.*Is this easier if I start with vectors of all 5296 days between 7/1/00 and 12/31/14, as well as a
t <- 1:5296
and key these 3134 days of payments to that full list of days?