I have a dataset with these columns
ID Cancer.Date Age Gender Col1 Col2
15 1998-03-26 35 F Yes No
53 NA 65 F No Yes
37 1996-11-10 84 M Yes No
58 NA 90 F Yes No
60 2016-12-08 70 M Yes No
12 2000-04-29 20 M No Yes
46 NA 72 F Yes No
59 2008-05-26 34 F Yes No
99 NA 89 M Yes No
46 2009-06-22 87 M No Yes
35 2000-02-20 24 F Yes Yes
26 NA 80 F Yes No
43 2001-02-20 74 M No No
77 NA 81 F No Yes
16 2015-11-03 52 F No Yes
04 NA 27 M Yes No
82 2004-05-08 45 M No No
01 2006-04-25 49 F No Yes
92 2004-10-26 40 F Yes Yes
67 2002-09-20 67 F No No
My goal is to perform the following tasks.
Step1: Arrange the Cancer.Date column in ascending order. Earliest date on top. This case row with date 1996-11-10
Step2: Check if the date is NA. If the date is not NA, then find 3 observations that are similar to that row in Gender and closest in Age.
For example, after sorting by date (earliest first), the third row will be the 1st row. The Gender = M, Age = 84
. So the three IDs that similar in gender and closest in Age are , (ID 46, Gender =M, Age = 87), (ID 99, Gender =M, Age = 89), (ID 43, Gender =M, Age = 74).
Step3: Repeat Step2 for all rows where Cancer.Date is not NA (Not Missing).
The expected Output
ID Cancer.Date Age Gender Col1 Col2 Match.ID
37 1996-11-10 84 M Yes No 46,99,43
15 1998-03-26 35 F Yes No 59,35,12
. . . . . . .
Perhaps I could do this using for-loops, subset by Gender and distance by age but I suspect this would be painfully slow. I would appreciate any suggestions on accomplishing this more efficiently.
question from:https://stackoverflow.com/questions/65880695/r-finding-a-match-based-on-age-and-sex