How can we filter a join in R using DPLYR

Question

Welcome To Ask or Share your Answers For Others

How can we filter a join in R using DPLYR

asked Feb 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

I am working on a project to find the number of times a service has to be re-worked but only within the first 30 days of the end of the original service. Using the sample datasets below:

work <- tribble(
  ~Client ,  ~Initial_Date_In, ~Initial_Date_Out,
  'Car1',     '2019-01-01', '2019-01-02',
  'Car1',     '2019-06-01', '2019-06-03',
  'Car2',     '2019-01-01', '2019-01-02',
  'Truck1',     '2019-06-01', '2019-06-02',
  'Truck2',     '2019-04-01', '2019-04-02',
  'Truck3',     '2019-08-01', '2019-08-04',
  'Van1',     '2019-01-01', '2019-01-02',
  'Van2',     '2019-06-01', '2019-06-02',
  'Truck4',     '2019-10-05', '2019-10-09',
  
  
)

rework <- tribble(
  ~Client ,  ~Date_In, ~Date_Out,
  'Car1',     '2019-01-05', '2019-01-07',
  'Car1',     '2019-01-09', '2019-01-10',
  'Car1',     '2019-06-09', '2019-06-10',
  'Truck3',     '2019-08-07', '2019-08-08',
  'Van1',     '2019-03-01', '2019-03-01',
  'Van2',     '2019-06-09', '2019-06-11',
  'Truck4',     '2019-11-01', '2019-11-02',
  
  
)

work$Initial_Date_In <- lubridate::as_date(work$Initial_Date_In)
work$Initial_Date_Out <- lubridate::as_date(work$Initial_Date_Out)
rework$Date_In <- lubridate::as_date(rework$Date_In)
rework$Date_Out <- lubridate::as_date(rework$Date_Out)

I tried multiple things, such as :

work %>% 
  left_join(select(rework, Client,Date_Out), by = ("Client" = "Client")) %>% 
  group_by(Client) %>% 
  filter(max(Date_Out) >Initial_Date_Out & max(Date_Out) < (Initial_Date_Out+duration(30,"days")))

This code seems to pull the max Initial_Date_In(Out), but I want all from the work table where there is an appropriate entry in rework (displaying on the max Date_Out from the rework table)

The expected output would be something like:

Client    Initial_Date_In    Initial_Date_Out    Date_Out    #_Of_Rework_Visits
Car1      2019-01-01         2019-01-02          2019-01-10    2
Car1      2019-06-01         2019-06-03          2019-06-10    1
Truck3    2019-08-01         2019-08-04          2019-08-08    1
Van1      2019-06-01         2019-06-02          2019-06-11    1
Truck4    2019-10-05         2019-10-09          2019-11-02    1

Any tips on what I am doing wrong?

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

158 views

1 Answer

深蓝 · Answer 1 · 2021-02-16T17:36:49+0000

Update
Here's an approach than avoids doing any kind of formal join.
Stack work and rework with bind_rows, then use a service_ix column (index for service appts within a given Client) as a grouping/summarizing variable:

work %>% 
  mutate(service = "original") %>% 
  rename(Date_In = "Initial_Date_In", Date_Out = "Initial_Date_Out") %>% 
  group_by(Client) %>% 
  mutate(service_ix = row_number()) %>% 
  bind_rows(rework %>% 
              mutate(service = "rework") %>% 
              select(Client, Date_In, Date_Out, service)) %>% 
  arrange(Client, Date_In) %>% 
  fill(service_ix) %>% 
  group_by(Client, service_ix) %>% 
  summarise(initial_date_in = Date_In[service == "original"],
            initial_date_out = Date_Out[service == "original"],
            max_date_out = max(Date_Out),
            n_reworks = sum(service == "rework")) %>%
  filter(n_reworks > 0) %>% 
  ungroup()

Unpacking a bit:

Added a service column indicating initial vs rework. Here's the combined df after bind_rows():

# A tibble: 16 x 5
# Groups:   Client [8]
   Client Date_In    Date_Out   service  service_ix
   <chr>  <date>     <date>     <chr>         <int>
 1 Car1   2019-01-01 2019-01-02 original          1
 2 Car1   2019-01-05 2019-01-07 rework           NA
 3 Car1   2019-01-09 2019-01-10 rework           NA
 4 Car1   2019-06-01 2019-06-03 original          2
 5 Car1   2019-06-09 2019-06-10 rework           NA
 6 Car2   2019-01-01 2019-01-02 original          1
 7 Truck1 2019-06-01 2019-06-02 original          1
 8 Truck2 2019-04-01 2019-04-02 original          1
 9 Truck3 2019-08-01 2019-08-04 original          1
10 Truck3 2019-08-07 2019-08-08 rework           NA
11 Truck4 2019-10-05 2019-10-09 original          1
12 Truck4 2019-11-01 2019-11-02 rework           NA
13 Van1   2019-01-01 2019-01-02 original          1
14 Van1   2019-03-01 2019-03-01 rework           NA
15 Van2   2019-06-01 2019-06-02 original          1
16 Van2   2019-06-09 2019-06-11 rework           NA

The fill step replaces the service_ix NA values:

   Client Date_In    Date_Out   service  service_ix
   <chr>  <date>     <date>     <chr>         <int>
 1 Car1   2019-01-01 2019-01-02 original          1
 2 Car1   2019-01-05 2019-01-07 rework            1
 3 Car1   2019-01-09 2019-01-10 rework            1
 4 Car1   2019-06-01 2019-06-03 original          2
 5 Car1   2019-06-09 2019-06-10 rework            2

Now group by Client and service_ix, and summarize as needed.

# A tibble: 6 x 6
  Client service_ix initial_date_in initial_date_out max_date_out n_reworks
  <chr>       <int> <date>          <date>           <date>           <int>
1 Car1            1 2019-01-01      2019-01-02       2019-01-10           2
2 Car1            2 2019-06-01      2019-06-03       2019-06-10           1
3 Truck3          1 2019-08-01      2019-08-04       2019-08-08           1
4 Truck4          1 2019-10-05      2019-10-09       2019-11-02           1
5 Van1            1 2019-01-01      2019-01-02       2019-03-01           1
6 Van2            1 2019-06-01      2019-06-02       2019-06-11           1

Categories

How can we filter a join in R using DPLYR

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags