Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Suppose that I have two sets of identifiers id1 and id2 in a data frame. How can I create a new identifier id3 that works as follows:

I consider id1 as the stricter key, so that observations are first grouped in id1 and then in id2. If there are two sets of rows with different values of id2 that have some of its elements with the same id1, these two sets should have the same value for id3 (the exact value in id3 doesn't matter much).

 df <- data.frame(id1 = c(1, 1, 2, 2, 5, 6),
             id2 = c(4, 3, 1, 2, 2, 7),
             id3 = c(1, 1, 2, 2, 2, 3))

Rows 1 and 2 are grouped together because they have the same id1. Rows 3, 4 and 5 are grouped together because 3 and 4 have the same id1 and 4 and 5 have the same id2.

Can someone help? I would rather have a solution with dplyr that encompasses a general case in which there is an arbitrary number of possible values in the id columns.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
653 views
Welcome To Ask or Share your Answers For Others

1 Answer

This is a graph theory problem. Each id1 and id2 is a separate node and df gives the links between them. You are looking to see which weakly connected clusters each id belongs too.

library(igraph)
df <- df %>% mutate(from = paste0('id1', '_', id1), to = paste0('id2', '_', id2))
dg <- graph_from_data_frame(df %>% select(from, to), directed = FALSE)
df <- df %>% mutate(id3 = components(dg)$membership[from])
df %>% select(id1, id2, id3)

#>   id1 id2 id3
#> 1   1   4   1
#> 2   1   3   1
#> 3   2   1   2
#> 4   2   2   2
#> 5   5   2   2
#> 6   6   7   3

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...