Ultimately, I think the method you seek is to allow max_dist
to be a vector of distances, where you might do stringdist_inner_join(..., max_dist=c(0,2))
. Unfortunately, while that has been requested (in 2017: https://github.com/dgrtwo/fuzzyjoin/issues/36 and https://github.com/dgrtwo/fuzzyjoin/issues/21), it does not appear to be implemented yet.
A work-around, if you can afford the larger intermediate join product, is to allow it and then filter out where decade
is an inexact join.
Lacking data, I'll demonstrate using ggplot2::diamonds
. Here, I'll want normal stringdist
functionality for cut
and exact matches for clarity
.
d <- data.frame(cut = c("Idea", "Premiums", "Premioom", "VeryGood", "VeryGood", "Faiir"),
clarity = rep(c("SI1", "SI2"),3),
type = 1:6)
data("diamonds", package = "ggplot2")
diamonds <- diamonds[1:10,]
joined <- stringdist_inner_join(diamonds, d, by = c("cut", "clarity"))
joined
# # A tibble: 8 x 13
# carat cut.x color clarity.x depth table price x y z cut.y clarity.y type
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <chr> <int>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 Idea SI1 1
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 Premiums SI2 2
# 3 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 Premioom SI1 3
# 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 Premiums SI2 2
# 5 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 VeryGood SI2 4
# 6 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 VeryGood SI1 5
# 7 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 Faiir SI2 6
# 8 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 VeryGood SI1 5
subset(joined, clarity.x == clarity.y)
# # A tibble: 2 x 13
# carat cut.x color clarity.x depth table price x y z cut.y clarity.y type
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <chr> <int>
# 1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 Premioom SI1 3
# 2 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 VeryGood SI1 5
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…