首页 > 解决方案 > 使用 tidyverse 检查一个字符向量的元素与另一个字符向量的更好解决方案?

问题描述

你好!
我的目标是比较两个字符向量——主要是同义词和另一个混合名称。mixnames 中的字符串元素与同义词中的元素不完全匹配,因此需要进行一些字符串比较。我的目标是提取同义词中的元素,这些元素看起来像 mixnames 中的内容。我尝试仅使用 tidyverse 来做到这一点,但失败了。我找到了一个使用 base 的解决方案。我知道有更好的方法,但我无法弄清楚......

library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 3.6.1
#> Warning: package 'tidyr' was built under R version 3.6.1
#> Warning: package 'dplyr' was built under R version 3.6.1

#Acetometaphin 

synonyms <- c("Pediatrix","Percocet-5","Percocet-Demi","Perdolan Mono","Perfalgan", 
              "Phenaphen","Phenaphen W/Codeine","Phenipirin","Phogoglandin","Pinex", 
              "Piramin","Pirinasol","Plicet","Polmofen","Predimol","Predualito",
              "Prodol","Prontina","Puernol","Pulmofen", "Pyregesic-C")

mixNames <- c("Liquiprin","Midol Maximum Strength","Midol PM Night Time Formula",
              "Midol Regular Strength" ,"Midol Teen Formula","Naldegesic",
              "Ornex Severe Cold Formula","Percocet","Percogesic with Codeine",
              "Propacet" )

尝试失败:

#####STUFF THAT DIDNT WORK!!!!

# cross2(
#   .x = synonyms, .y = mixNames  #lists - each list has 2 lists - each of those is an atomic vector of 1
# ) %>% 
#   map_dfc(lift(str_detect)) #lift - modifies function to take a list of arguments - works for nested lists 

#this returns a df just like the apply 

# mix_syn_lgl_df <- map_dfc(
#   mixNames,
#   ~ map_lgl(synonyms, str_detect, pattern = .x)
# )

# colnames(mix_syn_lgl_df) <- mixNames
# 
# mix_syn_lgl_df$synonyms <- synonyms

这实际上有效:


#remove mixture names from synonyms

mix_syn_lgl_mat <- sapply(mixNames, function(x){
  str_detect(string = synonyms, pattern = x)
}) #returns a matrix 21x10 of logicals while preserving colnames

rownames(mix_syn_lgl_mat) <- synonyms #add synoyms as rownames
#create a new object with a new col of sum of TRUES in row
mix_syn_lgl_mat2 <- cbind(mix_syn_lgl_mat, rowSums(mix_syn_lgl_mat)) 
#take the numerical matrix mix_syn_lgl_mat2 and return the row names where the last col (rowsums) > 0
badNames <- row.names(mix_syn_lgl_mat2[mix_syn_lgl_mat2[, ncol(mix_syn_lgl_mat2)] > 0, ])
#filter out those names from the synonyms vector
pureSyn <- synonyms[!(synonyms %in% badNames)]

reprex 包(v0.3.0)于 2019 年 10 月 29 日创建

标签: rtidyversestring-comparisonpurrr

解决方案


看起来您想要的synonyms向量没有与 有任何重叠的值mixNames。您可以子集synonyms删除匹配项。这里str_c/ pastecollapsemixNames用所有的mixNames. 然后你只需使用部分字符串匹配(即,str_detectgrepl那里)。

这里使用stringr- 稍微整洁一些

synonyms[str_detect(synonyms, str_c(mixNames, collapse = "|"), negate = T)]

或使用基础 R 中的函数:

synonyms[!grepl(paste(mixNames, collapse = "|"), synonyms)]
# OR
grep(paste(mixNames, collapse = "|"), synonyms, value = T, invert = T)

作为站点说明,如果您想查看匹配字符串的替代方法,请查看stringdist或其他字符串距离函数/包。


推荐阅读