首页 > 解决方案 > 删除 R 中具有相似(不相同)字符串的行

问题描述

我有大量的 word 文件,它们作为文本(单元格中的每个报告)导入到 r 中,每个主题都有一个 ID。

然后我使用distinctdplyr 中的函数删除重复的函数。

但是,有些报告完全相同,但有细微差别(例如,额外/更少的单词、额外的空间等......),因此 dplyr 不会将它们视为重复。有没有一种有效的方法来删除 r 中“高度相似”的项目?

这将创建一个示例数据集(与我正在处理的原始数据非常简化:

d = structure(list(ID = 1:8, text = c("The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.", 
                                      "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.", 
                                      "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.", 
                                      "The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.", 
                                      "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.", 
                                      "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.", 
                                      "All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.", 
                                      "all plastics are polymers   but not all polymers are plastic. Plastic polymers consist of chains of linked   subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
)), class = "data.frame", row.names = c(NA, -8L))

这是删除完全重复的 dplyr 代码。但是,您会注意到第 2、7 和 8 项几乎相同

library(dplyr)

d %>% 
  distinct(text, .keep_all = T) %>% 
  View()

看起来like在 dplyr 中有一个函数,但我可以找到如何在此处准确应用它(它似乎也仅适用于短字符串,例如单词)dplyr filter() with SQL-like %wildcard%

此外,还有一个包tidystringdist可以计算 2 个字符串的相似程度,但找不到在此处应用它以删除相似但不相同的项目的方法。 https://cran.r-project.org/web/packages/tidystringdist/vignettes/Getting_started.html

此时有什么建议或指导吗?

更新:

看起来该软件包stringdist可能会按照以下用户的建议解决此问题。

来自 rstudio 网站的这个问题处理了类似的问题,尽管所需的输出有点不同。我将他们的代码应用到我的数据中,并能够识别出相似的代码。 https://community.rstudio.com/t/identifying-fuzzy-duplicates-from-a-column/35207/2

library(tidystringdist)
library(tidyverse)

# First remove any duplicates: 
d =d %>% 
  distinct(text, .keep_all = T) %>% 
  View()

# this will identify the similar ones and place then in one dataframe called match: 
match <- d %>% 
  tidy_comb_all(text) %>% 
  tidy_stringdist() %>% 
  filter(soundex == 0) %>% # Set a threshold
  gather(x, match, starts_with("V")) %>% 
  .$match

# create negate function of %in%:

 `%!in%` = Negate(`%in%`)

# this will remove those in the `match` out of `d` :
d2 = d %>% 
  filter(text %!in% match) %>% 
  arrange(text)


使用上面的代码,d2 根本没有任何重复/相似的,但我想保留一份副本。

关于如何保留一份副本的任何想法(例如,仅第一次出现)?

标签: rfilterdplyrduplicatessimilarity

解决方案


library(stringdist)


dd <- d[ !duplicated( d[['test']] ) , ]
dput(dd)
# --------------
[1] "The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method."                                                                                                                                                                              
[2] "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength."                                                                                                                                                                                                          
[3] "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."    
[4] "All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains." 
[5] "all plastics are polymers   but not all polymers are plastic. Plastic polymers consist of chains of linked   subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."

unname( sapply(dd, stringdist, dd, method="dl") )
#------------------
     [,1] [,2] [,3] [,4] [,5]
[1,]    0  105  231  235  235
[2,]  105    0  234  238  238
[3,]  231  234    0   10    5
[4,]  235  238   10    0   13
[5,]  235  238    5   13    0

距离与字符串长度相关,因此较短的字符串具有较大的最大距离,但对于这种情况,20 的上限似乎就足够了。一个适当的解决方案将使用“距离”与nchar该向量元素的某个比率。

不是作为完成的解决方案提供的,而是作为 4 步中的第 1 步和第 2 步提供的。


推荐阅读