r - 删除 R 中具有相似(不相同)字符串的行
问题描述
我有大量的 word 文件,它们作为文本(单元格中的每个报告)导入到 r 中,每个主题都有一个 ID。
然后我使用distinct
dplyr 中的函数删除重复的函数。
但是,有些报告完全相同,但有细微差别(例如,额外/更少的单词、额外的空间等......),因此 dplyr 不会将它们视为重复。有没有一种有效的方法来删除 r 中“高度相似”的项目?
这将创建一个示例数据集(与我正在处理的原始数据非常简化:
d = structure(list(ID = 1:8, text = c("The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.",
"Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.",
"All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.",
"The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.",
"Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.",
"All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.",
"All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.",
"all plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
)), class = "data.frame", row.names = c(NA, -8L))
这是删除完全重复的 dplyr 代码。但是,您会注意到第 2、7 和 8 项几乎相同
library(dplyr)
d %>%
distinct(text, .keep_all = T) %>%
View()
看起来like
在 dplyr 中有一个函数,但我可以找到如何在此处准确应用它(它似乎也仅适用于短字符串,例如单词)dplyr filter() with SQL-like %wildcard%
此外,还有一个包tidystringdist
可以计算 2 个字符串的相似程度,但找不到在此处应用它以删除相似但不相同的项目的方法。
https://cran.r-project.org/web/packages/tidystringdist/vignettes/Getting_started.html
此时有什么建议或指导吗?
更新:
看起来该软件包stringdist
可能会按照以下用户的建议解决此问题。
来自 rstudio 网站的这个问题处理了类似的问题,尽管所需的输出有点不同。我将他们的代码应用到我的数据中,并能够识别出相似的代码。 https://community.rstudio.com/t/identifying-fuzzy-duplicates-from-a-column/35207/2
library(tidystringdist)
library(tidyverse)
# First remove any duplicates:
d =d %>%
distinct(text, .keep_all = T) %>%
View()
# this will identify the similar ones and place then in one dataframe called match:
match <- d %>%
tidy_comb_all(text) %>%
tidy_stringdist() %>%
filter(soundex == 0) %>% # Set a threshold
gather(x, match, starts_with("V")) %>%
.$match
# create negate function of %in%:
`%!in%` = Negate(`%in%`)
# this will remove those in the `match` out of `d` :
d2 = d %>%
filter(text %!in% match) %>%
arrange(text)
使用上面的代码,d2 根本没有任何重复/相似的,但我想保留一份副本。
关于如何保留一份副本的任何想法(例如,仅第一次出现)?
解决方案
library(stringdist)
dd <- d[ !duplicated( d[['test']] ) , ]
dput(dd)
# --------------
[1] "The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method."
[2] "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength."
[3] "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
[4] "All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
[5] "all plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
unname( sapply(dd, stringdist, dd, method="dl") )
#------------------
[,1] [,2] [,3] [,4] [,5]
[1,] 0 105 231 235 235
[2,] 105 0 234 238 238
[3,] 231 234 0 10 5
[4,] 235 238 10 0 13
[5,] 235 238 5 13 0
距离与字符串长度相关,因此较短的字符串具有较大的最大距离,但对于这种情况,20 的上限似乎就足够了。一个适当的解决方案将使用“距离”与nchar
该向量元素的某个比率。
不是作为完成的解决方案提供的,而是作为 4 步中的第 1 步和第 2 步提供的。
推荐阅读
- google-cloud-platform - 从 GCS 并行下载 blob 会导致 SSL 错误
- c++ - 如何使用函数指针从其内存地址调用成员函数?
- xaml - 在单个用户控件中使用相同的 MahApps 按钮样式两次不起作用
- python - 使用 Python 将 CSV 数据加载到 MySQL 中,创建表并添加记录
- r - 如何在 r 演示文稿中调整表格/字符串的大小?
- swift - Java 到 Swift AES 加密/解密移植
- cocoa - 如何在 Cocoa 应用程序中捆绑 AsciiDoctor gem
- r - 如何分解 xts 半小时时间序列数据
- node.js - 具有多个foreignFields的Mongoose虚拟
- angular - 使用 ng-new 我得到“Schematic input does not validate against the Schema”: