首页 > 解决方案 > 根据常用词对列中的值进行分组

问题描述

我有一个数据框:

ID    message
1     request body: <?xml version="2.0",<code> dwfkjn34241
2     request body: <?xml version="2.0",<code> jnwg3425
3     request body: <?xml version="2.0", <PlatCode>, <code> qwefn2
4     received an error
5     <MarkCheckMSG>
6     received an error

我想根据常用词提取列中的值组。因此,消息列中的前三行可以视为同一组,尽管它们略有不同。第四和第六作为同一组的成员。我如何使用单词和结构相似性标准将这些值分组到列消息中?有什么好的方法呢?例如,给出了示例中的数据框。因此,我对适合问题概念的方法更感兴趣,而不是基于正则表达式的解决方案,例如

标签: rdataframegroup-bycluster-computing

解决方案


Perhaps try a k-medoids clustering analysis with a string distance measure?

library(cluster)
library(stringdist)

find_medoids <- function(x, k_from, method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)) {
  diss <- stringdist::stringdistmatrix(x, x, method = method, weight = weight)
  dimnames(diss) <- list(x, x)
  trials <- lapply(
    seq(from = k_from, to = length(unique(x))), 
    function(i) cluster::pam(diss, i, diss = TRUE)
  )
  sel <- which.max(vapply(trials, `[[`, numeric(1L), c("silinfo", "avg.width")))
  trials[[sel]]
}

map_cluster <- function(x, med_obj) {
  unname(med_obj$clustering[x])
}

Output

> map_cluster(df$message, find_medoids(df$message, 2, "cosine"))
[1] 1 1 1 2 3 2

For your real data, you may have to adjust some parameters such as the string distance method (the example above used cosine distance).


推荐阅读