r - 基于多个变量的聚类观察
问题描述
我正在寻找一个 r 函数来基于两个变量在我的数据集中创建集群(希望“集群”是我想做的正确名称)。变量_1 或变量_2 具有相同值的每两个观察值应位于同一个集群中。在下面的简短示例中,我根据 variable_1 和 variable_2 聚合数据帧 df。
df <- data.frame(variable_1=c("a","a","b","b","c","c","d","d","e","e"),variable_2=c("g1","g2","g1","g3","g2","g4","g4","g6","g7","g8"),value=rnorm(10))
df$clusters <- some_function_to_create_clusters(df[,c("variable_1","variable_2")])
结果应如下所示:
df$clusters <- c("clu1","clu1","clu1","clu1","clu1","clu1","clu1","clu1","clu2","clu2")
df
请注意,第一个集群包含 variable_1 等于“a”、“b”、“c”或“d”的每个人:“a”和“b”合并在一起,因为它们共享“g1”(第 1 行和第 3 行);“a”和“c”被合并,因为它们共享“g2”(第 2 行和第 5 行);和 "c" 和 "d" 被合并,因为它们共享 "g4"(第 6 行和第 7 行)。最后,在最后一个集群中,只有 variable_1=="e" 的观察值,因为它们不与任何人共享 variable_2。
只是为了澄清我打算做什么,我会更好地解释我的问题集。我正在将县与附近的旅游景点配对。不同的县周围有不同的旅游景点(TA),同一个县周围有很多旅游景点。但是,这个由县和 TA 组成的“旅游集群”在该国很少分布。请注意,由于县与旅游景点连接的“连锁”效应,一些遥远的县可能在同一个集群内。所以我想根据县和旅游景点的 id 找到那些“集群”。
这看起来很简单,但我无法弄清楚如何实现。
非常感谢
解决方案
igraph solution
Disclaimer: I am completely new to igraph, so there's probably a better solution to this problem. However this seems to work.
With the igraph
package we can cluster the data using the graph_from_data_frame()
function, and then extract the clusters with components
. You get the added advantage of being able to visualise the clusters.
library(igraph)
graph <- graph_from_data_frame(df[, 1:2], directed = FALSE)
cmp <- components(graph)$membership
df$cluster <- cmp[df$variable_1]
plot(graph)
Wrapping it up into a function
If you wanted to wrap it up as a function, something like this works:
find_clusters <- function(x, y) {
edges <- data.frame(from = x, to = y)
graph <- igraph::graph_from_data_frame(edges, directed = FALSE)
cmp <- igraph::components(graph)$membership
return(cmp[x])
}
Using the additional example you posted as a comment above, we thus have the following workflow:
library(dplyr)
df <- data.frame(
variable_1 = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e", "f", "f"),
variable_2 = c( "g1", "g2", "g1", "g3", "g2", "g4", "g4", "g6", "g7", "g8", "g9", "g12"),
value = rnorm(12)
)
df %>%
mutate(cluster = find_clusters(variable_1, variable_2))
# variable_1 variable_2 value cluster
# 1 a g1 -0.03410073 1
# 2 a g2 0.51261548 1
# 3 b g1 0.06470451 1
# 4 b g3 -1.97228101 1
# 5 c g2 -0.39751063 1
# 6 c g4 0.17761619 1
# 7 d g4 -0.13771207 1
# 8 d g6 -0.72183017 1
# 9 e g7 0.09012701 2
# 10 e g8 0.45763593 2
# 11 f g9 -0.83172613 3
# 12 f g12 2.83480352 3
推荐阅读
- asp.net - ASP.Net web api 令牌认证是否适合保护金融 API 以阻止滥用?
- macos - Homebrew 无法安装任何东西:由 cURL 的一些 SSL 相关问题引起
- android - 为什么观察者再次调用 API 而没有在 Android 中进行任何更改?
- julia - 使用 Flux 的神经代理问题
- google-apps-script - Google Apps 脚本的 TOTAL DAILY 限制是多少?
- c# - DisplayActionSheet() 不会将销毁按钮放在 Android 的顶部
- python - 使用 StringIO 从 pyscopg2 写入 PostgresSQL - 复制会随着时间的推移而变慢
- react-native - 在反应原生选项卡导航中导航到另一个屏幕时,如何明确退出一个屏幕?
- firebase - 如何使用 GameCenter Firebase 身份验证在 Unity 中访问身份验证令牌
- sass - SCSS - 悬停在多个元素上?