r - R - 在定义的阈值上选择矩阵列表中存在的相同行的省时方法
问题描述
我有一个包含 68 个矩阵的列表。每个矩阵基本上是一个边缘列表,由三列和数千行组成。前两列分别命名为 Node1 和 Node2,包含基因名称。每行代表图中的一条边,即基因之间的相互作用。第三列包含每条边的权重。
目标是获得一个最终表格,其中存在于 75% 或更多矩阵中且具有不同权重的边被折叠成单行。每个最终边的权重将对应于相同边的权重的平均值。
我想知道一个更省时的代码,用于将大型矩阵与数百万行进行比较。
例子
矩阵
edgelist1<-matrix(data = c("ABCD1","EFGH1","DFEC","JEKC4",0.1314,1.1231),nrow = 2,ncol = 3,dimnames = list(c(),c("Node1","Node2","Weight"))) edgelist1 edgelist2<-matrix(data = c("ABCD1","DEIR3","CGESL","DFEC","KMN3","PME2",1.7564,0.6573,0.5478),nrow = 3,ncol = 3,dimnames = list(c(),c("Node1","Node2","Weight"))) edgelist2 edgelist3<-matrix(data = c("ACCD1","DEIR3","GUESL","DFEC","KMN3","PMKE2",1.264,0.8573,0.7458),nrow = 3,ncol = 3,dimnames = list(c(),c("Node1","Node2","Weight"))) edgelist3 edgelist4<-matrix(data = c("KPF2","NDM1","GUESL","ABCD1","KMN3","PMKE2","LTRC5","DFEC",1.142,0.9273,0.1358,0.3456),nrow = 4,ncol = 3,dimnames = list(c(),c("Node1","Node2","Weight"))) edgelist4
列表
list<-list(edgelist1,edgelist2,edgelist3,edgelist4)
期望的输出
finaledgelist<-matrix(c("ABCD1","DFEC","0.7445"),nrow=1,ncol = 3,dimnames = list(c(),c("Node1","Node2","Weight"))) finaledgelist
我的代码
#Combining all edgelists into one
alledges<-do.call(rbind,list)
#Merging column 1 and column 2
alledges<-data.frame(list(Edges=paste(alledges[,1],alledges[,2]),Weights=alledges[,3]))
#Table to see the frequencies of appearance of each edge
as.data.frame(table(alledges$Edge))->frequencies
# Selection of the edges present in 75% or more of the original edgelists
frequencies[frequencies$Freq>=3,]->selection
#Selection of each edge that appears three or more times
alledges[alledges$Edge %in% selection$Var1,]->repeated
#Collapse by edge name and compute mean of the weights
finaledgelist<-repeated %>%
group_by(Edges) %>%
dplyr::summarize(Weights=mean(as.numeric(as.character(Weights)), na.rm = TRUE))
#Final edge list as data frame
finaledgelist<-as.data.frame(cbind(Node1=unlist(strsplit(as.vector(finaledgelist$Edges),split=" "))[2*(1:nrow(finaledgelist))-1],Node2=unlist(strsplit(as.vector(finaledgelist$Edges),split=" "))[2*(1:nrow(finaledgelist))],Weights=finaledgelist$Weights))
finaledgelist$Weights<-as.numeric(as.character(finaledgelist$Weights))
解决方案
这是一种使用 tidyverse 的方法
library(tidyverse)
do.call(rbind, list1) %>% #bind all matrices together
as.data.frame %>% #convert to data frame
group_by(Node1, Node2) %>% #group by nodes
mutate(n1 = n()) %>% #count members of each group
filter(n1 >= (0.75 * length(list1))) %>% #filter those that are present in less than 75% of list elements
summarise(weight = mean(as.numeric(as.character(Weight)))) #get mean weight for those that are left
#output#
A tibble: 1 x 3
# Groups: Node1 [?]
Node1 Node2 weight
<fct> <fct> <dbl>
1 ABCD1 DFEC 0.744
推荐阅读
- javascript - 使用 svg.js 时“未定义 SVG”
- python - decode 方法在 lambda env 中返回 unicode 响应
- regex - 重定向 ID 号为 htaccess 的 301 网址
- android - Ionic Bluetooth-Classic 插件,用于与本地到本地通信
- angular - Angular @ViewChild 错误“无法读取未定义的属性 nativeElement”
- angular - 'Unexpected token <' 在每个新版本的 Angular 生产 PWA 上,直到站点刷新
- sql - 什么相当于 Azure 中的 MySQL DELIMITER?
- rdma - RDMA 集群基准测试
- c++ - C++ 中初始化列表的顺序
- java - 从任务逐行更新 TextArea 的最佳方法是什么?