首页 > 解决方案 > Iterate over ID column, creating a new graph for each unique ID

问题描述

Imagine I have the following data:

dat <- read.table(text="TrxID Items Quant Team_Id
Trx1 A 3 11
Trx1 B 1 11
Trx1 C 1 12
Trx2 E 3 13
Trx2 B 1 13
Trx3 B 1 14
Trx3 C 4 14
Trx4 D 1 15
Trx4 E 1 15
Trx4 A 1 15
Trx5 F 5 18
Trx5 B 3 13
Trx5 C 2 19
Trx5 D 1 20", header=T)

dat[1, ]$Team_Id <- paste0(c('11','19'), collapse = ',')
dat[6, ]$Team_Id <- paste0(c('14','13'), collapse = ',')

Some people are on more than one team, so they have multiple team_ids stored in a list. I can generate an adjacency matrix of all the occurrences, and turn it into a graph to perform network analysis like so:

tabbed <- xtabs(~ TrxID + Items, data=dat, sparse = TRUE)
co_occur <- crossprod(tabbed, tabbed)
diag(co_occur) <- 0
co_occur

g <- graph.adjacency(co_occur, weighted=TRUE, mode ='undirected')
g <- simplify(g)

However, what I want to do is to group by the team_id column, and to generate the above adjacency matrix and graph objects for every unique team_id. I tried using a for loop to achieve this, but I don't believe it is feasible given the size of my dataset. Moreover, it cannot handle the cases when people are on more than one team (as it would require another for loop to iterate over each element in a list).

For example,

complete_teams <- data.frame(team_id = c(11, 12, 13, 14, 15, 18, 19, 20))

for(i in complete_teams$team_id){
  if(i %in% dat$Team_Id) {
        newdata = subset(dat, Team_Id == i)
        tabbed <- xtabs(~ TrxID + Items, data=newdata, sparse = TRUE)
        co_occur <- crossprod(tabbed, tabbed)
        diag(co_occur) <- 0
        print(co_occur)
        g <- graph.adjacency(co_occur, weighted=TRUE, mode ='undirected')
        g <- simplify(g)


  }

}

So, what I'm wondering is

  1. what is the best way to generate separate networks for each team_id?
  2. how should the resultant graph objects for each team_id be stored in order to do analysis on them later?

If there is a more obvious way of doing this within the network analysis paradigm, please let me know.

标签: rnetworkingigraph

解决方案


这是一种使用by. 但是我在拆分逗号分隔列之前对数据进行了预处理。

create_g <- function(dx){
  tabbed <- xtabs(~ TrxID + Items, data=dx, sparse = TRUE)
  co_occur <- crossprod(tabbed, tabbed)
  diag(co_occur) <- 0
  g <- graph.adjacency(co_occur, weighted=TRUE, mode ='undirected')
  g <- simplify(g)
  g
}

data.table用来拆分列,因为它是按 ID 组:

library(data.table)
out <- setDT(dat)[, {
  data.table(new_id = unlist(strsplit(Team_Id,",")),
  .SD)
   },Team_Id]

我们不能再使用 data.table 框架来应用created_g,因为结果不是嵌套列表:

by(out,out$new_id,FUN=create_g)

推荐阅读