首页 > 解决方案 > 如何优化矩阵中行和列的交集?

问题描述

例如,在矩阵中M1,行是国家,列是年份。这些国家在同一年份没有观察结果。我想找到给我最多国家的年份的“最佳”交叉点。最少年数和最少国家将被预先定义。结果中包括哪些国家/地区无关紧要,年份不必是连续的。

> M1
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
 [1,]   NA   NA   NA 2004   NA 2006   NA 2008 2009    NA  2011  2012    NA    NA    NA
 [2,]   NA 2002   NA 2004   NA   NA 2007   NA   NA  2010  2011    NA  2013  2014    NA
 [3,]   NA   NA   NA 2004 2005 2006 2007 2008 2009    NA    NA  2012  2013    NA  2015
 [4,]   NA 2002   NA 2004 2005 2006 2007 2008   NA  2010  2011    NA  2013    NA    NA
 [5,] 2001   NA   NA   NA 2005 2006 2007 2008   NA  2010    NA  2012  2013  2014    NA
 [6,] 2001   NA 2003 2004 2005 2006 2007 2008 2009  2010  2011  2012    NA  2014    NA
 [7,] 2001 2002   NA   NA 2005   NA 2007   NA 2009    NA  2011    NA    NA  2014  2015
 [8,] 2001 2002   NA 2004 2005 2006   NA   NA   NA  2010    NA    NA  2013    NA  2015
 [9,]   NA 2002   NA 2004 2005   NA 2007   NA   NA  2010  2011    NA    NA    NA    NA
[10,] 2001 2002   NA 2004   NA   NA   NA   NA   NA  2010    NA  2012    NA  2014  2015

因为没有明显的交叉点,所以一次Reduce(intersect...)尝试是行不通的,我通过连续排除一个国家到定义的阈值来反复这样做n.row。结果至少过滤了几年n.col。我写了这个函数,

findBestIntersect <- function(M, min.row=5, min.col=3) {
  ## min.row: minimum number of rows (countries) to analyze
  ## min.col: minimum number of complete columns (years)
  # put matrices with row combn into list (HUGE!)
  L1 <- lapply(min.row:(nrow(M) - 1), function(x)
    combn(nrow(M), x, function(i) M[i, ], simplify=FALSE))
  # select lists w/ def. number of complete columns
  slc <- sapply(L1, function(y)  # numbers of lists
    which(sapply(y, function(x)
      sum(!(apply(x, 2, function(i) any(is.na(i))))))
      >= min.col))
  # list selected lists
  L2 <- Map(function(x, i)
    x[i], L1[lengths(slc) > 0], slc[lengths(slc) > 0])
  # find intersects
  L3 <- rapply(L2, function(l)
    as.integer(na.omit(Reduce(intersect, as.list(as.data.frame(t(l)))))),
    how="list")
  return(unique(unlist(L3, recursive=FALSE)))
}

这给了我想要的结果M1

> system.time(best.yrs.1 <- findBestIntersect(M1))
   user  system elapsed 
   0.06    0.00    0.07 

> best.yrs.1
[[1]]
[1] 2002 2004 2010

然而,性能M2只是可接受的(RAM 使用量约为 1.1 GB),

> system.time(best.yrs.2 <- findBestIntersect(M2))
   user  system elapsed 
  79.90    0.39   82.76 
> head(best.yrs.2, 3)
[[1]]
[1] 2002 2009 2015

[[2]]
[1] 2002 2014 2015

[[3]]
[1] 2003 2009 2010

而且您不想尝试使用M3类似于我的真实矩阵的(爆炸 32 GB RAM):

# best.yrs.3 <- findBestIntersect(M3)

可能这个功能最大的缺陷就是L1很快就变得太大了。

所以,我的问题是,是否有更好的方法也适用于M3?“奖金”将最大化国家和年份。如果可能的话,我想在没有额外软件包的情况下做到这一点。

数据

set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 150, replace=TRUE), 10)
M1 <- t(replicate(10, 2001:2015, simplify=TRUE))
M1[tf] <- NA

tf <- matrix(sample(c(TRUE, FALSE), 300, replace=TRUE), 20)
M2 <- t(replicate(20, 2001:2015, simplify=TRUE))
M2[tf] <- NA

tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA

标签: ralgorithmperformanceintersection

解决方案


这是一个使用混合整数规划的小问题,即使使用弱开源求解器(如glpk. 我正在使用ompr包进行数学建模(有关ompr的更多信息)并将模型逻辑作为注释包含在代码中。请注意,由于我猜的 R 版本不同,我的随机数据与 OP 不同。

M3当模型设置为最大化数据最多 15 年时,总运行时间约为一分钟(即实际求解时间甚至更少) 。这种方法很容易扩展到更大的实例。

library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)

set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA

m <- +!is.na(M3) # gets logical matrix; 0 if NA else 1    
nr <- nrow(m)
nc <- ncol(m)
n_years <- 15 

model <- MIPModel() %>% 
  # keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
  add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>% 
  # rm_row[i] is 1 if row i is selected for removal else 0
  add_variable(rm_row[i], i = 1:nr, type = "binary") %>% 
  # rm_col[j] is 1 if column j is selected for removal else 0
  add_variable(rm_col[j], j = 1:nc, type = "binary") %>% 
  # maximize good cells kept
  set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>% 
  # cell can be kept only when row is not selected for removal
  add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
  # cell can be kept only when column is not selected for removal
  add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
  # only non-NA values can be kept
  add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>% 
  # keep at most n_years columns i.e. remove at least (nc - n_years) columns
  # I used >= instead of == to avoid infeasiblity
  add_constraint(sum_expr(rm_col[j], j = 1:nc) >= nc - n_years) %>% 
  # solve using free glpk solver
  solve_model(with_ROI(solver = "glpk"))

结果 -

solver_status(model)
# [1] "optimal"    <- indicates guaranteed optimum (at least one of the many possible)

# get rows to remove
rm_rows <- model %>% 
  get_solution(rm_row[i]) %>% 
  filter(value > 0) %>% pull(i) %>% print()

# [1]  1  2  3  4  6  8  9 11 12 13 14 15 17 18 19 20 21 22 23 25 27 28 29 30 31

# get columns to remove
rm_cols <- model %>% 
  get_solution(rm_col[j]) %>% 
  filter(value > 0) %>% pull(j) %>% print()

# [1]  2  3  4  5  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [24] 27 28 29 30 31 32 33 34 35 36 38 39 40 41 44 45 46 47 48

result <- M3[-rm_rows, -rm_cols, drop = F]

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1969 1974 1994 2005 2010 2011
[2,] 1969 1974 1994 2005 2010 2011
[3,] 1969 1974 1994 2005 2010 2011
[4,] 1969 1974 1994 2005 2010 2011
[5,] 1969 1974 1994 2005 2010 2011
[6,] 1969 1974 1994 2005 2010 2011

推荐阅读