r - 如何优化矩阵中行和列的交集?
问题描述
例如,在矩阵中M1
,行是国家,列是年份。这些国家在同一年份没有观察结果。我想找到给我最多国家的年份的“最佳”交叉点。最少年数和最少国家将被预先定义。结果中包括哪些国家/地区无关紧要,年份不必是连续的。
> M1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] NA NA NA 2004 NA 2006 NA 2008 2009 NA 2011 2012 NA NA NA
[2,] NA 2002 NA 2004 NA NA 2007 NA NA 2010 2011 NA 2013 2014 NA
[3,] NA NA NA 2004 2005 2006 2007 2008 2009 NA NA 2012 2013 NA 2015
[4,] NA 2002 NA 2004 2005 2006 2007 2008 NA 2010 2011 NA 2013 NA NA
[5,] 2001 NA NA NA 2005 2006 2007 2008 NA 2010 NA 2012 2013 2014 NA
[6,] 2001 NA 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 NA 2014 NA
[7,] 2001 2002 NA NA 2005 NA 2007 NA 2009 NA 2011 NA NA 2014 2015
[8,] 2001 2002 NA 2004 2005 2006 NA NA NA 2010 NA NA 2013 NA 2015
[9,] NA 2002 NA 2004 2005 NA 2007 NA NA 2010 2011 NA NA NA NA
[10,] 2001 2002 NA 2004 NA NA NA NA NA 2010 NA 2012 NA 2014 2015
因为没有明显的交叉点,所以一次Reduce(intersect...)
尝试是行不通的,我通过连续排除一个国家到定义的阈值来反复这样做n.row
。结果至少过滤了几年n.col
。我写了这个函数,
findBestIntersect <- function(M, min.row=5, min.col=3) {
## min.row: minimum number of rows (countries) to analyze
## min.col: minimum number of complete columns (years)
# put matrices with row combn into list (HUGE!)
L1 <- lapply(min.row:(nrow(M) - 1), function(x)
combn(nrow(M), x, function(i) M[i, ], simplify=FALSE))
# select lists w/ def. number of complete columns
slc <- sapply(L1, function(y) # numbers of lists
which(sapply(y, function(x)
sum(!(apply(x, 2, function(i) any(is.na(i))))))
>= min.col))
# list selected lists
L2 <- Map(function(x, i)
x[i], L1[lengths(slc) > 0], slc[lengths(slc) > 0])
# find intersects
L3 <- rapply(L2, function(l)
as.integer(na.omit(Reduce(intersect, as.list(as.data.frame(t(l)))))),
how="list")
return(unique(unlist(L3, recursive=FALSE)))
}
这给了我想要的结果M1
。
> system.time(best.yrs.1 <- findBestIntersect(M1))
user system elapsed
0.06 0.00 0.07
> best.yrs.1
[[1]]
[1] 2002 2004 2010
然而,性能M2
只是可接受的(RAM 使用量约为 1.1 GB),
> system.time(best.yrs.2 <- findBestIntersect(M2))
user system elapsed
79.90 0.39 82.76
> head(best.yrs.2, 3)
[[1]]
[1] 2002 2009 2015
[[2]]
[1] 2002 2014 2015
[[3]]
[1] 2003 2009 2010
而且您不想尝试使用M3
类似于我的真实矩阵的(爆炸 32 GB RAM):
# best.yrs.3 <- findBestIntersect(M3)
可能这个功能最大的缺陷就是L1
很快就变得太大了。
所以,我的问题是,是否有更好的方法也适用于M3
?“奖金”将最大化国家和年份。如果可能的话,我想在没有额外软件包的情况下做到这一点。
数据
set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 150, replace=TRUE), 10)
M1 <- t(replicate(10, 2001:2015, simplify=TRUE))
M1[tf] <- NA
tf <- matrix(sample(c(TRUE, FALSE), 300, replace=TRUE), 20)
M2 <- t(replicate(20, 2001:2015, simplify=TRUE))
M2[tf] <- NA
tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA
解决方案
这是一个使用混合整数规划的小问题,即使使用弱开源求解器(如glpk
. 我正在使用ompr
包进行数学建模(有关ompr的更多信息)并将模型逻辑作为注释包含在代码中。请注意,由于我猜的 R 版本不同,我的随机数据与 OP 不同。
M3
当模型设置为最大化数据最多 15 年时,总运行时间约为一分钟(即实际求解时间甚至更少) 。这种方法很容易扩展到更大的实例。
library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)
set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA
m <- +!is.na(M3) # gets logical matrix; 0 if NA else 1
nr <- nrow(m)
nc <- ncol(m)
n_years <- 15
model <- MIPModel() %>%
# keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>%
# rm_row[i] is 1 if row i is selected for removal else 0
add_variable(rm_row[i], i = 1:nr, type = "binary") %>%
# rm_col[j] is 1 if column j is selected for removal else 0
add_variable(rm_col[j], j = 1:nc, type = "binary") %>%
# maximize good cells kept
set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>%
# cell can be kept only when row is not selected for removal
add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
# cell can be kept only when column is not selected for removal
add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
# only non-NA values can be kept
add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>%
# keep at most n_years columns i.e. remove at least (nc - n_years) columns
# I used >= instead of == to avoid infeasiblity
add_constraint(sum_expr(rm_col[j], j = 1:nc) >= nc - n_years) %>%
# solve using free glpk solver
solve_model(with_ROI(solver = "glpk"))
结果 -
solver_status(model)
# [1] "optimal" <- indicates guaranteed optimum (at least one of the many possible)
# get rows to remove
rm_rows <- model %>%
get_solution(rm_row[i]) %>%
filter(value > 0) %>% pull(i) %>% print()
# [1] 1 2 3 4 6 8 9 11 12 13 14 15 17 18 19 20 21 22 23 25 27 28 29 30 31
# get columns to remove
rm_cols <- model %>%
get_solution(rm_col[j]) %>%
filter(value > 0) %>% pull(j) %>% print()
# [1] 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [24] 27 28 29 30 31 32 33 34 35 36 38 39 40 41 44 45 46 47 48
result <- M3[-rm_rows, -rm_cols, drop = F]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1969 1974 1994 2005 2010 2011
[2,] 1969 1974 1994 2005 2010 2011
[3,] 1969 1974 1994 2005 2010 2011
[4,] 1969 1974 1994 2005 2010 2011
[5,] 1969 1974 1994 2005 2010 2011
[6,] 1969 1974 1994 2005 2010 2011
推荐阅读
- c# - 如何将 SignalR 添加到 .NET Core Windows 服务?
- javascript - Firebase 获取集合内的文档长度
- laravel - 播种工厂关系时,Morph 字段没有默认值
- typescript - TypeScript 泛型类型推断失败
- asp.net - 创建容器时无法连接到 azure nuget feed
- r - 使用 R 对数字字符串进行操作
- haskell - Haskell - 排序子列表
- heroku - 如何将 Siteground 的域指向 Heroku 托管的应用程序?
- eclipse - 在 Web 服务客户端上运行简单功能时 Eclipse 出错
- haskell - Haskell 快速排序,但对于元组第二个元素,不使用 map、sortby 或任何更高阶或任何需要导入的东西