r - Efficiently picking combinations of Integers
问题描述
Let's say we have a 5x5 matrix, filled with 0s.
myMatrix <- matrix(rep(0, 25), ncol = 5)
Now, let's pick a triplet of integers between 1 and 5.
triplet <- c(1,2,3)
For all combinations of this triplet we now add 1 in the matrix, with this function:
addCombinationsToMatrix <- function(.matrix, .triplet){
indexesToChange <- as.matrix(expand.grid(.triplet, .triplet))
.matrix[indexesToChange] <- .matrix[indexesToChange] + 1
.matrix
}
Using the function, we go from
myMatrix
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
[4,] 0 0 0 0 0
[5,] 0 0 0 0 0
to
myMatrix <- addCombinationsToMatrix(myMatrix, triplet)
myMatrix
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 1 0 0
[2,] 1 1 1 0 0
[3,] 1 1 1 0 0
[4,] 0 0 0 0 0
[5,] 0 0 0 0 0
If we pick another triplet we move on to
nextTriplet <- 2:4
myMatrix <- addCombinationsToMatrix(myMatrix, nextTriplet)
myMatrix
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 1 0 0
[2,] 1 2 2 1 0
[3,] 1 2 2 1 0
[4,] 0 1 1 1 0
[5,] 0 0 0 0 0
So, row-column combinations represent how often two integers have been shown together in a triplet: 3 and 4 have been shown together once, 2 and 3 have been shown together twice.
Question: How can one pick triplets, such that every combination (1-2, 1-3, 1-4...) was picked at least once and the number of triplets is minimized.
I'm looking for an algorithm here that picks the next triplet.
Ideally it can be extended to
- arbitrarily big matrices (10x10, 100x100 ...)
- arbitrarily big vectors (quadruplets, quintuplets, n-tuplets)
- an arbitrary number of times a combination must have been picked at least
Example:
myMatrix
myMatrix <- addCombinationsToMatrix(myMatrix, 1:3)
myMatrix
myMatrix <- addCombinationsToMatrix(myMatrix, 3:5)
myMatrix
myMatrix <- addCombinationsToMatrix(myMatrix, c(1,4,5))
myMatrix
myMatrix <- addCombinationsToMatrix(myMatrix, c(2,4,5))
myMatrix
EDIT:
Just to be sure: the answer doesn't have to be R
code. It can be some other language as well or even pseudo code.
EDIT 2: It occured to me now, that there are different ways of measuring efficiency. I actually meant, the algorithm should take as little iterations as possible. The algorithm being fast is also very cool, but not the main goal here.
解决方案
Great question! This comes up in survey design, where you want a few different versions of the survey that each only contain a subset of the questions, but you want every pair (or t-tuple) of questions to have been asked at least once.
This is called covering design, and is a variant of the classic set cover problem. As you can read in an excellent Mathematics Stack Exchange post on the topic, folks use notation C(v, k, t) indicating the minimum number of k-element subsets you need to draw (k=3 in your case) from a v-element set (v=5 in your case) such that every t-element subset in the entire set (t=2 in your case) is contained within one of your selected subsets. Folks have evaluated this function for many different (v, k, t) tuples; see, for instance, https://ljcr.dmgordon.org/cover/table.html . We can read from that table that C(5, 3, 2) = 4, with the following as one possible design:
1 2 3
1 4 5
2 3 4
2 3 5
First and foremost, this problem is NP-hard, so all known exact algorithms will scale exponentially in inputs v, k, and t. So while you may be able to solve small instances exactly by enumeration or some more clever exact method (e.g. integer programming), you will likely need to resort to heuristic methods as the problem size gets very large.
One possibility in this direction is lexicographic covering, as proposed in https://arxiv.org/pdf/math/9502238.pdf (you will note that many of the solutions on the site linked above list "lex covering" as the method of construction). Basically you list out all possible k-tuples in lexicographic order:
123
124
125
134
135
145
234
235
245
345
Then you greedily add the k-tuple that covers the most previously uncovered t-tuples, breaking ties using the lexicographic ordering.
Here's how the algorithm works in our case:
At the beginning every 3-tuple covers 3 different 2-tuples, so we add
123
since it is lexicographically earliest.After doing this, the 2-tuples of
12
,13
, and23
have been covered, while all remaining 2-tuples are not covered. A number of 3-tuples cover 3 more 2-tuples, e.g.145
and245
. We pick145
, since it is lexicographically first, covering14
,45
, and15
.Now we have 4 remaining uncovered 2-tuples --
24
,25
,34
, and35
. No 3-tuple covers 3 of these, but several cover 2, e.g.234
and345
. We select234
as the lexicographically earliest.We have two remaining uncovered 2-tuples --
25
and35
. We select235
as the only 3-tuple that covers both.
We end up with the exact solution shown above. Importantly, this is just a heuristic method -- it doesn't give any guarantee that 4 is the smallest number of 3-tuples needed to cover all pairs in a set with 5 elements. In this case, a lower bound by Schönheim (a reference is provided in the linked article above) convinces us that, in fact, C(5, 3, 2) cannot be smaller than 4. We conclude that the solution from lexicographic covering is in fact optimal.
You would need a tweak to cover each t-tuple a certain number of times r. One obvious one would just be to repeat each tuple to be covered "r" times, and then run lex covering as usual (so for instance in the first step above each 3-tuple would cover 9 2-tuples with r=3). Of course this remains a heuristic for your overall problem due to the use of lex covering.
推荐阅读
- pandas - 如何将 numpy 一维数组转换为 Pandas Series 或 Dataframe
- html - 使用自定义 css 类作为 css 文件中的值
- javascript - 在点击时调用 Jquery 中的函数第一次在页面加载时有效,之后则无效
- python - Python——努力理解如何更新函数,因为while循环中的变量被更新
- linux - 如何让 WSL Ubuntu 1804 明白在 Windows 中启用了 WSL?
- python - 导入张量流:DLL 加载失败
- node.js - express --view=hbs myapp getting [ERR_INVALID_CALLBACK]: 回调必须是函数
- php - 如何从外部源中的 url 编码字符串中获取数据
- actions-on-google - Google 助理和 Dialogflow 请求配额
- python - 在 for 循环中改变亮度只接受第一个参数