首页 > 解决方案 > 选择最大化相关性的样本

问题描述

我有一个优化问题,我正在努力用 R 编写代码。我有一个面板数据集,每个工人的体重和身高观察了两次。

worker weight height weight2 height2
1      120    60     125     60
2      152    55     156     66
3      222    55     100     20

我想确定体重和身高之间的最大相关性,而不考虑收集数据的时间。例如,我想确定当我使用工人 1 的体重和身高或工人 1 的体重 2 和身高 2 时,相关性是否更强,依此类推。确定最大化这种相关性的样本的最佳方法是什么?

标签: roptimizationcorrelation

解决方案


以下是一些可能的方法。第一个是使用全局 MINLP 求解器(不使用 R),第二个是演示 R 的 GA(遗传算法)启发式。

求解为非凸 MINLP

此问题的 MINLP(混合整数非线性规划)模型可以是:

 max cor(h,w)
 h[i] = height1[i]*(1-x[i]) + height2[i]*x[i]
 w[i] = weight1[i]*(1-x[i]) + weight2[i]*x[i]
 x[i] ∈ {0,1}

即,x[i]=0选择第一个观察值并x[i]=1选择第二个观察值。

这需要一个全局求解器,例如 Baron、Antigone 或 Couenne。

以下是 Baron 的示例:

----     27 PARAMETER data  

        height1     weight1     height2     weight2

i1    67.433285  168.262871   67.445523  163.692389
i2    70.638374  174.437750   68.649190  160.084811
i3    71.317794  159.909672   69.503911  164.720010
i4    59.850261  145.704159   61.175728  142.708300
i5    65.341938  155.586984   68.483909  165.564991
i6    64.142009  154.335001   68.568683  166.169507
i7    67.030368  158.768813   65.780803  153.721717
i8    73.672863  175.126951   73.236515  164.704340
i9    65.203516  157.593587   63.279277  149.784500
i10   69.001848  160.063428   68.786656  162.278007
i11   64.455422  159.039195   63.930208  152.827710
i12   70.719334  164.885704   69.666096  157.356595
i13   65.688428  151.223468   63.614565  150.071072
i14   66.569252  160.978671   70.533320  160.722483
i15   78.417676  172.298652   80.070076  172.695207
i16   65.396154  158.234709   67.404942  158.310596
i17   62.504967  150.899428   61.000439  154.094647
i18   62.122630  150.024298   63.634554  153.644324
i19   70.598400  165.086523   72.999194  166.771223
i20   74.935107  170.820610   76.622182  169.013550
i21   63.233956  154.331546   60.372876  149.152520
i22   72.550105  173.961915   76.748649  167.462369
i23   74.086553  168.190867   75.433331  171.773607
i24   65.379648  163.577697   65.717553  160.134888
i25   64.003038  155.357607   67.301426  158.713710

----     68 VARIABLE x.L  select 1 or 2

i1  1.000000,    i2  1.000000,    i3  1.000000,    i8  1.000000,    i9  1.000000,    i11 1.000000,    i13 1.000000
i14 1.000000,    i16 1.000000,    i19 1.000000,    i21 1.000000,    i22 1.000000,    i23 1.000000,    i24 1.000000
i25 1.000000


----     68 VARIABLE z.L                   =     0.956452  objective

----     72 PARAMETER corr  

all1    0.868691,    all2    0.894532,    optimal 0.956452

笔记:

  • 中的零x不打印
  • 该参数corr显示三种情况下的相关性:(a) 全部时x[i]=0,(b) 全部时x[i]=1和 (c) 的最佳值x
  • 我使用 GAMS/Baron 运行它(即在 R 之外)

求解为非凸 MIQCP

通过一些努力,我们可以将我们的 MINLP 模型重新定义为非凸二次模型。这个模型可以用像 Gurobi(在 R 下可用)这样的求解器来求解。有关更多详细信息,请参阅链接

使用元启发式

如果我们只想找到一个好的解决方案而不是经过验证的最优解决方案,我们可以使用元启发式算法,例如遗传算法。ga这是一个来自包的试用版GA

> df <- read.table(text="
+ id      height1     weight1     height2     weight2
+ i1    67.433285  168.262871   67.445523  163.692389
+ i2    70.638374  174.437750   68.649190  160.084811
+ i3    71.317794  159.909672   69.503911  164.720010
+ i4    59.850261  145.704159   61.175728  142.708300
+ i5    65.341938  155.586984   68.483909  165.564991
+ i6    64.142009  154.335001   68.568683  166.169507
+ i7    67.030368  158.768813   65.780803  153.721717
+ i8    73.672863  175.126951   73.236515  164.704340
+ i9    65.203516  157.593587   63.279277  149.784500
+ i10   69.001848  160.063428   68.786656  162.278007
+ i11   64.455422  159.039195   63.930208  152.827710
+ i12   70.719334  164.885704   69.666096  157.356595
+ i13   65.688428  151.223468   63.614565  150.071072
+ i14   66.569252  160.978671   70.533320  160.722483
+ i15   78.417676  172.298652   80.070076  172.695207
+ i16   65.396154  158.234709   67.404942  158.310596
+ i17   62.504967  150.899428   61.000439  154.094647
+ i18   62.122630  150.024298   63.634554  153.644324
+ i19   70.598400  165.086523   72.999194  166.771223
+ i20   74.935107  170.820610   76.622182  169.013550
+ i21   63.233956  154.331546   60.372876  149.152520
+ i22   72.550105  173.961915   76.748649  167.462369
+ i23   74.086553  168.190867   75.433331  171.773607
+ i24   65.379648  163.577697   65.717553  160.134888
+ i25   64.003038  155.357607   67.301426  158.713710
+ ", header=T)
> 
> #
> # print obvious cases 
> #
> cor(df$weight1,df$height1)
[1] 0.8686908
> cor(df$weight2,df$height2)
[1] 0.894532
> 
> #
> # fitness function
> #
> f <- function(x) {
+   w <- df$weight1*(1-x) + df$weight2*x
+   h <- df$height1*(1-x) + df$height2*x
+   cor(w,h) 
+ }
> 
> library(GA)
> res <- ga(type=c("binary"),fitness=f,nBits=25,seed=123)
GA | iter = 1 | Mean = 0.8709318 | Best = 0.9237155
GA | iter = 2 | Mean = 0.8742004 | Best = 0.9237155
GA | iter = 3 | Mean = 0.8736450 | Best = 0.9237155
GA | iter = 4 | Mean = 0.8742228 | Best = 0.9384788
GA | iter = 5 | Mean = 0.8746517 | Best = 0.9384788
GA | iter = 6 | Mean = 0.8792048 | Best = 0.9486227
GA | iter = 7 | Mean = 0.8844841 | Best = 0.9486227
GA | iter = 8 | Mean = 0.8816874 | Best = 0.9486227
GA | iter = 9 | Mean = 0.8805522 | Best = 0.9486227
GA | iter = 10 | Mean = 0.8820974 | Best = 0.9486227
GA | iter = 11 | Mean = 0.8859074 | Best = 0.9486227
GA | iter = 12 | Mean = 0.8956467 | Best = 0.9486227
GA | iter = 13 | Mean = 0.8989140 | Best = 0.9486227
GA | iter = 14 | Mean = 0.9069327 | Best = 0.9486227
GA | iter = 15 | Mean = 0.9078787 | Best = 0.9486227
GA | iter = 16 | Mean = 0.9069163 | Best = 0.9489443
GA | iter = 17 | Mean = 0.9104712 | Best = 0.9489443
GA | iter = 18 | Mean = 0.9169900 | Best = 0.9489443
GA | iter = 19 | Mean = 0.9175285 | Best = 0.9489443
GA | iter = 20 | Mean = 0.9207076 | Best = 0.9489443
GA | iter = 21 | Mean = 0.9210288 | Best = 0.9489443
GA | iter = 22 | Mean = 0.9206928 | Best = 0.9489443
GA | iter = 23 | Mean = 0.9210399 | Best = 0.9489443
GA | iter = 24 | Mean = 0.9208985 | Best = 0.9489443
GA | iter = 25 | Mean = 0.9183778 | Best = 0.9511446
GA | iter = 26 | Mean = 0.9217391 | Best = 0.9511446
GA | iter = 27 | Mean = 0.9274271 | Best = 0.9522764
GA | iter = 28 | Mean = 0.9271156 | Best = 0.9522764
GA | iter = 29 | Mean = 0.9275347 | Best = 0.9522764
GA | iter = 30 | Mean = 0.9278315 | Best = 0.9522764
GA | iter = 31 | Mean = 0.9300289 | Best = 0.9522764
GA | iter = 32 | Mean = 0.9306409 | Best = 0.9528777
GA | iter = 33 | Mean = 0.9309087 | Best = 0.9528777
GA | iter = 34 | Mean = 0.9327691 | Best = 0.9528777
GA | iter = 35 | Mean = 0.9309344 | Best = 0.9549574
GA | iter = 36 | Mean = 0.9341977 | Best = 0.9549574
GA | iter = 37 | Mean = 0.9374437 | Best = 0.9559043
GA | iter = 38 | Mean = 0.9394410 | Best = 0.9559043
GA | iter = 39 | Mean = 0.9405482 | Best = 0.9559043
GA | iter = 40 | Mean = 0.9432749 | Best = 0.9564515
GA | iter = 41 | Mean = 0.9441814 | Best = 0.9564515
GA | iter = 42 | Mean = 0.9458232 | Best = 0.9564515
GA | iter = 43 | Mean = 0.9469625 | Best = 0.9564515
GA | iter = 44 | Mean = 0.9462313 | Best = 0.9564515
GA | iter = 45 | Mean = 0.9449716 | Best = 0.9564515
GA | iter = 46 | Mean = 0.9444071 | Best = 0.9564515
GA | iter = 47 | Mean = 0.9437149 | Best = 0.9564515
GA | iter = 48 | Mean = 0.9446355 | Best = 0.9564515
GA | iter = 49 | Mean = 0.9455424 | Best = 0.9564515
GA | iter = 50 | Mean = 0.9456497 | Best = 0.9564515
GA | iter = 51 | Mean = 0.9461382 | Best = 0.9564515
GA | iter = 52 | Mean = 0.9444960 | Best = 0.9564515
GA | iter = 53 | Mean = 0.9434671 | Best = 0.9564515
GA | iter = 54 | Mean = 0.9451851 | Best = 0.9564515
GA | iter = 55 | Mean = 0.9481903 | Best = 0.9564515
GA | iter = 56 | Mean = 0.9477778 | Best = 0.9564515
GA | iter = 57 | Mean = 0.9481829 | Best = 0.9564515
GA | iter = 58 | Mean = 0.9490952 | Best = 0.9564515
GA | iter = 59 | Mean = 0.9505670 | Best = 0.9564515
GA | iter = 60 | Mean = 0.9499329 | Best = 0.9564515
GA | iter = 61 | Mean = 0.9509299 | Best = 0.9564515
GA | iter = 62 | Mean = 0.9505341 | Best = 0.9564515
GA | iter = 63 | Mean = 0.9519624 | Best = 0.9564515
GA | iter = 64 | Mean = 0.9518618 | Best = 0.9564515
GA | iter = 65 | Mean = 0.9523598 | Best = 0.9564515
GA | iter = 66 | Mean = 0.9516766 | Best = 0.9564515
GA | iter = 67 | Mean = 0.9521926 | Best = 0.9564515
GA | iter = 68 | Mean = 0.9524419 | Best = 0.9564515
GA | iter = 69 | Mean = 0.9532865 | Best = 0.9564515
GA | iter = 70 | Mean = 0.9535871 | Best = 0.9564515
GA | iter = 71 | Mean = 0.9536049 | Best = 0.9564515
GA | iter = 72 | Mean = 0.9534035 | Best = 0.9564515
GA | iter = 73 | Mean = 0.9532859 | Best = 0.9564515
GA | iter = 74 | Mean = 0.9521064 | Best = 0.9564515
GA | iter = 75 | Mean = 0.9534997 | Best = 0.9564515
GA | iter = 76 | Mean = 0.9539987 | Best = 0.9564515
GA | iter = 77 | Mean = 0.9536670 | Best = 0.9564515
GA | iter = 78 | Mean = 0.9526224 | Best = 0.9564515
GA | iter = 79 | Mean = 0.9531871 | Best = 0.9564515
GA | iter = 80 | Mean = 0.9527495 | Best = 0.9564515
GA | iter = 81 | Mean = 0.9526061 | Best = 0.9564515
GA | iter = 82 | Mean = 0.9525577 | Best = 0.9564515
GA | iter = 83 | Mean = 0.9525084 | Best = 0.9564515
GA | iter = 84 | Mean = 0.9519052 | Best = 0.9564515
GA | iter = 85 | Mean = 0.9518549 | Best = 0.9564515
GA | iter = 86 | Mean = 0.9511299 | Best = 0.9564515
GA | iter = 87 | Mean = 0.9505129 | Best = 0.9564515
GA | iter = 88 | Mean = 0.9518203 | Best = 0.9564515
GA | iter = 89 | Mean = 0.9537234 | Best = 0.9564515
GA | iter = 90 | Mean = 0.9531017 | Best = 0.9564515
GA | iter = 91 | Mean = 0.9514525 | Best = 0.9564515
GA | iter = 92 | Mean = 0.9505517 | Best = 0.9564515
GA | iter = 93 | Mean = 0.9524752 | Best = 0.9564515
GA | iter = 94 | Mean = 0.9533879 | Best = 0.9564515
GA | iter = 95 | Mean = 0.9519166 | Best = 0.9564515
GA | iter = 96 | Mean = 0.9524416 | Best = 0.9564515
GA | iter = 97 | Mean = 0.9526676 | Best = 0.9564515
GA | iter = 98 | Mean = 0.9523745 | Best = 0.9564515
GA | iter = 99 | Mean = 0.9523710 | Best = 0.9564515
GA | iter = 100 | Mean = 0.9519255 | Best = 0.9564515
> res@solution
     x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25
[1,]  1  1  1  0  0  0  0  1  1   0   1   0   1   1   0   1   0   0   1   0   1   1   1   1   1
> res@fitnessValue
[1] 0.9564515

这实际上为这个小数据集找到了最佳解决方案。

结论

使 Pearson 相关系数最大化的最优选择数据是:

----     92 PARAMETER result  selected observations

        height1     weight1     height2     weight2

i1                            67.445523  163.692389
i2                            68.649190  160.084811
i3                            69.503911  164.720010
i4    59.850261  145.704159
i5    65.341938  155.586984
i6    64.142009  154.335001
i7    67.030368  158.768813
i8                            73.236515  164.704340
i9                            63.279277  149.784500
i10   69.001848  160.063428
i11                           63.930208  152.827710
i12   70.719334  164.885704
i13                           63.614565  150.071072
i14                           70.533320  160.722483
i15   78.417676  172.298652
i16                           67.404942  158.310596
i17   62.504967  150.899428
i18   62.122630  150.024298
i19                           72.999194  166.771223
i20   74.935107  170.820610
i21                           60.372876  149.152520
i22                           76.748649  167.462369
i23                           75.433331  171.773607
i24                           65.717553  160.134888
i25                           67.301426  158.713710

推荐阅读