r - 选择最大化相关性的样本
问题描述
我有一个优化问题,我正在努力用 R 编写代码。我有一个面板数据集,每个工人的体重和身高观察了两次。
worker weight height weight2 height2
1 120 60 125 60
2 152 55 156 66
3 222 55 100 20
我想确定体重和身高之间的最大相关性,而不考虑收集数据的时间。例如,我想确定当我使用工人 1 的体重和身高或工人 1 的体重 2 和身高 2 时,相关性是否更强,依此类推。确定最大化这种相关性的样本的最佳方法是什么?
解决方案
以下是一些可能的方法。第一个是使用全局 MINLP 求解器(不使用 R),第二个是演示 R 的 GA(遗传算法)启发式。
求解为非凸 MINLP
此问题的 MINLP(混合整数非线性规划)模型可以是:
max cor(h,w)
h[i] = height1[i]*(1-x[i]) + height2[i]*x[i]
w[i] = weight1[i]*(1-x[i]) + weight2[i]*x[i]
x[i] ∈ {0,1}
即,x[i]=0
选择第一个观察值并x[i]=1
选择第二个观察值。
这需要一个全局求解器,例如 Baron、Antigone 或 Couenne。
以下是 Baron 的示例:
---- 27 PARAMETER data
height1 weight1 height2 weight2
i1 67.433285 168.262871 67.445523 163.692389
i2 70.638374 174.437750 68.649190 160.084811
i3 71.317794 159.909672 69.503911 164.720010
i4 59.850261 145.704159 61.175728 142.708300
i5 65.341938 155.586984 68.483909 165.564991
i6 64.142009 154.335001 68.568683 166.169507
i7 67.030368 158.768813 65.780803 153.721717
i8 73.672863 175.126951 73.236515 164.704340
i9 65.203516 157.593587 63.279277 149.784500
i10 69.001848 160.063428 68.786656 162.278007
i11 64.455422 159.039195 63.930208 152.827710
i12 70.719334 164.885704 69.666096 157.356595
i13 65.688428 151.223468 63.614565 150.071072
i14 66.569252 160.978671 70.533320 160.722483
i15 78.417676 172.298652 80.070076 172.695207
i16 65.396154 158.234709 67.404942 158.310596
i17 62.504967 150.899428 61.000439 154.094647
i18 62.122630 150.024298 63.634554 153.644324
i19 70.598400 165.086523 72.999194 166.771223
i20 74.935107 170.820610 76.622182 169.013550
i21 63.233956 154.331546 60.372876 149.152520
i22 72.550105 173.961915 76.748649 167.462369
i23 74.086553 168.190867 75.433331 171.773607
i24 65.379648 163.577697 65.717553 160.134888
i25 64.003038 155.357607 67.301426 158.713710
---- 68 VARIABLE x.L select 1 or 2
i1 1.000000, i2 1.000000, i3 1.000000, i8 1.000000, i9 1.000000, i11 1.000000, i13 1.000000
i14 1.000000, i16 1.000000, i19 1.000000, i21 1.000000, i22 1.000000, i23 1.000000, i24 1.000000
i25 1.000000
---- 68 VARIABLE z.L = 0.956452 objective
---- 72 PARAMETER corr
all1 0.868691, all2 0.894532, optimal 0.956452
笔记:
- 中的零
x
不打印 - 该参数
corr
显示三种情况下的相关性:(a) 全部时x[i]=0
,(b) 全部时x[i]=1
和 (c) 的最佳值x
。 - 我使用 GAMS/Baron 运行它(即在 R 之外)
求解为非凸 MIQCP
通过一些努力,我们可以将我们的 MINLP 模型重新定义为非凸二次模型。这个模型可以用像 Gurobi(在 R 下可用)这样的求解器来求解。有关更多详细信息,请参阅链接。
使用元启发式
如果我们只想找到一个好的解决方案而不是经过验证的最优解决方案,我们可以使用元启发式算法,例如遗传算法。ga
这是一个来自包的试用版GA
:
> df <- read.table(text="
+ id height1 weight1 height2 weight2
+ i1 67.433285 168.262871 67.445523 163.692389
+ i2 70.638374 174.437750 68.649190 160.084811
+ i3 71.317794 159.909672 69.503911 164.720010
+ i4 59.850261 145.704159 61.175728 142.708300
+ i5 65.341938 155.586984 68.483909 165.564991
+ i6 64.142009 154.335001 68.568683 166.169507
+ i7 67.030368 158.768813 65.780803 153.721717
+ i8 73.672863 175.126951 73.236515 164.704340
+ i9 65.203516 157.593587 63.279277 149.784500
+ i10 69.001848 160.063428 68.786656 162.278007
+ i11 64.455422 159.039195 63.930208 152.827710
+ i12 70.719334 164.885704 69.666096 157.356595
+ i13 65.688428 151.223468 63.614565 150.071072
+ i14 66.569252 160.978671 70.533320 160.722483
+ i15 78.417676 172.298652 80.070076 172.695207
+ i16 65.396154 158.234709 67.404942 158.310596
+ i17 62.504967 150.899428 61.000439 154.094647
+ i18 62.122630 150.024298 63.634554 153.644324
+ i19 70.598400 165.086523 72.999194 166.771223
+ i20 74.935107 170.820610 76.622182 169.013550
+ i21 63.233956 154.331546 60.372876 149.152520
+ i22 72.550105 173.961915 76.748649 167.462369
+ i23 74.086553 168.190867 75.433331 171.773607
+ i24 65.379648 163.577697 65.717553 160.134888
+ i25 64.003038 155.357607 67.301426 158.713710
+ ", header=T)
>
> #
> # print obvious cases
> #
> cor(df$weight1,df$height1)
[1] 0.8686908
> cor(df$weight2,df$height2)
[1] 0.894532
>
> #
> # fitness function
> #
> f <- function(x) {
+ w <- df$weight1*(1-x) + df$weight2*x
+ h <- df$height1*(1-x) + df$height2*x
+ cor(w,h)
+ }
>
> library(GA)
> res <- ga(type=c("binary"),fitness=f,nBits=25,seed=123)
GA | iter = 1 | Mean = 0.8709318 | Best = 0.9237155
GA | iter = 2 | Mean = 0.8742004 | Best = 0.9237155
GA | iter = 3 | Mean = 0.8736450 | Best = 0.9237155
GA | iter = 4 | Mean = 0.8742228 | Best = 0.9384788
GA | iter = 5 | Mean = 0.8746517 | Best = 0.9384788
GA | iter = 6 | Mean = 0.8792048 | Best = 0.9486227
GA | iter = 7 | Mean = 0.8844841 | Best = 0.9486227
GA | iter = 8 | Mean = 0.8816874 | Best = 0.9486227
GA | iter = 9 | Mean = 0.8805522 | Best = 0.9486227
GA | iter = 10 | Mean = 0.8820974 | Best = 0.9486227
GA | iter = 11 | Mean = 0.8859074 | Best = 0.9486227
GA | iter = 12 | Mean = 0.8956467 | Best = 0.9486227
GA | iter = 13 | Mean = 0.8989140 | Best = 0.9486227
GA | iter = 14 | Mean = 0.9069327 | Best = 0.9486227
GA | iter = 15 | Mean = 0.9078787 | Best = 0.9486227
GA | iter = 16 | Mean = 0.9069163 | Best = 0.9489443
GA | iter = 17 | Mean = 0.9104712 | Best = 0.9489443
GA | iter = 18 | Mean = 0.9169900 | Best = 0.9489443
GA | iter = 19 | Mean = 0.9175285 | Best = 0.9489443
GA | iter = 20 | Mean = 0.9207076 | Best = 0.9489443
GA | iter = 21 | Mean = 0.9210288 | Best = 0.9489443
GA | iter = 22 | Mean = 0.9206928 | Best = 0.9489443
GA | iter = 23 | Mean = 0.9210399 | Best = 0.9489443
GA | iter = 24 | Mean = 0.9208985 | Best = 0.9489443
GA | iter = 25 | Mean = 0.9183778 | Best = 0.9511446
GA | iter = 26 | Mean = 0.9217391 | Best = 0.9511446
GA | iter = 27 | Mean = 0.9274271 | Best = 0.9522764
GA | iter = 28 | Mean = 0.9271156 | Best = 0.9522764
GA | iter = 29 | Mean = 0.9275347 | Best = 0.9522764
GA | iter = 30 | Mean = 0.9278315 | Best = 0.9522764
GA | iter = 31 | Mean = 0.9300289 | Best = 0.9522764
GA | iter = 32 | Mean = 0.9306409 | Best = 0.9528777
GA | iter = 33 | Mean = 0.9309087 | Best = 0.9528777
GA | iter = 34 | Mean = 0.9327691 | Best = 0.9528777
GA | iter = 35 | Mean = 0.9309344 | Best = 0.9549574
GA | iter = 36 | Mean = 0.9341977 | Best = 0.9549574
GA | iter = 37 | Mean = 0.9374437 | Best = 0.9559043
GA | iter = 38 | Mean = 0.9394410 | Best = 0.9559043
GA | iter = 39 | Mean = 0.9405482 | Best = 0.9559043
GA | iter = 40 | Mean = 0.9432749 | Best = 0.9564515
GA | iter = 41 | Mean = 0.9441814 | Best = 0.9564515
GA | iter = 42 | Mean = 0.9458232 | Best = 0.9564515
GA | iter = 43 | Mean = 0.9469625 | Best = 0.9564515
GA | iter = 44 | Mean = 0.9462313 | Best = 0.9564515
GA | iter = 45 | Mean = 0.9449716 | Best = 0.9564515
GA | iter = 46 | Mean = 0.9444071 | Best = 0.9564515
GA | iter = 47 | Mean = 0.9437149 | Best = 0.9564515
GA | iter = 48 | Mean = 0.9446355 | Best = 0.9564515
GA | iter = 49 | Mean = 0.9455424 | Best = 0.9564515
GA | iter = 50 | Mean = 0.9456497 | Best = 0.9564515
GA | iter = 51 | Mean = 0.9461382 | Best = 0.9564515
GA | iter = 52 | Mean = 0.9444960 | Best = 0.9564515
GA | iter = 53 | Mean = 0.9434671 | Best = 0.9564515
GA | iter = 54 | Mean = 0.9451851 | Best = 0.9564515
GA | iter = 55 | Mean = 0.9481903 | Best = 0.9564515
GA | iter = 56 | Mean = 0.9477778 | Best = 0.9564515
GA | iter = 57 | Mean = 0.9481829 | Best = 0.9564515
GA | iter = 58 | Mean = 0.9490952 | Best = 0.9564515
GA | iter = 59 | Mean = 0.9505670 | Best = 0.9564515
GA | iter = 60 | Mean = 0.9499329 | Best = 0.9564515
GA | iter = 61 | Mean = 0.9509299 | Best = 0.9564515
GA | iter = 62 | Mean = 0.9505341 | Best = 0.9564515
GA | iter = 63 | Mean = 0.9519624 | Best = 0.9564515
GA | iter = 64 | Mean = 0.9518618 | Best = 0.9564515
GA | iter = 65 | Mean = 0.9523598 | Best = 0.9564515
GA | iter = 66 | Mean = 0.9516766 | Best = 0.9564515
GA | iter = 67 | Mean = 0.9521926 | Best = 0.9564515
GA | iter = 68 | Mean = 0.9524419 | Best = 0.9564515
GA | iter = 69 | Mean = 0.9532865 | Best = 0.9564515
GA | iter = 70 | Mean = 0.9535871 | Best = 0.9564515
GA | iter = 71 | Mean = 0.9536049 | Best = 0.9564515
GA | iter = 72 | Mean = 0.9534035 | Best = 0.9564515
GA | iter = 73 | Mean = 0.9532859 | Best = 0.9564515
GA | iter = 74 | Mean = 0.9521064 | Best = 0.9564515
GA | iter = 75 | Mean = 0.9534997 | Best = 0.9564515
GA | iter = 76 | Mean = 0.9539987 | Best = 0.9564515
GA | iter = 77 | Mean = 0.9536670 | Best = 0.9564515
GA | iter = 78 | Mean = 0.9526224 | Best = 0.9564515
GA | iter = 79 | Mean = 0.9531871 | Best = 0.9564515
GA | iter = 80 | Mean = 0.9527495 | Best = 0.9564515
GA | iter = 81 | Mean = 0.9526061 | Best = 0.9564515
GA | iter = 82 | Mean = 0.9525577 | Best = 0.9564515
GA | iter = 83 | Mean = 0.9525084 | Best = 0.9564515
GA | iter = 84 | Mean = 0.9519052 | Best = 0.9564515
GA | iter = 85 | Mean = 0.9518549 | Best = 0.9564515
GA | iter = 86 | Mean = 0.9511299 | Best = 0.9564515
GA | iter = 87 | Mean = 0.9505129 | Best = 0.9564515
GA | iter = 88 | Mean = 0.9518203 | Best = 0.9564515
GA | iter = 89 | Mean = 0.9537234 | Best = 0.9564515
GA | iter = 90 | Mean = 0.9531017 | Best = 0.9564515
GA | iter = 91 | Mean = 0.9514525 | Best = 0.9564515
GA | iter = 92 | Mean = 0.9505517 | Best = 0.9564515
GA | iter = 93 | Mean = 0.9524752 | Best = 0.9564515
GA | iter = 94 | Mean = 0.9533879 | Best = 0.9564515
GA | iter = 95 | Mean = 0.9519166 | Best = 0.9564515
GA | iter = 96 | Mean = 0.9524416 | Best = 0.9564515
GA | iter = 97 | Mean = 0.9526676 | Best = 0.9564515
GA | iter = 98 | Mean = 0.9523745 | Best = 0.9564515
GA | iter = 99 | Mean = 0.9523710 | Best = 0.9564515
GA | iter = 100 | Mean = 0.9519255 | Best = 0.9564515
> res@solution
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25
[1,] 1 1 1 0 0 0 0 1 1 0 1 0 1 1 0 1 0 0 1 0 1 1 1 1 1
> res@fitnessValue
[1] 0.9564515
这实际上为这个小数据集找到了最佳解决方案。
结论
使 Pearson 相关系数最大化的最优选择数据是:
---- 92 PARAMETER result selected observations
height1 weight1 height2 weight2
i1 67.445523 163.692389
i2 68.649190 160.084811
i3 69.503911 164.720010
i4 59.850261 145.704159
i5 65.341938 155.586984
i6 64.142009 154.335001
i7 67.030368 158.768813
i8 73.236515 164.704340
i9 63.279277 149.784500
i10 69.001848 160.063428
i11 63.930208 152.827710
i12 70.719334 164.885704
i13 63.614565 150.071072
i14 70.533320 160.722483
i15 78.417676 172.298652
i16 67.404942 158.310596
i17 62.504967 150.899428
i18 62.122630 150.024298
i19 72.999194 166.771223
i20 74.935107 170.820610
i21 60.372876 149.152520
i22 76.748649 167.462369
i23 75.433331 171.773607
i24 65.717553 160.134888
i25 67.301426 158.713710
推荐阅读
- python - 带有非默认表的 Django 重定向
- neo4j - Neo4j 图中的密码查询
- api - 更改字符串的 maxLength 会破坏 API 合同?
- java - Angular 调用 Eureka API 时出现 CORS 来源错误
- r - 工作项目的逻辑数据整理建议
- python - AttributeError:模块“嵌入存储”没有属性“help_command”
- php - 在 Laravel 中自动添加组语句以进行查询
- sql - 如何计算 SQL Server 中重复条目的每一行?
- javascript - Vue | 获取 v-for 之后添加的项目数量而不计算列表
- python - Python - 带有进程的按钮