首页 > 解决方案 > 如何根据三个不同变量的三个条件在df中选择一个值?

问题描述

我有一个如下数据框:

set.seed(123)
df <- data.frame(Delay=rep(-5:5, times=4, each=1),
                 ID= rep(c("A","B","C","D"), times=1, each=11),
                 variable=rep(c("R2","SE"), times=11, each=1),
                 value=sample(seq(0, 1, by=0.01), 44, replace=TRUE))

df$ID <- as.factor(df$ID)
df$variable <- as.factor(df$variable)

head(df)
  Delay ID variable value
1    -5  A       R2  0.30
2    -4  A       SE  0.78
3    -3  A       R2  0.50
4    -2  A       SE  0.13
5    -1  A       R2  0.66
6     0  A       SE  0.41

我想获得,和具有最小值的值DelayID=="B"variable=="R2"value

我怎么能找到这个值?

标签: r

解决方案


该解决方案与 R 的版本无关,但结果(此处)对随机性很敏感(显然在 R-3.5.3 和 R-4.0.0 之间的某个地方发生了变化)。

R-3.5.3

with(df[order(df$value),], Delay[ID == "B" & variable == "R2"])
# [1] -2  0  2 -4  4
with(df[order(df$value),], Delay[ID == "B" & variable == "R2"][1])
# [1] -2

dput(df)
# structure(list(Delay = c(-5L, -4L, -3L, -2L, -1L, 0L, 1L, 2L, 3L, 4L, 5L, -5L, -4L, -3L, -2L, -1L, 0L, 1L, 2L, 3L, 4L, 5L, -5L, -4L, -3L, -2L, -1L, 0L, 1L, 2L, 3L, 4L, 5L, -5L, -4L, -3L, -2L, -1L, 0L, 1L, 2L, 3L, 4L, 5L), ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"), variable = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), class = "factor", .Label = c("R2", "SE")), value = c(0.29, 0.79, 0.41, 0.89, 0.94, 0.04, 0.53, 0.9, 0.55, 0.46, 0.96, 0.45, 0.68, 0.57, 0.1, 0.9, 0.24, 0.04, 0.33, 0.96, 0.89, 0.69, 0.64, 1, 0.66, 0.71, 0.54, 0.6, 0.29, 0.14, 0.97, 0.91, 0.69, 0.8, 0.02, 0.48, 0.76, 0.21, 0.32, 0.23, 0.14, 0.41, 0.41, 0.37)), row.names = c(NA, -44L), class = "data.frame")

R-4.0.0

with(df[order(df$value),], Delay[ID == "B" & variable == "R2"])
# [1]  4 -4 -2  0  2
with(df[order(df$value),], Delay[ID == "B" & variable == "R2"][1])
# [1] 4

dput(df)
# structure(list(Delay = c(-5L, -4L, -3L, -2L, -1L, 0L, 1L, 2L, 3L, 4L, 5L, -5L, -4L, -3L, -2L, -1L, 0L, 1L, 2L, 3L, 4L, 5L, -5L, -4L, -3L, -2L, -1L, 0L, 1L, 2L, 3L, 4L, 5L, -5L, -4L, -3L, -2L, -1L, 0L, 1L, 2L, 3L, 4L, 5L), ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"), variable = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("R2", "SE"), class = "factor"), value = c(0.3, 0.78, 0.5, 0.13, 0.66, 0.41, 0.49, 0.42, 1, 0.13, 0.24, 0.89, 0.9, 0.68, 0.9, 0.56, 0.91, 0.08, 0.92, 0.98, 0.71, 0.25, 0.06, 0.41, 0.08, 0.82, 0.35, 0.77, 0.8, 0.42, 0.75, 0.14, 0.31, 0.06, 0.08, 0.4, 0.73, 0.22, 0.26, 0.59, 0.52, 0.06, 0.52, 0.26)), row.names = c(NA, -44L), class = "data.frame")

他们不同的地方

数据的“随机性”对 R 版本很敏感。

如果你很好奇,左边的三个(非随机)列是相同的,只是value列不同。结合两个dfs(并命名为 R 版本)呈现

df
#    Delay ID variable R-3.5.3 R-4.0.0
# 1     -5  A       R2    0.29    0.30
# 2     -4  A       SE    0.79    0.78
# 3     -3  A       R2    0.41    0.50
# 4     -2  A       SE    0.89    0.13
# 5     -1  A       R2    0.94    0.66
# 6      0  A       SE    0.04    0.41
# 7      1  A       R2    0.53    0.49
# 8      2  A       SE    0.90    0.42
# 9      3  A       R2    0.55    1.00
# 10     4  A       SE    0.46    0.13
# 11     5  A       R2    0.96    0.24
# 12    -5  B       SE    0.45    0.89
# 13    -4  B       R2    0.68    0.90
# 14    -3  B       SE    0.57    0.68
# 15    -2  B       R2    0.10    0.90
# 16    -1  B       SE    0.90    0.56
# 17     0  B       R2    0.24    0.91
# 18     1  B       SE    0.04    0.08
# 19     2  B       R2    0.33    0.92
# 20     3  B       SE    0.96    0.98
# 21     4  B       R2    0.89    0.71
# 22     5  B       SE    0.69    0.25
# 23    -5  C       R2    0.64    0.06
# 24    -4  C       SE    1.00    0.41
# 25    -3  C       R2    0.66    0.08
# 26    -2  C       SE    0.71    0.82
# 27    -1  C       R2    0.54    0.35
# 28     0  C       SE    0.60    0.77
# 29     1  C       R2    0.29    0.80
# 30     2  C       SE    0.14    0.42
# 31     3  C       R2    0.97    0.75
# 32     4  C       SE    0.91    0.14
# 33     5  C       R2    0.69    0.31
# 34    -5  D       SE    0.80    0.06
# 35    -4  D       R2    0.02    0.08
# 36    -3  D       SE    0.48    0.40
# 37    -2  D       R2    0.76    0.73
# 38    -1  D       SE    0.21    0.22
# 39     0  D       R2    0.32    0.26
# 40     1  D       SE    0.23    0.59
# 41     2  D       R2    0.14    0.52
# 42     3  D       SE    0.41    0.06
# 43     4  D       R2    0.41    0.52
# 44     5  D       SE    0.37    0.26

为什么他们不同

正如@KonradRudolph 建议的那样,这在 R_3.6 中发生了变化,其中(我正在阅读此内容):

    * The default method for generating from a discrete uniform
      distribution (used in sample(), for instance) has been changed.
      This addresses the fact, pointed out by Ottoboni and Stark, that
      the previous method made sample() noticeably non-uniform on large
      populations.  See PR#17494 for a discussion.  The previous method
      can be requested using RNGkind() or RNGversion() if necessary for
      reproduction of old results.  Thanks to Duncan Murdoch for
      contributing the patch and Gabe Becker for further assistance.

      The output of RNGkind() has been changed to also return the
      'kind' used by sample().

(来源:https ://stat.ethz.ch/pipermail/r-announce/2019/000641.html和https://cran.r-project.org/doc/manuals/r-release/NEWS.3.html )


推荐阅读