首页 > 解决方案 > 从另一个数据帧子集一个数据帧不会产生预期的结果

问题描述

我有 2 个数据框df1df2.

df1包含 2 列 -t1data1t1从 0.0001 到 75,增量为 0.0001。所以它就像 0.0001、0.0002、0.0003 ...... 74.9999、75.0000。data1 只是 0 到 1 之间的一些数字。

df2还包含 2 列 -t2data2,但每列的长度为 114 - 时间列中仅存在 0.0001 和 75 之间的选定值 - 例如。14.6000,15.2451,....73.4568。data2 又是一些长度为 114 的随机数 我从另一个数据集中提取了 t2 的值

t2<- c(14.6000, 14.6001, 14.6002, 14.6002, 14.6007, 14.6011, 14.6016, 14.602, 14.6037, 14.6055, 14.6072, 14.6089, 14.6151, 14.6214, 14.6277, 14.6339, 14.6402, 14.6545, 14.6688, 14.6831, 14.6974, 14.7117, 14.7261, 14.7573, 14.7886, 14.8199, 14.8511, 14.8824, 14.9137, 14.9681, 15.0225, 15.0768, 15.1312, 15.1856, 15.24, 15.3233, 15.4065, 15.4897, 15.573, 15.6562, 15.7394, 15.8768, 16.0142, 16.1516, 16.289, 16.4264, 16.5638, 16.7676, 16.9715, 17.1753, 17.3792, 17.583, 17.7868, 17.9907, 18.3366, 18.6826, 19.0285, 19.3745, 19.7204, 20.0664, 20.4124, 20.9122, 21.412, 21.9118, 22.4116, 22.9114, 23.4112, 23.911, 24.5965, 25.282, 25.9675, 26.653, 27.3385, 28.024, 29.1158, 30.2075, 31.2993, 32.3911, 33.4828, 34.6828, 35.8828, 37.0828, 38.2828, 39.4828, 40.6828, 41.8828, 43.0828, 44.2828, 45.4828, 46.6828, 47.8828, 49.0828, 50.2828, 51.4828, 52.6828, 53.8828, 55.0828, 56.2828, 57.4828, 58.6828, 59.8828, 61.0828, 62.2828, 63.4828, 64.6828, 65.8828, 67.0828, 68.2828, 69.4828, 70.6828, 71.8828, 73.0828, 74.2828,74.6000)


df1<- data.frame("t1"=seq(0.0001,75,0.0001), "data1"=c(rnorm(750000)))

df2<- data.frame("t2"=t2, "data2"=c(rnorm(length(t2))))

我想创建一个新的数据框 - df_new,我想在其中选择 的值t2和相应的data1df1

df_new<- subset(df1,t1 %in% df2$t2)

当我这样做时,df_new只有 74 个观察值,而不是 114 个。我在这里做错了吗?

标签: rdataframesubset

解决方案


这似乎是浮点运算的问题。请参阅下面的两个示例。一般来说,像这样直接比较浮点数并不一定是健壮的,因为表示的准确性并不完美。df2$t2我选择了第一个没有按预期排列的元素。您希望第一次==比较会返回 true,但事实并非如此。看到all.equal,它令人困惑地测试“近似相等”,实际上对于我拉出的两个对象返回 true。通过更改用 打印的数字,您可以看到存在差异options

获得预期结果的一种方法是使用round使您想要的所有数字都相同。请注意,您的输出中只有 113 行,因为所df2$t2提供的只有 113 个唯一值。您也可以考虑转换为整数(具有相应更小的单位)。

t2<- c(14.6000, 14.6001, 14.6002, 14.6002, 14.6007, 14.6011, 14.6016, 14.602, 14.6037, 14.6055, 14.6072, 14.6089, 14.6151, 14.6214, 14.6277, 14.6339, 14.6402, 14.6545, 14.6688, 14.6831, 14.6974, 14.7117, 14.7261, 14.7573, 14.7886, 14.8199, 14.8511, 14.8824, 14.9137, 14.9681, 15.0225, 15.0768, 15.1312, 15.1856, 15.24, 15.3233, 15.4065, 15.4897, 15.573, 15.6562, 15.7394, 15.8768, 16.0142, 16.1516, 16.289, 16.4264, 16.5638, 16.7676, 16.9715, 17.1753, 17.3792, 17.583, 17.7868, 17.9907, 18.3366, 18.6826, 19.0285, 19.3745, 19.7204, 20.0664, 20.4124, 20.9122, 21.412, 21.9118, 22.4116, 22.9114, 23.4112, 23.911, 24.5965, 25.282, 25.9675, 26.653, 27.3385, 28.024, 29.1158, 30.2075, 31.2993, 32.3911, 33.4828, 34.6828, 35.8828, 37.0828, 38.2828, 39.4828, 40.6828, 41.8828, 43.0828, 44.2828, 45.4828, 46.6828, 47.8828, 49.0828, 50.2828, 51.4828, 52.6828, 53.8828, 55.0828, 56.2828, 57.4828, 58.6828, 59.8828, 61.0828, 62.2828, 63.4828, 64.6828, 65.8828, 67.0828, 68.2828, 69.4828, 70.6828, 71.8828, 73.0828, 74.2828,74.6000)

set.seed(12345)
df1<- data.frame("t1"=seq(0.0001,75,0.0001), "data1"=c(rnorm(750000)))

df2<- data.frame("t2"= t2, "data2"=c(rnorm(length(t2))))

df2$t2[2]
#> [1] 14.6001
df1$t1[146001]
#> [1] 14.6001

df1$t1[146001] == df2$t2[2]
#> [1] FALSE
all.equal(df1$t1[146001], df2$t2[2])
#> [1] TRUE

options(digits = 22)
df2$t2[2]
#> [1] 14.600099999999999
df1$t1[146001]
#> [1] 14.600100000000001

df_new_rnd <- subset(df1, round(t1, 4) %in% round(df2$t2, 4))
df_new_int <- subset(df1, as.integer(t1 * 10000) %in% as.integer(df2$t2 * 10000))
nrow(df_new_rnd)
#> [1] 113
nrow(df_new_int)
#> [1] 113

reprex 包(v0.2.0)于 2018 年 5 月 22 日创建。


推荐阅读