我生成了一个虚拟数据集(鸢尾花 x 鸢尾花的笛卡尔积。毫无意义,但本质上只是一个 22500 x 10 数据集)。


iris_big <- merge(x = iris, y = iris, by = NULL) 

iris_big_dt <- as.data.table(iris_big) #for data.table

benchmark("Base R" = {
          "dplyr" = {

          "data.table" = {
          replications = 30,
          columns = c("test", "replications", "elapsed",
                      "relative", "user.self", "sys.self"))


| test       | replications   |elapsed|...|sys.self|
| --------   | -------------- |----   |---|---|
| Base R     | 30             |0.00   |...|0.00|
| data.table | 30             |0.04   |...|0.02|
| dplyr      | 30             |1.55   |...|0.00|

为什么base R这么快?为什么 dplyr 这么慢?难道我做错了什么?谢谢

  • iris_big[..](即base::[原语)没有逗号是选择columns,而不是rows。添加尾随逗号。

  • base::order("Petal.Width.y"),无论是否在 内iris_big[..],总是返回奇异静态1,因为它正在对长度为 1 的字符向量进行排序(即,c("Petal.Width.y")不关心它是否可能引用封闭框架中的列名)。因此,它返回第一列而不改变行顺序。返回值的维度错误的事实应该强烈暗示这已被破坏。(感谢@DonaldSeinen 为本评论的开头。)


    iris_big[1]     # just the first column
    iris_big[1,]    # just the first row


  • 同样,dplyr::arrange(iris_big, "Petal.Width.y")以同样的方式被打破。如果我们继续快速检查以确保该列没有减少,我们将看到

    dplyr::arrange(iris_big, "Petal.Width.y") %>%
      summarize(nondecr = all(diff(Petal.Width.y) >= 0))
    #   nondecr
    # 1   FALSE


    dplyr::arrange(iris_big, Petal.Width.y) %>%
    summarize(nondecr = all(diff(Petal.Width.y) >= 0))
    #   nondecr
    # 1    TRUE

base 和 dplyr 变体的“引用”问题被以下事实混淆了:base R 没有使用非标准评估(NSE),dplyr在 中需要 NSE arrange,并且data.table::setorder似乎使用引用或未引用(尽管它的说明“不要引用列名”?setorder

(缺少逗号第一个项目符号,也会被一条破损的捷径:它每年可能会节省数千个(?)否则不必要的逗号,但会以阅读 base/data.table 代码时的歧义为代价。)data.tableiris_big_dt[1]



ret1wrong1 <- iris_big[base::order("Petal.Width.y")]
ret1wrong2 <- iris_big[base::order("Petal.Width.y"),]      # add comma
ret1 <- iris_big[base::order(iris_big$Petal.Width.y),]     # unquote, add comma
ret2wrong <- dplyr::arrange(iris_big, "Petal.Width.y")
ret2 <- dplyr::arrange(iris_big, Petal.Width.y)            # unquote
ret3 <- data.table::setorder(iris_big_dt, "Petal.Width.y")

range(iris_big$Petal.Width.y) # informative
# [1] 0.1 2.5

head(ret1wrong1)          # wrong, single column
#   Sepal.Length.x
# 1            5.1
# 2            4.9
# 3            4.7
# 4            4.6
# 5            5.0
# 6            5.4
ret1wrong2                # wrong, single row
#   Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1            5.1           3.5            1.4           0.2    setosa            5.1           3.5            1.4           0.2    setosa
head(ret1)                # CORRECT
#      Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1351            5.1           3.5            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1352            4.9           3.0            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1353            4.7           3.2            1.3           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1354            4.6           3.1            1.5           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1355            5.0           3.6            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1356            5.4           3.9            1.7           0.4    setosa            4.9           3.1            1.5           0.1    setosa
all(diff(ret1$Petal.Width.y) >= 0)
# [1] TRUE

head(ret2wrong)           # first petal.Width.y is 0.2 not 0.1
#   Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1            5.1           3.5            1.4           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 2            4.9           3.0            1.4           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 3            4.7           3.2            1.3           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 4            4.6           3.1            1.5           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 5            5.0           3.6            1.4           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 6            5.4           3.9            1.7           0.4    setosa            5.1           3.5            1.4           0.2    setosa
all(diff(ret2wrong$Petal.Width.y) >= 0)
# [1] FALSE
head(ret2)                # CORRECT
#   Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1            5.1           3.5            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 2            4.9           3.0            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 3            4.7           3.2            1.3           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 4            4.6           3.1            1.5           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 5            5.0           3.6            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 6            5.4           3.9            1.7           0.4    setosa            4.9           3.1            1.5           0.1    setosa
all(diff(ret2$Petal.Width.y) >= 0)
# [1] TRUE

#    Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
#             <num>         <num>          <num>         <num>    <fctr>          <num>         <num>          <num>         <num>    <fctr>
# 1:            5.1           3.5            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 2:            4.9           3.0            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 3:            4.7           3.2            1.3           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 4:            4.6           3.1            1.5           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 5:            5.0           3.6            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 6:            5.4           3.9            1.7           0.4    setosa            4.9           3.1            1.5           0.1    setosa
all(diff(ret3$Petal.Width.y) >= 0)
# [1] TRUE

all.equal(ret1, ret2, check.attributes = FALSE)
# [1] TRUE
all.equal(ret1, ret3, check.attributes = FALSE)
# [1] TRUE




iris_big_dt1 <- as.data.table(iris_big) #for data.table
iris_big_dt2 <- as.data.table(iris_big) #for data.table

  "Base R" = {
  "dplyr" = {
    dplyr::arrange(iris_big, Petal.Width.y)
  "data.table 1" = {
    data.table::setorder(iris_big_dt1, "Petal.Width.y")
  "data.table 2" = {
    data.table::setorder(copy(iris_big_dt2), "Petal.Width.y")
  min_iterations = 1000,
  check = FALSE)
# # A tibble: 4 x 13
#   expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                  time             gc                  
#   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                  <list>           <list>              
# 1 Base R         3.33ms   3.61ms      262.    1.97MB    5.08    981    19      3.74s <NULL> <Rprofmem[,3] [13 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 2 dplyr          3.75ms   4.32ms      216.    1.63MB    3.74    983    17      4.55s <NULL> <Rprofmem[,3] [15 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 3 data.table 1   1.19ms   1.37ms      713.   87.94KB    0.714   999     1       1.4s <NULL> <Rprofmem[,3] [1 x 3]>  <bch:tm [1,000]> <tibble [1,000 x 3]>
# 4 data.table 2   2.66ms   3.26ms      304.    1.84MB    5.56    982    18      3.23s <NULL> <Rprofmem[,3] [15 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>

# [1] TRUE
# [1] FALSE

我包含了该data.table变体的两个版本,因为可能会质疑对已经排序的(由于其引用/就地操作)表进行排序会更快地进行第二次排序。copy即使每次都增加 ing 数据的开销,该data.table 2变体仍然明显比Base R和快dplyr
