首页 > 解决方案 > How to get same grouping result using data.table comparing to the sqldf?

问题描述

I try to implement SQL query using sqldf and data.table.
I need to do this separately using these 2 different libraries.
Unfortunately, I cannot produce the same result using data.table.

library(sqldf)
library(data.table)

Id       <- c(1,2,3,4)
HasPet   <- c(0,0,1,1)
Age      <- c(20,1,14,10)

Posts <- data.table(Id, HasPet, Age)

# sqldf way
ref <- sqldf("
      SELECT Id, HasPet, MAX(Age) AS MaxAge
      FROM Posts
      GROUP BY HasPet
  ")

# data.table way
res <- Posts[,
      list(Id, HasPet, MaxAge=max(Age)),
      by=list(HasPet)]

head(ref)
head(res)

Output for sqldf is:

> head(ref)
  Id HasPet MaxAge
1  1      0     20
2  3      1     14

while the output for data.table is different:

> head(res)
   HasPet Id HasPet MaxAge
1:      0  1      0     20
2:      0  2      0     20
3:      1  3      1     14
4:      1  4      1     14

Please note, that SQL query cannot be modified.

标签: sqlrdata.tablesqldf

解决方案


这在 data.table 中出现了很多。如果您想要按组的最大值或最小值,最好的方法是自加入。它很快,而且只有一点奥术。

您可以逐步构建它:在data.table中,您可以在i中选择,在j中进行操作,然后进行分组。所以第一步是在组的每个级别中找到我们想要的东西

Posts[, Age == max(Age), by = HasPet]
#    HasPet    V1
# 1:      0  TRUE
# 2:      0 FALSE
# 3:      1  TRUE
# 4:      1 FALSE

我们可以.I用来检索每行的整数向量,然后是以前的 V1 逻辑向量 TRUE 和 FALSE 索引在每个组中,因此我们只有包含每组最大值的行。

Posts[, .I[Age == max(Age)], by=HasPet]

# From the data.table special symbols help:
# .I is an integer vector equal to seq_len(nrow(x)). While grouping,
# it holds for each item in the group, its row location in x. This is useful
# to subset in j; e.g. DT[, .I[which.max(somecol)], by=grp].

#    HasPet V1
# 1:      0  1
# 2:      1  3

然后,我们使用刚刚创建的列 V1 来调用 data.table 中的特定行(1 和 3)。而已!

Posts[Posts[, .I[Age == max(Age)], by=HasPet]$V1]

推荐阅读