首页 > 解决方案 > R data.table 按组排序,每组底部有“其他”

问题描述

我不能完全得到正确的语法。我有一个data.table我想先按分组列g1(有序因子)排序的地方,然后按另一列降序排序n。唯一的问题是,我希望第三列标记为“其他”的行g2出现在每个组的底部,无论它们的值如何n

例子:

library(data.table)

dt <- data.table(g1 = factor(rep(c('Australia', 'Mexico', 'Canada'), 3), levels = c('Australia', 'Canada', 'Mexico')),
                 g2 = rep(c('stuff', 'things', 'other'), each = 3),
                 n = c(1000, 2000, 3000, 5000, 100, 3500, 10000, 10000, 0))

这是预期的输出,在每个g1中,我们有降序排列,n除了那些g2 == 'other'总是在底部的行:

         g1     g2     n
1: Australia things  5000
2: Australia  stuff  1000
3: Australia  other 10000
4:    Canada things  3500
5:    Canada  stuff  3000
6:    Canada  other     0
7:    Mexico  stuff  2000
8:    Mexico things   100
9:    Mexico  other 10000

标签: rdata.table

解决方案


利用data.table::order及其--reverse 排序:

dt[order(g1, g2 == "other", -n), ]
#           g1     g2     n
#       <fctr> <char> <num>
# 1: Australia things  5000
# 2: Australia  stuff  1000
# 3: Australia  other 10000
# 4:    Canada things  3500
# 5:    Canada  stuff  3000
# 6:    Canada  other     0
# 7:    Mexico  stuff  2000
# 8:    Mexico things   100
# 9:    Mexico  other 10000

我们添加g2 == "other"是因为您说“其他”应该始终放在最后。例如,如果"stuff"was "abc",那么我们可以看到行为上的差异:

dt[ g2 == "stuff", g2 := "abc" ]
dt[order(g1, -n), ]
#           g1     g2     n
#       <fctr> <char> <num>
# 1: Australia  other 10000
# 2: Australia things  5000
# 3: Australia    abc  1000
# 4:    Canada things  3500
# 5:    Canada    abc  3000
# 6:    Canada  other     0
# 7:    Mexico  other 10000
# 8:    Mexico    abc  2000
# 9:    Mexico things   100

dt[order(g1, g2 == "other", -g2), ]
#           g1     g2     n
#       <fctr> <char> <num>
# 1: Australia things  5000
# 2: Australia    abc  1000
# 3: Australia  other 10000
# 4:    Canada things  3500
# 5:    Canada    abc  3000
# 6:    Canada  other     0
# 7:    Mexico things   100
# 8:    Mexico    abc  2000
# 9:    Mexico  other 10000

这样做的一个缺点是setorder不能直接工作:

setorder(dt, g1, g2 == "other", -n)
# Error in setorderv(x, cols, order, na.last) : 
#   some columns are not in the data.table: ==,other

所以我们需要重新排序并重新分配回dt.

顺便说一句:这有效,因为g2 == "other"解析为logical,是的,但是在排序时将它们视为0(假)和1(真),因此错误条件将出现在真条件之前。


推荐阅读