python - 如何在 pydatatable 中的列之间应用聚合（sum、mean、max、min 等）？

问题描述

我有一个数据表，


DT_X = dt.Frame({
    
    'issue':['cs-1','cs-2','cs-3','cs-1','cs-3','cs-2'],
    
    'speech':[1,1,1,0,1,1],
    
    'narrative':[1,0,1,1,1,0],
    
    'thought':[0,1,1,0,1,1]
})

它可以被视为，

Out[5]: 
   | issue  speech  narrative  thought
-- + -----  ------  ---------  -------
 0 | cs-1        1          1        0
 1 | cs-2        1          0        1
 2 | cs-3        1          1        1
 3 | cs-1        0          1        0
 4 | cs-3        1          1        1
 5 | cs-2        1          0        1

[6 rows x 4 columns]

我现在对 3 列中的所有值进行分组运算，

DT_X[:,{'speech': dt.sum(f.speech),
        'narrative': dt.sum(f.narrative),
        'thought': dt.sum(f.thought)},
        by(f.issue)]

它产生一个输出，

Out[6]: 
   | issue  speech  narrative  thought
-- + -----  ------  ---------  -------
 0 | cs-1        1          2        0
 1 | cs-2        2          0        2
 2 | cs-3        2          2        2

[3 rows x 4 columns]

在这里，我手动给出了每个字段名称和聚合函数（dt.sum），因为它只需要 3 列我可以轻松执行此任务，但如果我必须处理超过 10、20 等等领域？

你有其他解决方案吗？

参考：我们在 Rdatatable 中具有与以下相同的功能：

DT[,lapply(.SD,sum),by=.(issue),.SDcols=c('speech','narrative','thought')]

标签： pythonpy-datatable

解决方案

如果给定多列集作为参数，中的大多数函数datatable，包括sum()，将自动应用于所有列。因此，R'slapply(.SD, sum)变成了简单的sum(.SD)，除了在 python 中没有.SD，而是我们使用f符号和组合。在您的情况下，f[:]将选择除 groupby 之外的所有列，因此它基本上等同于.SD.

其次，所有一元函数（即作用于单个列的函数，而不是像+or之类的二元函数corr）都传递其列的名称。因此，sum(f[:])将生成一组与中同名的列f[:]。

把这一切放在一起：

>>> from datatable import by, sum, f, dt

>>> DT_X[:, sum(f[:]), by(f.issue)]
   | issue  speech  narrative  thought
-- + -----  ------  ---------  -------
 0 | cs-1        1          2        0
 1 | cs-2        2          0        2
 2 | cs-3        2          2        2

[3 rows x 4 columns]

python - 如何在 pydatatable 中的列之间应用聚合（sum、mean、max、min 等）？

问题描述

解决方案

推荐阅读