首页 > 解决方案 > 如何根据分类对数组的某些列进行分组和求和(例如按国家/地区对城市进行分组)

问题描述

问题

我有随着时间的推移跟踪某些项目的数组。这些项目属于某些类别。我想按时间和类别计算总和,例如从按时间和城市的表格到按时间和国家的表格。

我找到了几种方法,但它们看起来很笨拙——一定有更好的方法!当然我不是第一个遇到这个问题的人?也许使用np.where

进一步来说:

我有许多形状 (pxi) 的 numpy 数组,其中 p 是周期,i 是我随时间跟踪的项目。然后我有一个单独的形状 i 数组,它将项目分类为类别(红色、绿色、黄色等)。

我想要做的是计算一个形状数组(唯一类别的 px 数量),它按时间和类别对大数组的值求和。在图片中:

在此处输入图像描述

我需要代码尽可能高效,因为我需要在最多 400 x 1,000,000 的数组上多次执行此操作

我试过的:

这个问题涵盖了多种分组方式,而无需求助于熊猫。我喜欢 scipy.ndimage 方法,但 AFAIK 它只适用于一维。

我尝试了一个解决方案pandas

我还尝试了一组循环,优化numba

我的发现

在此处输入图像描述

PS我知道过早优化等的陷阱 - 我只是在研究这个,因为大量的时间都花在做这件事上

编码

import numpy as np
import pandas as pd
import time
import numba

periods = 300
n = int(2000)
categories = np.tile(['red','green','yellow','brown'],n)
my_array = np.random.randint(low = 0, high = 10, size = (periods, len(categories) ))
# my_arrays will have shape (periods x (n * number of categories))


#---- pandas
start = time.time()

df_categories = pd.DataFrame(data = categories).reset_index().rename(columns ={'index':'item',0:'category'})
df = pd.DataFrame(data = my_array)
unpiv = pd.melt(df.reset_index(), id_vars ='index', var_name ='item', value_name ='value').rename( columns = {'index':'time'})
unpiv = pd.merge(unpiv, df_categories, on='item' )
crosstab = pd.crosstab( unpiv['time'], unpiv['category'], values = unpiv['value'], aggfunc='sum' )

print("panda crosstab in:")
print(time.time() - start)
# yep, I know that timeit.timer would have been better, but I was in a hurry :)
print("")


#---- numba
@numba.jit(nopython = True, parallel = True, nogil = True)
def numba_classify(x, categories):
    cat_uniq = np.unique(categories)
    num_categories = len(cat_uniq)
    num_items = x.shape[1]
    periods = x.shape[0]
    categories_converted = np.zeros(len(categories), dtype = np.int32)
    out = np.zeros(( periods, num_categories))
    
    
    # before running the actual classification, I must convert the categories, which can be strings, to
    # the corresponsing number in cat_uniq, e.g. if brown is the first category by alphabetical sorting, then
    # brown --> 0, etc
    
    for i in numba.prange(num_items):
        for c in range(num_categories):
            if categories[i] == cat_uniq[c]:
                categories_converted[i] = c
      
        
    for i in numba.prange(num_items):        
        for p in range(periods):
            out[ p, categories_converted[i] ] += x[p,i]


    return out

start = time.time()

numba_out = numba_classify(my_array, categories)
print("numba done in:")
print(time.time() - start)

标签: pythonpandasdataframenumba

解决方案


您可以使用df.groupby(categories, axis=1).sum()大幅加速。

import numpy as np
import pandas as pd
import time


def make_data(periods, n):
    categories = np.tile(['red','green','yellow','brown'],n)
    my_array = np.random.randint(low = 0, high = 10, size = (periods, len(categories) ))
    
    return categories, pd.DataFrame(my_array)

for n in (200, 2000, 20000):
    categories, df = make_data(300, n)
    true_n = n * 4
    
    start = time.time()
    tabulation =df.groupby(categories, axis=1).sum()
    elapsed = time.time() - start
    
    print(f"300 x {true_n:5}: {elapsed:.3f} seconds")

# prints:
300 x   800: 0.005 seconds
300 x  8000: 0.021 seconds
300 x 80000: 0.673 seconds

推荐阅读