首页 > 解决方案 > pandas:根据多列生成分数

问题描述

import random, string, time
import pandas as pd

random.seed(1)
toy_set = pd.DataFrame({'group': [str(i)+'_'+str(j) for i in range(40000) for j in range(25)],
                        'feature1': random.choices(string.ascii_letters, k = 1000000),
                        'feature2': random.choices(string.ascii_letters, k = 1000000),
                        'feature3': random.choices(range(10), k=1000000)
                        })

#create hypothetical scoring dict
eventScores = {}
for k in toy_set.groupby(['feature1', 'feature2','feature3']).groups.keys():
    if k[0] not in eventScores:
        eventScores[k[0]] = {}
    if k[1] not in eventScores[k[0]]:
        eventScores[k[0]][k[1]] = {}
    eventScores[k[0]][k[1]][k[2]] = random.randint(1,10)   

def calc_x(subset):
    return subset.apply(lambda x: eventScores[x['feature1']][x['feature2']][x['feature3']],
                            axis =1)

t = time.time()
toy_set['x'] = calc_x(toy_set) 
print(round(time.time() - t))

我有一个df具有 3 个特征的特征,基于这些特征我为每一行生成一个分数(在这种情况下,每个案例的分数是随机指定的,只是为了示例的目的)。

有没有一种更快的方法来生成x而不是进行嵌套dict替换?(这套设备目前在我的 W10 I7 上需要大约 30 秒,而实际的要大 x15)

标签: pythonpandas

解决方案


尝试使用dict comprehensionto restructure eventScores,然后Series.map针对您的级联特征使用:

d_map = {f"{k1}_{k2}_{k3}":v3 for k1, v1 in eventScores.items() for k2, v2 in v1.items() for k3, v3 in v2.items()}

toy_set['x'] = (toy_set['feature1'].astype(str) + '_' + 
                toy_set['feature2'].astype(str) + '_' + 
                toy_set['feature3'].astype(str)).map(d_map)

计时

# This method
898 ms ± 9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Original method
25.3 s ± 497 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

推荐阅读