首页 > 解决方案 > 使用 apply() 处理 Pandas 行非常慢

问题描述

我有两个 DataFrame,df1(500 万行)和 df2(大约 150 行)。我想查找 df2 并根据 df2 中的数据更新 df1。基本上我有几个条件需要以 DataFrame df1 的形式应用于 df1。如果 df2 中的条件在 df1 中满足,则设置一个标志。

data1 = {'type':['1','2','3'],
         'code':['A','B','C'],
         'type_2':['A1', 'B1', 'C1'],
         'num1': ['101','102', '103'],
         'num2': ['','',''],
         'p_cd':['AA', 'BB', 'CC']
        }

df1 = pd.DataFrame(data1)

data2 = {'type':['1','2'],
         'code':['A','B'],
         'type_2':['', 'B1'],
         'num1': ['','102'],
         'num2': ['',''],
         'custom_expression':["p_cd=='AA'", ''],
         'delete_flag':['Y', 'Y']
        }

df2 = pd.DataFrame(data2)

df_cols = df1.columns.tolist()

def delete_flag(row_from_outlier):
    for index, row_from_delete in  df2.iterrows():
        if row_from_delete['custom_expression'].strip():
            for col in df_cols:
                if row_from_delete['custom_expression'].find(col) != -1:
                    print(row_from_delete['custom_expression'])
                    cust_exp = row_from_delete['custom_expression'].replace(col, "row_from_outlier['"+col+"']" )
        else:
            cust_exp = '1==1'
            print(cust_exp)

        if (
                ( (not row_from_delete['type'].strip()  ) or (row_from_delete['type'] ==row_from_outlier['type'] )  )
            and ( (not row_from_delete['code'].strip()  ) or (row_from_delete['code'] == row_from_outlier['code'] )               )
            and ( (not row_from_delete['type_2'].strip()) or (row_from_delete['type_2'] == row_from_outlier['type_2'] ) )
            and ( (not row_from_delete['num1'].strip()  ) or (row_from_delete['num1'] == row_from_outlier['num1'] ) )
            and ( (not row_from_delete['num2'].strip()  ) or (row_from_delete['num2'] == row_from_outlier['num2'] ) )
            and eval(cust_exp)
        ) :
            del_flg = row_from_delete['delete_flag']
    if not 'del_flg' in locals():
        del_flg = 'N/A'
    return del_flg

df1['delete_flag'] = df1.apply(lambda row:delete_flag(row), axis=1)

上面的代码在测试一小组数据时运行良好,但速度不足以处理 500 万条记录。

发现这个Cython、Numba 和 pandas.eval()建议可能有效,但我对 Python 还是很陌生,不知道如何将上述代码转换为使用 CYTHON

标签: pythonpandaslambdacython

解决方案


一些可以帮助您加快速度的观察结果。

你反复做很多事情:

剥离内部delete_flag

只需在开始时在函数之外执行一次

for col in df1.columns:
    df1[col] = df1[col].str.strip()
for col in df2.columns:
    df2[col] = df2[col].str.strip()

建筑cust_exp

这部分功能

    ...
    for index, row_from_delete in df2.iterrows():
        if row_from_delete['custom_expression'].strip():
            for col in df_cols:
                if row_from_delete['custom_expression'].find(col) != -1:
                    cust_exp = row_from_delete['custom_expression'].replace(col, "row_from_outlier['"+col+"']" )
        else:
            cust_exp = '1==1'
    ...

实际上独立于函数的参数row_from_outlier!所以你cust_exp一遍又一遍地构建相同的 s 。这意味着你应该只做一次,在 function 之外。为此,您可以尝试

df2['cust_exp'] = df2.custom_expression.str.replace(
                        '|'.join(str(col) for col in df1.columns),
                        lambda m: 'row_from_outlier["' + m.group(0) + '"]',
                        regex=True
                    )
df2.cust_exp[df2.cust_exp == ''] = 'True'

将结果附加到df2( print(df2.cust_exp)):

0    row_from_outlier["p_cd"]=='AA'
1                              True

要进一步“去循环”该功能,您可以执行以下操作:

from functools import reduce
from operator import and_

def delete_flag(row_from_outlier):
    scope = locals()
    ser = reduce(and_, [df2[col].isin(['', row_from_outlier[col]])
                        for col in df2.columns[:5]]
                       + [df2.cust_exp.apply(lambda s: eval(s, scope))])
    if ser.any():
        return df2.delete_flag[ser].to_list()[-1]
    return 'N/A'

在这里我用过这个

( (not row_from_delete['type'].strip()  )
  or (row_from_delete['type'] == row_from_outlier['type'] ) )

相当于

row_from_delete['type'].isin(['', row_from_outlier['type']])

并且该操作可以按列进行

df2['type'].isin(['', row_from_outlier['type']])

并且and/&列的聚合是由 完成的reduce(and_, ...)

最后但并非最不重要的一个问题:在我看来,您的代码中有del_flg可能被多次分配,并且只有最后一个被使用?这似乎很奇怪。是有意的吗?


推荐阅读