首页 > 解决方案 > Pandas 在整个 DataFrame 中对函数对象的应用进行矢量化

问题描述

我有一个基本数据帧,我想对另一个具有相同索引和列的数据帧给出的每个元素应用一个特定的函数,例如:

df = pd.DataFrame([[1,2],[3,4]])
df_format = pd.DataFrame([[lambda x: f'{x:.1f}', lambda x: f'{x:.2f}'], 
                          [lambda x: f'{x:.3f}', lambda x: f'{x:.4f}']])

for i in range(2):
    for j in range(2):
        df.iloc[i,j] = df_format.iloc[i,j](df.iloc[i,j])
        
print(df)
       0       1
0    1.0    2.00
1  3.000  4.0000

这不是矢量化的,我想知道是否有更有效的方法来做到这一点,尤其是对于较大的 DataFrame

标签: pandasperformance

解决方案


You could make the code much faster by working on columns and using the vectorize function of Numpy. Indeed, direct accesses to Pandas dataframe (using iloc) or internal Numpy arrays (using arr[i]) are slow. Python loops are very slow too. Moreover, data are stored by column internally making column-wise operations faster than row-wise ones. Here is a solution to vectorize your operation:

def callOn(func, value):
    return func(value)
for j in range(2):
    # np.vectorize(callOn) generate a function calling callOn(x,y) for 
    # each input pair (x,y) of zip(df_format[j],df[j]).
    df[j] = np.vectorize(callOn)(df_format[j],df[j])

However, note that Numpy do not truely vectorize the calls internally since it deals with Python objects/functions. But this problem inherently comes from the assumption that all lambda could be different and are defined as plain Python objects.

On my machine this code is about 200 times faster than the initial one using the following setup:

nRows, nCols = 1000, 20
fList = [lambda x: f'{x:.1f}', lambda x: f'{x:.2f}', lambda x: f'{x:.3f}', lambda x: f'{x:.4f}']
df = pd.DataFrame(np.random.randint([[10 for j in range(nCols)] for i in range(nRows)]))
df_format = pd.DataFrame(np.random.choice(fList, size=(nRows, nCols)))

推荐阅读