首页 > 解决方案 > 在 Pandas Dataframe 中用更快的替代方法替换大型数据集多条件循环

问题描述

我正在尝试在 Dataframe 上执行嵌套循环,但遇到了严重的速度问题。本质上,我有一个唯一值列表,我想通过它循环,所有这些都需要在四个不同的列上进行迭代。代码如下所示:

def get_avg_val(temp_df, col):
    temp_df = temp_df.replace(0, np.NaN)
    avg_val = temp_df[col].mean()

    return (0 if math.isnan(avg_val) else avg_val)

Final_df = pd.DataFrame(rows_list, columns=col_names)

""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()

""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']

for unique_val in unique_list:
    temp_df = Final_df[Final_df['Group_SecCode'] == unique_val]    

    for col in col_list:
        amended_val = get_avg_val (temp_df, col)

        """ The below identifies columns where Unique code is and there is an NaN - via mask; afterwards np.where replaces the value in the cell with the amended value"""
        mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
        Final_df[col] = np.where(mask, amended_val, Final_df[col])

“掩码”部分指定何时在数据帧中满足两个条件,并且 np.where 替换用修正值标识的单元格中的值(它本身是一个执行平均值的函数)。

现在这通常可以工作,但是有超过 400k 行和十几列,速度真的很慢。有什么推荐的方法来改进这两个“For..”吗?因为我相信这些是代码需要一些时间的原因。

谢谢大家!

标签: pythonpandasnumpydataframefor-loop

解决方案


更新 - 找到了一种通过字典执行修改的替代方法,任务现在需要 1.5 分钟而不是 35 分钟。

代码如下。这里的不同方法允许将 DataFrame 过滤成更小的数据帧,在这些数据帧上执行一系列操作。这次新数据随后被存储到字典中,并通过循环向其添加更多数据。最后,字典被传输回初始 DataFrame,完全用更新的数据集替换它。

""" Creates Dataframe compatible with Factset Upload and using rows previously stored in rows_list"""
col_names = ['Group','Date','ISIN','Name','Currency','Price','Proxy Duration','Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
Final_df = pd.DataFrame(rows_list, columns=col_names)

""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()

""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']

""" Sets up Dictionary where to store Unique Values Dataframes"""
final_dict = {}

for unique_val in unique_list:
    condition = Final_df['Group_SecCode'].isin([unique_val])
    temp_df = Final_df[condition].replace(0, np.NaN)

    for col in col_list:
        """ Perform Amendments at Filtered Dataframe - by column """
        """ 1. Replace NaN values with Median for the Datapoints encountered """
        #amended_val = get_avg_val (temp_df, col) #Function previously used to compute average
        #mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
        #Final_df[col] = np.where(mask, amended_val, Final_df[col])
        amended_val = 0 if math.isnan(temp_df[col].median()) else temp_df[col].median()
        mask = temp_df[col].isnull()
        temp_df[col] = np.where(mask, amended_val, temp_df[col])

        """ 2. Perform Validation Checks via Function defined on line 36 """
        temp_df = val_checks (temp_df,col)

    """ Updates Dictionary with updated data at Unique Value level """
    final_dict.update(temp_df.to_dict('index')) #Updates Dictionary with Unique value Dataframe

""" Replaces entirety of Final Dataframe including amended data """
Final_df = pd.DataFrame.from_dict(final_dict, orient='index', columns=col_names)

推荐阅读