python - 在 Pandas Dataframe 中用更快的替代方法替换大型数据集多条件循环
问题描述
我正在尝试在 Dataframe 上执行嵌套循环,但遇到了严重的速度问题。本质上,我有一个唯一值列表,我想通过它循环,所有这些都需要在四个不同的列上进行迭代。代码如下所示:
def get_avg_val(temp_df, col):
temp_df = temp_df.replace(0, np.NaN)
avg_val = temp_df[col].mean()
return (0 if math.isnan(avg_val) else avg_val)
Final_df = pd.DataFrame(rows_list, columns=col_names)
""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()
""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
for unique_val in unique_list:
temp_df = Final_df[Final_df['Group_SecCode'] == unique_val]
for col in col_list:
amended_val = get_avg_val (temp_df, col)
""" The below identifies columns where Unique code is and there is an NaN - via mask; afterwards np.where replaces the value in the cell with the amended value"""
mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
Final_df[col] = np.where(mask, amended_val, Final_df[col])
“掩码”部分指定何时在数据帧中满足两个条件,并且 np.where 替换用修正值标识的单元格中的值(它本身是一个执行平均值的函数)。
现在这通常可以工作,但是有超过 400k 行和十几列,速度真的很慢。有什么推荐的方法来改进这两个“For..”吗?因为我相信这些是代码需要一些时间的原因。
谢谢大家!
解决方案
更新 - 找到了一种通过字典执行修改的替代方法,任务现在需要 1.5 分钟而不是 35 分钟。
代码如下。这里的不同方法允许将 DataFrame 过滤成更小的数据帧,在这些数据帧上执行一系列操作。这次新数据随后被存储到字典中,并通过循环向其添加更多数据。最后,字典被传输回初始 DataFrame,完全用更新的数据集替换它。
""" Creates Dataframe compatible with Factset Upload and using rows previously stored in rows_list"""
col_names = ['Group','Date','ISIN','Name','Currency','Price','Proxy Duration','Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
Final_df = pd.DataFrame(rows_list, columns=col_names)
""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()
""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
""" Sets up Dictionary where to store Unique Values Dataframes"""
final_dict = {}
for unique_val in unique_list:
condition = Final_df['Group_SecCode'].isin([unique_val])
temp_df = Final_df[condition].replace(0, np.NaN)
for col in col_list:
""" Perform Amendments at Filtered Dataframe - by column """
""" 1. Replace NaN values with Median for the Datapoints encountered """
#amended_val = get_avg_val (temp_df, col) #Function previously used to compute average
#mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
#Final_df[col] = np.where(mask, amended_val, Final_df[col])
amended_val = 0 if math.isnan(temp_df[col].median()) else temp_df[col].median()
mask = temp_df[col].isnull()
temp_df[col] = np.where(mask, amended_val, temp_df[col])
""" 2. Perform Validation Checks via Function defined on line 36 """
temp_df = val_checks (temp_df,col)
""" Updates Dictionary with updated data at Unique Value level """
final_dict.update(temp_df.to_dict('index')) #Updates Dictionary with Unique value Dataframe
""" Replaces entirety of Final Dataframe including amended data """
Final_df = pd.DataFrame.from_dict(final_dict, orient='index', columns=col_names)
推荐阅读
- java - java util将cron表达式从一个时区转换为另一个时区
- javascript - 在对象数组中的对象内部过滤对象内部的数组
- reactjs - 在单独的文件中具有 React 上下文,无法让组件不重新渲染
- c# - 编译为单个 .exe 时如何保留某些文件?
- java - 如何在 auditAware 中更新身份验证
- node.js - 使用在 localhost 上运行的 API 从 React 应用程序发出 http 请求
- entity-framework-core - EF 核心请求无法唤醒 Azure Sql(无服务器 sku)数据库并超时
- laravel-blade - 如何在livewire中获取菜单活动元素的应用程序的当前url
- python - 无法使用 anaconda navigator 在我的 Windows10 系统上打开 jupyter notebook
- nestjs - NestJs 管道与过滤器