python-3.x - 按月平均值标准化列值并添加组维度
问题描述
初始说明
我已经开始运行了,但是执行起来需要很长时间。我的 DataFrame 大约 500MB 大。我希望听到一些关于如何尽快执行此操作的反馈。
问题陈述
我想在每个月通过mean
列的值来规范化 DataFrame 列。一个额外的复杂性是我有一个名为的列group
,它表示测量参数(列)的不同传感器。因此,分析需要围绕group
每个月进行迭代。
DF 示例
X Y Z group
2019-02-01 09:30:07 1 2 1 'grp1'
2019-02-01 09:30:23 2 4 3 'grp2'
2019-02-01 09:30:38 3 6 5 'grp1'
...
代码(功能强大,但速度慢)
这是我使用的代码。编码注释提供了大多数行的描述。我认识到三个 for 循环导致了这个运行时问题,但我没有预见到解决它的方法。有谁知道
# Get mean monthly values for each group
mean_per_month_unit = process_df.groupby('group').resample('M', how='mean')
# Store the monthly dates created in last line into a list called month_dates
month_dates = mean_per_month_unit.index.get_level_values(1)
# Place date on multiIndex columns. future note: use df[DATE, COL_NAME][UNIT] to access mean value
mean_per_month_unit = mean_per_month_unit.unstack().swaplevel(0,1,1).sort_index(axis=1)
divide_df = pd.DataFrame().reindex_like(df)
process_cols.remove('group')
for grp in group_list:
print(grp)
# Iterate through month
for mnth in month_dates:
# Make mask where month and group
mask = (df.index.month == mnth.month) & (df['group'] == grp)
for col in process_cols:
# Set values of divide_df
divide_df.iloc[mask.tolist(), divide_df.columns.get_loc(col)] = mean_per_month_unit[mnth, col][grp]
# Divide process_df with divide_df
final_df = process_df / divide_df.values
编辑:示例数据
这是 CSV 格式的数据。
EDIT2:当前代码(根据当前答案)
def normalize_df(df):
df['month'] = df.index.month
print(df['month'])
df['year'] = df.index.year
print(df['year'])
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
agg = df.groupby(by=['group', 'month', 'year'], as_index=True).mean()
print("###################", x.name, x['month'])
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by
print(column)
mean_col = agg.loc[(x['group'], x['month'], x['year']), column]
print(mean_col)
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
normalize_cols = df.columns.tolist()
normalize_cols.remove('group')
#normalize_cols.remove('mode')
df2 = df.apply(find_norm, df_col_list = normalize_cols, axis=1)
代码在一次迭代中完美运行,然后失败并出现错误:
KeyError: ('month', 'occurred at index 2019-02-01 11:30:17')
正如我所说,它运行一次正确。但是,它再次迭代同一行然后失败。根据 df.apply() 文档,我看到第一行总是运行两次。我只是不确定为什么第二次失败。
解决方案
假设要求是按mean
和对列进行分组month
,这是另一种方法:
- 从索引创建新列 - 月和年。df.index.month 可用于此,前提是索引的类型为 DatetimeIndex
type(df.index) # df is the original dataframe
#pandas.core.indexes.datetimes.DatetimeIndex
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.
- 现在,分组
(grp, month, year)
并聚合以找到每列的平均值。(添加年份,假设每年按 grp 进行分组。如果不考虑年份,则无需添加此列。)
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
- 使用函数计算归一化值并
apply()
在原始数据帧上使用
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
#df2 will now have 3 additional columns - normA, normB, normC
df2:
A B C grp month year normA normB normC
2019-02-01 09:30:07 1 2 3 1 2 2019 0.666667 0.8 1.5
2019-03-02 09:30:07 2 3 4 1 3 2019 1.000000 1.0 1.0
2019-02-01 09:40:07 2 3 1 2 2 2019 1.000000 1.0 1.0
2019-02-01 09:38:07 2 3 1 1 2 2019 1.333333 1.2 0.5
或者,对于第 3 步,可以join
使用agg
和df
数据框并找到规范。希望这可以帮助!
下面是代码的样子:
# Step 1
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs
# Step 2
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
# Step 3
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
推荐阅读
- javascript - 即使填充了值,req 也没有定义
- python - Confirm_password 验证错误不起作用
- python - 从json获取正确信息的python问题
- blazor-webassembly - Blazor WASM 登录是否适用于 FIDO2 以及如何使用?
- react-native-popup-menu - 如何在点击时隐藏灰色背景颜色(react-native-popup-menu)
- azure-log-analytics - 使用 Kusto Query 在 ADF v2 中为长时间运行的管道设置警报
- android - 想要在单个页面中添加多个视频
- python - Python 上的 Sentinel API 错误:HTTP 状态 200 OK:API 响应无效。JSON解码失败
- azure - 如何从 Azure AD 令牌获取用户详细信息?
- docker - 如何在 docker-compose 中的同一图像中有多个标签