首页 > 解决方案 > 按月平均值标准化列值并添加组维度

问题描述

初始说明

我已经开始运行了,但是执行起来需要很长时间。我的 DataFrame 大约 500MB 大。我希望听到一些关于如何尽快执行此操作的反馈。

问题陈述

我想在每个月通过mean列的值来规范化 DataFrame 列。一个额外的复杂性是我有一个名为的列group,它表示测量参数(列)的不同传感器。因此,分析需要围绕group每个月进行迭代。

DF 示例

                     X  Y  Z  group 
2019-02-01 09:30:07  1  2  1  'grp1'
2019-02-01 09:30:23  2  4  3  'grp2'
2019-02-01 09:30:38  3  6  5  'grp1'
                ...

代码(功能强大,但速度慢)

这是我使用的代码。编码注释提供了大多数行的描述。我认识到三个 for 循环导致了这个运行时问题,但我没有预见到解决它的方法。有谁知道

    # Get mean monthly values for each group
    mean_per_month_unit = process_df.groupby('group').resample('M', how='mean')
    # Store the monthly dates created in last line into a list called month_dates
    month_dates = mean_per_month_unit.index.get_level_values(1)
    # Place date on multiIndex columns. future note: use df[DATE, COL_NAME][UNIT] to access mean value
    mean_per_month_unit = mean_per_month_unit.unstack().swaplevel(0,1,1).sort_index(axis=1)

    divide_df = pd.DataFrame().reindex_like(df)
    process_cols.remove('group')
    for grp in group_list:
        print(grp)
        # Iterate through month
        for mnth in month_dates:
            # Make mask where month and group
            mask = (df.index.month == mnth.month) & (df['group'] == grp)
            for col in process_cols:
                # Set values of divide_df 
                divide_df.iloc[mask.tolist(), divide_df.columns.get_loc(col)] = mean_per_month_unit[mnth, col][grp]
    # Divide process_df with divide_df
    final_df = process_df / divide_df.values

编辑:示例数据

这是 CSV 格式的数据

EDIT2:当前代码(根据当前答案)

def normalize_df(df):

    df['month'] = df.index.month
    print(df['month'])
    df['year'] = df.index.year
    print(df['year'])

    def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
        agg = df.groupby(by=['group', 'month', 'year'], as_index=True).mean()
        print("###################", x.name, x['month'])
        for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by
            print(column)
            mean_col = agg.loc[(x['group'], x['month'], x['year']), column]
            print(mean_col)
            col_name = "norm" + str(column)
            x[col_name] = x[column] / mean_col # norm

        return x

    normalize_cols = df.columns.tolist()
    normalize_cols.remove('group')
    #normalize_cols.remove('mode')
    df2 = df.apply(find_norm, df_col_list = normalize_cols, axis=1)

代码在一次迭代中完美运行,然后失败并出现错误:

KeyError: ('month', 'occurred at index 2019-02-01 11:30:17')

正如我所说,它运行一次正确。但是,它再次迭代同一行然后失败。根据 df.apply() 文档,我看到第一行总是运行两次。我只是不确定为什么第二次失败。

标签: python-3.xpandaspandas-groupby

解决方案


假设要求是按mean和对列进行分组month,这是另一种方法:

  1. 从索引创建新列 - 月和年。df.index.month 可用于此,前提是索引的类型为 DatetimeIndex
    type(df.index) # df is the original dataframe
    #pandas.core.indexes.datetimes.DatetimeIndex

    df['month'] = df.index.month
    df['year'] = df.index.year # added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.
  1. 现在,分组(grp, month, year)并聚合以找到每列的平均值。(添加年份,假设每年按 grp 进行分组。如果不考虑年份,则无需添加此列。)
    agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
  1. 使用函数计算归一化值并apply()在原始数据帧上使用
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize

    for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean. 
        mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
        col_name = "norm" + str(column)
        x[col_name] = x[column] / mean_col # norm

    return x

df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
#df2 will now have 3 additional columns - normA, normB, normC 
df2:

                        A   B   C   grp month year  normA     normB     normC
2019-02-01 09:30:07     1   2   3   1   2   2019    0.666667    0.8     1.5
2019-03-02 09:30:07     2   3   4   1   3   2019    1.000000    1.0     1.0
2019-02-01 09:40:07     2   3   1   2   2   2019    1.000000    1.0     1.0
2019-02-01 09:38:07     2   3   1   1   2   2019    1.333333    1.2     0.5

或者,对于第 3 步,可以join使用aggdf数据框并找到规范。希望这可以帮助!

下面是代码的样子:


# Step 1
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs 

# Step 2
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()

# Step 3
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize

    for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean. 
        mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
        col_name = "norm" + str(column)
        x[col_name] = x[column] / mean_col # norm

    return x

df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)

推荐阅读