首页 > 解决方案 > Groupby 和 Normalize 选定的列 Pandas DF

问题描述

我有一个样本 DF,我想根据 2 个条件对其进行标准化

创建示例 DF:

sample_df = pd.DataFrame(np.random.randint(1,20,size=(10, 3)), columns=list('ABC'))
sample_df["date"]= ["2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01",
                "2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02"]
sample_df["date"] = pd.to_datetime(sample_df["date"])
sample_df.set_index(sample_df["date"],inplace=True)
del sample_df["date"]
sample_df["A_cat"] = ["ind","sa","sa","sa","ind","ind","sa","sa","ind","sa"]
sample_df["B_cat"] = ["sa","ind","ind","sa","sa","sa","ind","sa","ind","sa"]
sample_df
print (sample_df)

操作:

            A    B   C  A_cat   B_cat
date                    
2020-02-01  14  11   7   ind    sa
2020-02-01  19  17   3   sa     ind
2020-02-01  19  6    3   sa     ind
2020-02-01  3   16   5   sa     sa
2020-02-01  12  6    16  ind    sa
2020-02-02  1   8    12  ind    sa
2020-02-02  10  13   19  sa     ind
2020-02-02  17  2    7   sa     sa
2020-02-02  9   13   17  ind    ind
2020-02-02  17  16   3   sa     sa

规范化条件:

1. Groupby based on index, and
2. Nomalize selected columns

例如,如果选择的列是["A","B"],在这种情况下它应该首先 groupby 索引2020-02-01并规范化该组的 5 行中的选择列。

其他输入:

selected_column = ["A","B"]

我可以for loop通过迭代组并连接标准化值来做到这一点。因此,任何有关更有效/基于熊猫的方法的建议都会很棒。

用 Pandas 尝试的代码:

from sklearn.preprocessing import StandardScaler
dfg = StandardScaler()
sample_df.groupby([sample_df.index.get_level_values(0)])[selected_columns].transform(dfg.fit_transform)     

错误:

('Expected 2D array, got 1D array instead:\narray=[14. 19. 19.  3. 12.].\nReshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.', 'occurred at index A')

标签: pythonpandasscikit-learnpandas-groupbysklearn-pandas

解决方案


希望我的问题是正确的。您是否只想按索引分组,从 A 和 B 中选择值并计算百分比?

    sample_df.reset_index(inplace=True)
    sample_df['date']=pd.to_datetime(sample_df['date'])
    sample_df.set_index('date', inplace=True)
    df2=sample_df[(sample_df['A']>10)&(sample_df['B']>5)]
    df2.groupby(df2.index.month)['A_cat'].value_counts(normalize=True)

如果您想要除 A 和 B 之外的所有其他列。请尝试

df2.groupby(df2.index.month).agg({i:'value_counts' for i in df2.columns[2:]}).groupby(level=0).transform(lambda x: x.div(x.sum()))

或者,在将 A 和 B 选择到数据框中后,删除 A 和 P 列并应用pd.series 值计数

df2.drop(columns=['A','B'], inplace=True)
df2.apply(pd.Series.value_counts).transform(lambda x: x.div(x.sum()))

推荐阅读