首页 > 解决方案 > 如何将多个数据框分组到聚合表中

问题描述

我使用加载数据

X_train, X_test, y_train, y_test = load_data() 
# y_train here has a string class name inside

然后我使用 sklearn 的编码器将我的数据更改为分类

from sklearn.preprocessing import LabelEncoder

数据示例

y_train = ["tomato","strawberry", "strawberry",  "potato", "strawberry",  "potato", "lemon"]
y_test = ["strawberry", "lemon", "lemon", "lemon"]

encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
print(y_train)
>>>[3 2 2 1 2 1 0]
y_test = encoder.transform(y_test)
print(y_test)
>>>[2 0 0 0]

我可以打电话encoder.inverse_transform(y_test)取回字符串

如何构建汇总表以显示类似的表

Label     | y_train_count | y_test_count
----------+---------------+-------------
strawberry| 3             | 1
potato    | 2             | 0
...       | ...           | ...

标签: pythonpandasnumpydataframegroup-by

解决方案


只需将列表编译成 Pandas 数据框,然后与groupby(用于长报告)和pivot_table(用于广泛报告)聚合。

汇编

# COMPILE EACH LIST INTO COLUMN WITH INDICATORS
y_train = ["tomato","strawberry", "strawberry",  "potato", "strawberry",  "potato", "lemon"]
y_test = ["strawberry", "lemon", "lemon", "lemon"]

categ_df = pd.concat([pd.DataFrame({'source': 'y_train', 'category': y_train}),
                      pd.DataFrame({'source': 'y_test', 'category': y_test})])

categ_df
#      source    category
# 0   y_train      tomato
# 1   y_train  strawberry
# 2   y_train  strawberry
# 3   y_train      potato
# 4   y_train  strawberry
# 5   y_train      potato
# 6   y_train       lemon
# 7    y_test  strawberry
# 8    y_test       lemon
# 9    y_test       lemon
# 10   y_test       lemon

聚合

all_vals = pd.MultiIndex.from_product(
                [categ_df['category'].unique(), categ_df['source'].unique()]
           )
                               
agg_df = (categ_df.groupby(['category', 'source'])
                  .agg(count=('category', 'size'))
                  .reindex(all_vals, axis='index')
                  .fillna(0)
         )
agg_df
#                     count
# tomato     y_train    1.0
#            y_test     0.0
# strawberry y_train    3.0
#            y_test     1.0
# potato     y_train    2.0
#            y_test     0.0
# lemon      y_train    1.0
#            y_test     3.0

agg_df = (categ_df.assign(key=1)
                  .pivot_table(index='category', columns='source',
                               values='key', aggfunc='count')
                  .reindex(categ_df['source'].unique(), 
                           axis='columns')
                  .fillna(0)
         )
agg_df
# source      y_test  y_train
# category
# lemon          3.0      1.0
# potato         0.0      2.0
# strawberry     1.0      3.0
# tomato         0.0      1.0

推荐阅读