python - 如何将多个数据框分组到聚合表中
问题描述
我使用加载数据
X_train, X_test, y_train, y_test = load_data()
# y_train here has a string class name inside
然后我使用 sklearn 的编码器将我的数据更改为分类
from sklearn.preprocessing import LabelEncoder
数据示例
y_train = ["tomato","strawberry", "strawberry", "potato", "strawberry", "potato", "lemon"]
y_test = ["strawberry", "lemon", "lemon", "lemon"]
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
print(y_train)
>>>[3 2 2 1 2 1 0]
y_test = encoder.transform(y_test)
print(y_test)
>>>[2 0 0 0]
我可以打电话encoder.inverse_transform(y_test)
取回字符串
如何构建汇总表以显示类似的表
Label | y_train_count | y_test_count
----------+---------------+-------------
strawberry| 3 | 1
potato | 2 | 0
... | ... | ...
解决方案
只需将列表编译成 Pandas 数据框,然后与groupby
(用于长报告)和pivot_table
(用于广泛报告)聚合。
汇编
# COMPILE EACH LIST INTO COLUMN WITH INDICATORS
y_train = ["tomato","strawberry", "strawberry", "potato", "strawberry", "potato", "lemon"]
y_test = ["strawberry", "lemon", "lemon", "lemon"]
categ_df = pd.concat([pd.DataFrame({'source': 'y_train', 'category': y_train}),
pd.DataFrame({'source': 'y_test', 'category': y_test})])
categ_df
# source category
# 0 y_train tomato
# 1 y_train strawberry
# 2 y_train strawberry
# 3 y_train potato
# 4 y_train strawberry
# 5 y_train potato
# 6 y_train lemon
# 7 y_test strawberry
# 8 y_test lemon
# 9 y_test lemon
# 10 y_test lemon
聚合
all_vals = pd.MultiIndex.from_product(
[categ_df['category'].unique(), categ_df['source'].unique()]
)
agg_df = (categ_df.groupby(['category', 'source'])
.agg(count=('category', 'size'))
.reindex(all_vals, axis='index')
.fillna(0)
)
agg_df
# count
# tomato y_train 1.0
# y_test 0.0
# strawberry y_train 3.0
# y_test 1.0
# potato y_train 2.0
# y_test 0.0
# lemon y_train 1.0
# y_test 3.0
agg_df = (categ_df.assign(key=1)
.pivot_table(index='category', columns='source',
values='key', aggfunc='count')
.reindex(categ_df['source'].unique(),
axis='columns')
.fillna(0)
)
agg_df
# source y_test y_train
# category
# lemon 3.0 1.0
# potato 0.0 2.0
# strawberry 1.0 3.0
# tomato 0.0 1.0
推荐阅读
- automated-tests - 如何在 Windows 10 上的 Cygwin 中安装 Dieharder?
- c++ - Visual Studio Code 中的调试器显示神秘的错误窗口
- excel - 有没有办法将大文本流分成 550 个字符的部分?
- python-3.x - 在 Pyshark 中连续捕获数据包
- javascript - js导入找不到wasm文件
- github - 如果推送者不是所有者,GitLab 不会自动镜像
- json - 从 json 值解析 JSON 对象字符串
- angular - 为循环生成的按钮分配一个 id
- postgresql - 在两个不同的 cloudera 集群上安装 postgresql
- docker - 如何从 docker 容器连接到 localhost elasticsearch