pandas - 如何增加每列/组的索引
问题描述
我需要从以下格式格式化数据框:
| country | county | city | street |
|-----------|----------|--------|-----------|
| country 1 | county 1 | city 1 | street 1 |
| country 1 | county 1 | city 1 | street 2 |
| country 1 | county 1 | city 2 | street 3 |
| country 2 | county 2 | city 3 | street 4 |
| country 2 | county 2 | city 3 | street 5 |
| country 3 | county 3 | city 4 | street 6 |
| country 3 | county 4 | city 5 | street 7 |
| country 3 | county 4 | city 6 | street 8 |
| country 3 | county 4 | city 6 | street 9 |
| country 3 | county 4 | city 6 | street 10 |
至
| country | county | city | street | count |
|-----------|----------|--------|-----------|-------|
| country 1 | | | | 3 |
| | county 1 | | | 3 |
| | | city 1 | | 2 |
| | | | street 1 | 1 |
| | | | street 2 | 1 |
| | | city 2 | | 1 |
| | | | street 3 | 1 |
| country 2 | | | | 2 |
| | county 2 | | | 2 |
| | | city 3 | | 2 |
| | | | street 4 | 1 |
| | | | street 5 | 1 |
| country 3 | | | | 5 |
| | county 3 | | | 1 |
| | | city 4 | | 1 |
| | | | street 6 | 1 |
| | county 4 | | | 4 |
| | | city 5 | | 1 |
| | | | street 7 | 1 |
| | | city 6 | | 3 |
| | | | street 8 | 1 |
| | | | street 9 | 1 |
| | | | street 10 | 1 |
列数可能会有所不同。
我正在使用多个groupby管理计数并尝试在 python 中格式化但没有成功。有办法只用熊猫吗?
解决方案
您可以遍历列本身并依靠在DataFrame.value_counts()
不同的嵌套级别上进行计数。您需要在执行此操作时使用索引,以便稍后正确重新对齐所有内容,但最后您只需pd.concat
将这些块粘在一起:
chunk_counts = []
for col in test_df.columns:
counts = test_df.loc[:, :col].value_counts()
n_empty_levels = test_df.columns.size - test_df.columns.get_loc(col) - 1
empty_levels = [[""]] * n_empty_levels
new_levels = [*counts.index.levels, *empty_levels]
new_index = pd.MultiIndex.from_product(new_levels, names=test_df.columns)
chunk_counts.append(counts.reindex(new_index))
final_series = (pd.concat(chunk_counts)
.sort_index()
.dropna()
.astype(int)
.rename("count"))
如果你, repr看起来很好print(final_series)
,但是多索引在每个嵌套级别下面没有空条目(只是以这种方式MultiIndex
显示。当我们使用时,这变得很明显reset_index
。要将我们的系列放回框架中需要保持 OP 请求的格式,我们需要再做一些调整。
index_cols = final_series.index.names
final_df = final_series.reset_index()
final_df[index_cols] = final_df[index_cols].where(~final_df[index_cols].apply(pd.Series.duplicated))
final_df = final_df.fillna("")
print(final_df)
Country County City Street count
0 Country 1 3
1 County 1 3
2 City 1 2
3 Street 1 1
4 Street 2 1
5 City 2 1
6 Street 3 1
7 Country 2 2
8 County 2 2
9 City 3 2
10 Street 4 1
11 Street 5 1
12 Country 3 5
13 County 3 1
14 City 4 1
15 Street 6 1
16 County 4 4
17 City 5 1
18 Street 7 1
19 City 6 3
20 Street 10 1
21 Street 8 1
22 Street 9 1
推荐阅读
- java - 将Android库集成到cocos2D for Android release(调用JavaScriptJavaBridge_callStaticMethod失败)
- python - random.choice 总是打印相同的结果
- javascript - 具有更多钩子的上下文 API
- security - How to add HTTP Headers to Jenkins
- node.js - 承诺不解决快递
- python - 如何在python中使用正则表达式从字节中提取单词?
- linux - 错误 [: ==: shell 脚本中需要一元运算符
- android - FlutterPusher 在发布版本中未连接到主机
- raspberry-pi - Linux通过usb串口通信
- angular - Angular 9 - 未定义的属性绑定