python - 数据框清洗
问题描述
我有一个 excel 电子表格,导入时看起来类似于:
df = pd.DataFrame({
datetime(2021, 8, 1, 00, 00, 00): [120, np.nan, np.nan, np.nan, 300],
datetime(2021, 9, 1, 00, 00, 00): [np.nan, np.nan, 50, np.nan, np.nan],
datetime(2021, 10, 1, 00, 00, 00): [np.nan, 40, np.nan, 100, np.nan],
datetime(2021, 11, 1, 00, 00, 00): [80, np.nan, 50, np.nan, np.nan],
datetime(2021, 12, 1, 00, 00, 00): [np.nan, 20, np.nan, np.nan, np.nan]})
2021-08-01 | 2021-09-01 | 2021-10-01 | 2021-11-01 | 2021-12-01 |
---|---|---|---|---|
120 | 钠 | 钠 | 80 | 钠 |
钠 | 钠 | 40 | 钠 | 20 |
钠 | 50 | 钠 | 50 | 钠 |
钠 | 钠 | 100 | 钠 | 钠 |
300 | 钠 | 钠 | 钠 | 钠 |
我正在寻找(通过python)将它转换成这样的东西:
shouldbe = pd.DataFrame({
"PayDate1":
[datetime(2021,8,1), datetime(2021,10,1), datetime(2021,9,1), datetime(2021,10,1), datetime(2021,8,1)],
"Amount1": [120, 40, 50, 100, 300],
"PayDate2":
[datetime(2021,11,1), datetime(2021,12,1), datetime(2021,11,1), '', ''],
"Amount2": [80, 20, 50, np.nan, np.nan]}))
付款日期1 | 金额1 | 付款日期2 | 金额2 |
---|---|---|---|
2021-08-01 | 120 | 2021-11-01 | 80 |
2021-10-01 | 40 | 2021-12-01 | 20 |
2021-09-01 | 50 | 2021-11-01 | 50 |
2021-10-01 | 100 | 钠盐 | 钠 |
2021-08-01 | 300 | 钠盐 | 钠 |
我正在寻找一些有关如何实现这种转换的示例,在此先感谢您的帮助。
解决方案
您可以使用melt
,groupby
并pivot
获取预期的数据框:
- 重塑您的数据框
melt
:
out = df.reset_index() \
.melt(id_vars='index', var_name='PayDate', value_name='Amount') \
.dropna()
print(out)
# Output
index PayDate Amount
0 0 2021-08-01 120.0 # <- index 0, 1st occurrence
4 4 2021-08-01 300.0 # <- index 4, 1st occurrence
7 2 2021-09-01 50.0 # <- index 2, 1st occurrence
11 1 2021-10-01 40.0 # <- index 1, 1st occurrence
13 3 2021-10-01 100.0 # <- index 3, 1st occurrence
15 0 2021-11-01 80.0 # <- index 0, 2nd occurrence
17 2 2021-11-01 50.0 # <- index 2, 2nd occurrence
21 1 2021-12-01 20.0 # <- index 1, 2nd occurrence
- 分组
index
并申请cumcount
创建新列的索引('1' & '2' 作为字符串以供将来连接):
out['col'] = out.groupby('index').cumcount().add(1).astype(str)
print(out)
# Output:
index PayDate Amount col
0 0 2021-08-01 120.0 1
4 4 2021-08-01 300.0 1
7 2 2021-09-01 50.0 1
11 1 2021-10-01 40.0 1
13 3 2021-10-01 100.0 1
15 0 2021-11-01 80.0 2
17 2 2021-11-01 50.0 2
21 1 2021-12-01 20.0 2
- 旋转数据框
out = out.pivot(index='index', columns='col', values=['PayDate', 'Amount'])
print(out)
# Output
PayDate Amount
col 1 2 1 2
index
0 2021-08-01 2021-11-01 120.0 80.0
1 2021-10-01 2021-12-01 40.0 20.0
2 2021-09-01 2021-11-01 50.0 50.0
3 2021-10-01 NaT 100.0 NaN
4 2021-08-01 NaT 300.0 NaN
- 获取最终数据框
cols = out.columns.get_level_values(1).argsort()
out.columns = out.columns.to_flat_index().map(''.join)
out.index.name = None
out = out[out.columns[cols]]
print(out)
PayDate1 Amount1 PayDate2 Amount2
0 2021-08-01 120.0 2021-11-01 80.0
1 2021-10-01 40.0 2021-12-01 20.0
2 2021-09-01 50.0 2021-11-01 50.0
3 2021-10-01 100.0 NaT NaN
4 2021-08-01 300.0 NaT NaN
推荐阅读
- azure-logic-apps - Microsoft Logic App 和 Workflow 是否相同
- html - div内的CSS定位范围
- spring-boot - 为什么spring data jpa不需要@Repository?
- reactjs - 从 API 获取数据后如何设置 react-select 的默认值
- azure - Azure AD B2C - 如何查看用户的扩展
- r - 如果一行包含一个值,如何删除列?
- javascript - 工作箱 registerRoute 不适用于特殊路线
- groovy - NEXUS神器下载解压
- c# - 如何在 Dotnet 中上传图像?
- python - 带条件的 MultiIndex 数据帧的操作