python - 过滤后将新列添加到熊猫 df
问题描述
我有一个包含有关各个地方的信息的 df。
import pandas as pd
d = ({
'C' : ['08:00:00','XX','08:10:00','XX','08:41:42','XX','08:50:00','XX', '09:00:00', 'XX','09:15:00','XX','09:21:00','XX','09:30:00','XX','09:40:00','XX'],
'D' : ['Home','','Home','','Away','','Home','','Away','','Home','','Home','','Away','','Home',''],
'E' : ['Num:','','Num:','','Num:','','Num:','','Num:', '','Num:','','Num:','','Num:', '','Num:', ''],
'F' : ['1','','1','','1','','1','','1', '','2','','2','','1', '','2',''],
'A' : ['A','','A','','A','','A','','A','','A','','A','','A','','A',''],
'B' : ['Stop','','Res','','Stop','','Start','','Res','','Stop','','Res','','Start','','Start','']
})
df = pd.DataFrame(data=d)
我想将该数据导出到各自的位置,这些位置标记为Column D
. 我还想添加基于Column B
.
df['C'] = pd.to_timedelta(df['C'], errors="coerce").dt.total_seconds()
incl = ['Home', 'Away']
for k, g in df[df.D.isin(incl)].groupby('D'):
Stop = g.loc[df['B'] == 'Stop'].reset_index()['C']
Start = g.loc[df['B'] == 'Start'].reset_index()['C']
Res = g.loc[df['B'] == 'Res'].reset_index()['C']
g['Start_diff'] = Start - Stop
g['Res_diff'] = Start - Res
问题是这些函数多次出现,标记为Column F
. 因此,如果我们查看导出,Home
我们将第一次在Column F
.
输出:
A B C D E F Start_diff Res_diff
0 A Stop 28800 Home Num: 1 3000 2400
2 A Res 29400 Home Num: 1
6 A Start 31800 Home Num: 1
10 A Stop 33300 Home Num: 2
12 A Res 33660 Home Num: 2
16 A Start 34800 Home Num: 2
而我希望预期的输出是:
A B C D E F Start_diff Res_diff
0 A Stop 28800 Home Num: 1 3000 2400
2 A Res 29400 Home Num: 1
6 A Start 31800 Home Num: 1
10 A Stop 33300 Home Num: 2 1500 1200
12 A Res 33660 Home Num: 2
16 A Start 34800 Home Num: 2
我试图改变for k, g in df[df.D.isin(incl)].groupby('D'):
为for k, g in df[df.D.isin(incl)].groupby('D').F.nunique():
但我得到一个错误TypeError: 'numpy.int64' object is not iterable
解决方案
我相信需要使用 groupby by 的自定义函数D
和F
将重复值替换为的列mask
:
def f(g):
Stop = g.loc[df['B'] == 'Stop', 'C']
Start = g.loc[df['B'] == 'Start', 'C']
Res = g.loc[df['B'] == 'Res', 'C']
g['Start_diff'] = Start.values[0] - Stop.values[0]
g['Res_diff'] = Start.values[0] - Res.values[0]
return (g)
df = df[df.D.isin(incl)].groupby(['D', 'F']).apply(f)
df[['Start_diff', 'Res_diff']] = df[['Start_diff', 'Res_diff']].mask(df.duplicated(['D','F']))
print (df)
C D E F A B Start_diff Res_diff
0 28800.0 Home Num: 1 A Stop 3000.0 2400.0
2 29400.0 Home Num: 1 A Res NaN NaN
4 31302.0 Away Num: 1 A Stop 2898.0 1800.0
6 31800.0 Home Num: 1 A Start NaN NaN
8 32400.0 Away Num: 1 A Res NaN NaN
10 33300.0 Home Num: 2 A Stop 1500.0 1140.0
12 33660.0 Home Num: 2 A Res NaN NaN
14 34200.0 Away Num: 1 A Start NaN NaN
16 34800.0 Home Num: 2 A Start NaN NaN
推荐阅读
- java - 如何在java中将JSON转换为属性文件?
- python - 'function' 和 'int' 不支持的操作数,lambdas 正在杀死我
- android - 片段重新创建时芯片组侦听器更改芯片位置
- java - 使用 Jlabel 和 JTextField 数组以某种格式显示
- javascript - 从同步的 Gmail 邮件线程中检测引用内容的最佳方法是什么?
- ios - 如何对 App Store 提交的 Carthage 框架进行代码签名?
- dart - Flutter 在获取数据时抛出错误。我该如何解决?
- javascript - 为什么在我打开检查工具之前我的图像不会加载到 carousal 中?
- spring - http 404 错误源服务器没有找到目标资源的当前表示或不愿意透露存在的表示
- recursion - makefile递归所有子系统总是最新的