python - 根据 dict 选择 Pandas 数据框的行
问题描述
我有一个只有两列这样的熊猫数据框
Timestamp X
0 2017-01-01 00:00:00 18450
1 2017-01-01 00:10:00 13787
2 2017-01-01 00:20:00 3249
3 2017-01-01 00:30:00 44354
4 2017-01-01 00:40:00 50750
Timestamp 列从月初到月底基本上间隔 10 分钟。要创建示例,可以使用以下代码。
l_data = pd.DataFrame()
l_data['Timestamp'] = pd.date_range(start=pd.Timestamp('2017-01-18 00:00:00'), end=pd.Timestamp('2017-01-20 00:00:00'), freq='10T')
l_data['X'] = random.sample(range(0, 100000), len(l_data))
我有一本像这样的字典
{Timestamp('2017-01-18 01:37:19.160000'): Timestamp('2017-01-18 01:37:29.520000'),
Timestamp('2017-01-18 01:41:04.880000'): Timestamp('2017-01-18 01:41:10.280000'),
Timestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:40:00.040000'),
Timestamp('2017-01-18 21:40:02.120000'): Timestamp('2017-01-18 21:50:00.040000'),
Timestamp('2017-01-18 21:50:02.120000'): Timestamp('2017-01-18 22:00:00.040000'),
Timestamp('2017-01-18 22:00:02.120000'): Timestamp('2017-01-18 22:01:50.760000'),
Timestamp('2017-01-18 22:20:22.760000'): Timestamp('2017-01-18 22:25:20.760000'),
Timestamp('2017-01-18 22:35:52.800000'): Timestamp('2017-01-18 22:40:00.040000')}
字典中的键是开始时间,值是结束时间。我想创建一个L
基于此命名的dict
列l_data
如果键和值之间的时间dict
大于 5 分钟,我必须将落在该范围内的时间戳标记l_data
为 1。
如何以直接的方式在熊猫中实现这一点,而不是使用多个循环。?
预期输出将如下所示
126 1/18/2017 21:00 43401 0
127 1/18/2017 21:10 290 0
128 1/18/2017 21:20 92509 0
129 1/18/2017 21:30 64545 0
130 1/18/2017 21:40 47780 1
131 1/18/2017 21:50 53293 1
132 1/18/2017 22:00 45634 0
133 1/18/2017 22:10 51462 0
134 1/18/2017 22:20 44736 0
135 1/18/2017 22:30 11697 1
136 1/18/2017 22:40 82587 1
137 1/18/2017 22:50 76250 0
138 1/18/2017 23:00 33307 0
139 1/18/2017 23:10 25851 0
140 1/18/2017 23:20 71131 0
141 1/18/2017 23:30 88015 0
142 1/18/2017 23:40 45577 0
143 1/18/2017 23:50 76761 0
144 1/19/2017 0:00 45363 0
仅显示重要行
解决方案
我相信你需要:
d = { pd.Timestamp('2017-01-18 21:45:02.120000'): pd.Timestamp('2017-01-18 21:50:29.040000'),
pd.Timestamp('2017-01-18 21:51:02.120000'): pd.Timestamp('2017-01-18 22:52:00.040000'),
pd.Timestamp('2017-01-18 22:52:02.120000'): pd.Timestamp('2017-01-18 22:57:59.760000'),
pd.Timestamp('2017-01-18 23:41:52.800000'): pd.Timestamp('2017-01-18 23:43:00.040000'),
pd.Timestamp('2017-01-18 23:44:52.800000'): pd.Timestamp('2017-01-18 23:50:30.040000'),
pd.Timestamp('2017-01-19 01:10:32.800000'): pd.Timestamp('2017-01-19 01:11:30.040000'),
pd.Timestamp('2017-01-19 01:40:32.800000'): pd.Timestamp('2017-01-19 01:55:30.040000'),
pd.Timestamp('2017-01-19 01:57:32.800000'): pd.Timestamp('2017-01-19 02:04:30.040000')}
l_data = pd.DataFrame()
l_data['Timestamp'] = pd.date_range(start=pd.Timestamp('2017-01-18 20:00:00'),
end=pd.Timestamp('2017-01-19 04:00:00'), freq='10T')
l_data['expected'] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
#print (l_data)
df = pd.DataFrame({'start': list(d.keys()),'end': list(d.values())})
#fikter by 5 minutes
df = df[(df['end'] - df['start']) > pd.Timedelta(5*60, 's')]
#correct 1 minutes end time
s = df['end'].dt.floor('10T')
df['end1'] = s.where((df['end'] - s) < pd.Timedelta(60, 's'), s + pd.Timedelta(10*60, 's'))
print (df)
start end end1
0 2017-01-18 21:45:02.120 2017-01-18 21:50:29.040 2017-01-18 21:50:00
1 2017-01-18 21:51:02.120 2017-01-18 22:52:00.040 2017-01-18 23:00:00
2 2017-01-18 22:52:02.120 2017-01-18 22:57:59.760 2017-01-18 23:00:00
4 2017-01-18 23:44:52.800 2017-01-18 23:50:30.040 2017-01-18 23:50:00
6 2017-01-19 01:40:32.800 2017-01-19 01:55:30.040 2017-01-19 02:00:00
7 2017-01-19 01:57:32.800 2017-01-19 02:04:30.040 2017-01-19 02:10:00
#for each group resample by 10min and add missimg datetimes
v = (df.reset_index()[['start','end1','index']]
.melt('index')
.set_index('value')
.groupby('index')
.resample('10T')['index']
.ffill()
.dropna()
.index
.get_level_values(1)
.unique()
)
#print (v)
l_data['L'] = l_data['Timestamp'].isin(v).astype(int)
print (l_data.head(20))
Timestamp expected L
0 2017-01-18 20:00:00 0 0
1 2017-01-18 20:10:00 0 0
2 2017-01-18 20:20:00 0 0
3 2017-01-18 20:30:00 0 0
4 2017-01-18 20:40:00 0 0
5 2017-01-18 20:50:00 0 0
6 2017-01-18 21:00:00 0 0
7 2017-01-18 21:10:00 0 0
8 2017-01-18 21:20:00 0 0
9 2017-01-18 21:30:00 0 0
10 2017-01-18 21:40:00 0 0
11 2017-01-18 21:50:00 1 1
12 2017-01-18 22:00:00 1 1
13 2017-01-18 22:10:00 1 1
14 2017-01-18 22:20:00 1 1
15 2017-01-18 22:30:00 1 1
16 2017-01-18 22:40:00 1 1
17 2017-01-18 22:50:00 1 1
18 2017-01-18 23:00:00 1 1
19 2017-01-18 23:10:00 0 0
推荐阅读
- python - 使用 Python 的 open() 函数读取二进制文件时,只读取文件的一部分
- python - 使用 python 多处理从 mongodb 读取和删除
- python - 比较二维 numpy 数组的元素
- ios - 快速查找两个 MLMultiArrays 之间的距离
- react-native - 如何解决 webpack 中的“node modules styled-components is not a function”?
- python - 如何在 oct2py 中将 func_args 赋予 feval?(视窗)
- javascript - 对父组件的异步函数调用
- elasticsearch - Elasticsearch 中的“ngram”过滤器和“ngram”分词器之间是否存在性能差异
- tensorflow - 在没有总和的情况下在张量流中查找梯度
- algorithm - 间隔调度 - 每个项目的几个间隔