python - Pandas 使用字典过滤
问题描述
我在一家信息亭公司工作,我们正在寻找 UI 更新是否有任何不同。每台机器在不同的日期/时间都有更新。我已经建立了一个 machine_ids 字典和安装新 UI 的时间戳。然后我想用它来过滤结果,所以只返回 machine_id 在字典中并且存款日期大于字典中相应日期的行
uidict= {
14.0: Timestamp('2018-10-12 17:48:57'),
16.0: Timestamp('2018-10-12 13:38:00'),
19.0: Timestamp('2018-10-17 20:17:33'),
20.0: Timestamp('2018-10-15 12:15:34'),
27.0: Timestamp('2018-09-26 11:50:01'),
29.0: Timestamp('2018-10-03 13:38:17'),
31.0: Timestamp('2018-10-17 10:06:23'),
33.0: Timestamp('2018-09-21 15:17:14'),
34.0: Timestamp('2018-10-17 11:42:21'),
42.0: Timestamp('2018-10-16 12:36:32'),
45.0: Timestamp('2018-09-23 13:23:37'),
53.0: Timestamp('2018-09-27 12:18:39'),
60.0: Timestamp('2018-10-15 15:27:46'),
62.0: Timestamp('2018-08-30 17:26:27'),
63.0: Timestamp('2018-09-25 17:44:04'),
64.0: Timestamp('2018-09-23 14:19:57'),
65.0: Timestamp('2018-08-31 19:07:47'),
66.0: Timestamp('2018-09-08 14:12:20'),
67.0: Timestamp('2018-09-11 08:18:31'),
69.0: Timestamp('2018-09-20 17:12:37'),
70.0: Timestamp('2018-09-24 12:56:45'),
71.0: Timestamp('2018-08-27 09:37:17'),
72.0: Timestamp('2018-09-05 19:07:34'),
73.0: Timestamp('2018-09-10 14:42:52'),
74.0: Timestamp('2018-09-25 16:36:05'),
75.0: Timestamp('2018-08-27 10:09:02'),
76.0: Timestamp('2018-09-13 07:20:40'),
77.0: Timestamp('2018-09-02 14:10:22'),
78.0: Timestamp('2018-09-26 15:06:51'),
79.0: Timestamp('2018-08-31 15:52:49'),
81.0: Timestamp('2018-10-05 10:05:11')}
我尝试了这种过滤以使其工作:
df[(df.machine_id.isin(uidict.keys()))&(df.deposited_at>uidict[df.machine_id])]
但这会返回
TypeError: 'Series' objects are mutable, thus they cannot be hashed
所以我想我会忘记字典,只使用我制作的 groupby 系列但是..
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-90-10d8db20a295> in <module>()
----> 1 df[(df.machine_name.isin(newuidict.index))&(df.deposited_at>newuidict[df.machine_name])]
~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
816 if not self._indexed_same(other):
817 msg = 'Can only compare identically-labeled Series objects'
--> 818 raise ValueError(msg)
819 return self._constructor(na_op(self.values, other.values),
820 index=self.index, name=name)
ValueError: Can only compare identically-labeled Series objects
用函数运行它并应用需要很长时间,我必须经常运行这段代码,有没有办法让这种过滤工作?
数据的小例子:
machine_id deposited_at
12 2018-10-04 14:49:38
56 2018-09-20 14:41:59
24 2018-08-25 14:50:07
56 2018-08-04 15:33:09
12 2018-08-01 18:18:44
24 2018-09-24 12:34:35
35 2018-10-01 17:09:38
21 2018-09-27 11:32:02
21 2018-09-27 11:33:55
23 2018-08-30 10:03:01
解决方案
[答案需要 Python 3 和 Pandas]
如果更改 uidict 不是太麻烦,您可以将其转换为数据框并使用连接。我将在下面说明该过程:
首先,重新创建您的 uidict:
import pandas as pd
from pandas import Timestamp
uidict= {
14.0: Timestamp('2018-10-12 17:48:57'),
16.0: Timestamp('2018-10-12 13:38:00'),
19.0: Timestamp('2018-10-17 20:17:33'),
20.0: Timestamp('2018-10-15 12:15:34'),
27.0: Timestamp('2018-09-26 11:50:01'),
29.0: Timestamp('2018-10-03 13:38:17'),
31.0: Timestamp('2018-10-17 10:06:23'),
33.0: Timestamp('2018-09-21 15:17:14'),
34.0: Timestamp('2018-10-17 11:42:21'),
42.0: Timestamp('2018-10-16 12:36:32'),
45.0: Timestamp('2018-09-23 13:23:37'),
53.0: Timestamp('2018-09-27 12:18:39'),
60.0: Timestamp('2018-10-15 15:27:46'),
62.0: Timestamp('2018-08-30 17:26:27'),
63.0: Timestamp('2018-09-25 17:44:04'),
64.0: Timestamp('2018-09-23 14:19:57'),
65.0: Timestamp('2018-08-31 19:07:47'),
66.0: Timestamp('2018-09-08 14:12:20'),
67.0: Timestamp('2018-09-11 08:18:31'),
69.0: Timestamp('2018-09-20 17:12:37'),
70.0: Timestamp('2018-09-24 12:56:45'),
71.0: Timestamp('2018-08-27 09:37:17'),
72.0: Timestamp('2018-09-05 19:07:34'),
73.0: Timestamp('2018-09-10 14:42:52'),
74.0: Timestamp('2018-09-25 16:36:05'),
75.0: Timestamp('2018-08-27 10:09:02'),
76.0: Timestamp('2018-09-13 07:20:40'),
77.0: Timestamp('2018-09-02 14:10:22'),
78.0: Timestamp('2018-09-26 15:06:51'),
79.0: Timestamp('2018-08-31 15:52:49'),
81.0: Timestamp('2018-10-05 10:05:11')
}
然后我们可以使用这一行来创建一个 pandas 数据框,为了以后方便,我将字典的键命名为“machine_id”。
uidf = pd.DataFrame(list(uidict.items()),columns=['machine_id','ui_date'])
结果是:
machine_id ui_date
0 64.0 2018-09-23 14:19:57
1 65.0 2018-08-31 19:07:47
2 66.0 2018-09-08 14:12:20
3 67.0 2018-09-11 08:18:31
4 69.0 2018-09-20 17:12:37
5 70.0 2018-09-24 12:56:45
6 71.0 2018-08-27 09:37:17
7 72.0 2018-09-05 19:07:34
8 73.0 2018-09-10 14:42:52
9 74.0 2018-09-25 16:36:05
10 75.0 2018-08-27 10:09:02
11 76.0 2018-09-13 07:20:40
12 77.0 2018-09-02 14:10:22
13 14.0 2018-10-12 17:48:57
14 79.0 2018-08-31 15:52:49
15 16.0 2018-10-12 13:38:00
16 81.0 2018-10-05 10:05:11
17 19.0 2018-10-17 20:17:33
18 20.0 2018-10-15 12:15:34
19 78.0 2018-09-26 15:06:51
20 27.0 2018-09-26 11:50:01
21 29.0 2018-10-03 13:38:17
22 31.0 2018-10-17 10:06:23
23 33.0 2018-09-21 15:17:14
24 34.0 2018-10-17 11:42:21
25 42.0 2018-10-16 12:36:32
26 45.0 2018-09-23 13:23:37
27 53.0 2018-09-27 12:18:39
28 60.0 2018-10-15 15:27:46
29 62.0 2018-08-30 17:26:27
30 63.0 2018-09-25 17:44:04
然后重新创建您的示例数据,但我在底部添加了两个测试用例行,因为您提供的示例在 uidict 上似乎没有任何匹配项。具体来说,machine_id = 81 的一行,但日期早于 uidict 中的日期,以及日期在之后的一行。
data_sample = pd.DataFrame(
[
{'machine_id': 12, 'deposited_at' : Timestamp('2018-10-04 14:49:38')},
{'machine_id': 56, 'deposited_at' : Timestamp('2018-09-20 14:41:59')},
{'machine_id': 24, 'deposited_at' : Timestamp('2018-08-25 14:50:07')},
{'machine_id': 56, 'deposited_at' : Timestamp('2018-08-04 15:33:09')},
{'machine_id': 12, 'deposited_at' : Timestamp('2018-08-01 18:18:44')},
{'machine_id': 24, 'deposited_at' : Timestamp('2018-09-24 12:34:35')},
{'machine_id': 35, 'deposited_at' : Timestamp('2018-10-01 17:09:38')},
{'machine_id': 21, 'deposited_at' : Timestamp('2018-09-27 11:32:02')},
{'machine_id': 21, 'deposited_at' : Timestamp('2018-09-27 11:33:55')},
{'machine_id': 23, 'deposited_at' : Timestamp('2018-08-30 10:03:01')},
{'machine_id': 81, 'deposited_at' : Timestamp('2018-09-01 10:03:01')},
{'machine_id': 81, 'deposited_at' : Timestamp('2018-10-06 10:03:01')}
]
)
deposited_at machine_id
0 2018-10-04 14:49:38 12
1 2018-09-20 14:41:59 56
2 2018-08-25 14:50:07 24
3 2018-08-04 15:33:09 56
4 2018-08-01 18:18:44 12
5 2018-09-24 12:34:35 24
6 2018-10-01 17:09:38 35
7 2018-09-27 11:32:02 21
8 2018-09-27 11:33:55 21
9 2018-08-30 10:03:01 23
10 2018-09-01 10:03:01 81
11 2018-10-06 10:03:01 81
然后我们使用“machine_id”作为key对这两个DataFrame进行内连接,然后在日期上跟进一个简单的过滤条件。这里的最后一行是简单地清理列以类似于您的原始输入。
filtered_dataframe = data_sample.merge(uidf, on=['machine_id'], how='inner')
filtered_dataframe = filtered_dataframe[
filtered_dataframe['deposited_at'] > filtered_dataframe['ui_date']
]
filtered_dataframe = filtered_dataframe[['machine_id', 'deposited_at']]
这有效地确保了 1) 数据样本中的机器 ID 在您的 UI 表中,并且 2) 存放日期大于 UI 表中的日期:
machine_id deposited_at
1 81 2018-10-06 10:03:01
希望这就是你要找的!
推荐阅读
- python - 在 Python 中使用正则表达式提取 HTML URL - 开始 (https) 和结束 (jpg)
- python - uwsgi 配置的默认值是什么?
- python - 找不到一些动态定义的 Python 类成员
- javascript - 如何在反应中合并传递道具和默认道具?
- github - GitHub页面直接访问子链接返回404错误
- android - 有趣的 getValue 没有类型参数:任何?
- javascript - Promise.all 用法的区别
- node.js - 请求 Paypal 沙箱时跨域请求被阻止,但不同的本地主机端口工作正常
- php - 用PHP数组中的循环替换单个数组值?
- javascript - 使用cheerio在html中获取元素名称