python - 根据键修剪有序字典?
问题描述
根据它们的键“修剪”字典的最快方法是什么?我的理解是字典现在保留了 Python 3.7 以来的顺序
我有一个包含键(日期时间类型)的字典:val(浮点类型)。字典按排序(按时间顺序)排列。
time_series_dict =
{"2019-02-27 14:00:00": 95,
"2019-02-27 15:00:00": 98,
"2019-02-27 16:25:00: 80,
.............
"2019-03-01 12:15:00": 85
}
我想修剪字典,删除start_date和end_date之外的所有内容。字典可以有 1000 个值。有没有比以下更快的方法:
for k in list(time_series_dict.keys()):
if not start_date <= k <= end_date:
del time_series_dict[k]
解决方案
如果您的字典有 1000 多个键,并且您要从有序时间戳序列的开头和结尾删除键,请考虑使用二进制搜索在键的列表副本中查找截止点。Python为此包含了bisect
模块:
from bisect import bisect_left, bisect_right
def trim_time_series_dict(tsd, start_date, end_date):
ts = list(tsd)
before = bisect_right(ts, start_date) # insertion point at > start_date
after = bisect_left(ts, end_date) # insertion point is < end_date
for i in range(before): # up to == start_date
del tsd[ts[i]]
for i in range(after + 1, len(ts)): # from >= end_date onwards
del tsd[ts[i]]
我已经进行了一些时间试验,看看这是否会对您的典型数据集产生影响;正如预期的那样,当删除的键数显着低于输入字典的长度时,它会得到回报。
计时设置(导入、构建测试数据字典以及开始和结束日期、定义测试功能)
>>> import random
>>> from bisect import bisect_left, bisect_right
>>> from datetime import datetime, timedelta
>>> from itertools import islice
>>> from timeit import Timer
>>> def randomised_ordered_timestamps():
... date = datetime.now().replace(second=0, microsecond=0)
... while True:
... date += timedelta(minutes=random.randint(15, 360))
... yield date.strftime('%Y-%m-%d %H:%M:%S')
...
>>> test_data = {ts: random.randint(50, 500) for ts in islice(randomised_ordered_timestamps(), 10000)}
>>> start_date = next(islice(test_data, 25, None)) # trim 25 from the start
>>> end_date = next(islice(test_data, len(test_data) - 25, None)) # trim 25 from the end
>>> def iteration(t, start_date, end_date):
... time_series_dict = t.copy() # avoid mutating test data
... for k in list(time_series_dict.keys()):
... if not start_date <= k <= end_date:
... del time_series_dict[k]
...
>>> def bisection(t, start_date, end_date):
... tsd = t.copy() # avoid mutating test data
... ts = list(tsd)
... before = bisect_right(ts, start_date) # insertion point at > start_date
... after = bisect_left(ts, end_date) # insertion point is < end_date
... for i in range(before): # up to == start_date
... del tsd[ts[i]]
... for i in range(after + 1, len(ts)): # from >= end_date onwards
... del tsd[ts[i]]
...
试验结果:
>>> count, total = Timer("t.copy()", "from __main__ import test_data as t").autorange()
>>> baseline = total / count
>>> for test in (iteration, bisection):
... timer = Timer("test(t, s, e)", "from __main__ import test, test_data as t, start_date as s, end_date as e")
... count, total = timer.autorange()
... print(f"{test.__name__:>10}: {((total / count) - baseline) * 1000000:6.2f} microseconds")
...
iteration: 671.33 microseconds
bisection: 80.92 microseconds
(测试先减去制作 dict 副本的基线成本)。
但是,对于此类操作,可能有更有效的数据结构。我检查了该sortedcontainers
项目,因为它包含一种直接支持对键进行二等分的SortedDict()
类型。不幸的是,虽然它比您的迭代方法表现更好,但我不能让它在这里表现得比在键列表的副本上二等分更好:
>>> from sortedcontainers import SortedDict
>>> test_data_sorteddict = SortedDict(test_data)
>>> def sorteddict(t, start_date, end_date):
... tsd = t.copy()
... # SortedDict supports slicing on the key view
... keys = tsd.keys()
... del keys[:tsd.bisect_right(start_date)]
... del keys[tsd.bisect_left(end_date) + 1:]
...
>>> count, total = Timer("t.copy()", "from __main__ import test_data_sorteddict as t").autorange()
>>> baseline = total / count
>>> timer = Timer("test(t, s, e)", "from __main__ import sorteddict as test, test_data_sorteddict as t, start_date as s, end_date as e")
>>> count, total = timer.autorange()
>>> print(f"sorteddict: {((total / count) - baseline) * 1000000:6.2f} microseconds")
sorteddict: 249.46 microseconds
但是,我可能错误地使用了该项目。从对象中删除键SortedDict
是 O(NlogN) 所以我怀疑这就是失败的地方。SortedDict()
从其他 9950 个键值对创建新对象仍然较慢(超过 2 毫秒,而不是您想与其他方法进行比较的时间)。
但是,如果您要使用该SortedDict.irange()
方法,您可以简单地忽略值,而不是删除它们,并遍历字典键的子集:
for ts in timeseries(start_date, end_date, inclusive=(False, False)):
# iterates over all start_date > timestamp > end_date keys, in order.
无需删除任何内容。该irange()
实现在引擎盖下使用二分法。
推荐阅读
- php - 如何比较不同定义键中相同值的多维数组
- python - 变量应该位于 if __name__ == "__main__": 块的内部还是外部?
- swift - 将 emoji 的 unicode 表示转换为实际的 emoji
- algorithm - 确定一个数是否为素数的平均情况(不是使用最优算法)
- laravel - 从 nginx /location 块(虚拟主机)服务 Laravel 应用程序
- google-cloud-platform - 谷歌计算引擎提高单线程性能
- python - 是否可以在 django 的同一虚拟环境下运行多个应用程序
- npm - 你如何更新 Babel-CLI 中的依赖项?
- javascript - fs.readdir 的范围或上下文问题
- autohotkey - AutoHotKey:MonitorCount 作为触发器?