首页 > 解决方案 > How to optimally compare if element is in a list of ranges?

问题描述

I have two 2D lists with one million+ entries. List1 is full of ranges (e.g. [100, 25] meaning the range starts at 100 and has a span of 25, or 100 to 125) and List2 is full of plot points. I have to count how many ranges each point fits into.

Since I'm dealing with such a large amount of data, I began with a binary search function to make searching for a starting point easier, rather than cycling through the whole list one by one. Except I've modified the search to return the closest index from List1 that is ABOVE my plot point, because the plot point is not guaranteed to be an element in List1.

I end up doing a comparison between each plot point in List2 and List1, starting at the index I got from the Binary Search, moving downwards in List1 until the plot point is no longer within a range. Although this works, this is very time consuming and inefficient.

def findRange(List1, List2):

    for plot in List2:
        count = 0
        startIndex = binarySearch(List1, plot)

        while plot <= List1[startIndex - 1][0] + 250:
            if List1[startIndex][0] < plot[0] < List1[startIndex][0] + List1[startIndex][1]:
                count += 1
            startIndex -= 1

        plot[1] = count
    return List2

I have the ... + 250 in the while loop because the range distance can be anywhere from 0 to 250, so I use that to determine whether or not further indices might contain the point.

I've looked at converting List1 into a dictionary where the key is the summed distance, and the value is the number of repeats found. I thought that by getting rid of repeating ranges, I could cut back on some of the overhead. I couldn't find a way to get the correct solution this way though.

Is there any suggestions you can direct me in to better optimize my algorithm? I can only use the standard library.

EDIT:

List1 = [
    [1233, 120],
    [1233, 80],
    [1490, 50],
    [1789, 220],
    [1800, 250]
]
 
List2 = [
    [1300, 0], 
    [1450, 0], 
    [1490, 0], 
    [2000, 0]
]

I am supposed to update the zeros in List2 with the count of how many ranges the plot point falls into. So an output would be something like this:

List2 = [
    [1300, 2], # this point can be found in List1[0] and List1[1] because 1300 is between 1233-1353 and 1233-1313
    [1450, 0], # this plot can be found in none of ranges
    [1490, 1], # this plot can be found in List1[2] 
    [2000, 2]  # this plot can be found in List1[2] and List1[3]
]

As you can see, the count index is updated in List2 from 0 to the number of ranges contains the point.

标签: pythonpython-3.xlistalgorithmperformance

解决方案


版本 1(二分查找)

您最好搜索List2元素List1而不是相反。假设两个列表都已排序,您总是会缩小搜索空间。此外,您将避免在假设范围大小上限时必须执行的潜在浪费操作。

附带说明一下,二进制搜索是在bisect模块中的 python 中实现的。我建议同时使用bisect_left(用于范围开始)和bisect_right范围结束。

from bisect import bisect_left, bisect_right

start = 0
for rnge in List1:
    start = bisect_left(List2, [rnge[0], 0], start)
    end = bisect_right(List2, [rnge[0] + rnge[1] - 1, 0], start)
    for plot_index in range(start, end):
        List2[plot_index][1] += 1

您的示例的结果是

>>> List2
[[1300, 2], [1450, 0], [1490, 1], [2000, 2]]

版本 2(一次性线性搜索)

话虽如此,一个更好的解决方案浮现在脑海中。您可以一次通过两个列表同时完成整个操作。实现这一点有两个想法。

首先,考虑如果您有两个可以查看的迭代器会发生什么。您始终可以检查哪个更大,哪个更小,并增加适当的一个,直到另一个的头部更大或更小。

其次,您可以使用 List1,将每个范围的开始和结束都转换为索引,并对所有这些索引进行排序。但是,您仍然必须跟踪哪些索引表示进入范围,哪些表示退出范围。事实上,您可以想象输入一个范围会增加应该分配给您在其中遇到的所有图的数字,而退出一个范围会减少该数字。

让我们从List1使用第二个想法进行转换开始。它将成为一个对列表,其中第一个元素是索引,第二个元素是 +1 或 -1,具体取决于索引是开始还是结束:

actions = []
for rnge in List1:
    actions.append([rnge[0], +1])
    actions.append([rnge[0] + rnge[1], -1])
actions.sort()

现在您可以沿着两个列表 (actionList2) 前进,跟踪从 的增量的运行总和action。如果元素 fromList2的键小于 from action,则将当前增量分配给 fromList2并继续。如果一个 fromaction有一个较小或相等的键,您可以通过更新增量来处理它(即,进入或退出一个范围):

end = (List2[-1] + 1, 0) # Key to ensure that increment stops changing once actions runs out
it = iter(actions)
key, inc = next(it, end)
increment = 0
for plot in List2: # Keep plot as a list, since it needs to be incremented
    while key <= plot[0]:
        increment += inc
        key, inc = next(it, end)
    plot[1] += increment

虽然这种方法有两个部分,但请注意它只进行两次线性传递List1和一次传递List2。没有实际的搜索,只有比较。该调用next(it, end)确保所有的List2都得到完全处理,无论如何不增加iterator一次actions已经用完。

和以前一样,结果是

>>> List2
[[1300, 2], [1450, 0], [1490, 1], [2000, 2]]

推荐阅读