首页 > 解决方案 > 具有相互依赖值的矩阵中的矢量化计算

问题描述

我正在以多个时间分辨率跟踪多个离散时间序列,从而得到一个 SxRxB 矩阵,其中 S 是时间序列的数量,R 是不同分辨率的数量,B 是缓冲区,即每个序列记住多少个值。每个系列都是离散的,并使用有限范围的自然数来表示其值。我将在这里称这些“符号”。

对于每个系列,我想计算在所有测量中任何先前测量的符号直接在当前测量的任何符号之前的频率。我已经通过如下所示的 for 循环解决了这个问题,但出于明显的原因想对其进行矢量化。

我不确定我构建数据的方式是否有效,所以我愿意接受建议。我认为特别是比率矩阵可以做不同的事情。

提前致谢!

def supports_loop(data, num_series, resolutions, buffer_size, vocab_size):
    # For small test matrices we can calculate the complete matrix without problems
    indices = []
    indices.append(xrange(num_series))
    indices.append(xrange(vocab_size))
    indices.append(xrange(num_series))
    indices.append(xrange(vocab_size))
    indices.append(xrange(resolutions))

    # This is huge! :/
    # dimensions:
    #   series and value for which we calculate,
    #   series and value which precedes that measurement,
    #   resolution
    ratios = np.full((num_series, vocab_size, num_series, vocab_size, resolutions), 0.0)

    for idx in itertools.product(*indices):
        s0, v0 = idx[0],idx[1]  # the series and symbol for which we calculate
        s1, v1 = idx[2],idx[3]  # the series and symbol which should precede the we're calculating for
        res = idx[4]

        # Find the positions where s0==v0
        found0 = np.where(data[s0, res, :] == v0)[0]
        if found0.size == 0:
            continue
        #print('found {}={} at {}'.format(s0, v0, found0))

        # Check how often s1==v1 right before s0==v0
        candidates = (s1, res, (found0 - 1 + buffer_size) % buffer_size)
        found01 = np.count_nonzero(data[candidates] == v1)
        if found01 == 0:
            continue

        print('found {}={} following {}={} at {}'.format(s0, v0, s1, v1, found01))
        # total01 = number of positions where either s0 or s1 is defined (i.e. >=0)
        total01 = len(np.argwhere((data[s0, res, :] >= 0) & (data[s1, res, :] >= 0)))
        ratio = (float(found01) / total01) if total01 > 0 else 0.0
        ratios[idx] = ratio

    return ratios


def stackoverflow_example(fnc):
    data = np.array([
        [[0, 0, 1],  # series 0, resolution 0
         [1, 3, 2]], # series 0, resolution 1

        [[2, 1, 2],  # series 1, resolution 0
         [3, 3, 3]], # series 1, resoltuion 1
    ])

    num_series = data.shape[0]
    resolutions = data.shape[1]
    buffer_size = data.shape[2]
    vocab_size = np.max(data)+1

    ratios = fnc(data, num_series, resolutions, buffer_size, vocab_size)
    coordinates = np.argwhere(ratios > 0.0)
    nz_values = ratios[ratios > 0.0]
    print(np.hstack((coordinates, nz_values[:,None])))
    print('0/0 precedes 0/0 in 1 out of 3 cases: {}'.format(np.isclose(ratios[0,0,0,0,0], 1.0/3.0)))
    print('1/2 precedes 0/0 in 2 out of 3 cases: {}'.format(np.isclose(ratios[0,0,1,2,0], 2.0/3.0)))

预期输出(21 对,5 列坐标,然后是找到的计数):

[[0 0 0 0 0 1]
 [0 0 0 1 0 1]
 [0 0 1 2 0 2]
 [0 1 0 0 0 1]
 [0 1 0 2 1 1]
 [0 1 1 1 0 1]
 [0 1 1 3 1 1]
 [0 2 0 3 1 1]
 [0 2 1 3 1 1]
 [0 3 0 1 1 1]
 [0 3 1 3 1 1]
 [1 1 0 0 0 1]
 [1 1 1 2 0 1]
 [1 2 0 0 0 1]
 [1 2 0 1 0 1]
 [1 2 1 1 0 1]
 [1 2 1 2 0 1]
 [1 3 0 1 1 1]
 [1 3 0 2 1 1]
 [1 3 0 3 1 1]
 [1 3 1 3 1 3]]

在上面的示例中,系列 0 中的 0 在三分之二的情况下跟随系列 1 中的 2(因为缓冲区是循环的),因此 [0, 0, 1, 2, 0] 处的比率约为 0.6666。同样是系列 0,值 0 在三分之一的情况下跟随其自身,因此 [0, 0, 0, 0, 0] 处的比率将为 ~0.3333。还有一些其他的> 0.0。


我正在两个数据集上测试每个答案:一个很小的数据集(如上所示)和一个更现实的数据集(100 个系列,5 个分辨率,每个系列 10 个值,50 个符号)。

结果

Answer        Time (tiny)     Time (huge)     All pairs found (tiny=21)
-----------------------------------------------------------------------
Baseline      ~1ms            ~675s (!)       Yes
Saedeas       ~0.13ms         ~1.4ms          No (!)
Saedeas2      ~0.20ms         ~4.0ms          Yes, +cross resolutions
Elliot_1      ~0.70ms         ~100s (!)       Yes
Elliot_2      ~1ms            ~21s (!)        Yes
Kuppern_1     ~0.39ms         ~2.4s (!)       Yes
Kuppern_2     ~0.18ms         ~28ms           Yes
Kuppern_3     ~0.19ms         ~24ms           Yes
David         ~0.21ms         ~27ms           Yes

Saedeas 2nd 方法是明显的赢家!非常感谢你们,大家:)

标签: pythonnumpyvectorizationmarket-basket-analysis

解决方案


如果我正确理解了您的问题,我认为这段代码将以相对快速的矢量化方式为您提供所需的符号对。

import numpy as np
import time
from collections import Counter

series = 2
resolutions = 2
buffer_len = 3
symbols = range(3)

#mat = np.random.choice(symbols, size=(series, resolutions, buffer_len)).astype('uint8')

mat = np.array([
        [[0, 0, 1],  # series 0, resolution 0
         [1, 3, 2]],  # series 0, resolution 1
        [[2, 1, 2],  # series 1, resolution 0
         [3, 3, 3]],  # series 1, resoltuion 1
    ])

start = time.time()
index_mat = np.indices(mat.shape)

right_shift_indices = np.roll(index_mat, -1, axis=3)
mat_shifted = mat[right_shift_indices[0], right_shift_indices[1], right_shift_indices[2]]

# These construct all the pairs directly
first_series = np.repeat(range(series), series*resolutions*buffer_len)
second_series = np.tile(np.repeat(range(series), resolutions*buffer_len), series)
res_loop = np.tile(np.repeat(range(resolutions), buffer_len), series*series)
mat_unroll = np.repeat(mat, series, axis=0)
shift_unroll = np.tile(mat_shifted, series)

# Constructs the pairs
pairs = zip(np.ravel(first_series),
            np.ravel(second_series),
            np.ravel(res_loop),
            np.ravel(mat_unroll),
            np.ravel(shift_unroll))

pair_time = time.time() - start
results = Counter(pairs)
end = time.time() - start

print("Mat: {}").format(mat)
print("Pairs: {}").format(results)
print("Number of Pairs: {}".format(len(pairs)))
print("Pair time is: {}".format(pair_time))
print("Count time is: {}".format(end-pair_time))
print("Total time is: {}".format(end))

基本思想是根据缓冲区的时间序列将每个缓冲区循环移动适当的数量(我认为这就是您当前的代码正在做的事情)。然后,我可以通过简单地将沿系列轴偏移 1 的列表压缩在一起来生成所有符号对。

示例输出:

Mat: [[[0 0 1]
  [1 3 2]]

 [[2 1 2]
  [3 3 3]]]
Pairs: Counter({(1, 1, 1, 3, 3): 3, (1, 0, 0, 2, 0): 2, (0, 0, 0, 0, 0): 1, (1, 1, 0, 2, 2): 1, (1, 1, 0, 2, 1): 1, (0, 1, 0, 0, 2): 1, (1, 0, 1, 3, 3): 1, (0, 0, 1, 1, 3): 1, (0, 0, 1, 3, 2): 1, (1, 0, 0, 1, 1): 1, (0, 1, 0, 0, 1): 1, (0, 1, 1, 2, 3): 1, (0, 1, 0, 1, 2): 1, (1, 1, 0, 1, 2): 1, (0, 1, 1, 3, 3): 1, (1, 0, 1, 3, 2): 1, (0, 0, 0, 0, 1): 1, (0, 1, 1, 1, 3): 1, (0, 0, 1, 2, 1): 1, (0, 0, 0, 1, 0): 1, (1, 0, 1, 3, 1): 1})
Number of Pairs: 24
Pair time is: 0.000135183334351
Count time is: 5.10215759277e-05
Total time is: 0.000186204910278

编辑:真正的最后尝试。完全矢量化。


推荐阅读