我编写了一个函数来获取基于组 ID 的 LSTM/GRU 序列模型的序列。我没有得到预期的输出。


def windowGeneratorByID(data, target, id_col_index, lookback, offset, batch_size=16):
  max_index = data.shape[0]-offset
  i = min_index + lookback
  while 1:
    if i + batch_size >= max_index:
      i = min_index + lookback
    rows = np.arange(i, min(i + batch_size, max_index))
    i += len(rows)
    samples = np.zeros((len(rows), lookback, data.shape[-1]))
    targets = np.zeros((len(rows), target.shape[-1]))

    for j, row in enumerate(rows):
      indices = range(rows[j] - lookback, rows[j])
      if data[rows[j] + offset][id_col_index] in set(data[indices][:, id_col_index]):
        if len(set(data[indices][:, id_col_index])) == 1:
            samples[j] = data[indices]
            targets[j] = target[rows[j] + offset]

    yield  np.delete(samples,id_col_index,axis=2) , targets


df=np.array([[1,1,0.1,11],[1,2,0.2,12], [1,3,0.3,13], [1,4,0.4,14], [2,5,0.5,15], [2,6,0.6,16], [2,7,0.7,17],[3,8,0.8,18],[3,9,0.9,19],[3,10,0.7,20]])


offset = 0
windows = windowGeneratorByID(data=df, target=df[:,2:4],id_col_index=0 , offset=offset, lookback=lookback,batch_size=batch_size)

#The number of total batches are equal to the number of (training examples - lookback-offset)/batch_size 

# #print the batches
for i in range(no_batches):
  #get the next batch from the windowGenerator
  print("{}th batch: \ninput is:\n{}\n and \ntarget is:\n{}\n".format(i+1, input, output))


1th batch: 
input is:
[[[ 1.   0.1 11. ]
  [ 2.   0.2 12. ]]

 [[ 2.   0.2 12. ]
  [ 3.   0.3 13. ]]]
target is:
[[ 0.3 13. ]
 [ 0.4 14. ]]

2nd batch: 
input is:
[[[ 5.   0.5 15. ]
  [ 6.   0.6 16. ]]

 [[[ 8.   0.8 18. ]
  [ 9.   0.9 19. ]]
target is:
[[ 0.7 17. ]
 [ 0.7 20. ]]

这里有两种方法可以帮助您解决您想要解决的问题。一个是像您这样的生成器方法,一次获取 1 个批次,第二个是矢量化 NumPy 方法,它一次对完整数据进行操作以获取所有批次(此方法可用于 df 块而不是完整的) .


  1. A chunk,offsetlookback, 基本上是一组 X 到 y 行。所以,如果我愿意lookback 2offset 1。然后我需要 df 的 4 行。前 2 个将转到 X,最后一个将转到 y。同样,如果我需要lookback 1 offset 0,那么我只需要 2 行。首先去 X,最后去 y。
  2. 有了这种理解,我可以计算出我可以从每个带有滚动窗口的组中获得的最大块数并将其存储在c
  3. 一旦我有了这个,我只需要创建一个函数,让我滚动迭代 df 的行,选择块的数量,然后跳过一些,因为这几个将具有来自不同组的元素。所以,如果我有[0,1,2,3,4,5,6]并且我有c = [2,1,1]并且跳过(又名lookback+offset)= 1。那么我必须取 2,跳过 1,取 1,跳过 1,取 1,跳过 1。所以,,[0,1,3,5]是我要迭代的。我会从这些索引中的每一个开始计算块的大小。
  4. 接下来就超级简单了。只需获取一个生成器设置,它会拉出这些块,对于 a batch size = n,拉出 n 个块并在返回之前将它们堆叠起来。

def take(xs, runs, skip_size):
    ixs = iter(xs)
    for run_size in runs:
        for _ in range(run_size ):
            yield next(ixs)
        for _ in range(skip_size):
def get_batch(df, target, lookback, offset, batch_size):
    _ , c = np.unique(df[:,0], return_counts=True)
    rows = (lookback+offset+1)
    w = c-rows+1
    itr = take(range(len(df)), w, lookback+offset)
    while 1:
        X, Y = [],[]
        for _ in range(batch_size):
            k = next(itr, 'out of batches!')
            x = df[k:lookback+k, 1:]
            y = df[rows+k-1:rows+k, target]
        try: yield np.stack(X), np.stack(Y)
        except: break
lookback = 2
offset = 0
batch_size = 2
target = slice(2,4) #set the target as a slice instead of a separate df view

windows = get_batch(df, target, lookback, offset, batch_size)

no_batches = int(np.sum(np.unique(df[:,0], return_counts=True)[1] - lookback - offset)/batch_size)

for i in range(no_batches):
    print("{}th batch: \ninput is:\n{}\n and \ntarget is:\n{}\n".format(i+1, input, output))
#Lookback = 2, offset = 0, batch_size = 2 

1th batch: 
input is:
[[[ 1.   0.1 11. ]
  [ 2.   0.2 12. ]]

 [[ 2.   0.2 12. ]
  [ 3.   0.3 13. ]]]
target is:
[[[ 0.3 13. ]]

 [[ 0.4 14. ]]]

2th batch: 
input is:
[[[ 5.   0.5 15. ]
  [ 6.   0.6 16. ]]

 [[ 8.   0.8 18. ]
  [ 9.   0.9 19. ]]]
target is:
[[[ 0.7 17. ]]

另一个例子 -

lookback = 1
offset = 1
batch_size = 1
target = slice(2,4) #set the target as a slice instead of a separate df view

windows = get_batch(df, target, lookback, offset, batch_size)

no_batches = int(np.sum(np.unique(df[:,0], return_counts=True)[1] - lookback - offset)/batch_size)

for i in range(no_batches):
    print("{}th batch: \ninput is:\n{}\n and \ntarget is:\n{}\n".format(i+1, input, output))
#Lookback = 1, offset = 1, batch_size = 1

1th batch: 
input is:
[[[ 1.   0.1 11. ]]]
target is:
[[[ 0.3 13. ]]]

2th batch: 
input is:
[[[ 2.   0.2 12. ]]]
target is:
[[[ 0.4 14. ]]]

3th batch: 
input is:
[[[ 5.   0.5 15. ]]]
target is:
[[[ 0.7 17. ]]]

4th batch: 
input is:
[[[ 8.   0.8 18. ]]]
target is:
[[[ 0.7 20. ]]]

向量化 NumPy 方法

但是,如果您可以一次对所有数据使用矢量化 NumPy 计算,而不是生成器方法,我也编写了以下内容。如果 df 很大,那么您可以简单地将 df 块传递给该函数并为该块获取一组批次。

  1. 根据 id_column 将数组分成不等长的组
  2. 使用步幅技巧在轴 = 0 上滚动窗口
  3. 将所有窗口堆叠成一个块
  4. 计算可能的批次数
  5. 只保留可以成功堆叠成相同大小批次的块数
  6. 按 num 批次拆分块并获得 X
  7. 按 num 个批次拆分块并获得 y
  8. 将所有 X、y 作为单个数组中的批次返回

offset = 1

def window_split_2d(g, window):
    shp = (g.shape[0] - window + 1, window, g.shape[-1])
    strd = (g.strides[0], g.strides[0], g.strides[1])
    return np.lib.stride_tricks.as_strided(g, shape=shp, strides=strd)

def get_batches_vectorized(df, target, id_col_index, lookback, offset, batch_size):

    #Break array into unequal length groups based on id_column
    groups = np.split(df, np.where(np.diff(df[:,id_col_index]))[0]+1)
    #Get rolling windows over axis=0 using stride tricks
    chunks = [window_split_2d(i,lookback+offset+1) for i in groups]
    #Stack all the windows into a block
    block = np.vstack(chunks)
    #Calculate number of batches possible
    n_batches = block.shape[0]//batch_size
    #Keep only the number of blocks that can successfully be stacked into equal sized batches
    keep = block.shape[0]-(block.shape[0]%batch_size)
    block = block[:keep]
    #Split block by num batches and get X
    X = np.split(block[:,:lookback,1:], n_batches)

    #Split block by num batches and get y
    y = np.split(block[:,-1,target], n_batches)
    return X, y
