python - 使用 numpy 数组按组 ID 创建序列
问题描述
我编写了一个函数来获取基于组 ID 的 LSTM/GRU 序列模型的序列。我没有得到预期的输出。
蟒蛇功能:
def windowGeneratorByID(data, target, id_col_index, lookback, offset, batch_size=16):
min_index=0
max_index = data.shape[0]-offset
i = min_index + lookback
while 1:
if i + batch_size >= max_index:
i = min_index + lookback
rows = np.arange(i, min(i + batch_size, max_index))
i += len(rows)
samples = np.zeros((len(rows), lookback, data.shape[-1]))
targets = np.zeros((len(rows), target.shape[-1]))
for j, row in enumerate(rows):
indices = range(rows[j] - lookback, rows[j])
if data[rows[j] + offset][id_col_index] in set(data[indices][:, id_col_index]):
if len(set(data[indices][:, id_col_index])) == 1:
samples[j] = data[indices]
targets[j] = target[rows[j] + offset]
yield np.delete(samples,id_col_index,axis=2) , targets
输入:
df=np.array([[1,1,0.1,11],[1,2,0.2,12], [1,3,0.3,13], [1,4,0.4,14], [2,5,0.5,15], [2,6,0.6,16], [2,7,0.7,17],[3,8,0.8,18],[3,9,0.9,19],[3,10,0.7,20]])
输出代码:
lookback=2
batch_size=2
offset = 0
windows = windowGeneratorByID(data=df, target=df[:,2:4],id_col_index=0 , offset=offset, lookback=lookback,batch_size=batch_size)
#The number of total batches are equal to the number of (training examples - lookback-offset)/batch_size
no_batches=int((df.shape[0]-lookback-offset)/batch_size)
# #print the batches
for i in range(no_batches):
#get the next batch from the windowGenerator
input,output=next(windows)
print("{}th batch: \ninput is:\n{}\n and \ntarget is:\n{}\n".format(i+1, input, output))
预期输出:
1th batch:
input is:
[[[ 1. 0.1 11. ]
[ 2. 0.2 12. ]]
[[ 2. 0.2 12. ]
[ 3. 0.3 13. ]]]
and
target is:
[[ 0.3 13. ]
[ 0.4 14. ]]
2nd batch:
input is:
[[[ 5. 0.5 15. ]
[ 6. 0.6 16. ]]
[[[ 8. 0.8 18. ]
[ 9. 0.9 19. ]]
and
target is:
[[ 0.7 17. ]
[ 0.7 20. ]]
解决方案
这里有两种方法可以帮助您解决您想要解决的问题。一个是像您这样的生成器方法,一次获取 1 个批次,第二个是矢量化 NumPy 方法,它一次对完整数据进行操作以获取所有批次(此方法可用于 df 块而不是完整的) .
生成器方法
- A
chunk
,offset
和lookback
, 基本上是一组 X 到 y 行。所以,如果我愿意lookback 2
,offset 1
。然后我需要 df 的 4 行。前 2 个将转到 X,最后一个将转到 y。同样,如果我需要lookback 1
offset 0
,那么我只需要 2 行。首先去 X,最后去 y。 - 有了这种理解,我可以计算出我可以从每个带有滚动窗口的组中获得的最大块数并将其存储在
c
- 一旦我有了这个,我只需要创建一个函数,让我滚动迭代 df 的行,选择块的数量,然后跳过一些,因为这几个将具有来自不同组的元素。所以,如果我有
[0,1,2,3,4,5,6]
并且我有c = [2,1,1]
并且跳过(又名lookback+offset
)= 1。那么我必须取 2,跳过 1,取 1,跳过 1,取 1,跳过 1。所以,,[0,1,3,5]
是我要迭代的。我会从这些索引中的每一个开始计算块的大小。 - 接下来就超级简单了。只需获取一个生成器设置,它会拉出这些块,对于 a
batch size = n
,拉出 n 个块并在返回之前将它们堆叠起来。
df=np.array([[1,1,0.1,11],
[1,2,0.2,12],
[1,3,0.3,13],
[1,4,0.4,14],
[2,5,0.5,15],
[2,6,0.6,16],
[2,7,0.7,17],
[3,8,0.8,18],
[3,9,0.9,19],
[3,10,0.7,20]])
def take(xs, runs, skip_size):
'https://stackoverflow.com/questions/65163947/iterate-over-a-list-based-on-list-with-set-of-iteration-steps'
ixs = iter(xs)
for run_size in runs:
for _ in range(run_size ):
yield next(ixs)
for _ in range(skip_size):
next(ixs)
def get_batch(df, target, lookback, offset, batch_size):
_ , c = np.unique(df[:,0], return_counts=True)
rows = (lookback+offset+1)
w = c-rows+1
itr = take(range(len(df)), w, lookback+offset)
while 1:
X, Y = [],[]
for _ in range(batch_size):
k = next(itr, 'out of batches!')
x = df[k:lookback+k, 1:]
y = df[rows+k-1:rows+k, target]
X.append(x)
Y.append(y)
try: yield np.stack(X), np.stack(Y)
except: break
lookback = 2
offset = 0
batch_size = 2
target = slice(2,4) #set the target as a slice instead of a separate df view
windows = get_batch(df, target, lookback, offset, batch_size)
no_batches = int(np.sum(np.unique(df[:,0], return_counts=True)[1] - lookback - offset)/batch_size)
for i in range(no_batches):
input,output=next(windows)
print("{}th batch: \ninput is:\n{}\n and \ntarget is:\n{}\n".format(i+1, input, output))
#Lookback = 2, offset = 0, batch_size = 2
1th batch:
input is:
[[[ 1. 0.1 11. ]
[ 2. 0.2 12. ]]
[[ 2. 0.2 12. ]
[ 3. 0.3 13. ]]]
and
target is:
[[[ 0.3 13. ]]
[[ 0.4 14. ]]]
2th batch:
input is:
[[[ 5. 0.5 15. ]
[ 6. 0.6 16. ]]
[[ 8. 0.8 18. ]
[ 9. 0.9 19. ]]]
and
target is:
[[[ 0.7 17. ]]
另一个例子 -
lookback = 1
offset = 1
batch_size = 1
target = slice(2,4) #set the target as a slice instead of a separate df view
windows = get_batch(df, target, lookback, offset, batch_size)
no_batches = int(np.sum(np.unique(df[:,0], return_counts=True)[1] - lookback - offset)/batch_size)
for i in range(no_batches):
input,output=next(windows)
print("{}th batch: \ninput is:\n{}\n and \ntarget is:\n{}\n".format(i+1, input, output))
#Lookback = 1, offset = 1, batch_size = 1
1th batch:
input is:
[[[ 1. 0.1 11. ]]]
and
target is:
[[[ 0.3 13. ]]]
2th batch:
input is:
[[[ 2. 0.2 12. ]]]
and
target is:
[[[ 0.4 14. ]]]
3th batch:
input is:
[[[ 5. 0.5 15. ]]]
and
target is:
[[[ 0.7 17. ]]]
4th batch:
input is:
[[[ 8. 0.8 18. ]]]
and
target is:
[[[ 0.7 20. ]]]
向量化 NumPy 方法
但是,如果您可以一次对所有数据使用矢量化 NumPy 计算,而不是生成器方法,我也编写了以下内容。如果 df 很大,那么您可以简单地将 df 块传递给该函数并为该块获取一组批次。
- 根据 id_column 将数组分成不等长的组
- 使用步幅技巧在轴 = 0 上滚动窗口
- 将所有窗口堆叠成一个块
- 计算可能的批次数
- 只保留可以成功堆叠成相同大小批次的块数
- 按 num 批次拆分块并获得 X
- 按 num 个批次拆分块并获得 y
- 将所有 X、y 作为单个数组中的批次返回
df=np.array([[1,1,0.1,11],
[1,2,0.2,12],
[1,3,0.3,13],
[1,4,0.4,14],
[2,5,0.5,15],
[2,6,0.6,16],
[2,7,0.7,17],
[3,8,0.8,18],
[3,9,0.9,19],
[3,10,0.7,20]])
lookback=1
batch_size=2
offset = 1
def window_split_2d(g, window):
shp = (g.shape[0] - window + 1, window, g.shape[-1])
strd = (g.strides[0], g.strides[0], g.strides[1])
return np.lib.stride_tricks.as_strided(g, shape=shp, strides=strd)
def get_batches_vectorized(df, target, id_col_index, lookback, offset, batch_size):
#Break array into unequal length groups based on id_column
groups = np.split(df, np.where(np.diff(df[:,id_col_index]))[0]+1)
#Get rolling windows over axis=0 using stride tricks
chunks = [window_split_2d(i,lookback+offset+1) for i in groups]
#Stack all the windows into a block
block = np.vstack(chunks)
#Calculate number of batches possible
n_batches = block.shape[0]//batch_size
#Keep only the number of blocks that can successfully be stacked into equal sized batches
keep = block.shape[0]-(block.shape[0]%batch_size)
block = block[:keep]
#Split block by num batches and get X
X = np.split(block[:,:lookback,1:], n_batches)
#Split block by num batches and get y
y = np.split(block[:,-1,target], n_batches)
return X, y
推荐阅读
- reporting-services - 将具有多个详细信息行的 SSRS 报告导出到 CSV 时,会为每个详细信息行生成额外的列
- python - 如何接受无整数类型字段
- git - 如何使用 git 跟踪 repo 的单个目录并在工作树中显示为顶级目录
- sql - 是否可以使用记录集作为 MS Access (VBA) 中的源来运行生成表查询?
- java - MaterialButton 与 Button 的大小差异
- java - 在两台不同的机器上创建 RMI 应用程序时客户端和服务器我们应该在哪里定义我们的接口客户端或服务器端?
- featuretools - 您如何搜索特定功能?
- vagrant - vagrant ssh-config 非常慢
- r - R 字符到日期时间导致 NA
- angular - Angular 7注入服务未设置