首页 > 解决方案 > 在 LSTM 中处理极长的时间步长序列(NLP 多标签分类)

问题描述

这是我第一次在 stackoverflow 上提问,如果我没有以正确的格式提问,我很抱歉。假设我正在处理一些非常长的时间步长序列数据(10000000),有 2701 个样本,只有一个特征,我的输入数组是[2701,10000000,1],我的数据集看起来像

 [ 2.81143e-01  4.98219e-01 -8.08500e-03 ...  1.00000e+02  1.00000e+02
   1.00000e+02]
 [ 1.95077e-01  2.20920e-02 -1.68663e-01 ...  1.00000e+02  1.00000e+02
   1.00000e+02]
 ...
 [ 1.06033e-01  8.96650e-02 -3.20860e-01 ...  1.00000e+02  1.00000e+02
   1.00000e+02]
 [ 6.85510e-02 -3.83653e-01 -2.19265e-01 ...  1.00000e+02  1.00000e+02
   1.00000e+02]
 [ 2.51404e-01  8.02280e-02  2.84610e-01 ...  1.00000e+02  1.00000e+02
   1.00000e+02]]

但是,根据我的阅读,通常 LSTM 网络在 (200~400) 时间步的范围内表现更好,即使忽略性能,我也无法用单个样本成功训练[1,10000000,1]。我相信网络是正常的,因为我试图将每个样本的长度限制为(1500),现在[2701,1500,1]它终于停止卡在第一个时期。如果需要,下面是我的代码:

from keras.utils import Sequence
import numpy as np
from numpy.lib.format import open_memmap
import gc
import platform
import pandas as pd
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers import Masking
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session

config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC' #A "Best-fit with coalescing" algorithm, simplified from a version of dlmalloc.
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config)) 

stock_price=pd.read_csv("C:/Users/user/Desktop/Final-Year-Project-master/stock_classification_7Days_test.csv",sep=',', dtype={"ID":"string","Class":int})
print(stock_price)


print (platform.architecture()) 

y_data=[]
x_data=[]

y_data=pd.get_dummies(stock_price['Class'])


def embedded_reader(file_path):
    with open(file_path) as embedded_raw:
        for line in embedded_raw:
            for word in line.split(','):
                try:
                    val=float(word)
                    yield val
                except:
                    pass
    
    embedded_raw.close()
    gc.collect()   
        

for y in range(len(stock_price)):
    if int(stock_price.at[y,'Class']) is not None:
        i = stock_price.at[y,'ID']
        print("Company code current: ",i)
        embedded_current=[]

        try:
            gen=embedded_reader("C:/Users/user/Desktop/Final-Year-Project-master/json_test/{}.jsonl".format(i))

            
            while True:
                val=next(gen)
                embedded_current.append(val)

        except:
            pass

                    
        fp=np.memmap('embedded_array.mymemmap', dtype=np.uint8,mode='w+',shape=(1,)) 
        fp=np.delete(fp,0)
        fp=np.concatenate((fp,embedded_current),axis=0)
        fp=np.pad(fp, (0,(10000000-len(embedded_current))), 'constant', constant_values=(100, 100))
        print(fp)
        x_data.append(fp)
        print(np.shape(x_data))
        del fp

        print("embedded_data current: ",len(embedded_current))

        print("this is number {}".format(y))
        print("-----------------------------")
        gc.collect()
        

    gc.collect()        


                  
print(len(x_data))
print(np.shape(x_data))
print("-"*20)
print(np.shape(y_data))
print(np.size(y_data))

X_train, X_test, y_train, y_test = train_test_split(x_data,y_data,test_size=0.2,random_state=0)
print(np.shape(X_train))
print(np.shape(X_test))
X_train=np.array(X_train)
X_test=np.array(X_test)
print(np.shape(X_train))
print(np.shape(X_test))
print(X_train)
X_train = np.reshape(X_train, (X_train.shape[0],  X_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0],  X_train.shape[1], 1))
print(np.shape(X_train))
print(np.shape(X_test))
y_train=np.array(y_train)
y_test=np.array(y_test)

print(len(X_test[0]))
print(np.shape(y_train))


model=Sequential()

model.add(Masking(mask_value=100, input_shape=(10000000,1)))
model.add(LSTM(units=1, return_sequences = True, input_shape=(10000000,1)))
model.add(LSTM(units=1,return_sequences=False))
model.add(Dense(5,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])


model.summary()
model.fit(X_train,y_train,epochs=50,batch_size=4,verbose=1)
print(model.predict(X_test))
print("class label:", reverse_label(model.predict_classes(X_test)))
scores = model.evaluate(X_test, y_test)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))



model.save('my_model')

从他们提到的一些教程中重塑数组,所以我试图将我的数组重塑成类似的东西,[2701*25000, 10000000/25000, 1]但后来我遇到了 x_data 样本和 y_data 样本不一样的问题。我也看到提到的那些,model.fit_generator但似乎它正在解决样本量巨大的问题,在我的情况下,该模型甚至不能处理单个样本(我是神经网络的新手,所以不确定我是否理解正确) . 完全不知道,非常感谢您的帮助,谢谢。

编辑:只是为了清楚地说明我的问题:“关于使用 LSTM 处理如此长的输入有什么建议吗?”

标签: pythonkerasdeep-learningnlplstm

解决方案


推荐阅读