python - 在 LSTM 中处理极长的时间步长序列(NLP 多标签分类)
问题描述
这是我第一次在 stackoverflow 上提问,如果我没有以正确的格式提问,我很抱歉。假设我正在处理一些非常长的时间步长序列数据(10000000),有 2701 个样本,只有一个特征,我的输入数组是[2701,10000000,1]
,我的数据集看起来像
[ 2.81143e-01 4.98219e-01 -8.08500e-03 ... 1.00000e+02 1.00000e+02
1.00000e+02]
[ 1.95077e-01 2.20920e-02 -1.68663e-01 ... 1.00000e+02 1.00000e+02
1.00000e+02]
...
[ 1.06033e-01 8.96650e-02 -3.20860e-01 ... 1.00000e+02 1.00000e+02
1.00000e+02]
[ 6.85510e-02 -3.83653e-01 -2.19265e-01 ... 1.00000e+02 1.00000e+02
1.00000e+02]
[ 2.51404e-01 8.02280e-02 2.84610e-01 ... 1.00000e+02 1.00000e+02
1.00000e+02]]
但是,根据我的阅读,通常 LSTM 网络在 (200~400) 时间步的范围内表现更好,即使忽略性能,我也无法用单个样本成功训练[1,10000000,1]
。我相信网络是正常的,因为我试图将每个样本的长度限制为(1500),现在[2701,1500,1]
它终于停止卡在第一个时期。如果需要,下面是我的代码:
from keras.utils import Sequence
import numpy as np
from numpy.lib.format import open_memmap
import gc
import platform
import pandas as pd
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers import Masking
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC' #A "Best-fit with coalescing" algorithm, simplified from a version of dlmalloc.
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config))
stock_price=pd.read_csv("C:/Users/user/Desktop/Final-Year-Project-master/stock_classification_7Days_test.csv",sep=',', dtype={"ID":"string","Class":int})
print(stock_price)
print (platform.architecture())
y_data=[]
x_data=[]
y_data=pd.get_dummies(stock_price['Class'])
def embedded_reader(file_path):
with open(file_path) as embedded_raw:
for line in embedded_raw:
for word in line.split(','):
try:
val=float(word)
yield val
except:
pass
embedded_raw.close()
gc.collect()
for y in range(len(stock_price)):
if int(stock_price.at[y,'Class']) is not None:
i = stock_price.at[y,'ID']
print("Company code current: ",i)
embedded_current=[]
try:
gen=embedded_reader("C:/Users/user/Desktop/Final-Year-Project-master/json_test/{}.jsonl".format(i))
while True:
val=next(gen)
embedded_current.append(val)
except:
pass
fp=np.memmap('embedded_array.mymemmap', dtype=np.uint8,mode='w+',shape=(1,))
fp=np.delete(fp,0)
fp=np.concatenate((fp,embedded_current),axis=0)
fp=np.pad(fp, (0,(10000000-len(embedded_current))), 'constant', constant_values=(100, 100))
print(fp)
x_data.append(fp)
print(np.shape(x_data))
del fp
print("embedded_data current: ",len(embedded_current))
print("this is number {}".format(y))
print("-----------------------------")
gc.collect()
gc.collect()
print(len(x_data))
print(np.shape(x_data))
print("-"*20)
print(np.shape(y_data))
print(np.size(y_data))
X_train, X_test, y_train, y_test = train_test_split(x_data,y_data,test_size=0.2,random_state=0)
print(np.shape(X_train))
print(np.shape(X_test))
X_train=np.array(X_train)
X_test=np.array(X_test)
print(np.shape(X_train))
print(np.shape(X_test))
print(X_train)
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_train.shape[1], 1))
print(np.shape(X_train))
print(np.shape(X_test))
y_train=np.array(y_train)
y_test=np.array(y_test)
print(len(X_test[0]))
print(np.shape(y_train))
model=Sequential()
model.add(Masking(mask_value=100, input_shape=(10000000,1)))
model.add(LSTM(units=1, return_sequences = True, input_shape=(10000000,1)))
model.add(LSTM(units=1,return_sequences=False))
model.add(Dense(5,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
model.fit(X_train,y_train,epochs=50,batch_size=4,verbose=1)
print(model.predict(X_test))
print("class label:", reverse_label(model.predict_classes(X_test)))
scores = model.evaluate(X_test, y_test)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
model.save('my_model')
从他们提到的一些教程中重塑数组,所以我试图将我的数组重塑成类似的东西,[2701*25000, 10000000/25000, 1]
但后来我遇到了 x_data 样本和 y_data 样本不一样的问题。我也看到提到的那些,model.fit_generator
但似乎它正在解决样本量巨大的问题,在我的情况下,该模型甚至不能处理单个样本(我是神经网络的新手,所以不确定我是否理解正确) . 完全不知道,非常感谢您的帮助,谢谢。
编辑:只是为了清楚地说明我的问题:“关于使用 LSTM 处理如此长的输入有什么建议吗?”
解决方案
推荐阅读
- java - univocity - 如何从选定的字符中解析字符串
- ruby-on-rails - 在rails 5中用户删除帐户(before_destroy)后发送电子邮件
- scheduled-tasks - 多实例环境下的 ScheduledExecutorService 冗余
- excel - Excel 数据透视表 - 提取数据透视表列表并用分号连接
- javascript - Angular 10 - Openlayers 获取地图坐标(点击)
- java - 调用包存储过程时超出的最大游标数
- php - 来自 div 和 p 的模态输入变量
- c# - 使用c#在switch case中检查变量的字符串值
- javascript - 仅来自顶部的 MouseOut 事件
- javascript - 表单内容作为 Cloud Firestore 中查询的参数