首页 > 解决方案 > 在 Tensorflow 中运行 LSTM 时出现 ResourceExhausted 错误或 OOM

问题描述

我正在使用以下代码在 Tensorflow 中训练我的 LSTM 网络:

import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from scipy import stats
import tensorflow as tf
import seaborn as sns
from pylab import rcParams
from sklearn import metrics
from sklearn.model_selection import train_test_split

%matplotlib inline

sns.set(style='whitegrid', palette='muted', font_scale=1.5)

rcParams['figure.figsize'] = 14, 8

RANDOM_SEED = 42

columns = ['user','activity','timestamp', 'x-axis', 'y-axis', 'z-axis']
df = pd.read_csv('data/WISDM_ar_v1.1_raw.txt', header = None, names = columns)
df = df.dropna()

df.head()

df.info()

##df['activity'].value_counts().plot(kind='bar', title='Training examples by activity type');
##df['user'].value_counts().plot(kind='bar', title='Training examples by user');

def plot_activity(activity, df):
    data = df[df['activity'] == activity][['x-axis', 'y-axis', 'z-axis']][:200]
    axis = data.plot(subplots=True, figsize=(16, 12), 
                     title=activity)
    for ax in axis:
        ax.legend(loc='lower left', bbox_to_anchor=(1.0, 0.5))


##plot_activity("Sitting", df)
##plot_activity("Standing", df)
##plot_activity("Walking", df)
##plot_activity("Jogging", df)


N_TIME_STEPS = 200
N_FEATURES = 3
step = 20
segments = []
labels = []
for i in range(0, len(df) - N_TIME_STEPS, step):
    xs = df['x-axis'].values[i: i + N_TIME_STEPS]
    ys = df['y-axis'].values[i: i + N_TIME_STEPS]
    zs = df['z-axis'].values[i: i + N_TIME_STEPS]
    label = stats.mode(df['activity'][i: i + N_TIME_STEPS])[0][0]
    segments.append([xs, ys, zs])
    labels.append(label)

np.array(segments).shape

reshaped_segments = np.asarray(segments, dtype= np.float32).reshape(-1, N_TIME_STEPS, N_FEATURES)
labels = np.asarray(pd.get_dummies(labels), dtype = np.float32)

reshaped_segments.shape
labels[0]

X_train, X_test, y_train, y_test = train_test_split(
        reshaped_segments, labels, test_size=0.2, random_state=RANDOM_SEED)

len(X_train)
len(X_test)

N_CLASSES = 6
N_HIDDEN_UNITS = 64


def create_LSTM_model(inputs):
    W = {
        'hidden': tf.Variable(tf.random_normal([N_FEATURES, N_HIDDEN_UNITS])),
        'output': tf.Variable(tf.random_normal([N_HIDDEN_UNITS, N_CLASSES]))
    }
    biases = {
        'hidden': tf.Variable(tf.random_normal([N_HIDDEN_UNITS], mean=1.0)),
        'output': tf.Variable(tf.random_normal([N_CLASSES]))
    }

    X = tf.transpose(inputs, [1, 0, 2])
    X = tf.reshape(X, [-1, N_FEATURES])
    hidden = tf.nn.relu(tf.matmul(X, W['hidden']) + biases['hidden'])
    hidden = tf.split(hidden, N_TIME_STEPS, 0)

    # Stack 2 LSTM layers
    lstm_layers = [tf.contrib.rnn.BasicLSTMCell(N_HIDDEN_UNITS, forget_bias=1.0) for _ in range(2)]
    lstm_layers = tf.contrib.rnn.MultiRNNCell(lstm_layers)

    outputs, _ = tf.contrib.rnn.static_rnn(lstm_layers, hidden, dtype=tf.float32)

    # Get output for the last time step
    lstm_last_output = outputs[-1]

    return tf.matmul(lstm_last_output, W['output']) + biases['output']


tf.reset_default_graph()

X = tf.placeholder(tf.float32, [None, N_TIME_STEPS, N_FEATURES], name="input")
Y = tf.placeholder(tf.float32, [None, N_CLASSES])


pred_Y = create_LSTM_model(X)

pred_softmax = tf.nn.softmax(pred_Y, name="y_")

loss = -tf.reduce_sum(Y * tf.log(pred_softmax))
optimizer = tf.train.GradientDescentOptimizer(learning_rate = LEARNING_RATE).minimize(loss)

correct_prediction = tf.equal(tf.argmax(pred_softmax,1), tf.argmax(Y,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

cost_history = np.empty(shape=[1],dtype=float)
saver = tf.train.Saver()

session = tf.Session()
session.run(tf.global_variables_initializer())

batch_size = 10
total_batches = X_train.shape[0] // batch_size


for epoch in range(8):
        for b in range(total_batches):    
            offset = (b * batch_size) % (y_train.shape[0] - batch_size)
            batch_x = X_train[offset:(offset + batch_size), :]
            batch_y = y_train[offset:(offset + batch_size), :]
            _, c = session.run([optimizer, loss],feed_dict={X: batch_x, Y : batch_y})
            cost_history = np.append(cost_history,c)
        print("Epoch: ",epoch," Training Loss: ",c," Training Accuracy: ",\
              session.run(accuracy, feed_dict={X: X_train, Y: y_train}))

我使用的数据集来自http://www.cis.fordham.edu/wisdm/dataset.php

WISDM_ar_txtv1.1_raw

但是,当我运行它时,我收到 ResourceExhausted 或 OOM 错误:

回溯(最近一次调用):文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py”,第 1350 行,在 _do_call 返回fn(*args) 文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py”,第 1329 行,处于 _run_fn 状态,run_metadata)文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\errors_impl.py”,第 473 行,退出 c_api.TF_GetCode(self.status.status )) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM 分配具有 shape[8784000,64] 的张量并通过分配器 GPU_0_bfc 在 /job:localhost/replica:0/task:0/device:GPU:0 上键入 float
[[节点:MatMul = MatMul[T=DT_FLOAT,transpose_a=false,transpose_b=false,_device="/job:localhost/replica:0/task:0/device:GPU:0"](重塑,变量/读取) ]] 提示:如果您想在 OOM 发生时查看已分配张量的列表,请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

[[节点:add_1/_15 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0 /device:GPU:0", send_device_incarnation=1, tensor_name="edge_9637_add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] 提示:如果你想要在 OOM 发生时查看已分配张量的列表,请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

在处理上述异常的过程中,又出现了一个异常:

回溯(最近一次调用最后一次):文件“”,第 9 行,在 session.run(准确性,feed_dict={X: X_train, Y: y_train}))文件“C:\Users\Chaine\AppData\Local\Programs\ Python\Python35\lib\site-packages\tensorflow\python\client\session.py”,第 895 行,在运行 run_metadata_ptr) 文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site -packages\tensorflow\python\client\session.py”,第 1128 行,在 _run feed_dict_tensor、options、run_metadata)文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\ tensorflow\python\client\session.py”,第 1344 行,在 _do_run 选项中,run_metadata)文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client \session.py”,第 1363 行,在 _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM 分配张量与 shape[8784000,64] 并在 /job:localhost/replica:0/task 上键入 float 时: 0/device:GPU:0 分配器 GPU_0_bfc
[[节点:MatMul = MatMul[T=DT_FLOAT,transpose_a=false,transpose_b=false,_device="/job:localhost/replica:0/task:0/device:GPU:0"](重塑,变量/读取) ]] 提示:如果您想在 OOM 发生时查看已分配张量的列表,请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

[[节点:add_1/_15 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0 /device:GPU:0", send_device_incarnation=1, tensor_name="edge_9637_add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] 提示:如果你想要在 OOM 发生时查看已分配张量的列表,请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

由操作“MatMul”引起,定义在:文件“”,第 1 行,文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\idlelib\run.py”,第 130 行,在main ret = method(*args, **kwargs) File "C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\idlelib\run.py",第 357 行,运行代码 exec(code, self .locals) 文件“”,第 1 行,文件“”,第 13 行,create_LSTM_model 文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops \math_ops.py”,第 2022 行,在 matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) 文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-包\tensorflow\python\ops\gen_math_ops.py",第 2799 行,在 _mat_mul 名称=名称)文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\op_def_library.py”,第 787 行,在 _apply_op_helper op_def=op_def ) 文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py”,第 3160 行,在 create_op op_def=op_def) 文件“C: \Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py”,第 1625 行,在,第 3160 行,在 create_op op_def=op_def) 文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py”,第 1625 行,在,第 3160 行,在 create_op op_def=op_def) 文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py”,第 1625 行,在init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError(有关回溯,请参见上文):OOM 分配具有 shape[8784000,64] 的张量并通过分配器 GPU_0_bfc
[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable/read)]] 提示:如果你想要在 OOM 发生时查看已分配张量的列表,请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

[[节点:add_1/_15 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0 /device:GPU:0", send_device_incarnation=1, tensor_name="edge_9637_add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] 提示:如果你想要在 OOM 发生时查看已分配张量的列表,请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

什么可能导致此错误?

更新:我在另一台机器上运行我的代码,它没有给出错误。

标签: tensorflowerror-handlingdeep-learningrecurrent-neural-network

解决方案


你的代码有一个大问题。您正面临这个问题,因为您没有静态图 - 这意味着您在执行 for 循环时会不断添加新图。如果您跟踪如何评估您的损失值

session.run([loss]), 

你会注意到你正在跑步

pred_Y = create_LSTM_model(X)

当您通过 for 循环时,您的代码的一部分多次。

你不想这样做。您应该修改您的代码,以便您可以从图表中提取损失参数,而无需重新创建图表。

希望能帮助到你。


推荐阅读