tensorflow - steps_per_epoch 在 tf.keras.fit 中不被尊重
问题描述
tf.keras.fit
不遵守批量大小,不断获得 OOM 以在 GPU 内存中分配整个张量。
我正在尝试为 mnist 数据集拟合 DNN 模型:
mnist_model = tf.keras.Sequential([
tf.keras.layers.Conv2D(filters=35,kernel_size=(3,3), strides=(1,1), padding='same',
activation='relu', input_shape = (1, 28, 28), data_format="channels_first",
use_bias=True, bias_initializer=tf.keras.initializers.constant(0.01),
kernel_initializer='glorot_normal'),
# tf.keras.layers.BatchNormalization(),
tf.keras.layers.MaxPool2D(pool_size=(2,2), padding='same', data_format='channels_first'),
tf.keras.layers.Conv2D(filters=36,kernel_size=(3,3), strides=(1,1), padding='same',
activation='relu', data_format="channels_first", use_bias=True,
bias_initializer=tf.keras.initializers.constant(0.01), kernel_initializer='glorot_normal'),
# tf.keras.layers.BatchNormalization(),
tf.keras.layers.MaxPool2D(pool_size=(2,2), padding='same', data_format='channels_first'),
tf.keras.layers.Conv2D(filters=36,kernel_size=(3,3), strides=(1,1), padding='same',
activation='relu', data_format="channels_first", use_bias=True,
bias_initializer=tf.keras.initializers.constant(0.01), kernel_initializer='glorot_normal'),
# tf.keras.layers.BatchNormalization(),
tf.keras.layers.MaxPool2D(pool_size=(2,2), padding='same', data_format='channels_first'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(576, activation='relu'),
tf.keras.layers.Dense(10, activation='relu')
])
(mnist_images, mnist_labels), _ = tf.keras.datasets.mnist.load_data()
dataset = tf.data.Dataset.from_tensor_slices(
(tf.cast(mnist_images[...,tf.newaxis]/255, tf.float16),
tf.cast(mnist_labels,tf.int8)))
dataset = dataset.shuffle(1000)
mnist_images = tf.convert_to_tensor(np.expand_dims(mnist_images, axis = 1))
mnist_model.compile(optimizer=tf.keras.optimizers.Adam(), loss="categorical_crossentropy", metrics=['accuracy'])
mnist_model.fit(mnist_images, tf.one_hot(mnist_labels, depth=10), epochs=2, steps_per_epoch=100)
我预计批量大小为 600000 / 100 = 6000,但是,Keras 不断分配形状为 [60000,35,28,28] 的张量。该steps_per_epoch
参数不受支持。我收到此错误:
ResourceExhaustedError: OOM when allocating tensor with shape[60000,35,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node conv2d_19/Conv2D}} = Conv2D[T=DT_FLOAT, _class=["loc:@training_6/Adam/gradients/conv2d_19/Conv2D_grad/Conv2DBackpropFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_identity_conv2d_19_input_0, conv2d_19/Conv2D/ReadVariableOp)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node ConstantFoldingCtrl/loss_6/dense_13_loss/broadcast_weights/assert_broadcastable/AssertGuard/Switch_0/_912}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_324_C...d/Switch_0", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
解决方案
您可以将batch_size传递给fit 函数,下面的代码是 fit 的实际函数。你检查它的参数。
def fit(self, x=None, y=None, input_fn=None, steps=None, batch_size=None,
monitors=None, max_steps=None):
"""Trains a model given training data `x` predictions and `y` targets.
Args:
x: Matrix of shape [n_samples, n_features...]. Can be iterator that
returns arrays of features. The training input samples for fitting the
model. If set, `input_fn` must be `None`.
y: Vector or matrix [n_samples] or [n_samples, n_outputs]. Can be
iterator that returns array of targets. The training target values
(class labels in classification, real numbers in regression). If set,
`input_fn` must be `None`.
input_fn: Input function. If set, `x`, `y`, and `batch_size` must be
`None`.
steps: Number of steps for which to train model. If `None`, train forever.
If set, `max_steps` must be `None`.
batch_size: minibatch size to use on the input, defaults to first
dimension of `x`. Must be `None` if `input_fn` is provided.
monitors: List of `BaseMonitor` subclass instances. Used for callbacks
inside the training loop.
max_steps: Number of total steps for which to train model. If `None`,
train forever. If set, `steps` must be `None`.
Two calls to `fit(steps=100)` means 200 training
iterations. On the other hand, two calls to `fit(max_steps=100)` means
that the second call will not do any iteration since first call did
all 100 steps.
Returns:
`self`, for chaining.
Raises:
ValueError: If `x` or `y` are not `None` while `input_fn` is not `None`.
ValueError: If both `steps` and `max_steps` are not `None`.
"""
设置batch_size: 1(mini_batch),这样可以检查模型是否收敛。根据您的 GPU 和 CPU 容量,您可以扩展 batch_size。
还可以通过以下方式限制 GPU 的使用:
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.85
session = tf.Session(config=config, ...)
通过这种方式,您可以解决 OOM(内存不足)错误。
推荐阅读
- html - 如何将文本添加到网页的响应式背景画布?
- node.js - 从源“http://localhost:3000”访问“https://sua-ap-web-1.agora.io/api/v1?action=stringuid”的 XMLHttpRequest 已被 CORS 策略阻止
- javascript - 如何根据 React 中的元素大小计算属性?
- macos - Flutter Desktop(macOS) TextField - 如何禁用自动回绕文本
- css - 为什么不应该应用 css 转换?
- powershell - 使用 Powershell 从多个文件夹复制特定子文件夹
- python - 我们如何在 python 中生成非高斯随机噪声?
- ios - 使用 Flutter Boost 在同一视图上显示多个颤振片段的最佳实践(在现有的 ios 应用程序上)
- android - 在 Kotlin 中成功登录后如何进入仪表板(开始活动)?
- python-3.x - 如何使用事件处理 Pygame 中的中断?