首页 > 解决方案 > 在 RTX 3070 问题上使用 Tensorflow 1.5.0 训练 ResNet50

问题描述

我试图在如下创建的 docker 上运行:

docker run --gpus=all -it -p "8888:8888" -v "/home/miguel/ml-resnet-50/:/notebooks/" --name ml-resnet-50 tensorflow/tensorflow:1.5.0-gpu-py3 jupyter notebook --ip 0.0.0.0 --no-browser --allow-root

在带有 RTX 3070 Nvidia 卡的 Linux PC Ubuntu 20.04 上,以下代码:

model.fit(
    x=imgs_train,
    y=clss_train,
    batch_size=16,
    epochs=2,
    verbose=1,
    validation_data=(imgs_val, clss_val)
    )

并出现以下错误:

InternalError: Blas SGEMM launch failed : m=48400, n=64, k=64
[[Node: res2a_branch2a/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](max_pooling2d/MaxPool, res2a_branch2a/kernel/read)]] [[Node: loss/mul/_2859 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job :localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_15435_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/设备:CPU:0“]]

由操作“res2a_branch2a/Conv2D”引起,定义在:文件“/usr/lib/python3.5/runpy.py”,第 184 行,在 _run_module_as_main", mod_spec) 文件 "/usr/lib/python3.5/runpy.py", 第 85 行, 在 _run_code exec(code, run_globals) 文件 "/usr/local/lib/python3.5/dist-packages/ipykernel_launcher. py”,第 16 行,在 app.launch_new_instance() 文件中“/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py”,第 658 行,在 launch_instance app.start() 文件中“ /usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py”,第 478 行,在 start self.io_loop.start() 文件中“/usr/local/lib/python3.5/dist-packages /zmq/eventloop/ioloop.py”,第 177 行,在 start super(ZMQIOLoop,self).start() 文件“/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py”中,行888,在启动 handler_func(fd_obj, events) 文件“/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py”中,第 277 行,在 null_wrapper 返回 fn(*args, **kwargs) 文件“/usr/local/lib/python3.5/dist-packages/zmq/eventloop/zmqstream.py”,第 440 行,在 _handle_events self._handle_recv() 文件中“ /usr/local/lib/python3.5/dist-packages/zmq/eventloop/zmqstream.py”,第 472 行,在 _handle_recv self._run_callback(callback, msg) 文件“/usr/local/lib/python3.5/ dist-packages/zmq/eventloop/zmqstream.py”,第 414 行,在 _run_callback 回调(*args,**kwargs)文件“/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py”中,第 277 行,在 null_wrapper 返回 fn(*args, **kwargs) 文件“/usr/local/lib/python3.5/dist-packages/ipykernel/kernelbase.py”,第 283 行,在调度程序中返回 self.dispatch_shell(流,味精)文件“/usr/local/lib/python3.5/dist-packages/ipykernel/kernelbase.py”,第 233 行,在 dispatch_shell handler(stream, idents, msg) 文件 "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelbase.py", 第 399 行, 在 execute_request user_expressions, allow_stdin) 文件 "/usr/local/ lib/python3.5/dist-packages/ipykernel/ipkernel.py”,第 208 行,在 do_execute res = shell.run_cell(代码,store_history=store_history,silent=silent)文件“/usr/local/lib/python3.5 /dist-packages/ipykernel/zmqshell.py”,第 537 行,在 run_cell 返回 super(ZMQInteractiveShell, self).run_cell(*args, **kwargs) 文件“/usr/local/lib/python3.5/dist-packages /IPython/core/interactiveshell.py”,第 2728 行,在 run_cell interactivity=interactivity, compiler=compiler, result=result) 文件“/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell. py",第 2850 行,在 run_ast_nodes 如果 self.run_code(code,结果):文件“/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py”,第2910行,在run_code exec(code_obj,self.user_global_ns,self.user_ns)文件“”中,第 2 行,在 model = get_model() 文件中 "",第 4 行,在 get_model 中 model = ResNet50(include_top=False,input_shape=(pipeline['img_height'], pipeline['img_width'], 3)) 文件 "/usr /local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/applications/resnet50.py”,第 235 行,在 ResNet50 x = conv_block(x, 3, [64, 64, 256] , stage=2, block='a', strides=(1, 1)) 文件 "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/applications/resnet50. py”,第 122 行,在 conv_block name=conv_name_base + '2a')(input_tensor) 文件“/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/topology.py”,第 258 行,在调用 输出 = 超级(层,自我)。调用(输入,**kwargs)文件“/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py”,第 652 行,调用 输出 = self.call(输入,* args, **kwargs) 文件“/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/convolutional.py”,第 167 行,调用输出 = self._convolution_op(inputs, self.kernel ) 文件“/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py”,第 838 行,调用 return self.conv_op(inp, filter) 文件“/usr/local/ lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py”,第 502 行,在调用中 返回 self.call(inp, filter) 文件“/usr/local/lib/python3.5/dist-包/tensorflow/python/ops/nn_ops.py",call name=self.name) 文件“/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_nn_ops.py”,第 639 行,在 conv2d data_format=data_format, dilations=dilations, name=名称)文件“/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py”,第 787 行,在 _apply_op_helper op_def=op_def)文件“/usr/local/lib/python3. 5/dist-packages/tensorflow/python/framework/ops.py”,第 3160 行,在 create_op op_def=op_def) 文件“/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops .py",第 1625 行,在init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InternalError(参见上面的回溯):Blas SGEMM 启动失败:m=48400,n=64,k=64 [[节点:res2a_branch2a/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1 , 1, 1], padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0" ](max_pooling2d/MaxPool, res2a_branch2a/kernel/read)]] [[Node: loss/mul/_2859 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0" , send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_15435_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica: 0/任务:0/设备:CPU:0"]]

知道为什么会这样吗?

标签: pythondockertensorflowkerasnvidia

解决方案


推荐阅读