python - 如何在 Ubuntu 18.04 上尝试在 Python (Anaconda) 中拟合 keras 模型时出现“分段错误(核心转储)”错误
问题描述
我有一台具有 2080Ti GPU 的新 PC(在 Ubuntu 18.04 上)。我正在尝试使用 Keras(在 Anaconda 环境中)在 Python 中训练神经网络,但在尝试拟合模型时出现“分段错误(核心转储)”错误。
我正在使用的代码在我的 Windows PC(具有 1080Ti GPU)上运行良好。该错误似乎与 GPU 内存有关,当我在拟合模型之前运行“nvidia-smi”时,我看到发生了一些奇怪的事情,我看到大约 800mb 的可用 11gb GPU 内存正在被用完,但是一旦我编译该可用内存已全部占用的模型。在进程部分,我可以看到这与 anaconda 环境有关(即 ...ics-link/anaconda3/envs/py35/bin/python = 9677MiB)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.25 Driver Version: 415.25 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:04:00.0 On | N/A |
| 28% 44C P2 51W / 250W | 10491MiB / 10986MiB | 7% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1507 G /usr/lib/xorg/Xorg 30MiB |
| 0 1538 G /usr/bin/gnome-shell 57MiB |
| 0 1844 G /usr/lib/xorg/Xorg 309MiB |
| 0 1979 G /usr/bin/gnome-shell 177MiB |
| 0 3816 G /usr/lib/firefox/firefox 6MiB |
| 0 5451 G ...-token=169F1B80118E535BC5002C22A81DD0FA 90MiB |
| 0 5896 G ...-token=631C5DCD90ADCF80959770937CE797E7 128MiB |
| 0 6485 C ...ics-link/anaconda3/envs/py35/bin/python 9677MiB |
+-----------------------------------------------------------------------------+
下面是代码,仅供参考:
from __future__ import print_function
import keras
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D, Activation, BatchNormalization
from keras.callbacks import ModelCheckpoint, CSVLogger
from keras import backend as K
import numpy as np
batch_size = 64
num_classes = 10
epochs = 10
# input image dimensions
img_rows, img_cols = 32, 32
# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 3, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 3, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 3)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 3)
input_shape = (img_rows, img_cols, 3)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# normalise pixel values
mean = np.mean(x_train,axis=(0,1,2,3))
std = np.std(x_train,axis=(0,1,2,3))
x_train = (x_train-mean)/(std+1e-7)
x_test = (x_test-mean)/(std+1e-7)
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(Conv2D(64, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(128, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(256, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(1024))
model.add(Activation("relu"))
model.add(Dropout(0.25))
model.add(Dense(1024))
model.add(Activation("relu"))
model.add(Dropout(0.25))
model.add(Dense(1024))
model.add(Activation("relu"))
model.add(Dropout(0.25))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
#load weights from previous run
#model.load_weights('model07_weights_best.hdf5')
from keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=0.1, # randomly rotate images in the range (degrees, 0 to 180)
width_shift_range=0.1, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total height)
horizontal_flip=True, # randomly flip images
vertical_flip=False) # randomly flip images
# Compute quantities required for feature-wise normalization
# (std, mean, and principal components if ZCA whitening is applied).
datagen.fit(x_train)
#save weights and log
checkpoint = ModelCheckpoint("model14_weights_best.hdf5", monitor='val_acc', verbose=1, save_best_only=True, mode='max')
csv_logger = CSVLogger('model14_loss_log.csv', append=True, separator=';')
callbacks_list = [checkpoint,csv_logger]
# Fit the model on the batches generated by datagen.flow().
model.fit_generator(datagen.flow(x_train, y_train,
batch_size=batch_size),
epochs=epochs,
validation_data=(x_test, y_test),
callbacks = callbacks_list
)
我并不期望在 GPU 上占用大量空间,但它似乎已经饱和。正如我提到的,它适用于我的 Windows PC。
关于可能导致这种情况的任何想法?
解决方案
我不相信这与内存大小有关。我最近一直在处理这个问题。分段错误错误表示您的训练过程在 GPU 上的并行化失败。如果进程按顺序运行,无论您的数据集有多大,您都不会出现此错误。此外,也无需担心您的深度学习设置。
由于您即将设置一台新机器,我相信您的上下文中的分段错误肯定有两个原因。
首先,我会检查我的 GPU 是否安装正确,但根据您提供的详细信息,我认为问题更多是关于模块(在您的情况下为 Keras)作为第二个原因:
在这种情况下,您在安装模块或其依赖项之一时可能会遇到一些奇怪的事情。我建议将其删除并清理所有内容并重新安装。
您确定您的 tensorflow-gpu 已安装(正确)吗?cuda 和 cudnn 呢?
如果您认为 keras 已正确安装,请尝试以下测试代码:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
这将打印您的 tensorflow 是使用 CPU 还是 GPU 后端。
如果上述所有步骤都顺利进行,我怀疑您是否会再次出现分段错误。
检查此参考以在 GPU 上进行 tensorflow 测试。
推荐阅读
- python-3.x - 升级到 python 3.7 后无法与 Kafka 建立 SSL 连接
- sql-server - 无法使用 docker 连接到 Azure SQL 服务器
- ruby - 无法在 Mojave 中使用 rvm 安装新版本的 ruby
- python - 模拟导入和一些简单的属性
- ios - 有没有办法防止工具栏出现在路线更改上?IOS IPAD
- ios - 地理服务崩溃
- rabbitmq - 为什么这个 ProcessWindowFunction 总是计算每个传入元素而不是一个窗口的所有元素?
- javascript - 许多 try-catch 语句在相当大的客户端 JavaScript 单页应用程序中的影响
- php - SQLite & PHP 准备好的语句
- html - 为什么我在 html 标签上收到 Uncaught SyntaxError '<'?