首页 > 解决方案 > 如何在 Ubuntu 18.04 上尝试在 Python (Anaconda) 中拟合 keras 模型时出现“分段错误(核心转储)”错误

问题描述

我有一台具有 2080Ti GPU 的新 PC(在 Ubuntu 18.04 上)。我正在尝试使用 Keras(在 Anaconda 环境中)在 Python 中训练神经网络,但在尝试拟合模型时出现“分段错误(核心转储)”错误。

我正在使用的代码在我的 Windows PC(具有 1080Ti GPU)上运行良好。该错误似乎与 GPU 内存有关,当我在拟合模型之前运行“nvidia-smi”时,我看到发生了一些奇怪的事情,我看到大约 800mb 的可用 11gb GPU 内存正在被用完,但是一旦我编译该可用内存已全部占用的模型。在进程部分,我可以看到这与 anaconda 环境有关(即 ...ics-link/anaconda3/envs/py35/bin/python = 9677MiB)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.25       Driver Version: 415.25       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:04:00.0  On |                  N/A |
| 28%   44C    P2    51W / 250W |  10491MiB / 10986MiB |      7%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1507      G   /usr/lib/xorg/Xorg                            30MiB |
|    0      1538      G   /usr/bin/gnome-shell                          57MiB |
|    0      1844      G   /usr/lib/xorg/Xorg                           309MiB |
|    0      1979      G   /usr/bin/gnome-shell                         177MiB |
|    0      3816      G   /usr/lib/firefox/firefox                       6MiB |
|    0      5451      G   ...-token=169F1B80118E535BC5002C22A81DD0FA    90MiB |
|    0      5896      G   ...-token=631C5DCD90ADCF80959770937CE797E7   128MiB |
|    0      6485      C   ...ics-link/anaconda3/envs/py35/bin/python  9677MiB |
+-----------------------------------------------------------------------------+

下面是代码,仅供参考:

from __future__ import print_function
import keras
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D, Activation, BatchNormalization
from keras.callbacks import ModelCheckpoint, CSVLogger
from keras import backend as K
import numpy as np

batch_size = 64
num_classes = 10
epochs = 10

# input image dimensions
img_rows, img_cols = 32, 32

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 3, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 3, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 3)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 3)
    input_shape = (img_rows, img_cols, 3)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

# normalise pixel values
mean = np.mean(x_train,axis=(0,1,2,3))
std = np.std(x_train,axis=(0,1,2,3))
x_train = (x_train-mean)/(std+1e-7)
x_test = (x_test-mean)/(std+1e-7)

print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))

model.add(Conv2D(64, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(128, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(256, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())

model.add(Dense(1024))
model.add(Activation("relu"))
model.add(Dropout(0.25))

model.add(Dense(1024))
model.add(Activation("relu"))
model.add(Dropout(0.25))

model.add(Dense(1024))
model.add(Activation("relu"))
model.add(Dropout(0.25))

model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

#load weights from previous run
#model.load_weights('model07_weights_best.hdf5')

from keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=0.1,  # randomly rotate images in the range (degrees, 0 to 180)
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=True,  # randomly flip images
        vertical_flip=False)  # randomly flip images

# Compute quantities required for feature-wise normalization
# (std, mean, and principal components if ZCA whitening is applied).
datagen.fit(x_train)


#save weights and log
checkpoint = ModelCheckpoint("model14_weights_best.hdf5", monitor='val_acc', verbose=1, save_best_only=True, mode='max')
csv_logger = CSVLogger('model14_loss_log.csv', append=True, separator=';')
callbacks_list = [checkpoint,csv_logger]

# Fit the model on the batches generated by datagen.flow().
model.fit_generator(datagen.flow(x_train, y_train,
                                 batch_size=batch_size),
                                 epochs=epochs,
                                 validation_data=(x_test, y_test),
                                 callbacks = callbacks_list
                                 )

我并不期望在 GPU 上占用大量空间,但它似乎已经饱和。正如我提到的,它适用于我的 Windows PC。

关于可能导致这种情况的任何想法?

标签: pythonkerassegmentation-faultanacondaubuntu-18.04

解决方案


我不相信这与内存大小有关。我最近一直在处理这个问题。分段错误错误表示您的训练过程在 GPU 上的并行化失败。如果进程按顺序运行,无论您的数据集有多大,您都不会出现此错误。此外,也无需担心您的深度学习设置。

由于您即将设置一台新机器,我相信您的上下文中的分段错误肯定有两个原因。

首先,我会检查我的 GPU 是否安装正确,但根据您提供的详细信息,我认为问题更多是关于模块(在您的情况下为 Keras)作为第二个原因:

  • 在这种情况下,您在安装模块或其依赖项之一时可能会遇到一些奇怪的事情。我建议将其删除并清理所有内容并重新安装。

  • 您确定您的 tensorflow-gpu 已安装(正确)吗?cuda 和 cudnn 呢?

如果您认为 keras 已正确安装,请尝试以下测试代码:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

这将打印您的 tensorflow 是使用 CPU 还是 GPU 后端。

如果上述所有步骤都顺利进行,我怀疑您是否会再次出现分段错误。

检查此参考以在 GPU 上进行 tensorflow 测试。


推荐阅读