首页 > 解决方案 > 我可以破解 keras.fit_generator I/O 以停止验证/将过程状态从 Sl+ 更改为 Rl 吗?

问题描述

我在 1 个 epoch 上训练了我的深度学习模型 25 小时,之后验证没有在 25 小时内完成。我想以某种方式保存模型。

我的流程如下:

aksel    14135 40.6 26.8 43304288 17717160 pts/19 Sl+ May18 1292:38 python generator_main.py v31 v29
aksel    14312  0.0  2.3 20212124 1561936 pts/19 S+ May18   0:06 python generator_main.py v31 v29
aksel    14313  0.0  0.9 19311000 638528 pts/19 S+  May18   0:18 python generator_main.py v31 v29
aksel    14315  0.0  0.9 19311000 638528 pts/19 S+  May18   0:24 python generator_main.py v31 v29
aksel    14316  0.0  1.0 19311000 681516 pts/19 S+  May18   0:17 python generator_main.py v31 v29
aksel    25467  0.7 12.8 34743884 8448060 pts/19 S+ May19  14:38 python generator_main.py v31 v29
aksel    25468  0.7 12.8 34743884 8450772 pts/19 S+ May19  14:47 python generator_main.py v31 v29
aksel    25469  0.7 12.8 34743884 8462988 pts/19 S+ May19  14:36 python generator_main.py v31 v29
aksel    25470  0.7 12.8 34743884 8485316 pts/19 S+ May19  14:33 python generator_main.py v31 v29

它坚持的路线是:

hist = s2_model.model.fit_generator(generator=training_generator,
                validation_data=validation_generator,
                **fit_params,
)

适合参数:

fit_params = {
'workers':4,
'class_weight':class_weights,
'max_queue_size':8,
'epochs':1,
'steps_per_epoch':40000,
'use_multiprocessing':True,
'callbacks':[EarlyStopping(**early_stopping_params),stop_cb],
}

我可以以某种方式发送信号让这条线停止并转到下一条线来保存模型吗?

标签: linuxkeras

解决方案


在阅读了 keras 源代码几个小时后,我放弃了并在 keras Github 上提出了一个问题。https://github.com/keras-team/keras/issues/12840

只是再次训练,这次更聪明。问题是检查点回调仅适用于时期。我的时代花了 25 个小时 :)

解决方案是制作一个信号处理程序:

这是我的自定义回调代码,它通过按 CTRL+Z 停止训练并保存我的模型:

class SignalStopping(keras.callbacks.Callback):
'''Stop training when an interrupt signal (or other) was received
    # Arguments
    sig: the signal to listen to. Defaults to signal.SIGTSTP.
    doubleSignalExits: Receiving the signal twice exits the python
        process instead of waiting for this epoch to finish.
    patience: number of epochs with no improvement
        after which training will be stopped.
    verbose: verbosity mode.
'''
# SBW 2018.10.15 Since ctrl-c trapping isn't working, watch for existence of file, e.g. .\path\_StopTraining.txt.
def __init__(self, sig=signal.SIGTSTP, doubleSignalExits=False, verbose=1):
    super(SignalStopping, self).__init__()
    self.signal_received = False
    self.verbose = verbose
    self.doubleSignalExits = doubleSignalExits
    def signal_handler(sig, frame):
        self.model.stop_training = True
        #if self.signal_received and self.doubleSignalExits:
        #    if self.verbose > 0:
        #        print('') #new line to not print on current status bar. Better solution?
        #        print('Received signal to stop ' + str(sig)+' twice. Exiting..')
        #    exit(sig)
        #self.signal_received = True
        #if self.verbose > 0:
        #    print('') #new line to not print on current status bar. Better solution?
        #    print('Received signal to stop: ' + str(sig))
    signal.signal(signal.SIGTSTP, signal_handler)
    self.stopped_epoch = 0

def on_epoch_end(self, epoch, logs={}):
    if self.signal_received:
        self.stopped_epoch = epoch
        self.model.stop_training = True
        print("stop_training=true")

def on_train_end(self, logs={}):
    print("on_train_end")
    if self.stopped_epoch > 0 and self.verbose > 0:
        print('Epoch %05d: stopping due to signal' % 

(self.stopped_epoch))

我称之为:

stop_cb = SignalStopping()

然后将其放入回调列表中,并传递给 fit_generator。


推荐阅读