linux - 我可以破解 keras.fit_generator I/O 以停止验证/将过程状态从 Sl+ 更改为 Rl 吗?
问题描述
我在 1 个 epoch 上训练了我的深度学习模型 25 小时,之后验证没有在 25 小时内完成。我想以某种方式保存模型。
我的流程如下:
aksel 14135 40.6 26.8 43304288 17717160 pts/19 Sl+ May18 1292:38 python generator_main.py v31 v29
aksel 14312 0.0 2.3 20212124 1561936 pts/19 S+ May18 0:06 python generator_main.py v31 v29
aksel 14313 0.0 0.9 19311000 638528 pts/19 S+ May18 0:18 python generator_main.py v31 v29
aksel 14315 0.0 0.9 19311000 638528 pts/19 S+ May18 0:24 python generator_main.py v31 v29
aksel 14316 0.0 1.0 19311000 681516 pts/19 S+ May18 0:17 python generator_main.py v31 v29
aksel 25467 0.7 12.8 34743884 8448060 pts/19 S+ May19 14:38 python generator_main.py v31 v29
aksel 25468 0.7 12.8 34743884 8450772 pts/19 S+ May19 14:47 python generator_main.py v31 v29
aksel 25469 0.7 12.8 34743884 8462988 pts/19 S+ May19 14:36 python generator_main.py v31 v29
aksel 25470 0.7 12.8 34743884 8485316 pts/19 S+ May19 14:33 python generator_main.py v31 v29
它坚持的路线是:
hist = s2_model.model.fit_generator(generator=training_generator,
validation_data=validation_generator,
**fit_params,
)
适合参数:
fit_params = {
'workers':4,
'class_weight':class_weights,
'max_queue_size':8,
'epochs':1,
'steps_per_epoch':40000,
'use_multiprocessing':True,
'callbacks':[EarlyStopping(**early_stopping_params),stop_cb],
}
我可以以某种方式发送信号让这条线停止并转到下一条线来保存模型吗?
解决方案
在阅读了 keras 源代码几个小时后,我放弃了并在 keras Github 上提出了一个问题。https://github.com/keras-team/keras/issues/12840
只是再次训练,这次更聪明。问题是检查点回调仅适用于时期。我的时代花了 25 个小时 :)
解决方案是制作一个信号处理程序:
这是我的自定义回调代码,它通过按 CTRL+Z 停止训练并保存我的模型:
class SignalStopping(keras.callbacks.Callback):
'''Stop training when an interrupt signal (or other) was received
# Arguments
sig: the signal to listen to. Defaults to signal.SIGTSTP.
doubleSignalExits: Receiving the signal twice exits the python
process instead of waiting for this epoch to finish.
patience: number of epochs with no improvement
after which training will be stopped.
verbose: verbosity mode.
'''
# SBW 2018.10.15 Since ctrl-c trapping isn't working, watch for existence of file, e.g. .\path\_StopTraining.txt.
def __init__(self, sig=signal.SIGTSTP, doubleSignalExits=False, verbose=1):
super(SignalStopping, self).__init__()
self.signal_received = False
self.verbose = verbose
self.doubleSignalExits = doubleSignalExits
def signal_handler(sig, frame):
self.model.stop_training = True
#if self.signal_received and self.doubleSignalExits:
# if self.verbose > 0:
# print('') #new line to not print on current status bar. Better solution?
# print('Received signal to stop ' + str(sig)+' twice. Exiting..')
# exit(sig)
#self.signal_received = True
#if self.verbose > 0:
# print('') #new line to not print on current status bar. Better solution?
# print('Received signal to stop: ' + str(sig))
signal.signal(signal.SIGTSTP, signal_handler)
self.stopped_epoch = 0
def on_epoch_end(self, epoch, logs={}):
if self.signal_received:
self.stopped_epoch = epoch
self.model.stop_training = True
print("stop_training=true")
def on_train_end(self, logs={}):
print("on_train_end")
if self.stopped_epoch > 0 and self.verbose > 0:
print('Epoch %05d: stopping due to signal' %
(self.stopped_epoch))
我称之为:
stop_cb = SignalStopping()
然后将其放入回调列表中,并传递给 fit_generator。
推荐阅读
- r - 使用 dplyr [r] 标准化变量
- c# - 粮食计划署所有文本块未更新
- split - 尝试将文件拆分为四部分,而不拆分序列
- c# - 无法绑定到 IPv6 环回接口上的 http://localhost:5000
- java - xlsm 文件合并错误线程“主”java.lang.IllegalArgumentException 中的异常:无效行号 (-1) 超出允许范围 (0..1048575)
- spring-boot - Spring Cloud Stream 功能与 Spring Cloud Contract
- python - Beautiful Soup 网页抓取和整数处理
- ios - 用数组迭代不同的视图,Swift
- asp.net-core - 使用 Microsoft.AspNet.Identity.Core 的示例或教程。
身份本地化? - vue.js - 带动画的步进栏加载