python - Tensorflow 2.1 错误“在完成 GeneratorDataset 迭代器时” - 可能是我的生成器中的内存泄漏,但如何缩小范围?
问题描述
问题
我在 Centos Linux 下使用 TensorFlow 2.1.0 进行图像分类。随着我的图像训练数据集的增长,我必须开始使用生成器,因为我没有足够的 RAM 来保存所有图片。我已经根据本教程对生成器进行了编码。
它似乎工作正常,直到我的程序突然被杀死而没有错误消息:
Epoch 6/30
2020-03-08 13:28:11.361785: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
43/43 [==============================] - 54s 1s/step - loss: 5.6839 - accuracy: 0.4669
Epoch 7/30
2020-03-08 13:29:05.511813: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
7/43 [===>..........................] - ETA: 1:04 - loss: 4.3953 - accuracy: 0.5268Killed
看着 linux 的 top 不断增长的内存消耗,我怀疑是内存泄漏?
我试过的
- 这里建议,切换到 TF nightly build 版本会有所帮助。对我来说没有,降级到 TF2.0.1 也没有帮助
- 还有一个讨论(我再也找不到了)表明它很重要,'steps_per_epoch'和'batch size'匹配(无论这究竟意味着什么) - 在没有发现任何改进的情况下使用它
- 在这里,他们声称这是一个 TensorFlow 错误,将被修复,但现在不用担心,因为这只是一个警告 - 但我的工作被杀死了,所以这无济于事
- 进一步向下滚动相同的讨论,似乎已经同意,这是 TensorFlow 内存泄漏。但如果是这样,我希望我的错误消息会有更多的谷歌点击?
作为初学者,我认为我的代码中有一些错误,而不是声称发现了 TensorFlow 错误。但是我尝试使用“打印”查看生成器中所有列表的大小,但没有发现任何增长。不知道现在该怎么办
相关代码片段
class DataGenerator(tf.keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, labels, dir, n_classes):
'Initialization'
config = configparser.ConfigParser()
config.sections()
config.read('config.ini')
self.dim = (int(config['Basics']['PicHeight']),int(config['Basics']['PicWidth']))
self.batch_size = int(config['HyperParameter']['batchsize'])
self.labels = labels
self.list_IDs = list_IDs
self.dir = dir
self.n_channels = 3
self.n_classes = n_classes
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return math.floor(len(self.list_IDs) / self.batch_size)
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
X, y = self.__data_generation(list_IDs_temp)
return X, y, [None]
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim, self.n_channels))
y = np.empty((self.batch_size), dtype=int)
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i,] = np.load(os.path.join(self.dir, ID))
# Store class
y[i] = self.labels[ID]
return X, y
并被称为
training_generator = datagenerator.DataGenerator(train_files, labels, dir, len(self.class_names))
self.model.fit(x=training_generator,
use_multiprocessing=False,
workers=6,
epochs=self._Epochs,
steps_per_epoch = len(training_generator),
callbacks=[LoggingCallback(self.logger.debug)])
我在 Windows 10 下运行了完全相同的代码,这给了我以下错误:
Epoch 9/30
2020-03-08 20:49:37.555692: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
41/41 [==============================] - 75s 2s/step - loss: 2.0167 - accuracy: 0.3133
Epoch 10/30
2020-03-08 20:50:52.986306: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
1/41 [..............................] - ETA: 2:36 - loss: 1.6237 - accuracy: 0.39062020-03-08 20:50:57.689373: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at matmul_op.cc:480 : Resource exhausted: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2020-03-08 20:50:57.766163: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[{{node MatMul_6}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
2/41 [>.............................] - ETA: 2:02 - loss: 1.6237 - accuracy: 0.3906Traceback (most recent call last):
File "run.py", line 83, in <module>
main()
File "run.py", line 70, in main
accuracy, num_of_classes = train_Posture(unique_name)
File "run.py", line 31, in train_Posture
acc = neuro.train(picdb, train_ids, test_ids, "Posture")
File "A:\200307 3rd Try\neuro.py", line 161, in train
callbacks=[LoggingCallback(self.logger.debug)])
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 819, in fit
use_multiprocessing=use_multiprocessing)
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 342, in fit
total_epochs=epochs)
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 128, in run_one_epoch
batch_outs = execution_function(iterator)
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 98, in execution_function
distributed_function(input_fn))
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 568, in __call__
result = self._call(*args, **kwds)
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 599, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 2363, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1611, in _filtered_call
self.captured_inputs)
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
ctx=ctx)
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[node MatMul_6 (defined at A:\200307 3rd Try\neuro.py:161) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_distributed_function_764]
Function call stack:
distributed_function
2020-03-08 20:51:00.785175: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
任何想法都非常受欢迎,我被卡住了!
解决方案
推荐阅读
- c# - 使用 BeginGetResponse 发送许多相同的 HttpWebRequest 时,有一小部分没有完成
- php - 使用 Laravel 将行数据转换为列数据
- angularjs - 角JS。如何在两个应用程序之间共享单个服务
- css - Angular 5 渐变背景图片
- php - PHP:从当前键的数组A中提取元素并推送到数组B
- reactjs - axios调用后,我应该如何更新前端?[我有两种方法]
- regex - 如何执行正则表达式 AND 操作?
- android - 为什么用户授予运行时权限后,onRequestPermissionsResult 中返回的 requestCode 为 1?
- javascript - 从 Swift WKWebView 中的完成处理程序将 html 数据保存在变量中
- json - 无法将编码的 jason 数据发送到 spark