首页 > 解决方案 > Tensorflow 2.1 错误“在完成 GeneratorDataset 迭代器时” - 可能是我的生成器中的内存泄漏,但如何缩小范围?

问题描述

问题

我在 Centos Linux 下使用 TensorFlow 2.1.0 进行图像分类。随着我的图像训练数据集的增长,我必须开始使用生成器,因为我没有足够的 RAM 来保存所有图片。我已经根据本教程对生成器进行了编码。

它似乎工作正常,直到我的程序突然被杀死而没有错误消息:

Epoch 6/30
2020-03-08 13:28:11.361785: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
43/43 [==============================] - 54s 1s/step - loss: 5.6839 - accuracy: 0.4669
Epoch 7/30
2020-03-08 13:29:05.511813: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
 7/43 [===>..........................] - ETA: 1:04 - loss: 4.3953 - accuracy: 0.5268Killed

看着 linux 的 top 不断增长的内存消耗,我怀疑是内存泄漏?

我试过的

作为初学者,我认为我的代码中有一些错误,而不是声称发现了 TensorFlow 错误。但是我尝试使用“打印”查看生成器中所有列表的大小,但没有发现任何增长。不知道现在该怎么办

相关代码片段

class DataGenerator(tf.keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, list_IDs, labels, dir, n_classes):
        'Initialization'
        config = configparser.ConfigParser()
        config.sections()
        config.read('config.ini')

        self.dim = (int(config['Basics']['PicHeight']),int(config['Basics']['PicWidth']))
        self.batch_size = int(config['HyperParameter']['batchsize'])
        self.labels = labels
        self.list_IDs = list_IDs
        self.dir = dir
        self.n_channels = 3
        self.n_classes = n_classes
        self.on_epoch_end()        


    def __len__(self):
        'Denotes the number of batches per epoch'
        return math.floor(len(self.list_IDs) / self.batch_size)

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        list_IDs_temp = [self.list_IDs[k] for k in indexes]

        # Generate data
        X, y = self.__data_generation(list_IDs_temp)

        return X, y, [None]

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.list_IDs)) 
        np.random.shuffle(self.indexes)        

    def __data_generation(self, list_IDs_temp):
        'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
        # Initialization
        X = np.empty((self.batch_size, *self.dim, self.n_channels))
        y = np.empty((self.batch_size), dtype=int)

        # Generate data
        for i, ID in enumerate(list_IDs_temp):
            # Store sample
            X[i,] = np.load(os.path.join(self.dir, ID))

            # Store class
            y[i] = self.labels[ID]

        return X, y

并被称为

        training_generator = datagenerator.DataGenerator(train_files, labels, dir, len(self.class_names))
        self.model.fit(x=training_generator,
                    use_multiprocessing=False,
                    workers=6, 
                    epochs=self._Epochs, 
                    steps_per_epoch = len(training_generator),
                    callbacks=[LoggingCallback(self.logger.debug)])

我在 Windows 10 下运行了完全相同的代码,这给了我以下错误:

    Epoch 9/30
    2020-03-08 20:49:37.555692: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
    41/41 [==============================] - 75s 2s/step - loss: 2.0167 - accuracy: 0.3133
    Epoch 10/30
    2020-03-08 20:50:52.986306: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
     1/41 [..............................] - ETA: 2:36 - loss: 1.6237 - accuracy: 0.39062020-03-08 20:50:57.689373: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at matmul_op.cc:480 : Resource exhausted: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
    2020-03-08 20:50:57.766163: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
             [[{{node MatMul_6}}]]
    Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     2/41 [>.............................] - ETA: 2:02 - loss: 1.6237 - accuracy: 0.3906Traceback (most recent call last):
      File "run.py", line 83, in <module>
        main()
      File "run.py", line 70, in main
        accuracy, num_of_classes = train_Posture(unique_name)
      File "run.py", line 31, in train_Posture
        acc = neuro.train(picdb, train_ids, test_ids, "Posture")
      File "A:\200307 3rd Try\neuro.py", line 161, in train
        callbacks=[LoggingCallback(self.logger.debug)])
      File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 819, in fit
        use_multiprocessing=use_multiprocessing)
      File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 342, in fit
        total_epochs=epochs)
      File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 128, in run_one_epoch
        batch_outs = execution_function(iterator)
      File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 98, in execution_function
        distributed_function(input_fn))
      File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 568, in __call__
        result = self._call(*args, **kwds)
      File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 599, in _call
        return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
      File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 2363, in __call__
        return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
      File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1611, in _filtered_call
        self.captured_inputs)
      File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
        ctx, args, cancellation_manager=cancellation_manager))
      File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
        ctx=ctx)
      File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
        six.raise_from(core._status_to_exception(e.code, message), None)
      File "<string>", line 3, in raise_from
    tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
             [[node MatMul_6 (defined at A:\200307 3rd Try\neuro.py:161) ]]
    Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
     [Op:__inference_distributed_function_764]

    Function call stack:
    distributed_function

    2020-03-08 20:51:00.785175: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

任何想法都非常受欢迎,我被卡住了!

标签: pythontensorflowkerasmemory-leaks

解决方案


推荐阅读