python - InvalidArgumentError: indices[] = is not in: Re-training not starting after INFO:tensorflow:global

问题描述

我目前正在微调 ssd mobilenet v2 模型以改进人体检测并收到以下问题：

2018-09-07 14:05:33.501707: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1471] Adding visible gpu devices: 0
2018-09-07 14:05:34.037588: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-07 14:05:34.040906: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:958]      0
2018-09-07 14:05:34.043348: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0:   N
2018-09-07 14:05:34.045821: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4734 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from D:/Databases/Coco/cctv/tf/ssd_mobilenet_v2_coco_2018_03_29/model.ckpt
INFO:tensorflow:Restoring parameters from D:/Databases/Coco/cctv/tf/ssd_mobilenet_v2_coco_2018_03_29/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path training/model2/model.ckpt
INFO:tensorflow:Saving checkpoint to path training/model2/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, indices[1] = 2 is not in [0, 1)
         [[Node: cond_2/RandomCropImage/PruneCompleteleyOutsideWindow/Gather/GatherV2_1 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT64, Tparams=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"](cond_2/Switch_3:1, cond_2/RandomCropImage/PruneCompleteleyOutsideWindow/Reshape, cond_2/RandomCropImage/PruneNonOverlappingBoxes/Gather/GatherV2/axis)]]
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, indices[1] = 2 is not in [0, 1)
         [[Node: cond_2/RandomCropImage/PruneCompleteleyOutsideWindow/Gather/GatherV2_1 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT64, Tparams=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"](cond_2/Switch_3:1, cond_2/RandomCropImage/PruneCompleteleyOutsideWindow/Reshape, cond_2/RandomCropImage/PruneNonOverlappingBoxes/Gather/GatherV2/axis)]]
INFO:tensorflow:Caught OutOfRangeError. Stopping Training. FIFOQueue '_3_prefetch_queue' is closed and has insufficient elements (requested 1, current size 0)
         [[Node: prefetch_queue_Dequeue = QueueDequeueV2[component_types=[DT_STRING, DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, ..., DT_INT32, DT_INT32, DT_INT32, DT_STRING, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefetch_queue)]]

阅读 tf 记录似乎存在问题，因为在 INFO:tensorflow:global_step/sec:0 之后，我没有收到教程 2 中显示的预期训练摘要。

我发现了很多与此相关的问题，通常的解决方案是 tf 记录的路径可能不正确或 tf 记录为空。在我的情况下，路径是正确的，我的 tf 记录是 55000KB。

我正在使用教程tutorial 1并参考教程 2。我的图像是 300 x 300，我有 1 节课。

我生成 tf 记录的代码如下：

import os
import io
import glob
import hashlib
import pandas as pd
import xml.etree.ElementTree as ET
import tensorflow as tf
import random
import sys
sys.path.append("{D:\\Databases\\Coco\\cctv\\object_detection")
sys.path.append("D:\\Databases\\Coco\\cctv\\object_detection\\utils")
from PIL import Image
#import Image
import dataset_util #from object_detection.utils

'''
this script automatically divides dataset into training and evaluation (10% for evaluation)
this scripts also shuffles the dataset before converting it into tfrecords
if u have different structure of dataset (rather than pascal VOC ) u need to change
the paths and names input directories(images and annotation) and output tfrecords names.
(note: this script can be enhanced to use flags instead of changing parameters on code).
default expected directories tree:
dataset- 
   -JPEGImages
   -Annotations
    dataset_to_tfrecord.py   
to run this script:
$ python dataset_to_tfrecord.py 
'''
def create_example(xml_file):
        #process the xml file
        xmlp = ET.XMLParser(encoding="utf-8")
        tree = ET.parse(xml_file,parser=xmlp)  
        #tree = ET.parse(xml_file)
        root = tree.getroot()       
        image_name =str(xml_file).rsplit('\\', 1)[-1].replace('.xml', '.jpg')[:-1]
        print("image_name",image_name)
        file_name = image_name.encode('utf8')
        size=root.find('size')
        width = int(size[0].text)
        height = int(size[1].text)

        xmin = []
        ymin = []
        xmax = []
        ymax = []
        classes = []
        classes_text = []
        for x_min in root.findall(".//xmin"):
            xmin.append(float(x_min.text) / float(width))            
        for y_min in root.findall(".//ymin"):
            ymin.append(float(y_min.text) / float(height))
        for y_max in root.findall(".//ymax"):
            ymax.append(float(y_max.text) / float(height))
        for x_max in root.findall(".//xmax"):
            xmax.append(float(x_max.text)/ float(width)) 

           #if you have more than one classes in dataset you can change the next line
           #to read the class from the xml file and change the class label into its 
           #corresponding integer number, u can use next function structure
        classes.append(1)   # i wrote 1 because i have only one class(person)
        classes_text.append('pedestrian'.encode('utf8'))

        #read corresponding image
        full_path = 'C:/Users/SDy/Desktop/testfinalALL/'+image_name  #provide the path of images directory     
        with tf.gfile.GFile(full_path, 'rb') as fid:
            encoded_jpg = fid.read()          
        encoded_jpg_io = io.BytesIO(encoded_jpg)
        image = Image.open(encoded_jpg_io)
        if image.format != 'JPEG':
           raise ValueError('Image format not JPEG')
        key = hashlib.sha256(encoded_jpg).hexdigest()

        #create TFRecord Example
        example = tf.train.Example(features=tf.train.Features(feature={
            'image/height': dataset_util.int64_feature(height),
            'image/width': dataset_util.int64_feature(width),
            'image/filename': dataset_util.bytes_feature(file_name),
            'image/source_id': dataset_util.bytes_feature(file_name),
            'image/key/sha256': dataset_util.bytes_feature(key.encode('utf8')),
            'image/encoded': dataset_util.bytes_feature(encoded_jpg),
            'image/format': dataset_util.bytes_feature('jpeg'.encode('utf8')),
            'image/object/bbox/xmin': dataset_util.float_list_feature(xmin),
            'image/object/bbox/xmax': dataset_util.float_list_feature(xmax),
            'image/object/bbox/ymin': dataset_util.float_list_feature(ymin),
            'image/object/bbox/ymax': dataset_util.float_list_feature(ymax),
            'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
            'image/object/class/label': dataset_util.int64_list_feature(classes),

        })) 

        print(example)
        return example  

def main(_):
    writer_train = tf.python_io.TFRecordWriter('C:/Users/SD/Desktop/tfrecordfinalALL/test2.record')     
    #writer_test = tf.python_io.TFRecordWriter('test.record')
    #provide the path to annotation xml files directory
    filename_list=tf.train.match_filenames_once('C:/Users/SD/Desktop/testxmlfinalALL/*.xml')
    init = (tf.global_variables_initializer(), tf.local_variables_initializer())
    sess=tf.Session()
    sess.run(init)
    list=sess.run(filename_list)
    random.shuffle(list)   #shuffle files list
    i=1 
    trn=0   #to count number of images for training
    for xml_file in list:
      print(xml_file)
      print("                   jjj")
      print(i)
      example = create_example(xml_file)
      writer_train.write(example.SerializeToString())
      trn=trn+1
      i=i+1

    #writer_test.close()
    writer_train.close()
    print('Successfully converted dataset to TFRecord.')
    print('training dataset: # ')
    print(trn)

if __name__ == '__main__':
    tf.app.run()

完成后，我将 tf 记录移动到培训文件夹，路径在配置文件中定义。

. 我的 ssd_mobilenet 模型如图所示。

我的配置文件的训练路径如图所示。

我的 ssd_mobilenet_v2_coco_config 代码是：

# SSD with Mobilenet v2 configuration for MSCOCO Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  ssd {
    num_classes: 1
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 3
        box_code_size: 4
        apply_sigmoid_to_scores: false
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.9997,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v2'
      min_depth: 16
      depth_multiplier: 1.0
      use_depthwise: true
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.9997,
          epsilon: 0.001,
        }
      }
    }
    loss {
      classification_loss {
        weighted_sigmoid {
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.99
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 3
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 24
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "D:/Databases/Coco/cctv/tf/ssd_mobilenet_v2_coco_2018_03_29/model.ckpt"
  fine_tune_checkpoint_type:  "detection"
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 200000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "D:/Code/Image/cctvmodel/tfrecordfinalALL/train2.record"
  }
  label_map_path: "D:/Code/Image/cctvmodel/label_map.pbtxt"
}

eval_config: {
  num_examples: 8000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "D:/Code/Image/cctvmodel/tfrecordfinalALL/test2.record"
  }
  label_map_path: "D:/Code/Image/cctvmodel/label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

我的 label_map.pbtxt 包含以下内容：

item {
  id: 1
  name: 'pedestrian'
}

最后要在 cmd 提示符下运行模型，我使用以下代码：

(tensorflow) c:\models-master\research>python object_detection/legacy/train.py --logtostderr --train_dir=training/model2/ --pipeline_config_path=training/ssd_mobilenet_v2_2.config

根据信息提供问题发生的原因。它不是训练或没有打开 tf 记录。除此之外，还生成了模型，当我测试模型时，我得到以下结果。

标签： pythontensorflow

python - InvalidArgumentError: indices[] = is not in: Re-training not starting after INFO:tensorflow:global_step/sec: 0

问题描述

解决方案

推荐阅读