python - 为什么即使我提到了 epoch 100,模型也会在 epoch 1 之后停止训练而没有任何警告?
问题描述
我正在尝试在具有 GPU 支持的 google colab 上运行retinanet 模型,但是在开始训练 1 个 epoch 后,它会在没有正确训练的情况下快速完成 1000 步,并且在没有任何警告的情况下停止训练。
这是我在运行 train 命令后得到的终端窗口的输出
!keras_retinanet/bin/train.py --tensorboard-dir /content/TrainingOutput --snapshot-path /content/TrainingOutput/snapshots --random-transform --steps 1000 pascal /content/PlumsVOC
Creating model, this may take a second...
2021-08-19 03:38:20.717241: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.725782: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.726450: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.727359: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX512F
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-19 03:38:20.727598: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.728167: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.728749: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:21.263376: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:21.264133: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:21.264721: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:21.265247: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-19 03:38:21.265304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13839 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py:356: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
"The `lr` argument is deprecated, use `learning_rate` instead.")
Model: "retinanet"
__________________________________________________________________________________________________
None
WARNING:tensorflow:`batch_size` is no longer needed in the `TensorBoard` Callback and will be ignored in TensorFlow 2.0.
2021-08-19 03:38:24.467332: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-08-19 03:38:24.467379: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-08-19 03:38:24.467435: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 1 GPUs
2021-08-19 03:38:24.588819: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-08-19 03:38:24.589029: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
/usr/local/lib/python3.7/dist-packages/keras/engine/training.py:1972: UserWarning: `Model.fit_generator` is deprecated and will be removed in a future version. Please use `Model.fit`, which supports generators.
warnings.warn('`Model.fit_generator` is deprecated and '
/usr/local/lib/python3.7/dist-packages/keras/utils/generic_utils.py:497: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
category=CustomMaskWarning)
2021-08-19 03:38:25.187697: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/50
2021-08-19 03:38:32.881842: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8004
1/1000 [..............................] - ETA: 3:31:30 - loss: 3.8681 - regression_loss: 2.7375 - classification_loss: 1.13062021-08-19 03:38:38.104179: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-08-19 03:38:38.104232: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2/1000 [..............................] - ETA: 17:05 - loss: 3.8988 - regression_loss: 2.7693 - classification_loss: 1.1295 2021-08-19 03:38:38.938537: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2021-08-19 03:38:38.940902: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
2021-08-19 03:38:39.134281: I tensorflow/core/profiler/internal/gpu/cupti_collector.cc:673] GpuTracer has collected 3251 callback api events and 3247 activity events.
2021-08-19 03:38:39.192167: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-08-19 03:38:39.289977: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39
2021-08-19 03:38:39.355897: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.trace.json.gz
2021-08-19 03:38:39.455150: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39
2021-08-19 03:38:39.462678: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for memory_profile.json.gz to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.memory_profile.json.gz
2021-08-19 03:38:39.466401: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39
Dumped tool data for xplane.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.xplane.pb
Dumped tool data for overview_page.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.overview_page.pb
Dumped tool data for input_pipeline.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.kernel_stats.pb
11/1000 [..............................] - ETA: 6:57 - loss: 3.9632 - regression_loss: 2.8365 - classification_loss: 1.1267WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 50000 batches). You may need to use the repeat() function when building your dataset.
1000/1000 [==============================] - 17s 4ms/step - loss: 3.9632 - regression_loss: 2.8365 - classification_loss: 1.1267
Running network: 100% (4 of 4) |##########| Elapsed Time: 0:00:02 Time: 0:00:02
Parsing annotations: 100% (4 of 4) |######| Elapsed Time: 0:00:00 Time: 0:00:00
32 instances of class redPlum with average precision: 0.0000
0 instances of class greenPlum with average precision: 0.0000
mAP: 0.0000
Epoch 00001: saving model to /content/TrainingOutput/snapshots/resnet50_pascal_01.h5
/usr/local/lib/python3.7/dist-packages/keras/utils/generic_utils.py:497: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
category=CustomMaskWarning)
它节省了模型的重量,但它没有检测到测试图像中的任何对象。发生了什么?我怎样才能解决这个问题并在指定的时期完全训练模型?对此的任何帮助将非常有帮助,谢谢。
解决方案
推荐阅读
- amazon-web-services - AWSIoTPythonSDK publishTimeoutException
- python - 在条件下在df中添加和填充行
- mysql - Mysql 在 FULLTEXT 索引具有相同数据的两台服务器之间显示不同的结果
- typescript - React-Admin 从 aws amplify schema.graphql 中的“data-generator-retail”转换 generateData
- python-3.x - 在 Python 中使谷歌驱动器访问令牌动态化
- java - android.content.ActivityNotFoundException: No Activity found to handle Intent { act=android.intent.action.VIEW dat=PendingIntent{ }
- python - 如何关闭文件以便可以使用 os.remove() 删除它?获取 WinError32
- c - valgrind 在分配的内存中写入大小为 8 的无效
- node.js - 未捕获的集成错误:无效的 stripe.redirectToCheckout 参数:items.0.price 不是可接受的参数
- javascript - Fullcalender V5 RefetchEvents 问题