首页 > 解决方案 > Adanet:错误:张量流:模型因损失= NaN而发散

问题描述

在按照教程笔记本( Adanet_objective )训练 Adanet 网络并切换数据集时,我遇到了丢失错误,这是错误完整的细节在这里:

 INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 5000 or save_checkpoints_secs None.
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 5000 or save_checkpoints_secs None.
INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
WARNING:tensorflow:Estimator's model_fn (<function Estimator._create_model_fn.<locals>._adanet_model_fn at 0x7f3d42ba6c20>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:Estimator's model_fn (<function Estimator._create_model_fn.<locals>._adanet_model_fn at 0x7f3d42ba6c20>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into ./logs/uniform_average_ensemble_baseline/model.ckpt.
INFO:tensorflow:Saving checkpoints for 0 into ./logs/uniform_average_ensemble_baseline/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
ERROR:tensorflow:Model diverged with loss = NaN.

每次我尝试某些事情时我都会清除日志,因为我知道它也会导致这个问题,

我的数据集与示例一类似,代码中没有具体的大小

参考数据集示例:

x_train[0], y_train[0]
array([  1.23247,   0.     ,   8.14   ,   0.     ,   0.538  ,   6.142  ,
         91.7    ,   3.9769 ,   4.     , 307.     ,  21.     , 396.9    ,
         18.72   ]),
 15.2

我的数据集示例:

x_train[0], y_train[0]
array([1977.        ,    0.        ,    1.        ,    0.225     ,
       0.11111111,    0.10169492,    0.22072072,    0.48296441,
       0.00934278,    0.25761378,    0.29399057,    0.04283397,
       0.3241088 ,    0.20679821,    0.09214841,    2.31192802,
      48.97102657,    0.14316477,    0.17729479,    0.22970639,
       0.33924853]),
 0.09940174401130854

所以很明显值会发生变化,因为它是我的数据集,但类型是相同的,我不明白为什么在参考数据集上进行训练,而我的数据集出现了丢失错误。我的数据集在这里进行了标准化,但我也尝试过不对其进行标准化并得到同样的错误。

我尝试更改 learning_rate 并将其设置为低至 0.00001 但它没有影响任何东西

如果有人得到有用的线索,我已经看过了,但是我发现清除日志或更改 learning_rate 之类的东西不起作用。

正如我所说,模型在第一个数据集上工作正常,我没有看到代码上有任何特定的东西可以使它在这个数据集Adanet_objective上工作。

标签: pythontensorflowneural-networklossadanet

解决方案


推荐阅读