python - Adanet:错误:张量流:模型因损失= NaN而发散
问题描述
在按照教程笔记本( Adanet_objective )训练 Adanet 网络并切换数据集时,我遇到了丢失错误,这是错误:完整的细节在这里:
INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 5000 or save_checkpoints_secs None.
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 5000 or save_checkpoints_secs None.
INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
WARNING:tensorflow:Estimator's model_fn (<function Estimator._create_model_fn.<locals>._adanet_model_fn at 0x7f3d42ba6c20>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:Estimator's model_fn (<function Estimator._create_model_fn.<locals>._adanet_model_fn at 0x7f3d42ba6c20>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into ./logs/uniform_average_ensemble_baseline/model.ckpt.
INFO:tensorflow:Saving checkpoints for 0 into ./logs/uniform_average_ensemble_baseline/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
ERROR:tensorflow:Model diverged with loss = NaN.
每次我尝试某些事情时我都会清除日志,因为我知道它也会导致这个问题,
我的数据集与示例一类似,代码中没有具体的大小
参考数据集示例:
x_train[0], y_train[0]
array([ 1.23247, 0. , 8.14 , 0. , 0.538 , 6.142 ,
91.7 , 3.9769 , 4. , 307. , 21. , 396.9 ,
18.72 ]),
15.2
我的数据集示例:
x_train[0], y_train[0]
array([1977. , 0. , 1. , 0.225 ,
0.11111111, 0.10169492, 0.22072072, 0.48296441,
0.00934278, 0.25761378, 0.29399057, 0.04283397,
0.3241088 , 0.20679821, 0.09214841, 2.31192802,
48.97102657, 0.14316477, 0.17729479, 0.22970639,
0.33924853]),
0.09940174401130854
所以很明显值会发生变化,因为它是我的数据集,但类型是相同的,我不明白为什么在参考数据集上进行训练,而我的数据集出现了丢失错误。我的数据集在这里进行了标准化,但我也尝试过不对其进行标准化并得到同样的错误。
我尝试更改 learning_rate 并将其设置为低至 0.00001 但它没有影响任何东西
如果有人得到有用的线索,我已经看过了,但是我发现清除日志或更改 learning_rate 之类的东西不起作用。
正如我所说,模型在第一个数据集上工作正常,我没有看到代码上有任何特定的东西可以使它在这个数据集Adanet_objective上工作。
解决方案
推荐阅读
- python-3.x - 如何使用 Pyshark 将 UDP 端口号作为参数传递来读取协议?
- android - mediaplayer.getDuration() 在 android 8.1 中抛出了非法状态异常异常,但在较低版本中工作正常
- javascript - 如何在 Angular 应用程序中将 textarea 内容复制到剪贴板
- java - 使用 OpenCV Mat 使用 tesseract (Java) 读取文本
- android - 是否可以将来自 VoIP 应用程序的来电信号发送到汽车导航系统?
- sql - 在 SQL 中更改十进制值的准确性
- python - 将tensorflow object detection api与opencv的质心跟踪算法集成
- windows - AWS WorkSpace 客户端将不使用证书
- php - 如何在 php 代码中使用简码通过 php 代码的简码获取价值?
- java - 处理程序周期性延迟 Android