tensorflow - Kaggle TPU 不可用:无法连接到所有地址
问题描述
我是 ML 的新手。在尝试使用 TPU 方法完成数字识别时,遇到的问题真的让我很困扰。
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
with strategy.scope():
Model = Sequential([
InputLayer((28, 28, 1)),
Dropout(0.1),
Conv2D(128, 3, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
MaxPooling2D(2, 2),
Conv2D(64, 3, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
MaxPooling2D(2, 2),
Flatten(),
Dense(128, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
Dense(10, activation='softmax')
])
with strategy.scope():
Model.compile(optimizer='adam',
loss='categorical_crossentropy', metrics='accuracy')
CancelledError: 4 root error(s) found.
(0) Cancelled: Operation was cancelled
[[node IteratorGetNextAsOptional_1 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(1) Cancelled: Iterator was cancelled
[[node IteratorGetNextAsOptional_6 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(2) Cancelled: Operation was cancelled
[[node IteratorGetNextAsOptional_3 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(3) Cancelled: Iterator was cancelled
[[node IteratorGetNextAsOptional_5 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
0 successful operations.
5 derived errors ignored. [Op:__inference_train_function_23675]
Function call stack:
train_function -> train_function -> train_function -> train_function
然后我再次运行它。得到如下错误
UnavailableError: 9 root error(s) found.
(0) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[cond_11/switch_pred/_107/_78]]
(1) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[TPUReplicate/_compile/_7290104207349758044/_4/_178]]
(2) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[tpu_compile_succeeded_assert/_13543899577889784813/_5/_281]]
(3) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_37 ... [truncated] [Op:__inference_train_function_6939]
Function call stack:
train_function -> train_function -> train_function -> train_function
一定是在某处失踪strategy.scopy():
我尝试了很多次,并且在许多其他笔记本上都成功了,但它们都是tf.data.Dataset
虽然,我仍然无法弄清楚这个简单的数字识别哪里错了。我一次又一次地搜索并在这里停留了 2 天,真的很生气。
完整代码在 https://www.kaggle.com/dacianpeng/digit-hello-world?scriptVersionId=72464286
Version 6
是 TPU 版本。并且只能从Version 5
上面的代码修改。请帮我!
解决方案
修复了将它们更改为 tf.data.Dataset.( 不带GCS
)的问题
只使用本地tf.data.Dataset.
调用fit()
是可以的。但是Unavailable: failed to connect to all addresses
一旦ImageDataGenerator()
使用它就会失败。
# Fixed with changing to tf.data.Dataset.
ds1=tf.data.Dataset.from_tensor_slices((DS1,L1)).batch(128).prefetch(-1)
ds2=tf.data.Dataset.from_tensor_slices((DS2,L2)).batch(128).prefetch(-1)
...
...
History = Model.fit(ds1, epochs=Epochs,validation_data=ds2,
callbacks=[ReduceLR, Stop], verbose=1)
# one epoch time is not stable, sometimes faster, sometimes slower,
# but most time it's approximately same as GPU costs
使用一次失败ImageDataGenerator()
。
# Fail again with ImageDataGenerator() used
ds1=tf.data.Dataset.from_generator(lambda:ImageModifier.flow(DS1,L1),output_signature=(
tf.TensorSpec(shape=(28,28,1), dtype=tf.float32),
tf.TensorSpec(shape=(10), dtype=tf.float32))
).batch(128).prefetch(-1)
History = Model.fit(ds1, epochs=Epochs, verbose=1)
---------------------------------------------------------------------------
UnavailableError Traceback (most recent call last)
<ipython-input-107-149f17c4776c> in <module>
1 Epochs = 15
----> 2 History = Model.fit(ds1, epochs=Epochs, verbose=1)
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1100 tmp_logs = self.train_function(iterator)
1101 if data_handler.should_sync:
-> 1102 context.async_wait()
1103 logs = tmp_logs # No error, now safe to assign to logs.
1104 end_step = step + data_handler.step_increment
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in async_wait()
2328 an error state.
2329 """
-> 2330 context().sync_executors()
2331
2332
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in sync_executors(self)
643 """
644 if self._context_handle:
--> 645 pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
646 else:
647 raise ValueError("Context is not initialized.")
UnavailableError: 4 root error(s) found.
(0) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[Pad_2/paddings/_130]]
(1) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_36/_238]]
(2) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[IteratorGetNextAsOptional_3/_35]]
(3) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
0 successful operations.
5 derived errors ignored.
推荐阅读
- node.js - 安装 NodeJs 版本 16 后如何运行 Angular 项目?
- elasticsearch - elasticSearch 是否具有“JOIN”MYSQL 操作的类似物?
- google-cloud-platform - Google Cloud Run 错误:无效命令 \"/bin/sh\":从 Docker 映像部署时找不到文件
- redis - 从 Lua 进行 Redis 调用时出现语法错误
- android - 有没有办法删除这个“无法创建抽象类的实例”?
- r - 如何有效地将 NCDF 时间转换为适当的单位
- python - 如何在 Tkinter 窗口中绘制 Matplotlib 图?
- node.js - React-Admin 显示来自 mongoDB 的数据,但我无法编辑
- magento2 - 付款后拦截器 - Magento 2
- matlab - 在 for 循环 Matlab 中将 eeglab 保存到 mat 文件