首页 > 解决方案 > Kaggle TPU:无法连接到所有地址

问题描述

在尝试在 kaggle 上使用 TPU 拟合我的模型时,我遇到了一些问题。

Tpu 已经初始化:

try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print(f'Running on TPU {tpu.master()}')
except ValueError:
    tpu = None
if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy()

AUTO = tf.data.experimental.AUTOTUNE
REPLICAS = strategy.num_replicas_in_sync
print(f'REPLICAS: {REPLICAS}')

但是当我尝试拟合我的模型时,会出现此错误:

{{function_node __inference_train_function_64094}} failed to connect to all addresses
GRPC error information:{"created":"@1609444822.190891136","description":"Failed to pick
subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc",
file_line":3959,"referenced_errors": [{"created":"@1609444822.190889693"
,"description":"failed to connect to all addresses", […] 
[[{{node MultiDeviceIteratorGetNextFromShard}}]] [[RemoteCall][[IteratorGetNextAsOptional]]

标签: tensorflowkaggletpu

解决方案


您必须在策略范围内创建模型和优化器:

with strategy.scope():
  model = create_model()
  model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['sparse_categorical_accuracy'])

推荐阅读