首页 > 解决方案 > Kaggle TPU 不可用:无法连接到所有地址

问题描述

我是 ML 的新手。在尝试使用 TPU 方法完成数字识别时,遇到的问题真的让我很困扰。

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
with strategy.scope():
    Model = Sequential([

        InputLayer((28, 28, 1)),
        Dropout(0.1),
        Conv2D(128, 3, use_bias=False),
        LeakyReLU(0.05),
        BatchNormalization(),
        MaxPooling2D(2, 2),
        Conv2D(64, 3, use_bias=False),
        LeakyReLU(0.05),
        BatchNormalization(),
        MaxPooling2D(2, 2),
        Flatten(),
        Dense(128, use_bias=False),
        LeakyReLU(0.05),
        BatchNormalization(),
        Dense(10, activation='softmax')

    ])

with strategy.scope():
    Model.compile(optimizer='adam',
                  loss='categorical_crossentropy', metrics='accuracy') 
CancelledError: 4 root error(s) found.
  (0) Cancelled:  Operation was cancelled
     [[node IteratorGetNextAsOptional_1 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
  (1) Cancelled:  Iterator was cancelled
     [[node IteratorGetNextAsOptional_6 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
  (2) Cancelled:  Operation was cancelled
     [[node IteratorGetNextAsOptional_3 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
  (3) Cancelled:  Iterator was cancelled
     [[node IteratorGetNextAsOptional_5 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
0 successful operations.
5 derived errors ignored. [Op:__inference_train_function_23675]

Function call stack:
train_function -> train_function -> train_function -> train_function

然后我再次运行它。得到如下错误

UnavailableError: 9 root error(s) found.
  (0) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[cond_11/switch_pred/_107/_78]]
  (1) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[TPUReplicate/_compile/_7290104207349758044/_4/_178]]
  (2) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[tpu_compile_succeeded_assert/_13543899577889784813/_5/_281]]
  (3) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[strided_slice_37 ... [truncated] [Op:__inference_train_function_6939]

Function call stack:
train_function -> train_function -> train_function -> train_function

一定是在某处失踪strategy.scopy():

我尝试了很多次,并且在许多其他笔记本上都成功了,但它们都是tf.data.Dataset

虽然,我仍然无法弄清楚这个简单的数字识别哪里错了。我一次又一次地搜索并在这里停留了 2 天,真的很生气。

完整代码在 https://www.kaggle.com/dacianpeng/digit-hello-world?scriptVersionId=72464286

Version 6是 TPU 版本。并且只能从Version 5上面的代码修改。请帮我!

标签: tensorflowtpu

解决方案


修复了将它们更改为 tf.data.Dataset.( 不带GCS)的问题

只使用本地tf.data.Dataset.调用fit()是可以的。但是Unavailable: failed to connect to all addresses一旦ImageDataGenerator()使用它就会失败。

# Fixed with changing to tf.data.Dataset.

ds1=tf.data.Dataset.from_tensor_slices((DS1,L1)).batch(128).prefetch(-1)
ds2=tf.data.Dataset.from_tensor_slices((DS2,L2)).batch(128).prefetch(-1)

...
...


History = Model.fit(ds1, epochs=Epochs,validation_data=ds2,
                    callbacks=[ReduceLR, Stop], verbose=1)

# one epoch time is not stable, sometimes faster, sometimes slower,
# but most time it's approximately same as GPU costs

使用一次失败ImageDataGenerator()

# Fail again with ImageDataGenerator() used

ds1=tf.data.Dataset.from_generator(lambda:ImageModifier.flow(DS1,L1),output_signature=(
    tf.TensorSpec(shape=(28,28,1), dtype=tf.float32),
    tf.TensorSpec(shape=(10), dtype=tf.float32))
).batch(128).prefetch(-1)

History = Model.fit(ds1, epochs=Epochs, verbose=1)
---------------------------------------------------------------------------
UnavailableError                          Traceback (most recent call last)
<ipython-input-107-149f17c4776c> in <module>
      1 Epochs = 15
----> 2 History = Model.fit(ds1, epochs=Epochs, verbose=1)

/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1100               tmp_logs = self.train_function(iterator)
   1101               if data_handler.should_sync:
-> 1102                 context.async_wait()
   1103               logs = tmp_logs  # No error, now safe to assign to logs.
   1104               end_step = step + data_handler.step_increment

/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in async_wait()
   2328   an error state.
   2329   """
-> 2330   context().sync_executors()
   2331 
   2332 

/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in sync_executors(self)
    643     """
    644     if self._context_handle:
--> 645       pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
    646     else:
    647       raise ValueError("Context is not initialized.")

UnavailableError: 4 root error(s) found.
  (0) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[Pad_2/paddings/_130]]
  (1) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[strided_slice_36/_238]]
  (2) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[IteratorGetNextAsOptional_3/_35]]
  (3) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
0 successful operations.
5 derived errors ignored.

推荐阅读