首页 > 解决方案 > 初始化 ttpu 时出现 InvalidArgumentError

问题描述

实际上,我在使用 tf-2.3.0 stable build 时遇到了这个问题,同时使用以下代码在 kaggle 中初始化 tpu:

try:
tpu_name = os.getenv('TPU_NAME')
tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_name)
print("running on tpu: ", tpu.master())
except ValueError:
tpu = None
if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy()

输出:

running on tpu:  grpc://10.0.0.2:8470

有错误:

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-9-5394dc1b79b0> in <module>
      7 if tpu:
      8     tf.config.experimental_connect_to_cluster(tpu)
----> 9     tf.tpu.experimental.initialize_tpu_system(tpu)
     10     strategy = tf.distribute.experimental.TPUStrategy(tpu)
     11 else:

/opt/conda/lib/python3.7/site-packages/tensorflow/python/tpu/tpu_strategy_util.py in initialize_tpu_system(cluster_resolver)
    109     context.context()._clear_caches()  # pylint: disable=protected-access
    110 
--> 111     serialized_topology = output.numpy()
    112 
    113     # TODO(b/134094971): Remove this when lazy tensor copy in multi-device

/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in numpy(self)
   1061     """
   1062     # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
-> 1063     maybe_arr = self._numpy()  # pylint: disable=protected-access
   1064     return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
   1065 

/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in _numpy(self)
   1029       return self._numpy_internal()
   1030     except core._NotOkStatusException as e:  # pylint: disable=protected-access
-> 1031       six.raise_from(core._status_to_exception(e.code, e.message), None)  # pylint: disable=protected-access
   1032 
   1033   @property

/opt/conda/lib/python3.7/site-packages/six.py in raise_from(value, from_value)

InvalidArgumentError: NodeDef expected inputs 'string' do not match 0 inputs specified; Op<name=_Send; signature=tensor:T -> ; attr=T:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>; NodeDef: {{node _Send}}
----------

如果有人遇到此问题或知道任何解决方法,请提出来,谢谢!

标签: tensorflowdistributed-computingkaggletpu

解决方案


如果您在 kaggle(或 colab)笔记本中手动更新了 Tensorflow,尽管pip install您的 TPU 机器可能有不同版本的 Tensorflow。尝试使用以下代码使 TPU 使用当前版本的 Tensorflow:

!pip install cloud-tpu-client

import tensorflow as tf
from cloud_tpu_client import Client
print(tf.__version__)

Client().configure_tpu_version(tf.__version__, restart_type='ifNeeded')

推荐阅读