首页 > 解决方案 > TPU 错误:ray.exceptions.RayActorError:演员在完成此任务之前意外死亡

问题描述

我是 GCP 用户并在美国地区创建了一个虚拟机。我已经克隆了https://github.com/kingoflolz/mesh-transformer-jax并运行 train.py

它在内部创建一个 TPU 并将复杂的计算传递给 TPU。它还在两者之间创建了一个 radis 服务器。当代码到达 mesh-transformer-jax/mesh_transformer/TPU_cluster.py 中的第 39 行时

elf.param_count = ray.get(params)[0]

它产生以下错误

集群 TPU 设置开始 (pid=6093, ip=10.128.0.47) 2021-07-07 10:52:52.676541: F external/org_tensorflow/tensorflow/core/tpu/tpu_executor_init_fns.inc:110] TpuTransferManager_ReadDynamicShapes 在此库中不可用。

[[A2021-07-07 11:04:20,579 WARNING worker.py:1107 -- A worker died or was killed while executing task ffffffffffffffffcd0102814274b3f83c2a1f1c01000000.
Traceback (most recent call last):
  File "train.py", line 75, in <module>
    t = build_model(params, tpu_name, region, preemptible, version=args.version)
  File "/home/param_jeet/content-intelligence/Mesh-Transformer/mesh_transformer/build_model.py", line 64, in build_model
    t = TPUCluster((tpu_size // cores_per_replica, cores_per_replica), len(conns), model_fn)
  File "/home/param_jeet/.local/lib/python3.8/site-packages/func_timeout/dafunc.py", line 185, in <lambda>
    return wraps(func)(lambda *args, **kwargs : func_timeout(defaultTimeout, func, args=args, kwargs=kwargs))
  File "/home/param_jeet/.local/lib/python3.8/site-packages/func_timeout/dafunc.py", line 108, in func_timeout
    raise_exception(exception)
  File "/home/param_jeet/.local/lib/python3.8/site-packages/func_timeout/py3_raise.py", line 7, in raise_exception
    raise exception[0] from None
  File "/home/param_jeet/content-intelligence/Mesh-Transformer/mesh_transformer/TPU_cluster.py", line 51, in __init__
    self.param_count = ray.get(params)[0]
  File "/home/param_jeet/.local/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/param_jeet/.local/lib/python3.8/site-packages/ray/worker.py", line 1458, in get
    raise value

ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.

我无法检测到可能的原因。这里的任何人都可以帮助我解决可能的原因以及如何摆脱这个错误。我也有日志文件。如果您愿意,我也可以共享日志文件。

标签: google-compute-enginegoogle-cloud-tpu

解决方案


推荐阅读