首页 > 解决方案 > Dask 工作人员在缺少 dep 密钥后挂起

问题描述

在分布式 GKE dask 集群中,我有一个图表在下面的回溯中停止。工作人员仪表板继续为 cpu 报告相同的恒定高值,而 GKE 仪表板显示 pod 的 CPU 接近于零。工作人员仪表板的“最后一次看到”值变为许多分钟。15 分钟后,我杀死了 GKE pod,但 dask 调度程序仍然指示工作人员存在并保持分配给它的任务。调度程序似乎被困在这个任务上——没有取得任何进展,也没有清理或重新启动失败的工作单元。

我正在使用 dask/distributed 2020.12.0、dask-gateway 0.9.0 和 xarray 0.16.2。

什么会导致钥匙丢失?

如何调试或解决这里的潜在问题?

编辑:

    image_chunks = image.to_delayed().ravel()
    labels_chunks = labels.to_delayed().ravel()

    results = []
    for image_chunk, labels_chunk in zip(image_chunks, labels_chunks):
        offsets = np.array(image_chunk.key[1:]) * np.array(image.chunksize)
        result = dask.delayed(stats)(image_chunk, labels_chunk, offsets, ...)
        results.append(result)
    ...
    dask_df = dd.from_delayed(results, meta=df_meta)
    dask_df = dask_df.groupby(['label', 'kind']).sum()

示例回溯 #1

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2627, in execute
    self.ensure_communicating()
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1880, in ensure_communicating
    to_gather, total_nbytes = self.select_keys_for_gather(
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1985, in select_keys_for_gather
    total_bytes = self.tasks[dep].get_nbytes()
KeyError: 'xarray-image-bc4e1224600f3930ab9b691d1009ed0c'
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fac3d04f4f0>>, <Task finished name='Task-143' coro=<Worker.execute() done, defined at /opt/conda/lib/python3.8/site-packages/distributed/worker.py:2524> exception=KeyError('xarray-image-bc4e1224600f3930ab9b691d1009ed0c')>)

示例回溯 #2

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2146, in gather_dep
    await self.query_who_has(dep.key)
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2218, in query_who_has
    self.update_who_has(response)
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2227, in update_who_has
    self.tasks[dep].who_has.update(workers)
KeyError: "('rechunk-merge-607c9ba97d3abca4de3981b3de246bf3', 0, 0, 4, 4)"

标签: daskdask-distributed

解决方案


推荐阅读