首页 > 解决方案 > KilledWorker 异常

问题描述

我正在使用线圈来启动集群并使用 dask 对从 S3 存储桶读取的 csv 进行一些操作。然而,在某些时候,我的工人被杀了。当我检查日志时,以下任务正在杀死它们。

distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 65, 0) marked as failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 70, 0) marked as failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 71, 0) marked as failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 86, 0) marked as failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 1, 0) marked as failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 8, 0) marked as failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 45, 0) marked as failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 39, 0) marked as failed because 3 workers died while trying to run it

所以,然后,我将 csv 从 s3 存储桶移到我的本地仓库并运行它,但读取的 csv 仍然会失败。

另一点是读取的 csv 对于先验数据操作正常工作,但对于一些虚拟编码器、.compute() 和日期操作,工作人员正在被杀死。

知道会发生什么吗?

标签: pythondaskdask-distributeddask-mlcoiled

解决方案


至少有两种可能:

  1. 工人没有足够的资源来执行他们的任务,一个常见的原因是内存不足;

  2. 任务本身是有问题的,例如(许多可能的原因之一)存在不匹配的数据类型,因此期望整数的函数无法使用 nan 执行计算。

为了尽量减少由于第二种可能性而导致任务失败的风险,最好在 pandas 数据帧上测试代码。


推荐阅读