首页 > 解决方案 > dask jobqueue worker failure at startup 'Resource temporarily unavailable'

问题描述

I'm running dask over slurm via jobqueue and I have been getting 3 errors pretty consistently...

Basically my question is what could be causing these failures? At first glance the problem is that too many workers are writing to disk at once, or my workers are forking into many other processes, but it's pretty difficult to track that. I can ssh into the node but I'm not seeing an abnormal number of processes, and each node has a 500gb ssd, so I shouldn't be writing excessively.

Everything below this is just information about my configurations and such
My setup is as follows:

cluster = SLURMCluster(cores=1, memory=f"{args.gbmem}GB", queue='fast_q', name=args.name,
                           env_extra=["source ~/.zshrc"])
cluster.adapt(minimum=1, maximum=200)

client = await Client(cluster, processes=False, asynchronous=True)

I suppose i'm not even sure if processes=False should be set.

I run this starter script via sbatch under the conditions of 4gb of memory, 2 cores (-c) (even though i expect to only need 1) and 1 task (-n). And this sets off all of my jobs via the slurmcluster config from above. I dumped my slurm submission scripts to files and they look reasonable.

Each job is not complex, it is a subprocess.call( command to a compiled executable that takes 1 core and 2-4 GB of memory. I require the client call and further calls to be asynchronous because I have a lot of conditional computations. So each worker when loaded should consist of 1 python processes, 1 running executable, and 1 shell. Imposed by the scheduler we have

>> ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-m: resident set size (kbytes)      unlimited
-u: processes                       512
-n: file descriptors                1024
-l: locked-in-memory size (kbytes)  64
-v: address space (kbytes)          unlimited
-x: file locks                      unlimited
-i: pending signals                 1031203
-q: bytes in POSIX msg queues       819200
-e: max nice                        0
-r: max rt priority                 0
-N 15:                              unlimited

And each node has 64 cores. so I don't really think i'm hitting any limits.

i'm using the jobqueue.yaml file that looks like:

slurm:
  name: dask-worker
  cores: 1                 # Total number of cores per job
  memory: 2                # Total amount of memory per job
  processes: 1                # Number of Python processes per job
  local-directory: /scratch       # Location of fast local storage like /scratch or $TMPDIR
  queue: fast_q
  walltime: '24:00:00'
  log-directory: /home/dbun/slurm_logs

I would appreciate any advice at all! Full log is below.

FORK BLOCKING IO ERROR


distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.131.82:13687'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/dbun/.local/share/pyenv/versions/3.7.0/lib/python3.7/multiprocessing/forkserver.py", line 250, in main
    pid = os.fork()
BlockingIOError: [Errno 11] Resource temporarily unavailable
distributed.dask_worker - INFO - End worker

Aborted!

CANT START NEW THREAD ERROR

https://pastebin.com/ibYUNcqD

BLOCKING IO ERROR

https://pastebin.com/FGfxqZEk

EDIT: Another piece of the puzzle: It looks like dask_worker is running multiple multiprocessing.forkserver calls? does that sound reasonable?

https://pastebin.com/r2pTQUS4

标签: daskslurmpython-3.7dask-distributed

解决方案


这个问题是由ulimit -u太低引起的。

事实证明,每个工作人员都有一些与之关联的进程,而 python 有多个线程。最后,您最终会得到大约 14 个线程,这些线程有助于您的ulimit -u. 我的设置为 512,而在 64 核系统中,我可能达到了 ~896。看起来我可以拥有的每个进程的最大线程数是 8。

解决方案:在 .zshrc (.bashrc) 我添加了这一行

ulimit -u unlimited

从那以后没有任何问题。


推荐阅读