dask - dask jobqueue worker failure at startup 'Resource temporarily unavailable'
问题描述
I'm running dask over slurm via jobqueue and I have been getting 3 errors pretty consistently...
Basically my question is what could be causing these failures? At first glance the problem is that too many workers are writing to disk at once, or my workers are forking into many other processes, but it's pretty difficult to track that. I can ssh into the node but I'm not seeing an abnormal number of processes, and each node has a 500gb ssd, so I shouldn't be writing excessively.
Everything below this is just information about my configurations and such
My setup is as follows:
cluster = SLURMCluster(cores=1, memory=f"{args.gbmem}GB", queue='fast_q', name=args.name,
env_extra=["source ~/.zshrc"])
cluster.adapt(minimum=1, maximum=200)
client = await Client(cluster, processes=False, asynchronous=True)
I suppose i'm not even sure if processes=False should be set.
I run this starter script via sbatch under the conditions of 4gb of memory, 2 cores (-c) (even though i expect to only need 1) and 1 task (-n). And this sets off all of my jobs via the slurmcluster config from above. I dumped my slurm submission scripts to files and they look reasonable.
Each job is not complex, it is a subprocess.call(
command to a compiled executable that takes 1 core and 2-4 GB of memory. I require the client call and further calls to be asynchronous because I have a lot of conditional computations. So each worker when loaded should consist of 1 python processes, 1 running executable, and 1 shell.
Imposed by the scheduler we have
>> ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes 512
-n: file descriptors 1024
-l: locked-in-memory size (kbytes) 64
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 1031203
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 0
-N 15: unlimited
And each node has 64 cores. so I don't really think i'm hitting any limits.
i'm using the jobqueue.yaml file that looks like:
slurm:
name: dask-worker
cores: 1 # Total number of cores per job
memory: 2 # Total amount of memory per job
processes: 1 # Number of Python processes per job
local-directory: /scratch # Location of fast local storage like /scratch or $TMPDIR
queue: fast_q
walltime: '24:00:00'
log-directory: /home/dbun/slurm_logs
I would appreciate any advice at all! Full log is below.
FORK BLOCKING IO ERROR
distributed.nanny - INFO - Start Nanny at: 'tcp://172.16.131.82:13687'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/dbun/.local/share/pyenv/versions/3.7.0/lib/python3.7/multiprocessing/forkserver.py", line 250, in main
pid = os.fork()
BlockingIOError: [Errno 11] Resource temporarily unavailable
distributed.dask_worker - INFO - End worker
Aborted!
CANT START NEW THREAD ERROR
BLOCKING IO ERROR
EDIT:
Another piece of the puzzle:
It looks like dask_worker is running multiple multiprocessing.forkserver
calls? does that sound reasonable?
解决方案
这个问题是由ulimit -u
太低引起的。
事实证明,每个工作人员都有一些与之关联的进程,而 python 有多个线程。最后,您最终会得到大约 14 个线程,这些线程有助于您的ulimit -u
. 我的设置为 512,而在 64 核系统中,我可能达到了 ~896。看起来我可以拥有的每个进程的最大线程数是 8。
解决方案:在 .zshrc (.bashrc) 我添加了这一行
ulimit -u unlimited
从那以后没有任何问题。
推荐阅读
- database - 由两个主键标识的值的数据库设计
- python - 如何获得成功和错误代码?
- html - 如何删除 Bootstrap3 容器的背景?
- keras - 使用 word2vec 对文本数据进行分类时出错
- bash - 在 bash 中加入两个文件
- java - 删除对象后处理程序类崩溃
- javascript - 如何有效地大规模应用 React 组件的版本控制?
- python - Python/Selenium webdriver 中的 ElementNotVisibleException 错误
- react-native - 连接管理器的 React Native 包装器
- java - 如何在我的 Android Studio 中将本地管理面板与我的 Android 应用程序连接起来?