首页 > 解决方案 > 使用 slurm 运行多个文件而不会使队列超载

问题描述

我需要在多个(70000)个样本中并行运行一个脚本,我不想一次将所有样本提交到队列中。我如何一次安排 100 个,并且每次一个完成另一个可以排队?

由于运行了包含在我的脚本中的另一个软件,因此编写了很多文件。我还需要将每个文件中的结果提取到单个结果文件中。

我想到了一些事情:

# set maximum number of processes to run in SLURM
MAX_QUEUE=200

Protein_sequence='MNNAANTGTTNESNVSDAPRIEPLPSLNDDDIEKILQPNDIFTTDRTDASTTSSTAIEDIINPSLDPQSAASPVPSSSFFHDSRKPSTSTHLVRRGTPLGIYQTNLYGHNSRENTNPNSTLLSSKLLAHPPVPYGQNPDLLQHAVYRAQPSSGTTNAQPRQTTRRYQSHKSRPAFVNKLWSMLNDDSNTKLIQWAEDGKSFIVTNREEFVHQILPKYFKHSNFASFVRQLNMYGWHKVQDVKSGSIQSSSDDKWQFENENFIRGREDLLEKIIRQKGSSNNHNSPSGNGNPANGSNIPLDNAAGSNNSNNNISSSNSFFNNGHLLQGKTLRLMNEANLGDKNDVTAILGELEQIKYNQIAISKDLLRINKDNELLWQENMMARERHRTQQQALEKMFRFLTSIVPHLDPKMIMDGLGDPKVNNEKLNSANNIGLNRDNTGTIDELKSNDSFINDDRNSFTNATTNARNNMSPNNDDNSIDTASTNTTNRKKNIDENIKNNNDIINDIIFNTNLANNLSNYNSNNNAGSPIRPYKQRYLLKNRANSSTSSENPSLTPFDIESNNDRKISEIPFDDEEEEETDFRPFTSRDPNNQTSENTFDPNRFTMLSDDDLKKDSHTNDNKHNESDLFWDNVHRNIDEQDARLQNLENMVHILSPGYPNKSFNNKTSSTNTNSNMESAVNVNSPGFNLQDYLTGESNSPNSVHSVPSNGSGSTPLPMPNDNDTEHASTSVNQGENGSGLTPFLTVDDHTLNDNNTSEGSTRVSPDIKFSATENTKVSDNLPSFNDHSYSTQADTAPENAKKRFVEEIPEPAIVEIQDPTEYNDHRLPKRAKK'

# 5' primer to add at "N" terminal (left of the sequence)
p5=${Protein_Sequence:463:30}

header=true # file has header and I have to skip it

# open file containing the sequence fused at the right of p5
for insert in `cat $1 | awk 'BEGIN{FS=","}{print $2}'`
do
    # if header, then continue with next iteration and flag header as false
    if [ $header = true ]
    then
        header=false
    else
        printf ">${insert}\n${p5}${insert}" > ${insert}.fasta # write fasta file (this is the input of psipred)

        # check how many processes are in the queue
        queue=$(squeue -u aerijman | wc -l)
        queue=$(echo $queue -1 | bc)

        # if few processes queued, proceed, else wait.
        if [ $queue -lt $MAX_QUEUE ]
        then
            sbatch -p campus -c 1 --job-name=${insert} --wrap="runpsipred ${insert}.fasta"
        else
            # take the chance to find *horiz files which contain the result
            for prefix in `ls *horiz`
            do
                # extract the resulting sequence of 2ry structure elements and append it to a ingle file with all esults
                horiz=$(while read line; do if [ "${line:0:4}" == Pred ]; then echo ${line:6:${#line}} | tr -d "\n"; fi; done < $prefix)
                printf ">${p5}${insert}\n${horiz}" >> horiz.results
                # rm all side files (from psipred-blast)
                rm ${prefix:0:-5}*
            done

            # This  loop is tracking if any process has finished (so a new processes can ve queued)
            while [ $queue -ge $MAX_QUEUE ]
            do
                queue=$(squeue -u aerijman | wc -l)
                queue=$(echo $queue -1 | bc)

            done
        fi
    fi
done

对于在此脚本中包含太多不相关的信息,我深表歉意,但我相信我的业余方式可以通过更智能的方式更改循环监视队列中的空缺。

任何帮助将不胜感激!

标签: linuxbashscheduleslurm

解决方案


推荐阅读