首页 > 解决方案 > 由于正在处理大量文件,Snakemake 速度呈指数级下降

问题描述

我目前正在编写一个管道,它生成正 RNA 序列,将它们打乱,然后分析正序列和打乱(负)序列。例如,我想生成 100 个正序列,并使用三种不同的算法将这些序列中的每个序列打乱 1000 次。为此,我使用了两个通配符(pos_index 和 pred_index),范围分别为 0 到 100 和 0 到 1000。作为最后一步,所有文件都由另外三个不同的工具进行分析。

现在我的问题是:DAG 的构建过程实际上需要几个小时,随后实际管道的执行速度会更慢。当它启动时,它会执行一批 32 个作业(因为我为 snakemake 分配了 32 个内核),然后需要 10 到 15 分钟才能执行下一批(我猜是由于一些文件检查)。管道的完整执行将需要大约 2 个月的时间。

下面是我的蛇文件的简化示例。有什么办法,我可以通过某种方式优化它,所以蛇形及其开销不再是瓶颈了吗?

ITER_POS = 100
ITER_PRED = 1000

SAMPLE_INDEX = range(0, ITER_POS)
PRED_INDEX = range(0, ITER_PRED)

SHUFFLE_TOOLS = ["1", "2", "3"]
PRED_TOOLS = ["A", "B", "C"]

rule all:
    input:
        # Expand for negative sample analysis
        expand("predictions_{pred_tool}/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt",
            pred_tool = PRED_TOOLS,
            shuffle_tool = SHUFFLE_TOOLS,
            sample_index = SAMPLE_INDEX,
            pred_index = PRED_INDEX),

    # Expand for positive sample analysis
        expand("predictions_{pred_tool}/pos_sample_{sample_index}.txt",
            pred_tool = PRED_TOOLS,
            sample_index = SAMPLE_INDEX)


# GENERATION
rule generatePosSample:
    output: "samples/pos_sample_{sample_index}.clu"
    shell:  "sequence_generation.py > {output}"


# SHUFFLING
rule shufflePosSamples1:
    input:  "samples/pos_sample_{sample_index}.clu"
    output: "samples/neg_sample_1_{sample_index}_{pred_index}.clu"
    shell:  "sequence_shuffling.py {input} > {output}"

rule shufflePosSamples2:
    input:  "samples/pos_sample_{sample_index}.clu"
    output: "samples/neg_sample_2_{sample_index}_{pred_index}.clu"
    shell:  "sequence_shuffling.py {input} > {output}"

rule shufflePosSamples3:
    input:  "samples/pos_sample_{sample_index}.clu"
    output: "samples/neg_sample_3_{sample_index}_{pred_index}.clu"
    shell:  "sequence_shuffling.py {input} > {output}"


# ANALYSIS
rule analysePosSamplesA:
    input:  "samples/pos_sample_{sample_index}.clu"
    output: "predictions_A/pos_sample_{sample_index}.txt"
    shell:  "sequence_analysis_A.py {input} > {output}"

rule analysePosSamplesB:
    input:  "samples/pos_sample_{sample_index}.clu"
    output: "predictions_B/pos_sample_{sample_index}.txt"
    shell:  "sequence_analysis_B.py {input} > {output}"

rule analysePosSamplesC:
    input:  "samples/pos_sample_{sample_index}.clu"
    output: "predictions_C/pos_sample_{sample_index}.txt"
    shell:  "sequence_analysis_C.py {input} > {output}"

rule analyseNegSamplesA:
    input:  "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
    output: "predictions_A/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
    shell:  "sequence_analysis_A.py {input} > {output}"

rule analyseNegSamplesB:
    input:  "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
    output: "predictions_B/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
    shell:  "sequence_analysis_B.py {input} > {output}"

rule analyseNegSamplesC:
    input:  "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
    output: "predictions_C/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
    shell:  "sequence_analysis_C.py {input} > {output}"

标签: performanceoptimizationsnakemake

解决方案


推荐阅读