首页 > 解决方案 > “输入文件由另一个作业更新”但不是真的

问题描述

我的管道有问题,我不知道这对我来说是一个不好的用途还是一个错误。我正在使用蛇形 5.26.1。直到几天前,我使用相同版本的蛇形制作时才遇到任何问题。我不明白发生了什么变化。

管道部分就在检查点规则和聚合步骤之后,这会产生一个输出文件,例如foo.fasta.

我在foo.fasta使用它作为输入文件的生成之后有规则,并且使用 再次运行,--reason Input files updated by another job但情况并非如此,它们的输出比foo.fasta.

此外,使用该--summary选项,文件 foo.fasta 被标记为无更新,并且文件具有正确的时间戳顺序,无需再次执行规则。

我找不到再次运行以下规则的原因。检查点是否有可能导致问题使得 snakemake 认为foo.fasta已更新而不是更新?

这是一个试运行输出:规则agouti_scaffolding产生tros_v5.pseudohap.fasta.gz(my foo.fasta)

Building DAG of jobs...                                                                 
Updating job 7 (aggregate_gff3).                                                        
Replace aggregate_gff3 with dynamic branch aggregate_gff3                               
updating depending job agouti_scaffolding                                               
Updating job 3 (agouti_scaffolding).                                                    
Replace agouti_scaffolding with dynamic branch agouti_scaffolding                       
updating depending job unzip_fasta                                                      
updating depending job btk_prepare_workdir                                              
updating depending job v6_clean_rename                                                  
updating depending job kat_comp                                                         
updating depending job assembly_stats                                                   
Job counts:                                                                             
        count   jobs                                                                    
        1       kat_comp                                                                
        1       kat_plot_spectra                                                        
        2                                                                               
                                                                                        
[Tue Oct 27 12:08:41 2020]                                                              
rule kat_comp:                                                                          
    input: results/preprocessing/tros/tros_dedup_proc_fastp_R1_001.fastq.gz, results/pre
sing/tros/tros_dedup_proc_fastp_R2_001.fastq.gz, results/fasta/tros_v5.pseudohap.fasta.g
    output: results/kat/tros_v5/tros_v5_comp-main.mx                                    
    log: logs/kat_comp.tros_v5.log                                                      
    jobid: 1                                                                            
    reason: Input files updated by another job: results/fasta/tros_v5.pseudohap.fasta.gz
    wildcards: sample=tros, version=v5                                                  
                                                                                        
                                                                                        
[Tue Oct 27 12:08:41 2020]                                                              
rule kat_plot_spectra:                                                                  
    input: results/kat/tros_v5/tros_v5_comp-main.mx                                     
    output: results/kat/tros_v5/tros_v5_spectra.pdf                                     
    log: logs/kat_plot.tros_v5.log                                                      
    jobid: 0                                                                            
    reason: Input files updated by another job: results/kat/tros_v5/tros_v5_comp-main.mx
    wildcards: sample=tros, version=v5                                                  
                                                                                        
Job counts:                                                                             
        count   jobs                                                                    
        1       kat_comp                                                                
        1       kat_plot_spectra                                                        
        2                                                                               
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

似乎updating depending job kat_comp是这个问题。甚至ancientkat_comp 输入上的标志也不会改变任何东西。

编辑

这是管道的一部分。正如@TroyComi 建议的最后一个开发版本(5.26.1+26.gc2e2b501.dirty)解决了古老指令的问题。移除远古指令规则时仍会执行。

rule all:
    "results/kat/tros_v5/tros_v5_spectra.pdf"


checkpoint split_fa_augustus:
    input:
        "results/fasta/{sample}_v4.pseudohap.fasta.gz"
    output:
        "results/agouti/{sample}/{sample}_v4.pseudohap.fa",
        directory("results/agouti/{sample}/split")
    params:
        split_size = 50000000
    conda:
        "../envs/augustus.yaml"
    shell:
        """
        zcat {input} > {output[0]}
        mkdir {output[1]}
        splitMfasta.pl {output[0]} \
        --outputpath={output[1]} --minsize={params.split_size}
        """

rule augustus:
    input:
        "results/agouti/{sample}/split/{sample}_v4.pseudohap.split.{i}.fa"
    output:
        "results/agouti/{sample}/split/pred_{i}.gff3"
    conda:
        "../envs/augustus.yaml"
    shell:
        """
        augustus --gff3=on --species=caenorhabditis {input} > {output}
        """

def aggregate_input_gff3(wildcards):
    checkpoint_output = checkpoints.split_fa_augustus.get(**wildcards).output[1]
    return expand("results/agouti/{sample}/split/pred_{i}.gff3",
           sample=wildcards.sample,
           i=glob_wildcards(os.path.join(checkpoint_output, f"{wildcards.sample}_v4.pseudohap.split." + "{i}.fa")).i)

rule aggregate_gff3:
    input:
        aggregate_input_gff3
    output:
        "results/agouti/{sample}/{sample}_v4.pseudohap.gff3"
    conda:
        "../envs/augustus.yaml"
    shell:
        "cat {input} | join_aug_pred.pl > {output}"

#===============================
# Preprocess RNAseq data

# The RNAseq reads need to be in the folder resources/RNAseq_raw/{sample}
# Files must be named {run}_R1.fastq.gz and {run}_R2.fastq.gz for globbing to work
# globbing is done in the rule merge_RNA_bams

rule rna_rcorrector:
    input:
        expand("resources/RNAseq_raw/{{sample}}/{{run}}_{R}.fastq.gz",
            R=['R1', 'R2'])
    output:
        temp(expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_{R}.cor.fq.gz",
            R=['R1', 'R2']))
    params:
        outdir = lambda w, output: os.path.dirname(output[0])
    log:
        "logs/rcorrector_{sample}_{run}.log" 
    threads:
        config['agouti']['threads']
    conda:
        "../envs/rna_seq.yaml"
    shell:
        """
        run_rcorrector.pl -1 {input[0]} -2 {input[1]} \
        -t {threads} \
        -od {params.outdir} \
        > {log} 2>&1
        """

rule rna_trimgalore:
    input:
        expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_{R}.cor.fq.gz",
            R=['R1', 'R2'])
    output:
        temp(expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_trimgal_val_{i}.fq",
            i=['1', '2']))
    params:
        outdir = lambda w, output: os.path.dirname(output[0]),
        basename = lambda w: f'{w.run}_trimgal'
    log:
        "logs/trimgalore_{sample}_{run}.log" 
    threads:
        config['agouti']['threads']
    conda:
        "../envs/rna_seq.yaml"
    shell: 
        """
        trim_galore --cores {threads} \
        --phred33 \
        --quality 20 \
        --stringency 1 \
        -e 0.1 \
        --length 70 \
        --output_dir {params.outdir} \
        --basename {params.basename} \
        --dont_gzip \
        --paired \
        {input} \
        > {log} 2>&1
        """

#===============================
# Map the RNAseq reads
rule index_ref:
    input: 
        "results/agouti/{sample}/{sample}_v4.pseudohap.fa"
    output:
        multiext("results/agouti/{sample}/{sample}_v4.pseudohap.fa",
            ".0123", ".amb", ".ann", ".bwt.2bit.64", ".bwt.8bit.32", ".pac")
    conda:
        "../envs/mapping.yaml"
    shell:
        "bwa-mem2 index {input}"

rule map_RNAseq:
    input: 
        expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_trimgal_val_{i}.fq",
            i=['1', '2']),
        "results/agouti/{sample}/{sample}_v4.pseudohap.fa",
        multiext("results/agouti/{sample}/{sample}_v4.pseudohap.fa",
            ".0123", ".amb", ".ann", ".bwt.2bit.64", ".bwt.8bit.32", ".pac")
    output:
        "results/agouti/{sample}/mapping/{run}.bam"
    log:
        "logs/bwa_rna_{sample}_{run}.log"
    conda:
        "../envs/mapping.yaml"
    threads:
        config['agouti']['threads']
    shell:
        """
        bwa-mem2 mem -t {threads} {input[2]} {input[0]} {input[1]} 2> {log} \
        | samtools view -b -@ {threads} -o {output}
        """

def get_sample_rna_runs(w):
    list_R1_files = glob.glob(f"resources/RNAseq_raw/{w.sample}/*_R1.fastq.gz")
    list_runs = [re.sub('_R1\.fastq\.gz$', '', os.path.basename(f)) for f in list_R1_files]
    return [f'results/agouti/{w.sample}/mapping/{run}.bam' for run in list_runs]

rule merge_RNA_bams:
    input:
        get_sample_rna_runs
    output:
        "results/agouti/{sample}/RNAseq_mapped_merged.bam"
    params:
        tmp_merge = lambda w: f'results/agouti/{w.sample}/tmp_merge.bam'
    conda:
        "../envs/mapping.yaml"
    threads:
        config['agouti']['threads']
    shell:
        """
        samtools merge -@ {threads} {params.tmp_merge} {input}               
        samtools sort -@ {threads} -n -o {output} {params.tmp_merge}
        rm {params.tmp_merge}
        """

#===============================
# Run agouti on all that
rule agouti_scaffolding:
    input: 
        fa = "results/agouti/{sample}/{sample}_v4.pseudohap.fa",
        bam = "results/agouti/{sample}/RNAseq_mapped_merged.bam",
        gff = "results/agouti/{sample}/{sample}_v4.pseudohap.gff3"
    output: 
        protected("results/fasta/{sample}_v5.pseudohap.fasta.gz")
    params:
        outdir = lambda w: f'results/agouti/{w.sample}/agouti_out',
        minMQ = 20,
        maxFracMM = 0.05
    log: 
        "logs/agouti_{sample}.log"
    conda: 
        "../envs/agouti.yaml"
    shell:
        """
        python /opt/agouti/agouti.py scaffold \
        -assembly {input.fa} \
        -bam {input.bam} \
        -gff {input.gff} \
        -outdir {params.outdir} \
        -minMQ {params.minMQ} -maxFracMM {params.maxFracMM} \
        > {log} 2>&1

        gzip -c {params.outdir}/agouti.agouti.fasta > {output}
        """

#===============================================================
# Now do something on output {sample}_{version}.pseudohap.fasta.gz

rule kat_comp:
    input:
        expand("results/preprocessing/{{sample}}/{{sample}}_dedup_proc_fastp_{R}_001.fastq.gz",
            R=["R1", "R2"]),
        ancient("results/fasta/{sample}_{version}.pseudohap.fasta.gz")
    output:
        "results/kat/{sample}_{version}/{sample}_{version}_comp-main.mx"
    params:
        outprefix = lambda w: f'results/kat/{w.sample}_{w.version}/{w.sample}_{w.version}_comp'
    log:
        "logs/kat_comp.{sample}_{version}.log"
    conda:
        "../envs/kat.yaml"
    threads:
        16
    shell:
        """
        kat comp -t {threads} \
        -o {params.outprefix} \
        '{input[0]} {input[1]}' \
        {input[2]} \
        > {log} 2>&1
        """

rule kat_plot_spectra:
    input:
        "results/kat/{sample}_{version}/{sample}_{version}_comp-main.mx"
    output:
        "results/kat/{sample}_{version}/{sample}_{version}_spectra.pdf"
    params:
        title = lambda w: f'{w.sample}_{w.version}'
    log:
        "logs/kat_plot.{sample}_{version}.log"
    conda:
        "../envs/kat.yaml"
    shell:
        """
        kat plot spectra-cn \
        -o {output} \
        -t {params.title} \
        {input} \
        > {log} 2>&1
        """

标签: snakemake

解决方案


推荐阅读