python - 如何收集与输入函数匹配通配符的Snakemake输入文件?
问题描述
我有一组使用 BWA-MEM 生成并使用 GATK IndelRealigner 等进一步处理的 BAM 文件。我正在以较小的块预处理我的 BAM 文件以加快处理速度。但是,我必须在变体调用之前将这些单独的文件合并到一个 BAM 文件中,这对我的 Snakemake 管道来说是一个主要问题。
我的输入文件遵循这种命名约定
# Sample 1 BAM files
OVCA-1-FRESH-1_S16_L001_realigned.bam
OVCA-1-FRESH-1_S16_L002_realigned.bam
OVCA-1-FRESH-1_S16_L003_realigned.bam
OVCA-1-FRESH-1_S16_L004_realigned.bam
# Sample 2 BAM files
OVCA-2-FRESH-1_S16_L001_realigned.bam
OVCA-2-FRESH-1_S16_L002_realigned.bam
OVCA-2-FRESH-1_S16_L003_realigned.bam
OVCA-2-FRESH-1_S16_L004_realigned.bam
有问题的管道是这样的:
# Map start input files
RUN_ID, LINE = glob_wildcards('{run_id}_L{line}_realigned.bam')
rule all:
input:
expand('{run_id}_realigned.bam', run_id=RUN_ID)
# Map input files for merging. This function should collect all
# BAM files that match the {run_id} wildcard.
def samtools_merge_inputs(wildcards):
files = expand('{run_id}_L{line}_realigned.bam', run_id=RUN_ID, line=LINES)
return files
# Perform BAM merging.
rule samtools_merge:
input:
samtools_merge_inputs
output:
'{run_id}_realigned.bam
shell:
'samtools merge -h {input} {output}'
我试图构建一个输入函数来收集与当前处理的通配符匹配的所有可用输入文件。当我为我的管道执行试运行时,我可以看到该功能samtools_merge_inputs
无法正常工作,因为它收集了所有可用的 BAM 文件并多次重复它们:
rule samtools_merge:
input:
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam,
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam,
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam,
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam,
OVCA-2-FRESH-1_S4_L001_realigned.bam,
OVCA-2-FRESH-1_S4_L002_realigned.bam,
OVCA-2-FRESH-1_S4_L003_realigned.bam,
OVCA-2-FRESH-1_S4_L004_realigned.bam,
OVCA-2-FRESH-1_S4_L001_realigned.bam,
OVCA-2-FRESH-1_S4_L002_realigned.bam,
OVCA-2-FRESH-1_S4_L003_realigned.bam,
OVCA-2-FRESH-1_S4_L004_realigned.bam,
OVCA-2-FRESH-1_S4_L001_realigned.bam,
OVCA-2-FRESH-1_S4_L002_realigned.bam,
OVCA-2-FRESH-1_S4_L003_realigned.bam,
OVCA-2-FRESH-1_S4_L004_realigned.bam,
OVCA-2-FRESH-1_S4_L001_realigned.bam,
OVCA-2-FRESH-1_S4_L002_realigned.bam,
OVCA-2-FRESH-1_S4_L003_realigned.bam,
OVCA-2-FRESH-1_S4_L004_realigned.bam
output:
OVCA-1-FRESH-1_S16_realigned.bam
jobid:
18
wildcards:
run_id=OVCA-1-FRESH-1_S16
它应该如下所示:
rule samtools_merge:
input:
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam
output:
OVCA-1-FRESH-1_S16_realigned.bam
jobid:
18
wildcards:
run_id=OVCA-1-FRESH-1_S16
我应该如何编辑 samtools_merge_inputs 函数来收集所需的输入文件?我确实意识到我可以简单地忘记输入函数,只需使用通配符将四个输入文件输入到 samtools_merge,但我真的很想学习如何在这种情况下使用输入函数,因为我在其他管道中遇到了类似的问题也是。我试图从其他帖子中寻求帮助,但到目前为止,我还没有找到适合我目的的答案。
谢谢您的帮助!
解决方案
您的函数在这里不使用通配符。它应该是这样的:
def samtools_merge_inputs(wildcards):
files = expand(wildcards.run_id+'_L{line}_realigned.bam', line=LINES)
return files
当然,如果您在所有车道上都有所有样本。调用函数时,所有通配符都作为对象传递wildcards
给函数的参数。
你也可以这样做:
files = expand('{run_id}_L{line}_realigned.bam', run_id=wildcards.run_id, line=LINES)
你有很多东西在你的蛇文件中不起作用。
首先,您的 samtools 合并规则中缺少“'”:
rule samtools_merge:
input:
samtools_merge_inputs
output:
'{run_id}_realigned.bam'<-----
shell:
'samtools merge -h {input} {output}'
并注意变量名(LINE vs LINES)
其次,该函数glob_wildcards()
将返回找到的所有值的列表,这意味着您的两个变量将如下所示:
RUN_ID, LINES = glob_wildcards('{run_id}_L{line}_realigned.bam')
print(RUN_ID)
['OVCA-2-FRESH-1_S16', 'OVCA-2-FRESH-1_S16', 'OVCA-1-FRESH-1_S16', 'OVCA-1-FRESH-1_S16', 'OVCA-1-FRESH-1_S16', 'OVCA-1-FRESH-1_S16', 'OVCA-2-FRESH-1_S16', 'OVCA-2-FRESH-1_S16']
print(LINES)
['002', '001', '001', '002', '004', '003', '003', '004']
我敢肯定这不是你想要的。解决方案是使用正确的结构来描述您的样本。例如(再次,如果所有样本都在所有车道上):
RUN_ID = ["OVCA-1-FRESH-1_S16","OVCA-2-FRESH-1_S16"]
LINES = ["1","2","3","4"]
最后一件事:您的输入和输出无法通过通配符区分,这意味着您最终会遇到错误Cyclic dependency on rule samtools_merge
或RecursionError: maximum recursion depth exceeded in comparison
. 我建议您为输出选择不同的名称。全部放在一起:
# Map start input files
RUN_ID = ["OVCA-1-FRESH-1_S16","OVCA-2-FRESH-1_S16"]
LINES = ["001","002","003","004"]
rule all:
input:
expand('{run_id}_realignedFinal.bam', run_id=RUN_ID)
# Map input files for merging. This function should collect all
# BAM files that match the {run_id} wildcard.
def samtools_merge_inputs(wildcards):
files = expand('{run_id}_L{line}_realigned.bam', run_id=wildcards.run_id, line=LINES)
return files
# Perform BAM merging.
rule samtools_merge:
input:
samtools_merge_inputs
output:
'{run_id}_realignedFinal.bam'
shell:
'samtools merge -h {input} {output}'
没有检查你的 shell 命令,但我的文档说:
Usage: samtools merge [-nurlf] [-h inh.sam] [-b <bamlist.fofn>] <out.bam> <in1.bam> [<in2.bam> ... <inN.bam>]
推荐阅读
- javascript - JavaScript 测验问题,继续点击
- tensorflow - keras:model.fit_generator 抛出错误
- python-3.x - discord.ext.commands.errors.MissingPermissions:您缺少运行此命令的静音成员权限
- flutter - 颤动父子复选框
- asp.net - 有没有办法在不更改控制器文件名的情况下更改控制器名称?
- javascript - Material-UI 的 Dialog 如何允许在对话框后面进行交互?
- git - git-multimail 排除 git diff 和后续电子邮件
- matlab - 使用 Matlab 求解使用 Euler 方法的 ODE 系统
- google-cloud-platform - 从 gcp 中的 stackdriver-agent 获取 vm 正常运行时间数据?
- plot - 如何在 Gnuplot 中拟合 f(x)=a*sin^2(b*x^2 + c)