snakemake - 如何通过 Snakefile 中的输入并行化规则(具有多个输出)
问题描述
我对 Snakemake 如何在规则中并行化作业感到非常困惑。我想为每个输入使用一个核心(分别处理输入,而不是它们之间的空间),为每个输入提供多个输出。
这是我的代码的简化示例:
# Globals ---------------------------------------------------------------------
datasets = ["dataset_S1", "dataset_S2"]
methods = ["pbs", "pbs_windowed", "ihs", "xpehh"]
# Rules -----------------------------------------------------------------------
rule all:
input:
# Binary files
expand("{dataset}/results_bin/persnp/{method}.feather", dataset=datasets, method=methods),
expand("{dataset}/results_bin/pergene/{method}.feather", dataset=datasets, method=methods)
rule bin:
input:
"{dataset}/results_bin/convert2feather.R"
output:
"{dataset}/results_bin/persnp/{method}.feather",
"{dataset}/results_bin/pergene/{method}.feather"
threads:
2
shell:
"Rscript {input}
如果使用 运行上面的代码snakemake -j2
,我最终会根据输出方法重新运行每个脚本,这不是我想要的。如果我也将expand()
函数用于 bin 规则中的输入和输出,我最终会使用:
shell:
"""
Rscript {input[0]}
Rscript {input[1]}
"""
我认为不可能并行化。
我应该怎么做才能分别获取每个输入,以便我可以为每个输入使用一个内核?
任何帮助将不胜感激。谢谢!
编辑
试图更好地解释我的脚本做了什么,以及我期望 Snakemake 有哪些行为。请参阅我的示例文件夹结构:
.
├── dataset_S1
│ ├── data
│ │ └── data.vcf
│ ├── results_bin
│ │ └── convert2feather.R
│ ├── task2
│ │ └── script.py
│ └── task3
│ └── script.sh
└── dataset_S2
├── data
│ └── data.vcf
├── results_bin
│ └── convert2feather.R
├── task2
│ └── script.py
└── task3
└── script.sh
如您所见,对于每个数据集,我都有具有相同结构和命名脚本的文件夹(尽管脚本中的内容可能会有所不同)。在我的示例中,脚本将读取“data.vcf”文件,对其进行操作,然后在相应的数据集文件夹中创建新文件夹和文件。它将对两个数据集重复整个任务。我想以一种能够对文件夹 task2、task3 等中的脚本执行相同操作的方式来执行此操作...
例如,本例中我的管道的输出将是:
.
├── dataset_S1
│ ├── data
│ │ └── data.vcf
│ └── results_bin
│ ├── convert2feather.R
│ ├── pergene
│ │ ├── ihs.feather
│ │ ├── pbs.feather
│ │ ├── pbs_windowed.feather
│ │ └── xpehh.feather
│ └── persnp
│ ├── ihs.feather
│ ├── pbs.feather
│ ├── pbs_windowed.feather
│ └── xpehh.feather
└── dataset_S2
├── data
│ └── data.vcf
└── results_bin
├── convert2feather.R
├── pergene
│ ├── ihs.feather
│ ├── pbs.feather
│ ├── pbs_windowed.feather
│ └── xpehh.feather
└── persnp
├── ihs.feather
├── pbs.feather
├── pbs_windowed.feather
└── xpehh.feather
编辑2
使用的文件和命令:
(snakemake) cmcouto-silva@datascience-IB:~/cmcouto.silva@usp.br/lab_files/phd_data$ snakemake -j2 -p
# Globals ---------------------------------------------------------------------
datasets = ["dataset_S1", "dataset_S2"]
methods = ["pbs", "pbs_windowed", "ihs", "xpehh"]
# Rules -----------------------------------------------------------------------
rule all:
input:
# Binary files
expand("{dataset}/results_bin/persnp/{method}.feather", dataset=datasets, method=methods),
expand("{dataset}/results_bin/pergene/{method}.feather", dataset=datasets, method=methods)
rule bin:
input:
"{dataset}/results_bin/convert2feather.R"
output:
expand("{{dataset}}/results_bin/persnp/{method}.feather", method=methods),
expand("{{dataset}}/results_bin/pergene/{method}.feather", method=methods)
threads:
2
shell:
"Rscript {input}"
输出日志:
(snakemake) cmcouto-silva@datascience-IB:~/cmcouto.silva@usp.br/lab_files/phd_data$ snakemake -j2 -p
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
2 bin
3
[Wed Sep 30 23:47:55 2020]
rule bin:
input: dataset_S1/results_bin/convert2feather.R
output: dataset_S1/results_bin/persnp/pbs.feather, dataset_S1/results_bin/persnp/pbs_windowed.feather, dataset_S1/results_bin/persnp/ihs.feather, dataset_S1/results_bin/persnp/xpehh.feather, dataset_S1/results_bin/pergene/pbs.feather, dataset_S1/results_bin/pergene/pbs_windowed.feather, dataset_S1/results_bin/pergene/ihs.feather, dataset_S1/results_bin/pergene/xpehh.feather
jobid: 1
wildcards: dataset=dataset_S1
threads: 2
Rscript dataset_S1/results_bin/convert2feather.R
Package "data.table" successfully loaded!
Package "magrittr" successfully loaded!
Package "snpsel" successfully loaded!
[Wed Sep 30 23:48:43 2020]
Finished job 1.
1 of 3 steps (33%) done
[Wed Sep 30 23:48:43 2020]
rule bin:
input: dataset_S2/results_bin/convert2feather.R
output: dataset_S2/results_bin/persnp/pbs.feather, dataset_S2/results_bin/persnp/pbs_windowed.feather, dataset_S2/results_bin/persnp/ihs.feather, dataset_S2/results_bin/persnp/xpehh.feather, dataset_S2/results_bin/pergene/pbs.feather, dataset_S2/results_bin/pergene/pbs_windowed.feather, dataset_S2/results_bin/pergene/ihs.feather, dataset_S2/results_bin/pergene/xpehh.feather
jobid: 2
wildcards: dataset=dataset_S2
threads: 2
Rscript dataset_S2/results_bin/convert2feather.R
Package "data.table" successfully loaded!
Package "magrittr" successfully loaded!
Package "snpsel" successfully loaded!
[Wed Sep 30 23:49:41 2020]
Finished job 2.
2 of 3 steps (67%) done
[Wed Sep 30 23:49:41 2020]
localrule all:
input: dataset_S1/results_bin/persnp/pbs.feather, dataset_S1/results_bin/persnp/pbs_windowed.feather, dataset_S1/results_bin/persnp/ihs.feather, dataset_S1/results_bin/persnp/xpehh.feather, dataset_S2/results_bin/persnp/pbs.feather, dataset_S2/results_bin/persnp/pbs_windowed.feather, dataset_S2/results_bin/persnp/ihs.feather, dataset_S2/results_bin/persnp/xpehh.feather, dataset_S1/results_bin/pergene/pbs.feather, dataset_S1/results_bin/pergene/pbs_windowed.feather, dataset_S1/results_bin/pergene/ihs.feather, dataset_S1/results_bin/pergene/xpehh.feather, dataset_S2/results_bin/pergene/pbs.feather, dataset_S2/results_bin/pergene/pbs_windowed.feather, dataset_S2/results_bin/pergene/ihs.feather, dataset_S2/results_bin/pergene/xpehh.feather
jobid: 0
[Wed Sep 30 23:49:41 2020]
Finished job 0.
3 of 3 steps (100%) done
Complete log: /home/cmcouto-silva/cmcouto.silva@usp.br/lab_files/phd_data/.snakemake/log/2020-09-30T234755.741940.snakemake.log
(snakemake) cmcouto-silva@datascience-IB:~/cmcouto.silva@usp.br/lab_files/phd_data$ cat /home/cmcouto-silva/cmcouto.silva@usp.br/lab_files/phd_data/.snakemake/log/2020-09-30T234755.741940.snakemake.log
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
2 bin
3
[Wed Sep 30 23:47:55 2020]
rule bin:
input: dataset_S1/results_bin/convert2feather.R
output: dataset_S1/results_bin/persnp/pbs.feather, dataset_S1/results_bin/persnp/pbs_windowed.feather, dataset_S1/results_bin/persnp/ihs.feather, dataset_S1/results_bin/persnp/xpehh.feather, dataset_S1/results_bin/pergene/pbs.feather, dataset_S1/results_bin/pergene/pbs_windowed.feather, dataset_S1/results_bin/pergene/ihs.feather, dataset_S1/results_bin/pergene/xpehh.feather
jobid: 1
wildcards: dataset=dataset_S1
threads: 2
Rscript dataset_S1/results_bin/convert2feather.R
[Wed Sep 30 23:48:43 2020]
Finished job 1.
1 of 3 steps (33%) done
[Wed Sep 30 23:48:43 2020]
rule bin:
input: dataset_S2/results_bin/convert2feather.R
output: dataset_S2/results_bin/persnp/pbs.feather, dataset_S2/results_bin/persnp/pbs_windowed.feather, dataset_S2/results_bin/persnp/ihs.feather, dataset_S2/results_bin/persnp/xpehh.feather, dataset_S2/results_bin/pergene/pbs.feather, dataset_S2/results_bin/pergene/pbs_windowed.feather, dataset_S2/results_bin/pergene/ihs.feather, dataset_S2/results_bin/pergene/xpehh.feather
jobid: 2
wildcards: dataset=dataset_S2
threads: 2
Rscript dataset_S2/results_bin/convert2feather.R
[Wed Sep 30 23:49:41 2020]
Finished job 2.
2 of 3 steps (67%) done
[Wed Sep 30 23:49:41 2020]
localrule all:
input: dataset_S1/results_bin/persnp/pbs.feather, dataset_S1/results_bin/persnp/pbs_windowed.feather, dataset_S1/results_bin/persnp/ihs.feather, dataset_S1/results_bin/persnp/xpehh.feather, dataset_S2/results_bin/persnp/pbs.feather, dataset_S2/results_bin/persnp/pbs_windowed.feather, dataset_S2/results_bin/persnp/ihs.feather, dataset_S2/results_bin/persnp/xpehh.feather, dataset_S1/results_bin/pergene/pbs.feather, dataset_S1/results_bin/pergene/pbs_windowed.feather, dataset_S1/results_bin/pergene/ihs.feather, dataset_S1/results_bin/pergene/xpehh.feather, dataset_S2/results_bin/pergene/pbs.feather, dataset_S2/results_bin/pergene/pbs_windowed.feather, dataset_S2/results_bin/pergene/ihs.feather, dataset_S2/results_bin/pergene/xpehh.feather
jobid: 0
[Wed Sep 30 23:49:41 2020]
Finished job 0.
3 of 3 steps (100%) done
Complete log: /home/cmcouto-silva/cmcouto.silva@usp.br/lab_files/phd_data/.snakemake/log/2020-09-30T234755.741940.snakemake.log
解决方案
我不确定我理解正确,但在我看来,每个“数据集”infile 都会有三个“方法”输出文件。如果是这样,这应该工作。
rule bin:
input:
"{dataset}/results_bin/convert2feather.R"
output:
expand("{{dataset}}/results_bin/persnp/{method}.feather", method=methods),
expand("{{dataset}}/results_bin/pergene/{method}.feather", method=methods)
推荐阅读
- c++ - C++ 重生函数
- amazon-web-services - ElasticBeanstalk 在使用 Terraform 的共享 ALB 中使用 HTTPS:443 而不是 HTTP:80 创建默认侦听器规则
- r - 使用修改后的列名绘制图形的 R 函数
- node.js - 关于 Web 应用程序的安全性,我应该注意什么?
- asp.net - Visual Studio 2019 中多个帐户的 DefaultAzureCredential 异常
- php - 使用 PHP / AJAX 中的 mysql 数据库中的数据填充 Bootstrap Accordion
- mongodb - 尝试将 mongodb 与我的应用程序连接时出现错误
- r - 在 R 中对高于和低于特定阈值的值进行分组
- javascript - React-Router 不会将我从组件重定向
- android - 如何确保我不会继续向我的房间数据库输入数据