首页 > 解决方案 > 循环文件 - 解析文件并按标识符分组

问题描述

我想 :

  1. 从目录中读取 * .bed 文件列表
  2. 对于我文件夹中的所有 .bed 文件,我想使用所有行id=NAME中包含的信息,这是所有 *.bed 文件中第五列的一部分(例如,下面的 Hox.bed 和zinc.bed)
  3. 使用将值链接到值cram-2的单独查找表(例如,下面的查找表)确定给定文件属于哪个系列(例如)idFamily
  4. 将具有相同系列的所有文件(例如 HOX.bed 和zinc.bed)合并/连接到一个 .bed 文件中。
  5. 使用列 Family 的名称保存链接文件(例如 cram-2.bed)。

例子:

HOX.bed 文件行:

ma  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05

锌床文件行:

ma  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05

查找表:

Name                        Family
HOX                         cram-2
zinc                        cram-2
fire                        sf.xr
fire                        ra.XS-2
...continues...

我搜索以获得的输出:

文件名 = cram-2.bed

连接 HOX.bed 和 zinc.bed 因为两者都来自 Family cram-2!

ma  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
ma  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05

我开始准备一个脚本结构,但我正在努力设置如何设置具有相同 Family 的所有文件必须以相同的输出文件结束(可能是 .bed)

myFiles <- list.files(pattern = "\\.bed$") 
for(i in myFiles){
  name <- read.table((i), header = FALSE, sep="\t", stringsAsFactors=FALSE, quote="")
  name <- name %>% top_n(1, "id")
  Family_filtering <-
    table %>% filter(
      Family %in% name)
  save(...????????...)
}

非常感谢您的帮助!!!

标签: rloopsfor-loopparsingtibble

解决方案


将每个活动转换为一个功能,然后将它们组合在一起。是不是很简单?!?

library(fs)
library(tidyverse)

dfNameFamily = tibble(
  Name = c("HOX", "zinc", "fire", "fire2"),
  Family = c("cram-2", "cram-2", "sf.xr", "ra.XS-2"))

dir = "bedfile"

BedFile = function(dir) dir_ls(dir, regexp = "\\.bed$")

readTxt = function(FileName){
  lines = character()
  if(file_exists(FileName)){
    con = file(FileName, open = "r")
    lines = readLines(con)
    close(con)
  }
  lines
}

GetName = function(l) str_match(l, "id=(.+);seq")[1,2]

SaveFile = function(l, name, dir){
  con = file(paste0(dir, "/" , name))
  writeLines(unlist(l$lines), con)
  close(con)
}

tibble(FileName = BedFile(dir)) %>%  #Read all bed file names
  mutate(
    lines = map(FileName, readTxt),  #Read all lines from any bed file
    Name = map_chr(lines, GetName)) %>%  #Get Name for eny bed file
  left_join(dfNameFamily, by="Name") %>%  #Join Family
  group_by(Family) %>%  
  group_walk(SaveFile, dir)  #Save Family file

推荐阅读