r - 循环文件 - 解析文件并按标识符分组
问题描述
我想 :
- 从目录中读取 * .bed 文件列表
- 对于我文件夹中的所有 .bed 文件,我想使用所有行id=NAME中包含的信息,这是所有 *.bed 文件中第五列的一部分(例如,下面的 Hox.bed 和zinc.bed)
- 使用将值链接到值
cram-2
的单独查找表(例如,下面的查找表)确定给定文件属于哪个系列(例如)id
Family
- 将具有相同系列的所有文件(例如 HOX.bed 和zinc.bed)合并/连接到一个 .bed 文件中。
- 使用列 Family 的名称保存链接文件(例如 cram-2.bed)。
例子:
HOX.bed 文件行:
ma reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
锌床文件行:
ma reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
查找表:
Name Family
HOX cram-2
zinc cram-2
fire sf.xr
fire ra.XS-2
...continues...
我搜索以获得的输出:
文件名 = cram-2.bed
连接 HOX.bed 和 zinc.bed 因为两者都来自 Family cram-2!
ma reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
ma reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
我开始准备一个脚本结构,但我正在努力设置如何设置具有相同 Family 的所有文件必须以相同的输出文件结束(可能是 .bed)
myFiles <- list.files(pattern = "\\.bed$")
for(i in myFiles){
name <- read.table((i), header = FALSE, sep="\t", stringsAsFactors=FALSE, quote="")
name <- name %>% top_n(1, "id")
Family_filtering <-
table %>% filter(
Family %in% name)
save(...????????...)
}
非常感谢您的帮助!!!
解决方案
将每个活动转换为一个功能,然后将它们组合在一起。是不是很简单?!?
library(fs)
library(tidyverse)
dfNameFamily = tibble(
Name = c("HOX", "zinc", "fire", "fire2"),
Family = c("cram-2", "cram-2", "sf.xr", "ra.XS-2"))
dir = "bedfile"
BedFile = function(dir) dir_ls(dir, regexp = "\\.bed$")
readTxt = function(FileName){
lines = character()
if(file_exists(FileName)){
con = file(FileName, open = "r")
lines = readLines(con)
close(con)
}
lines
}
GetName = function(l) str_match(l, "id=(.+);seq")[1,2]
SaveFile = function(l, name, dir){
con = file(paste0(dir, "/" , name))
writeLines(unlist(l$lines), con)
close(con)
}
tibble(FileName = BedFile(dir)) %>% #Read all bed file names
mutate(
lines = map(FileName, readTxt), #Read all lines from any bed file
Name = map_chr(lines, GetName)) %>% #Get Name for eny bed file
left_join(dfNameFamily, by="Name") %>% #Join Family
group_by(Family) %>%
group_walk(SaveFile, dir) #Save Family file
推荐阅读
- javascript - 如何将对象添加到 Knex 中的数据库中,其中该对象是两个父母的孩子?
- python - 如何将 wx.BitmapButton 改回原来的状态 wxpython?
- angularjs - AngularJS-$uibModalInstance 如何保持模态对话框状态像隐藏/显示而不是关闭/取消它?
- maven - 无法读取 org.apache.Maven.plugins 的工件描述符:Maven-surefire-plugin:jar:3.0.0-M3:
- javascript - 如何从 React 中的今天日期中减去日期?
- google-chrome - WebSerial API中的方法GetInfo?还有另一种获取设备信息的方法吗?
- python - 理解python列表理解
- java - 如何将 Microsoft SQL Server 2014 Management Studio 连接到 Java 程序
- python - Python 将日期类型转换为 %Y-%m-%d
- apk - 如何将文件(.jks,.p12)作为变量?