r - 如何遍历多个文件,执行一些操作并将循环中的所有变量写入文件?
问题描述
我是 R 新手,所以请原谅这是基本的。
我正在阅读一些表格:
library(data.table)
require(magrittr); require(tidyr)
x=fread("merged_plot_SG", header=TRUE)
> head(x)
gene_id chr min_POS max_POS rs_id pvalue_G pvalue_E metaP
1 ENSG00000020922 11 94212567 95223359 rs11605546 0.1367 0.9353 0.2670442
2 ENSG00000020922 11 94212567 95223359 rs566917 0.2740 0.2275 0.9363864
3 ENSG00000020922 11 94212567 95223359 rs12286498 0.8961 0.3347 0.5552598
4 ENSG00000020922 11 94212567 95223359 rs7934178 0.9043 0.3353 0.5510581
5 ENSG00000020922 11 94212567 95223359 rs16924610 0.9047 0.3353 0.5507136
6 ENSG00000020922 11 94212567 95223359 rs2508783 0.8685 0.1382 0.3517432
...
在该表中,我想提取所有唯一的(x$chr),在这种情况下:
> unique(x$chr)
[1] 11 3 6 7 20 17 2 12 1 10 4 19 9 22
比我要加载文件的每个唯一数字,例如这里的第一个数字是 11,所以我会这样做:
b=fread("/mydir/bed_chr_11.bed")
和下一个:
b=fread("/mydir/bed_chr_3.bed")
b=fread("/mydir/bed_chr_6.bed")
...
接下来我会做这两个操作:
x00=x %>%
inner_join(b, by = c("rs_id" = "V4")) %>%
select(gene_id, chr, rs_id, pvalue_G, pvalue_E, V2, V3)
x11=x00 %<>%
unite(snp, chr, V3, remove = FALSE)
所以最后我将拥有所有这些数据框:
x11,x3,x6,x7,x20,x17,x2,x12,x1,x10,x4,x19,x9,x22
然后我会将它们全部加入一个数据框中并写入一个文件:
x.n <- c('x11','x3','x6','x7','x20','x17','x2','x12','x1','x10','x4','x19','x9','x22')
x.list <- lapply(x.n, get)
xx=do.call(rbind, x.list)
colnames(xx)[6] <- "pvalue"
write.table(xx, "ready_plot_SG", quote=F, col.names=TRUE,row.names = F)
您能否帮助如何在一个脚本中并使用循环来完成所有这些操作?
谢谢!
编辑:遵循下面的建议,我来到了这一点:
require(dplyr)
library(data.table)
require(magrittr); require(tidyr)
x=fread("merged_plot_RGL", header=TRUE)
num=unique(x$chr)
files=list.files(path = "/anika/bed/", pattern = "\\.bed$", full.names = FALSE)
data_dir <- "/anika/bed"
#loop over the initial files
for(i in num){
file <- paste0(data_dir,"/", "bed_chr_",num[i],".bed") # loaded .bed file
xx <- lapply(file, function(z){
b <- fread(z, header = TRUE)
data.table(
x %>%
inner_join(b, by = c("rs_id" = "V4")) %>%
select(gene_id, chr, rs_id, pvalue_G, pvalue_E, V2, V3) %>%
unite(snp, chr, V3, remove = FALSE)
)
})
#We can combine them using data.tables 'rbindlist'
x_final <- rbindlist(xx)
#now we can use data.tables 'fwrite' to output the table to a file
names(x_final)[6] <- "pvalue"
fwrite(x_final, "test_rgl.txt", quote = "F", col.names = TRUE, row.names = FALSE)
}
但我得到了这个错误:
Error: `by` can't contain join column `V4` which is missing from RHS
Execution halted
解决方案
根据您对问题的描述,我将尽我所能提供答案。似乎我们在每个步骤中都有多个文件。我建议使用带有内部应用函数的外部 for 循环,您可以使用它来读取和执行所需的转换。
initial_dir <- "directory to the folder with the initial data"
data_dir <- "directory to the folder containing the secondary data"
file_names <- c("lots of file names") #1: Insert any number of files here from which to read numbers for the bed_chr_[nr].bed files here
#loop over the initial files
for(i in file_names){
file <- paste0(initial_dir,"/", i)
x <- fread(file, header = TRUE)
#combine secondary path and name
secondary_files <- paste0(data_dir,"/bed_chr_", unique(x[, [Insert the column for which the value for the bed_chr_[nr].bed files are in the intitial variable here <--]), ".bed") #2: insert column name (note i added unique)
#Apply the desired transformation
xx <- lapply(secondary_files, function(z){ #lapply will apply the 'function' to each element in 'secondary_files'.
b <- fread(z)
data.table( #apply the transformation and return a table to the list
x %>%
inner_join(b, by = c("rs_id" = "V4")) %>%
select(gene_id, chr, rs_id, pvalue_G, pvalue_E, V2, V3) %>%
unite(snp, chr, V3, remove = FALSE)
)
}) #after the lapply has run xx contains all the tables from the bed_chr_[numbers].bed files. All will have been read.
#xx is now a list that contains all the table after applying
#We can combine them using data.tables 'rbindlist'
x_final <- rbindlist(xx)
#now we can use data.tables 'fwrite' to output the table to a file
names(x_final)[6] <- "pvalue"
fwrite(x_final, "ready_plot_sg.txt", quote = "F", col.names = TRUE, row.names = FALSE) #here all the bed_chr_[numbers].bed files will be output into a single combined file
}
注意:我使用 paste0 而不是 paste。它们都在 R 中组合字符串,不同之处在于paste0
具有标准参数,默认情况sep = ""
下paste
插入空格
lapply
将在列表、向量或类似内容上应用函数并将结果输出到列表中。vapply
orsapply
也可以使用,但不一定会给你一个列表,在这种情况下,为了在rbindlist
没有任何附加参数的情况下使用它是可取的。出于我的示例的目的,我假设您对 b 的转换没有错误。
:::Edit::: 从提问者添加的新片段中,我将在特殊情况下重写我发布的代码,以显示两者是如何合并的:
require(dplyr);library(data.table);require(magrittr); require(tidyr)
initial_dir <- "" #The base directory in 'edit' was none ("merged_plot_RGL" is in working directory)
data_dir <- "anika/bed"
file_names <- c("merged_plot_RGL") #1: I inserted the merged_plot_RGL, as this contains the labels for the 'bed' files.
#Note: This loop is now redundant as we only have 1 file. But for illustration it was kept (i will only take on 'merged_plot_RGL' as a value)
for(i in file_names){
file <- i #2: Sets current file to merged_plot_RGL. paste0 removed as intial_dir was empty ("")
x <- fread(file, header = TRUE) #3: loads in merged_plot_RGL as a data.table
#4: combine secondary path and name (Note: I extract the number from x's chr column)
secondary_files <- paste0(data_dir,"/bed_chr_", unique(x[, chr]), ".bed")
#5: Apply the desired transformation via lapply (Note: xx becomes a list of transformed data from the bed files)
xx <- lapply(secondary_files, function(z){
b <- fread(z) #5.1: First read in the .bed file (done 1 by 1)
data.table( #5.2: apply the transformation and return a table to the list
x %>%
inner_join(b, by = c("rs_id" = "V4")) %>% #Note: Questioneers error came
select(gene_id, chr, rs_id, pvalue_G, pvalue_E, V2, V3) %>%
unite(snp, chr, V3, remove = FALSE)
)
})
#Note: After the lapply has run, and read in each .bed file, xx is now a list of .bed data tables that have been transformed by inner join, select and unite.
#6: We can combine them using data.tables 'rbindlist'
x_final <- rbindlist(xx)
#7: now we can use data.tables 'fwrite' to output the table to a file
names(x_final)[6] <- "pvalue"
fwrite(x_final, "ready_plot_sg.txt", quote = "F", col.names = TRUE, row.names = FALSE)
}
::: 编辑 2 ::: 更短的版本,更容易阅读错误。
require(dplyr);library(data.table);require(magrittr); require(tidyr)
data_dir <- "anika/bed"
file_names <- c("merged_plot_RGL")
x <- fread("merged_plot_RGL", header = TRUE)
secondary_files <- paste0(data_dir,"/bed_chr_", unique(x[, chr]), ".bed")
bed_files <- lapply(secondary_files, fread)
xx <- rbindlist(bed_files)
xx2 <- x %>%
inner_join(xx, by = c("rs_id" = "V4"))
xx3 <- xx2 %>%
select(gene_id, chr, rs_id, pvalue_G, pvalue_E, V2, V3)
xx4 <- xx3 %>%
unite(snp, chr, V3, remove = FALSE)
names(xx4)[6] <- "pvalue"
fwrite(xx4, "ready_plot_sg.txt", quote = "F", col.names = TRUE, row.names = FALSE)
推荐阅读
- gps - ublox GPS 奇怪字符穿插 NMEA 输出
- php - 为 Hangouts Webhook 构建 PHP 数组
- excel - 将工作表对象传递给另一个子获取“对象不支持此属性或方法”
- sql - 如何将表更新为子查询中的值
- java - 是否可以在 Android 中将图像拆分为多个部分?
- python - 根据条件和其他列的聚合值创建新的数据框列
- node.js - Nodemon 在 Node 中仅启动 webserver(不是 db server)
- python - 如何使用python下载文件,请求模块
- php - 即使它在那里,seeElement 也找不到对象
- swift - 匹配两个 UILabel 的字体大小