r - Loop through files in R and select rows by string
问题描述
I have a large number of CSV files. I need to extract relevant data from each file, and compile all of the relevant data into a new file.
I have been copying/pasting the code below and changing relevant details (e.g., file name) to repeat the same process for many CSV files. After that, I use cbind()
/write.xlsx()
to combine all of the relevant data and write it to an excel file. I need a more efficient method to accomplish this task.
How can I:
incorporate a loop that imports a large number of CSV files (to replace #1 below)
select relevant rows based on a string instead of entering specific row numbers (to replace # 2 below)
combine all of the relevant data into a single data frame with each file's data in one column
library(tidyr)
# 1 - import raw data files
file1 <- read.csv ("1.csv", header = FALSE, sep = "\n")
# 2 - select relevant rows
file1 <- as.data.frame(file1[c(41:155),])
colnames(file1) <- c("file1")
#separate components of each line from raw csv file / isolate data
temp1 <- separate(file1, file1, into = c("Text", "IntNum", "Data", sep = "\\s"))
temp1 <- temp1$Data
temp1 <- as.data.frame(temp1)
解决方案
如果每个文件中的相关行数相同,您可以这样做。选项 1 显示了使用循环的解决方案,选项 2 显示了使用sapply
.
在第一步中,我生成了三个 csv 文件以使代码可重现。每个文件中的起始行由“start”定义,结束行由“end”定义。然后我得到一个列表,其中包含这些文件的名称dir()
。
#make csv-files, target vector always same length (3)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "begin",
paste0("dat", i),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#get list of file names
allFiles <- dir(pattern = glob2rx("*.csv"))
选项 1 - 循环
对于循环,您可以首先初始化结果数据框(“outDF”),其中列数设置为 csv 文件数,行数设置为每个文件中目标向量的长度( “开始”到“结束”)。然后,您可以遍历文件并填充数据框。可以使用 索引开始行和结束行which()
。
#initialise result data frame
outDF <- data.frame(matrix(nrow = 3, ncol = length(allFiles),
dimnames = list(NULL, allFiles)))
#loop over csv files
for (iFile in allFiles) {
idat <- read.csv(iFile, stringsAsFactors = FALSE) #read csv
outDF[, iFile] <- idat[which(idat$x == "start"):which(idat$x == "end"),]
}
选项 2 - sapply
您可以使用sapply
自定义函数来提取每个文件中的相关行,而不是循环。这将返回一个矩阵,然后您可以将其转换为数据框。
out <- sapply(allFiles, FUN = function(x) {
idat <- read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
outDF <- as.data.frame(out)
如果文件之间“开始”和“结束”之间的行数不同,则上述选项将不起作用。在这种情况下,您可以通过首先使用lapply()
(类似于选项 2)生成一个结果列表(具有不同长度的列表元素)然后使用 NA 填充较短的列表,然后再次将结果转换为数据框来生成数据框。
#make csv-files with with target vector of different lengths (3:12)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "start",
rep(paste0("dat", i), sample(1:10,1)),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#lapply
out <- lapply(allFiles, FUN = function(x) {
idat = read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
out <- lapply(out, `length<-`, max(lengths(out)))
outDF <- do.call(cbind, out)
推荐阅读
- javascript - 如何转义字符并同时将变量放入 Json - Javascript
- azure - 如何在 Azure 中打开虚拟机(经典)上的所有端口
- electron - Electron:webview 上的冒泡事件
- php - “ps -ef”在 php 脚本与命令行中运行时返回不同的结果
- django - 由于 Python 包出错,无法部署 Heroku
- reactjs - 是的/Formik 最小日期不包括当前日期
- android - Android Retrofit 在光标下一个循环之前获得响应
- weather-api - 从 web api 数据创建雷达图像
- node.js - 为什么 node.js 在不同平台上对相同字符串的解码方式不同?
- python-2.7 - opencv starts capture with a big zoom