首页 > 解决方案 > Loop through files in R and select rows by string

问题描述

I have a large number of CSV files. I need to extract relevant data from each file, and compile all of the relevant data into a new file.

I have been copying/pasting the code below and changing relevant details (e.g., file name) to repeat the same process for many CSV files. After that, I use cbind()/write.xlsx() to combine all of the relevant data and write it to an excel file. I need a more efficient method to accomplish this task.

How can I:

  1. incorporate a loop that imports a large number of CSV files (to replace #1 below)

  2. select relevant rows based on a string instead of entering specific row numbers (to replace # 2 below)

  3. combine all of the relevant data into a single data frame with each file's data in one column

library(tidyr)

# 1 - import raw data files 

file1 <- read.csv ("1.csv", header = FALSE, sep = "\n")

# 2 - select relevant rows

file1 <- as.data.frame(file1[c(41:155),])

colnames(file1) <- c("file1")

#separate components of each line from raw csv file / isolate data

temp1 <- separate(file1, file1, into = c("Text", "IntNum", "Data", sep = "\\s"))

temp1 <- temp1$Data

temp1 <- as.data.frame(temp1)

标签: rloopsread.csv

解决方案


如果每个文件中的相关行数相同,您可以这样做。选项 1 显示了使用循环的解决方案,选项 2 显示了使用sapply.

在第一步中,我生成了三个 csv 文件以使代码可重现。每个文件中的起始行由“start”定义,结束行由“end”定义。然后我得到一个列表,其中包含这些文件的名称dir()

#make csv-files, target vector always same length (3)
set.seed(1)
for (i in 1:3) {
  df <- data.frame(x = c(rep(0, sample(1:10,1)), "begin", 
                         paste0("dat", i), 
                         "end",rep(0, sample(1:10, 1))))
  write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}

#get list of file names
allFiles <- dir(pattern = glob2rx("*.csv"))

选项 1 - 循环 对于循环,您可以首先初始化结果数据框(“outDF”),其中列数设置为 csv 文件数,行数设置为每个文件中目标向量的长度( “开始”到“结束”)。然后,您可以遍历文件并填充数据框。可以使用 索引开始行和结束行which()

#initialise result data frame
outDF <- data.frame(matrix(nrow = 3, ncol = length(allFiles),
                         dimnames = list(NULL, allFiles)))

#loop over csv files
for (iFile in allFiles) {
  idat <- read.csv(iFile, stringsAsFactors = FALSE) #read csv
  outDF[, iFile] <- idat[which(idat$x == "start"):which(idat$x == "end"),]
}

选项 2 - sapply 您可以使用sapply自定义函数来提取每个文件中的相关行,而不是循环。这将返回一个矩阵,然后您可以将其转换为数据框。

out <- sapply(allFiles, FUN = function(x) {
  idat <- read.csv(x, stringsAsFactors = FALSE)
  return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})

outDF <- as.data.frame(out)

如果文件之间“开始”和“结束”之间的行数不同,则上述选项将不起作用。在这种情况下,您可以通过首先使用lapply()(类似于选项 2)生成一个结果列表(具有不同长度的列表元素)然后使用 NA 填充较短的列表,然后再次将结果转换为数据框来生成数据框。

#make csv-files with with target vector of different lengths (3:12)
set.seed(1)
for (i in 1:3) {
  df <- data.frame(x = c(rep(0, sample(1:10,1)), "start", 
                         rep(paste0("dat", i), sample(1:10,1)), 
                         "end",rep(0, sample(1:10, 1))))
  write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}

#lapply
out <- lapply(allFiles, FUN = function(x) {
  idat = read.csv(x, stringsAsFactors = FALSE)
  return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})

out <- lapply(out, `length<-`, max(lengths(out)))
outDF <- do.call(cbind, out)

推荐阅读