首页 > 解决方案 > 在 R 中组合多个数据集

问题描述

我是 R/ 编程语言的完整初学者。现在我正在尝试使用 R 处理数百个逗号分隔的数据文件。对于时间序列分析,我需要按顺序连接数据集。不幸的是,数据文件没有指定的带有时间戳的列,并且有一些标题行。为此,我从数据文件的第二行解析文件创建时间,并根据数据文件第三行中的采样频率添加时间步长。此外,采样频率会因文件而异,这些文件可以从文件名中的正则表达式模式中识别。前三个标题行如下所示:

SPU1 Monitor Data File
SPU Data Filename = 06Aug2021 ,07 -08 -28,s1c1h17.txt
Sample Frequency = 1

或者

SPU1 Traffic Data File
SPU Data Filename = 05Aug2021 ,02 -48 -14,s1c1p2311.txt
Sample Frequency = 20

我试过for循环和lapply。当我尝试 for 循环时,脚本只运行一次。当我尝试 lapply 时,我收到以下消息。我究竟做错了什么?

[Error in file(file, "rt") : invalid 'description' argument
In addition: Warning messages:
1: In n.readLines(paste(filenames\[i\], sep = ","), header = FALSE, n = 1,  :
  file doesn't exist
2: In n.readLines(paste(filenames\[i\], sep = ",|\\s|-"), header = FALSE,  :
  file doesn't exist
Called from: file(file, "rt")][1]

这是我正在尝试的代码:

setwd("C:/Users/rottweiller/Desktop/Practicing R")

filenames <- list.files(path="C:/Users/rottweiller/Desktop/Practicing R", pattern="c1h|c1p", full.names=FALSE)

library(reader)
library(readr)
library(tidyverse)

AddTS <- function(filenames){
        #frq1 <- parse_number(n.readLines(paste(filenames[i], sep = ","), header = FALSE, n = 1, skip = 2))
        frq1 <- as.integer(gsub("\\D", "", n.readLines(paste(filenames[i], sep = ","), header = FALSE, n = 1, skip = 2)))
        TL1 <- n.readLines(paste(filenames[i], sep = ",|\\s|-"), header = FALSE, n = 1, skip = 1)
        SUTC1 <- lubridate::parse_date_time(gsub("\\s-|\\s", "",
                stringr::str_extract(TL1, "[SPU Data Filename = ]?\\d{2}\\D{3}\\d{4}\\s\\,\\d{2}\\s-\\d{2}\\s-\\d{2}")), orders = "dmYHMS")
  C1 <- as.data.frame(read.delim(filenames[i], header = FALSE, sep = ",", skip = 79))
  C1[] <- lapply(C1, function(j) if(is.numeric(j)) ifelse(is.infinite(j), 0, j) else j)
  TS1 <- SUTC1 + (1/frq1)*seq_len(nrow(C1))
  Card1 <- cbind(TS1, C1)
}

combined <- dplyr::bind_rows(lapply(filenames, AddTS))

或者

[for(i in 1:length(filenames)){
    frq1 <- parse_number(n.readLines(paste(filenames\[i\], sep = ","), header = FALSE, n = 1, skip = 2), trim_ws = TRUE)
    TL1 <- n.readLines(paste(filenames\[i\], sep = ",|\\s|-"), header = FALSE, n = 1, skip = 1)
    SUTC1 <- lubridate::parse_date_time(gsub("\\s-|\\s", "",
                                             stringr::str_extract(TL1, "\[SPU Data Filename = \]?\\d{2}\\D{3}\\d{4}\\s\\,\\d{2}\\s-\\d{2}\\s-\\d{2}")),
                                        orders = "dmYHMS")
    C1 <- as.data.frame(read.delim(filenames\[i\], header = FALSE, sep = ",", skip = 79))
    C1\[\] <- lapply(C1, function(j) if(is.numeric(j)) ifelse(is.infinite(j), 0, j) else j)
    TS1 <- SUTC1 + (1/frq1)*seq_len(nrow(C1))
    Card1 <- cbind(TS1, C1)
}][1]

标签: rfor-looplapply

解决方案


这是您已经了解正则表达式和最近的 R 库的一个很好的开始步骤。

你可以这样做:

purrr::map_dfr(filenames, function(f) {
  lines <- readLines(file(f))
  
  frq <- lines[3] %>%
    str_replace(".*?(\\d*)$", "\\1") %>%
    as.integer()
  frq
  
  SUTC <- lines[2] %>%
    stringr::str_extract("[SPU Data Filename = ]?\\d{2}\\D{3}\\d{4}\\s\\,\\d{2}\\s-\\d{2}\\s-\\d{2}") %>%
    lubridate::parse_date_time(orders = "dmYHMS")
  SUTC
  
  C <- lines[(which(lines == "end of text") + 2):length(lines)] %>%
    textConnection() %>%
    read.delim(header = FALSE, sep = ",") %>%
    mutate(across(.fns = ~ if_else(. == Inf, 0, .)))
  C
  
  TS <- SUTC + seq_len(nrow(C)) / frq
  
  bind_cols(file = f, TS = TS, C)
})

推荐阅读