首页 > 解决方案 > fread 在 check.names 之后应用 select 和 colClasses?

问题描述

我正在尝试读取大量数据(最多 100 个文件,每个文件大小不超过 1.5GB),它们的格式有点烦人,而且每个文件都略有不同。出于速度原因,我想使用data.table::fread但我有很多问题:

我的攻击计划是导入所有列并使用正则表达式找到合适select的列,然后在fread. 但是现在我坚持分配colClasses,因为这些是在选择列之前以及在检查名称之前分配的,所以即使使用命名列表也不起作用。有没有办法在/colClasses之后应用而不丢失我的前导零?selectcheck.names

我尝试了在 fread 中使用 colClasses的命名列技术,还回顾了Using colClasses and select arguments of fread 同时但两者都不能处理我的文件中的差异

可重现的例子:

dt <- data.frame(ID = c("01","02","03"), HH = 1:3, MM = rep(0,3), HH = 2:4, MM = rep(0,3),Precipx = rnorm(3),
             other1 = rep(0,3), other2 = rep(1,3),check.names = F)
write.csv(dt, "test.csv", row.names = F, quote = F)

Colnames <- names(fread("test.csv",nrows = 0 ,check.names = T))
ColNos <- grp(c("ID|HH.1|MM.1|$Precip"),Colnames)
#This import works, but I lose leading 0s
dat <- fread("test.csv", check.names = T, select = ColNos)

#This tells me I have the wrong number of `colClasses`, but I cannot set for all columns as varies file to file
dat <- fread("test.csv", check.names = T, select = ColNos, colClasses = c("character","charcter","character","numeric"))

#This doesn't recognise that I want the second HH column. Using just `"HH"` also has this problem
# and "Precipx" will sometimes be "Precipy", "Precipz"... in the file
dat<- fread("test.csv", check.names = T, select = ColNos, 
  colClasses = c("ID" = "character","HH.1" = "charcter","MM.1" = "character","Precipx" = "numeric"))

标签: rdata.tablefread

解决方案


推荐阅读