首页 > 解决方案 > fread 挂在某些类型的文件上

问题描述

由于将 R 更新到 v 3.5.1 并更新到 data.table 的最新版本 (v. 1.11.18),fread() 在某些文件而不是其他文件上调用时会挂起。

> test_1<-fread("Dec_1_10.csv", verbose=TRUE)

omp_get_max_threads() = 4
omp_get_thread_limit() = 2147483647
DTthreads = 0
Input contains no \n. Taking this to be a filename to open

[01] Check arguments
  Using 4 threads (omp_get_max_threads()=4, nth=4)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer

[02] Opening the file
  Opening file Dec_1_10.csv
  File opened, size = 334.9MB (351129569 bytes).
  Memory mapped ok

[03] Detect and skip BOM

[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
'. Final end-of-line is missing. Using cow page to write 0 to the last byte.

[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<ID,NAME,GENDE>>

[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 1 lines of 26029650 fields using quote rule 0
  sep=','  with 9 lines of 31 fields using quote rule 2
  Detected 31 columns on line 2. This line is either column names or first data row. Line starts as: <<0126_V3","DSRI",>>
  Quote rule picked = 2
  fill=false and the most number of columns found is 31

[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 1 because (102965936 bytes from row 1 to eof) / (2 * 91563770 jump0size) == 0
  A line with too-many fields (31/31) was found on line 9 of sample jump 0. 
  Type codes (jump 000)    : AAAA2AAA52AAAAAAAA2AA22AAAAAA2A  Quote rule 2
Types in 1st data row match types in 2nd data row but previous row has 18402118 fields. Taking previous row as column names.  All rows were sampled since file is small so we know nrow=8 exactly

[08] Assign column names

[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : AAAA2AAA52AAAAAAAA2AA22AAAAAA2A2222222222222222222222222222222222222222222222222...2222222222

[10] Allocate memory for the datatable
  Allocating 18402118 column slots (18402118 - 0 dropped) with 8 rows

[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=102965936
  Too few rows allocated. Allocating additional 1024 rows (now nrows=1032) and continue reading from jump 0

. . 然后挂在这里,直到我强制退出 R。

在其他 .csv 文件上调用 fread() 似乎工作正常,但我拥有的所有具有这种特定结构/大小的文件都无法解析。

编辑:我让 R 会话运行了几个小时,而不是在几分钟后强制退出。

Error: vector memory exhausted (limit reached?)
In addition: Warning messages:
1: In FUN(X[[i]], ...) :
  Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
2: In FUN(X[[i]], ...) :
  Detected 5471442 column names but the data has 31 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning.

我试过跳过第一行数据,并指定列名。两者似乎都无法克服这个问题。

标签: rdata.tablefread

解决方案


推荐阅读