首页 > 解决方案 > 使用 read.table 解析愚蠢的数据帧?

问题描述

我进行了一个相当大的实验并将所有数据保存到一个 csv 文件中,但数据似乎是一种......愚蠢的格式。模拟需要几天时间才能运行,而且我无法重新运行它,所以我很好奇是否可以在 R 中做任何事情来帮助我以文件的当前形式提取数据。

我似乎无法附上这个问题的文件,所以我会尽我所能解释这个困境。csv 文件是一列数据,全部包含在一个列中。在 Excel 中打开,第一个条目 A1 包含

[run number],"agent-preference","infection-rate","length-of-patch","incubation-rate","recovery-rate","init-infected","prop-move","total-pop","neighborhood-radius","move-distance","infective-radius","time-to-disease-spread","init-vaccinated","[step]","precision ((count turtles with [disease-status = 4]) / total-pop) 4"

单元格 A2-A1000 中此下方的所有条目都包含相同压缩格式的数据,例如

2,"none","0.75","100","0.1","0.12","0.002","0","2000","10","1","2","0","0","222","0.9815"

也就是说,每个单元格都以一种长逗号分隔格式(上面提到的愚蠢格式)包含所有数据。我认为为了解决这个问题,我可以使用read.table, 定义我自己的列名(以绕过 A1 中的混乱),然后让逗号表示分隔,如下所示:

my.df<-read.table("run_1.csv", header = F,   
                  col.names = c("run_number","agent_preference","infection_rate","length_of_patch",
                               "incubation_rate","recovery_rate","init_infected","prop_move","total_pop",
                               "neighborhood_radius","move_distance","infective_radius","time_to_disease_spread",
                               "init_vaccinated","step","outbreak_prop"),
                  sep = ",",    # define the separator between       columns
                  colClasses = c("character", "character", "factor", "integer", "factor", "factor",
                                 "factor", "factor", "integer", "factor", "factor", "factor", "factor",
                                 "factor", "factor", "factor"),
                  fill = TRUE) # add blank fields if rows have unequal length

请注意,我通过指定我自己的列名来绕过 A1 的时髦格式,并尝试预定义列类以提供帮助。不幸的是,这不起作用,我最终得到(这里使用单行数据框作为示例):

>my.df[1,]
                                     run_number
1 2,"none","0.75","100","0.1","0.12","0.002","0","2000","10","1","2","0","0","222","0.9815"
  agent_preference infection_rate
1                                
  length_of_patch incubation_rate
1              NA                
  recovery_rate init_infected prop_move
1                                      
  total_pop neighborhood_radius
1        NA                    
  move_distance infective_radius
1                               
  time_to_disease_spread init_vaccinated
1                                       
  step outbreak_prop
1    

如果我想查看这一行中的第一个条目,我会得到

> my.df[1,1]
[1] "2,\"none\",\"0.75\",\"100\",\"0.1\",\"0.12\",\"0.002\",\"0\",\"2000\",\"10\",\"1\",\"2\",\"0\",\"0\",\"222\",\"0.9815\""

这是错误的,因为(1)我希望各个条目是整个向量,而不是第一个向量,并且(2)我不确定在哪里引入破折号......

任何帮助将不胜感激。

标签: rdataframecsv

解决方案


推荐阅读