首页 > 解决方案 > R有效地将具有字符列表的数据帧转换为具有正确R类型的列表

问题描述

我使用 crateDB 将表作为数据帧加载到 R 中。问题是,crateDB 将数组作为逗号分隔的字符串发送。因此,我想将所有数组转换为正确的 R 类型。我还想将数据框转换为列表,因为它可以使用 crateDB 中的对象,这不适用于数据框。目前这种转换太慢了,所以我尝试了几种方法来提高性能。

如果我有以下数据框:

df <- data.frame(
  id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  position = c(
    "{\"82.81864\",\"82.586235\",\"82.35383\"}",
    "{\"83.81864\",\"83.586235\",\"83.35383\"}",
    "{\"84.81864\",\"84.586235\",\"84.35383\"}",
    "{\"85.81864\",\"85.586235\",\"85.35383\"}",
    "{\"86.81864\",\"86.586235\",\"86.35383\"}",
    "{\"87.81864\",\"87.586235\",\"87.35383\"}",
    "{\"88.81864\",\"88.586235\",\"88.35383\"}",
    "{\"89.81864\",\"89.586235\",\"89.35383\"}",
    "{\"90.81864\",\"90.586235\",\"90.35383\"}",
    "{\"91.81864\",\"91.586235\",\"91.35383\"}"
  ),
  vcontrol = c(
    "{\"t\",\"t\",\"t\",\"t\"}","{\"f\",\"f\",\"f\",\"t\"}",
    "{\"f\",\"t\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"t\"}",
    "{\"t\",\"t\",\"f\",\"t\"}", "{\"t\",\"f\",\"f\",\"t\"}",
    "{\"t\",\"f\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"t\"}",
    "{\"t\",\"t\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"f\"}"
  )
)

转换后的结果列表应如下所示: 转换列表

我从两个 for 循环开始,这对于大数据集来说真的很慢。然后我尝试了应用功能:

convertDF = function(dataFrame, dataTypes){
  dimension <- dim(x = dataFrame)
  names <- names(x = dataFrame)
  
  asList <- lapply(dataFrame, as.list)
  
  for(row in seq_len(length(asList))){
    asList[[row]] <- lapply(asList[[x]], convertToRType, type = dataTypes[row])
  }
  
  data <- list()
  for(datarow in seq_len(dimension[1])){
    tempData <- list()
    for(datacol in seq_len(dimension[2])){
      tempData[[names[datacol]]] <- asList[[datacol]][[datarow]]
    }
    data[[datarow]] <- tempData
  }
  return(data)
}

convertToRType 函数使用数据库中使用的类型作为参数,因此它可以判断是将其转换为整数、双精度还是逻辑。我使用 if 相同的检查来执行此操作。对于数组,我首先删除所有不需要的字符,在 处拆分,然后在整个向量上使用 as.double 。

  data <- str_replace_all(
    string = rawData,
    pattern = c("\\{" = "", "\\}" = "", "\"" = "")
  )
  data <- str_split(string = data, pattern = ",")[[1]]

我这样做是因为我想使用 lapply 的多线程功能,但结果在 Windows 上它甚至更慢。但是有了这个功能,转换只需要以前的一半时间。不过,我仍然不喜欢这个解决方案。代码不干净,似乎也不是很好的表现。

谁能告诉我如何尽可能有效地进行这种转换?我的想法不多了。

标签: rperformancedataframelapply

解决方案


这是一个解决方案dplyr,用于将数据框内的正确类型转换purrrtranspose列表和simplify. 注意像“t”这样的“陷阱”<>TRUE

library(dplyr)
library(stringr)
library(purrr)
x <- 
   df %>% 
   mutate(position = str_replace_all(
      string = .$position,
      pattern = c("\\{" = "", "\\}" = "", "\"" = "")
   ) %>% str_split(string = ., pattern = ",")
   ) %>%
   mutate(vcontrol = str_replace_all(
      string = .$vcontrol,
      pattern = c("\\{" = "", "\\}" = "", "\"" = "")
   ) %>%
      str_replace_all(string = ., c("t" = "TRUE", 
                                             "f" = "FALSE")) %>%
      str_split(string = ., pattern = ",")) %>%
   rowwise() %>%
   mutate(position = list(as.numeric(unlist(position)))) %>%
   mutate(vcontrol = list(as.logical(unlist(vcontrol))))

converted_df <- transpose(x) %>% simplify_all()
str(converted_df)
#> List of 10
#>  $ :List of 3
#>   ..$ id      : num 1
#>   ..$ position: num [1:3] 82.8 82.6 82.4
#>   ..$ vcontrol: logi [1:4] TRUE TRUE TRUE TRUE
#>  $ :List of 3
#>   ..$ id      : num 2
#>   ..$ position: num [1:3] 83.8 83.6 83.4
#>   ..$ vcontrol: logi [1:4] FALSE FALSE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 3
#>   ..$ position: num [1:3] 84.8 84.6 84.4
#>   ..$ vcontrol: logi [1:4] FALSE TRUE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 4
#>   ..$ position: num [1:3] 85.8 85.6 85.4
#>   ..$ vcontrol: logi [1:4] TRUE TRUE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 5
#>   ..$ position: num [1:3] 86.8 86.6 86.4
#>   ..$ vcontrol: logi [1:4] TRUE TRUE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 6
#>   ..$ position: num [1:3] 87.8 87.6 87.4
#>   ..$ vcontrol: logi [1:4] TRUE FALSE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 7
#>   ..$ position: num [1:3] 88.8 88.6 88.4
#>   ..$ vcontrol: logi [1:4] TRUE FALSE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 8
#>   ..$ position: num [1:3] 89.8 89.6 89.4
#>   ..$ vcontrol: logi [1:4] TRUE TRUE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 9
#>   ..$ position: num [1:3] 90.8 90.6 90.4
#>   ..$ vcontrol: logi [1:4] TRUE TRUE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 10
#>   ..$ position: num [1:3] 91.8 91.6 91.4
#>   ..$ vcontrol: logi [1:4] TRUE TRUE FALSE FALSE

您的数据

df <- data.frame(
   id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
   position = c(
      "{\"82.81864\",\"82.586235\",\"82.35383\"}",
      "{\"83.81864\",\"83.586235\",\"83.35383\"}",
      "{\"84.81864\",\"84.586235\",\"84.35383\"}",
      "{\"85.81864\",\"85.586235\",\"85.35383\"}",
      "{\"86.81864\",\"86.586235\",\"86.35383\"}",
      "{\"87.81864\",\"87.586235\",\"87.35383\"}",
      "{\"88.81864\",\"88.586235\",\"88.35383\"}",
      "{\"89.81864\",\"89.586235\",\"89.35383\"}",
      "{\"90.81864\",\"90.586235\",\"90.35383\"}",
      "{\"91.81864\",\"91.586235\",\"91.35383\"}"
   ),
   vcontrol = c(
      "{\"t\",\"t\",\"t\",\"t\"}","{\"f\",\"f\",\"f\",\"t\"}",
      "{\"f\",\"t\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"t\"}",
      "{\"t\",\"t\",\"f\",\"t\"}", "{\"t\",\"f\",\"f\",\"t\"}",
      "{\"t\",\"f\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"t\"}",
      "{\"t\",\"t\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"f\"}"
   )
)


推荐阅读