首页 > 解决方案 > 反向摘要以扩展数据框中的逗号分隔字符串

问题描述

我有以下数据框

group = c("cat", "dog", "horse")
value = c("1", "2", "3")
list = c("siamese,burmese,balinese","corgi,sheltie,collie","arabian,friesian,andalusian" )
df = data.frame(group, value, list)

df
  group value                        list
1   cat     1    siamese,burmese,balinese
2   dog     2        corgi,sheltie,collie
3 horse     3 arabian,friesian,andalusian

并试图实现这一目标:

  group value       list
1   cat     1    siamese
2   cat     1    burmese
3   cat     1   balinese
4   dog     2      corgi
5   dog     2    sheltie
6   dog     2     collie
7 horse     3    arabian
8 horse     3   friesian
9 horse     3 andalusian

我知道如何总结一个数据框,但我现在意识到我不知道如何用逗号分隔的字符串“取消总结”一个。

标签: r

解决方案


data.frame(
  group = c("cat", "dog", "horse"),
  value = c("1", "2", "3"),
  list = c("siamese,burmese,balinese","corgi,sheltie,collie","arabian,friesian,andalusian"),
  stringsAsFactors = FALSE
) -> xdf

tidyverse

tidyr::separate_rows(xdf, list, sep=",")
##   group value       list
## 1   cat     1    siamese
## 2   cat     1    burmese
## 3   cat     1   balinese
## 4   dog     2      corgi
## 5   dog     2    sheltie
## 6   dog     2     collie
## 7 horse     3    arabian
## 8 horse     3   friesian
## 9 horse     3 andalusian

基数 R:

do.call(
  rbind.data.frame,
  lapply(1:nrow(xdf), function(idx) {

    data.frame(
      group = xdf[idx, "group"],
      value = xdf[idx, "value"],
      list = strsplit(xdf[idx, "list"], ",")[[1]],
      stringsAsFactors = FALSE
    )

  })
)
##   group value       list
## 1   cat     1    siamese
## 2   cat     1    burmese
## 3   cat     1   balinese
## 4   dog     2      corgi
## 5   dog     2    sheltie
## 6   dog     2     collie
## 7 horse     3    arabian
## 8 horse     3   friesian
## 9 horse     3 andalusian

枪战:

microbenchmark::microbenchmark(

  unnest = transform(xdf, list = strsplit(list, ",")) %>%
    tidyr::unnest(list),

  separate_rows = tidyr::separate_rows(xdf, list, sep=","),

  base = do.call(
    rbind.data.frame,
    lapply(1:nrow(xdf), function(idx) {

      data.frame(
        group = xdf[idx, "group"],
        value = xdf[idx, "value"],
        list = strsplit(xdf[idx, "list"], ",")[[1]],
        stringsAsFactors = FALSE
      )

    })
  )
)
## Unit: microseconds
##           expr      min        lq     mean   median        uq       max neval
##         unnest 3689.890 4280.7045 6326.231 4881.160  6428.508 16670.715   100
##  separate_rows 5093.618 5602.2510 8479.712 6289.193 10352.847 24447.528   100
##           base  872.343  975.1615 1589.915 1099.391  1660.324  6663.132   100

我一直对tidyr操作的可怕表现感到惊讶。


推荐阅读