首页 > 解决方案 > 将分隔字符串拆分为新行,类型转换设置为 TRUE

问题描述

data.table::tstrsplit有一个有用的type.convert参数。但是当拆分后每一行都转换为不同的类时,它会出错,请参见示例:

library(data.table)

x <- fread("CHROM POS REF ALT TYPE AF
chr1 1 A T MISSENSE 0.23
chr2 1 A T,G MISSENSE 0.17,0.09")

在 ALT 列中,我们有“T”和“T,G”,因此第一行被转换为逻辑“TRUE”,第二行被拆分并转换为字符“T”和“G”。结果我们得到以下错误:

x[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed = TRUE, type.convert = TRUE))),
  by = .(CHROM, POS, REF, TYPE)]

# Error in `[.data.table`(x, , lapply(.SD, function(x) unlist(tstrsplit(x,  : 
#   Column 1 of result for group 2 is type 'character' but expecting type
#   'logical'. Column types must be consistent for each group.

我们可以避免自动转换,然后手动转换,一切都很好:

x[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed = TRUE))),
  by = .(CHROM, POS, REF, TYPE)][, .(CHROM, POS, REF, ALT, TYPE, AF = as.numeric(AF))]
#    CHROM POS REF ALT     TYPE   AF
# 1:  chr1   1   A   T MISSENSE 0.23
# 2:  chr2   1   A   T MISSENSE 0.17
# 3:  chr2   1   A   G MISSENSE 0.09

但是tidyr::separate没有这个问题:

tidyr::separate_rows(x, ALT, AF, convert = TRUE)
# # A tibble: 3 x 6
#   CHROM   POS REF   ALT   TYPE        AF
#   <chr> <int> <chr> <chr> <chr>    <dbl>
# 1 chr1      1 A     T     MISSENSE  0.23
# 2 chr2      1 A     T     MISSENSE  0.17
# 3 chr2      1 A     G     MISSENSE  0.09

问题:有没有更好的data.table方法来实现这一点?我需要使用类型转换,因为 AF 列需要是数字。我想同时拆分分隔列。在实际数据中,可能有超过 2 列带有分隔符。

标签: rstringsplitdata.table

解决方案


它可以更容易地完成cSplit

library(splitstackshape)
cSplit(x, c("ALT", "AF"), ",", "long")
#   CHROM POS REF ALT     TYPE   AF
#1:  chr1   1   A   T MISSENSE 0.23
#2:  chr2   1   A   T MISSENSE 0.17
#3:  chr2   1   A   G MISSENSE 0.09

关于data.table选项,另一种方法是添加空格

x[, lapply(.SD, function(x) 
 trimws(unlist(tstrsplit(gsub("([TF])+", " \\1", x), ",", 
    fixed = TRUE, type.convert = TRUE)))),
    by = .(CHROM, POS, REF, TYPE)]
#   CHROM POS REF     TYPE ALT   AF
#1:  chr1   1   A MISSENSE   T 0.23
#2:  chr2   1   A MISSENSE   T 0.17
#3:  chr2   1   A MISSENSE   G 0.09

推荐阅读