首页 > 解决方案 > 使用R从字符串中提取字符和数字

问题描述

这是我的数据框的一部分。

> df
    Group Direction cytoband  q value residual q value      wide peak boundaries
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554

我想在“宽峰边界”列中提取“chr”之后的字符或数字。我尝试了下面的代码,但第二行获得了 NA 值。

library(tidyr)
df <- extract(df, 'wide peak boundaries', into = c('chr', 'start', 'end'), 
              '(\\d+)+:(\\d+)+-(\\d+)', remove = F, convert = T)
df
    Group Direction cytoband  q value residual q value      wide peak boundaries chr     start       end
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622  11 130906630 135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553  NA        NA        NA
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602  10  87745632  87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503  22  33050952  34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554  11   3230287   3799554

数据

structure(list(Group = c("All", "All", "All", "All", "All"), 
    Direction = c("DEL", "DEL", "DEL", "DEL", "DEL"), cytoband = c("11q25", 
    "Xp22.11", "10q23.31", "22q12.3", "11p15.4"), `q value` = c("7.78E-43", 
    "3.01E-38", "3.61E-31", "4.03E-25", "6.59E-25"), `residual q value` = c("2.22E-39", 
    "1.91E-35", "3.61E-31", "3.96E-25", "6.59E-25"), `wide peak boundaries` = c("chr11:130906630-135086622", 
    "chrX:23277186-26139553", "chr10:87745632-87859602", "chr22:33050952-34766503", 
    "chr11:3230287-3799554"), chr = c(11L, NA, 10L, 22L, 11L), 
    start = c(130906630L, NA, 87745632L, 33050952L, 3230287L), 
    end = c(135086622L, NA, 87859602L, 34766503L, 3799554L)), class = "data.frame", row.names = c("V29", 
"V30", "V31", "V32", "V33"))

标签: r

解决方案


您只需\\d将第一个捕获组更改为\\w(\\d仅匹配数字,而\\w匹配字母字符和数字以及下划线):

编辑(?<=chr)是积极的向后看,它确保仅在字符串发生后才\\w开始匹配:chr

df %>% 
  extract(col = 'wide peak boundaries', 
          into = c('chr', 'start', 'end'),
          regex = '((?<=chr)\\w+):(\\d+)-(\\d+)', 
          remove = FALSE, convert = TRUE)
    Group Direction cytoband  q value residual q value      wide peak boundaries chr     start       end
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622  11 130906630 135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553   X  23277186  26139553
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602  10  87745632  87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503  22  33050952  34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554  11   3230287   3799554

推荐阅读