r - 使用R从字符串中提取字符和数字
问题描述
这是我的数据框的一部分。
> df
Group Direction cytoband q value residual q value wide peak boundaries
V29 All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622
V30 All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553
V31 All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602
V32 All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503
V33 All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554
我想在“宽峰边界”列中提取“chr”之后的字符或数字。我尝试了下面的代码,但第二行获得了 NA 值。
library(tidyr)
df <- extract(df, 'wide peak boundaries', into = c('chr', 'start', 'end'),
'(\\d+)+:(\\d+)+-(\\d+)', remove = F, convert = T)
df
Group Direction cytoband q value residual q value wide peak boundaries chr start end
V29 All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622 11 130906630 135086622
V30 All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553 NA NA NA
V31 All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602 10 87745632 87859602
V32 All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503 22 33050952 34766503
V33 All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554 11 3230287 3799554
数据
structure(list(Group = c("All", "All", "All", "All", "All"),
Direction = c("DEL", "DEL", "DEL", "DEL", "DEL"), cytoband = c("11q25",
"Xp22.11", "10q23.31", "22q12.3", "11p15.4"), `q value` = c("7.78E-43",
"3.01E-38", "3.61E-31", "4.03E-25", "6.59E-25"), `residual q value` = c("2.22E-39",
"1.91E-35", "3.61E-31", "3.96E-25", "6.59E-25"), `wide peak boundaries` = c("chr11:130906630-135086622",
"chrX:23277186-26139553", "chr10:87745632-87859602", "chr22:33050952-34766503",
"chr11:3230287-3799554"), chr = c(11L, NA, 10L, 22L, 11L),
start = c(130906630L, NA, 87745632L, 33050952L, 3230287L),
end = c(135086622L, NA, 87859602L, 34766503L, 3799554L)), class = "data.frame", row.names = c("V29",
"V30", "V31", "V32", "V33"))
解决方案
您只需\\d
将第一个捕获组更改为\\w
(\\d
仅匹配数字,而\\w
匹配字母字符和数字以及下划线):
编辑:
(?<=chr)
是积极的向后看,它确保仅在字符串发生后才\\w
开始匹配:chr
df %>%
extract(col = 'wide peak boundaries',
into = c('chr', 'start', 'end'),
regex = '((?<=chr)\\w+):(\\d+)-(\\d+)',
remove = FALSE, convert = TRUE)
Group Direction cytoband q value residual q value wide peak boundaries chr start end
V29 All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622 11 130906630 135086622
V30 All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553 X 23277186 26139553
V31 All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602 10 87745632 87859602
V32 All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503 22 33050952 34766503
V33 All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554 11 3230287 3799554
推荐阅读
- mysql - 如何使用 mysql 根据同一张表的其他两列填充一列?
- r - R函数根据单位值分配固定资源
- ssrs-2012 - SSRS 父组重复创建“楼梯”
- go - 接口类型和值可以是不实现接口及其值的类型吗?
- java - 即使文件存在并且在同一目录中,Java FileNotFound 错误?我该如何解决?
- c - 将 GSL 库链接到 Matlab MEX 时如何修复“未知类型名称”错误
- excel - 从 Excel VBA 切换 Internet Explorer 代理
- scala - 重命名 Spark DataFrame 中的嵌套结构列
- python-3.x - 如何更改设置的箱线图子图上的 xticks,使其以文字形式完整?
- spring - 在 MongoDB/Spring 中连接多个条件