首页 > 解决方案 > 提取变量分隔符之间的文本

问题描述

我有大量特殊字符的文本,我想从中提取某些子字符串:

y <- c("some stuff <rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep> some stuff <#> <dir> where is Londonderry?</dir>",
       "some stuff <rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>",
       "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa> blah blub <icu> Yeah </icu>")

我想提取“标签”之间的任何内容,例如子字符串,或者 或<dir> ...</dir>等等:<rep> ...</rep><icu> ...</icu>

有了这个正则表达式,我就成功了:

library(stringr)
lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>(?!<\\1>).*</\\1>")), collapse = ", "))
[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>"

[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"

只是[[2]]不像预期的那样:仍然有不需要的材料(即<#> potentially more stuff),并且两次出现的<rep> ...</rep>子字符串没有用,. 我的预感是我的正则表达式在这里失败了,因为这两个标签是相同的而不是不同的。

如何改进正则表达式以获得预期的结果:

预期结果

[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"

[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"

编辑

与此同时,我找到了一个可行的解决方案:

lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>.*?</\\1>")), collapse = ", "))

标签: rregex

解决方案


这个怎么样?

unlist(str_extract_all(y, "\\<([A-Za-z0-9_]+\\>).*?(\\<\\/\\1)"))

# [1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>" "<dir> where is Londonderry?</dir>"                         
# [3] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>"    "<rep> I 1lIved in Lisburn </rep>"                          
# [5] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>"    "<icu> Yeah </icu>"     

基本上,我们在这里所做的就是将(开始)标签的主体(+ 其尾部尖括号)放入捕获组中,并使用该捕获组来定义结束标签。然后我们捕获所述捕获组的这两个实例之间的所有内容。就像这样:在<(tag>)whatever<\\1哪里。\1tag>

编辑:

我想这应该这样做:

lapply(str_extract_all(y, "\\<([A-Za-z0-9]+)\\>.*?\\<\\/\\1\\>"), paste, collapse = ", ")

# [[1]]
# [1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

# [[2]]
# [1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"

# [[3]]
# [1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"

推荐阅读