r - 提取变量分隔符之间的文本
问题描述
我有大量特殊字符的文本,我想从中提取某些子字符串:
y <- c("some stuff <rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep> some stuff <#> <dir> where is Londonderry?</dir>",
"some stuff <rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>",
"<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa> blah blub <icu> Yeah </icu>")
我想提取“标签”之间的任何内容,例如子字符串,或者 或<dir> ...</dir>
等等:<rep> ...</rep>
<icu> ...</icu>
有了这个正则表达式,我就成功了:
library(stringr)
lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>(?!<\\1>).*</\\1>")), collapse = ", "))
[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"
[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>"
[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"
只是[[2]]
不像预期的那样:仍然有不需要的材料(即<#> potentially more stuff
),并且两次出现的<rep> ...</rep>
子字符串没有用,
. 我的预感是我的正则表达式在这里失败了,因为这两个标签是相同的而不是不同的。
如何改进正则表达式以获得预期的结果:
预期结果:
[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"
[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"
[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"
编辑:
与此同时,我找到了一个可行的解决方案:
lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>.*?</\\1>")), collapse = ", "))
解决方案
这个怎么样?
unlist(str_extract_all(y, "\\<([A-Za-z0-9_]+\\>).*?(\\<\\/\\1)"))
# [1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>" "<dir> where is Londonderry?</dir>"
# [3] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>" "<rep> I 1lIved in Lisburn </rep>"
# [5] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>" "<icu> Yeah </icu>"
基本上,我们在这里所做的就是将(开始)标签的主体(+ 其尾部尖括号)放入捕获组中,并使用该捕获组来定义结束标签。然后我们捕获所述捕获组的这两个实例之间的所有内容。就像这样:在<(tag>)whatever<\\1
哪里。\1
tag>
编辑:
我想这应该这样做:
lapply(str_extract_all(y, "\\<([A-Za-z0-9]+)\\>.*?\\<\\/\\1\\>"), paste, collapse = ", ")
# [[1]]
# [1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"
# [[2]]
# [1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"
# [[3]]
# [1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"
推荐阅读
- c - 为什么试图尊重 void 指针不起作用?
- python - Python 中没有 sympy.nonlinsolve() 的收敛解决方案
- m3u - 如何定制 m3u 流以获得特定的分辨率或数据速率?
- vue.js - Jest + Vue3 + @Vueform/slider “语法错误:不能在模块外使用导入语句”
- android - 如何在android中以编程方式从后台打开自己的应用程序?
- excel - 为什么 VBA Excel 显示已定义变量的错误 91?
- forms - 替换正文中不需要的字符
- r - 字符串序列以一个正在运行的字符开始,然后是一个数字向量
- google-apps-script - 使用 Google 表格中的记录为每个 Google 文档页面填写多个“表格”
- apache - Apache 网络代理和上传缓冲区大小