首页 > 解决方案 > 如何在 r 中进行可选的后视和前瞻

问题描述

我想提取deen之间的文本以及没有deen的字符串中的文本。我对正则表达式不是很好,但在阅读了前瞻和后瞻之后,我设法得到了部分我想要的东西。现在我必须让它们成为可选的,但无论我尝试过什么,我都做不到。任何帮助将不胜感激!

library(stringr)
(sstring = c('{\"de\":\"extract this one\",\"en\":\"some text\"}',     'extract this one',     '{\"de\":\"extract this one\",\"en\":\"some text\"}', "p (340) extract this one"))
#> [1] "{\"de\":\"extract this one\",\"en\":\"some text\"}"
#> [2] "extract this one"                                  
#> [3] "{\"de\":\"extract this one\",\"en\":\"some text\"}"
#> [4] "p (340) extract this one"

str_extract_all(string = sstring, pattern = "(?<=.de\":\").*(?=.,\"en\":)")
#> [[1]]
#> [1] "extract this one"
#> 
#> [[2]]
#> character(0)
#> 
#> [[3]]
#> [1] "extract this one"
#> 
#> [[4]]
#> character(0)

所需的输出:

#> [1] "extract this one"         "extract this one"        
#> [3] "extract this one"         "p (340) extract this one"

reprex 包(v0.3.0)于 2020-05-08 创建

标签: rregexstringr

解决方案


我建议使用一种模式,该模式将匹配任何不包含子字符串的{"de":"字符串或之后{"de":"包含 1+ 个字符的子字符串,而不是"

(?<=\{"de":")[^"]+|^(?!.*\{"de":").+

请参阅正则表达式演示

细节

  • (?<=\{"de":")- 积极的回溯,寻找紧随其后的位置{"de":"
  • [^"]+- 然后提取 1+ 以外的字符"
  • |- 或者
  • ^- 在字符串的开头
  • (?!.*\{"de":")- 确保{"de":"字符串中没有,并且
  • .+- 尽可能多地提取除换行符以外的 1+ 个字符。

在线查看R 演示

library(stringr)
sstring = c('{\"de\":\"extract this one\",\"en\":\"some text\"}',     'extract this one',     '{\"de\":\"extract this one\",\"en\":\"some text\"}', "p (340) extract this one")
str_extract( sstring, '(?<=\\{"de":")[^"]+|^(?!.*\\{"de":").+')
# => [1] "extract this one"         "extract this one"        
#    [3] "extract this one"         "p (340) extract this one"

推荐阅读