首页 > 解决方案 > 如何在多次出现的模式上剪切子字符串?

问题描述

在对谷歌和 SO 进行彻底搜索之后,我在大量的正则表达式请求中找不到这个特定的问题。

我有一个要解析的字符串以替换一些子字符串。

但是,我的情况比简单的情况要复杂一些,str_replace所以我需要一个结构化版本的字符串。

例如,让我们取 valuevalue="There is __obj1__ and also __obj2__ in the house."和 pattern __.*?__

我想得到类似的东西,c("There is ", "obj1", "and also", "obj2", "in the house")这样我就可以对所有偶数指数采取行动。

这是我到目前为止的位置。我正在为正则表达式的贪婪而苦苦挣扎,它要么太多要么不够。矩阵返回类型真的不是问题,我可以unlist(x[[1]][-1])

library(tidyverse)
value="There is __obj1__ and also __obj2__ in the house."
str_match_all(value, "(.*?)__(.*?)__(.*?)") #too greedy at the very end
#> [[1]]
#>      [,1]                 [,2]         [,3]   [,4]
#> [1,] "There is __obj1__"  "There is "  "obj1" ""  
#> [2,] " and also __obj2__" " and also " "obj2" ""
str_match_all(value, "(.*)__(.*?)__(.*?)") #not greedy enough
#> [[1]]
#>      [,1]                                  [,2]                          [,3]  
#> [1,] "There is __obj1__ and also __obj2__" "There is __obj1__ and also " "obj2"
#>      [,4]
#> [1,] ""
str_match_all(value, "(.*?)__(.*)__(.*?)") #not greedy enough
#> [[1]]
#>      [,1]                                  [,2]        [,3]                    
#> [1,] "There is __obj1__ and also __obj2__" "There is " "obj1__ and also __obj2"
#>      [,4]
#> [1,] ""
str_match_all(value, "(.*?)__(.*?)__(.*)") #not greedy enough
#> [[1]]
#>      [,1]                                                [,2]        [,3]  
#> [1,] "There is __obj1__ and also __obj2__ in the house." "There is " "obj1"
#>      [,4]                              
#> [1,] " and also __obj2__ in the house."

reprex 包(v0.3.0)于 2021-01-19 创建

标签: rregexstringr

解决方案


您可以使用

value <- "There is __obj1__ and also __obj2__ in the house."
library(stringr)
result <- stringr::str_match_all(value, "\\s*(.*?)__(.*?)__(.*?)(?=\\s*(?:__|$))")
result <- lapply(result, function(x) x[,-1])
result

输出:

[[1]]
     [,1]        [,2]   [,3]            
[1,] "There is " "obj1" " and also"     
[2,] ""          "obj2" " in the house."

模式是

\s*(.*?)__(.*?)__(.*?)(?=\s*(?:__|$))

请参阅正则表达式演示。请注意,您甚至可以使用所有格量词\s*\s*+加快匹配速度。

详情

  • \s*- 零个或多个空格
  • (.*?)- 第 1 组:除换行符之外的任何零个或多个字符尽可能少
  • __- 文字__子串
  • (.*?)- 第 2 组:除换行符之外的任何零个或多个字符尽可能少
  • __- 文字__子串
  • (.*?)- 第 3 组:除换行符之外的任何零个或多个字符尽可能少
  • (?=\s*(?:__|$))- 一个正向前瞻,需要零个或多个空格,紧跟__当前位置右侧的字符串或字符串结尾。

推荐阅读