首页 > 解决方案 > R中的正则表达式匹配方括号中的字符串

问题描述

我有讲故事的成绩单,其中有许多重叠语音的实例,用方括号括在重叠的语音周围。我想提取这些重叠实例。在以下模拟示例中,

ovl <- c("well [yes right]", "let's go", "oh [  we::ll] i do n't (0.5) know", "erm [°well right° ]", "(3.2)")

此代码工作正常:

pattern <- "\\[(.*\\w.+])*"
grep(pattern, ovl, value=T) 
matches <- gregexpr(pattern, ovl) 
overlap <- regmatches(ovl, matches)
overlap_clean <- unlist(overlap); overlap_clean
[1] "[yes right]"     "[  we::ll]"      "[°well right° ]"

但在一个更大的文件中,一个数据框,它没有。这是由于模式中的错误还是由于数据框的结构?df 的前六行如下所示:

> head(df)
                                                             Story
1 "Kar:\tMind you our Colin's getting more like your dad every day
2                                             June:\tI know he is.
3                                 Kar:\tblack welding glasses on, 
4                        \tand he turned round and he made me jump
5                                                 \t“O:h, Colin”, 
6                                  \tand then (                  )

标签: rregex

解决方案


尽管它在某些情况下可能有效,但您的模式对我来说很有吸引力。我认为应该是这样的:

pattern <- "(\\[.*?\\])"
matches <- gregexpr(pattern, ovl)
overlap <- regmatches(ovl, matches)
overlap_clean <- unlist(overlap)
overlap_clean

[1] "[yes right]"     "[  we::ll]"      "[°well right° ]"

演示

这将匹配并捕获括号中的术语,使用 Perl 惰性点来确保我们在第一个右括号处停止。


推荐阅读