首页 > 解决方案 > Regex and str_remove_all in R - only remove words if multiple conditions are met

问题描述

I am trying to remove all instances of a country name based on the following conditions:

  1. Country name not at beginning of string

  2. Country name does not follow 'of '

So if I take a fictional string: Australia National Australia Bank of Australia

I only want to remove the instance of Australia highlighted in bold

I am using str_remove_all to pass a collapsed string of country names to a vector of company names.

country <- data.frame(name = c("Australia", "Singapore", "Malaysia")) %>%
mutate(name_regex = paste0("((?<!^)\\b", name, "\\b", "|(?<!of\\s)\\b", name, "\\b)"))

country_remove <- str_c(country$name_regex, collapse = "|")

str_remove_all(x, regex(country_remove, ignore_case = T))
(?<!^)\bAustralia\b     # select all instances not at beginning
(?<!of\s)\bAustralia\b  # select all instances not following 'of '

When I try and combine these together, it ends up just removing everything.

Thanks in advance!

标签: rregex

解决方案


您应该像这样构建正则表达式:

country <- data.frame(name = c("Australia", "Singapore", "Malaysia"))
name_regex <- paste0("\\b(?<!of\\s)(?<!^)(?:", paste(country$name, collapse="|"), ")\\b")
s <- "Australia National Australia Bank of Australia"
str_remove_all(s, regex(name_regex, ignore_case=TRUE))
## => [1] "Australia National  Bank of Australia"

图案看起来像

\b(?<!of\s)(?<!^)(?:Australia|Singapore|Malaysia)\b

在线查看正则表达式演示

细节

  • \b- 单词边界
  • (?<!of\s)- no of+ 当前位置左侧的空格是允许的
  • (?<!^)- 不允许在当前位置开始字符串位置
  • (?:Australia|Singapore|Malaysia)- 任何替代品
  • \b- 单词边界。

推荐阅读