r - 用于取消合并行条目的正则表达式

问题描述

我有一个示例数据集

df <- data.frame(
country = c("GermanyBerlin", "England (UK)London", "SpainMadrid", "United States of AmericaWashington DC", "HaitiPort-au-Prince", "country66city"),
  capital = c("#Berlin", "NA", "#Madrid", "NA", "NA", "NA"),
  url = c("/country/germany/01", "/country/england-uk/02", "/country/spain/03", "country/united-states-of-america/04", "country/haiti/05", "country/country6/06"),
  stringsAsFactors = FALSE
)

                                country capital                                 url
1                         GermanyBerlin #Berlin                 /country/germany/01
2                    England (UK)London      NA              /country/england-uk/02
3                           SpainMadrid #Madrid                   /country/spain/03
4 United States of AmericaWashington DC      NA country/united-states-of-america/04
5                   HaitiPort-au-Prince      NA                    country/haiti/05
6                         country66city      NA                 country/country6/06

目的是整理这一点，以便列与人们对它们的名称所期望的一样：

第一个应该只包含国家名称。
第二个应该包含大写字母（没有# 符号）。
第三个应该保持不变。

所以我想要的输出是：

                  country        capital                                 url
1                  Germany         Berlin                 /country/germany/01
2             England (UK)         London              /country/england-uk/02
3                    Spain         Madrid                   /country/spain/03
4 United States of America  Washington DC country/united-states-of-america/04
5                    Haiti Port-au-Prince                    country/haiti/05
6                 country6          6city                 country/country6/06

在列中有非 NA 条目的情况下capital，我有一段代码可以实现这一点（见帖子底部）。

因此，我正在寻找一种解决方案，该解决方案可以识别url列的模式可用于将大写从country列中拆分出来。

这需要考虑到这样一个事实

URL 文本全部小写，而出现在country列中的国家名称则混合大小写。
URL 中的文本用连字符替换空格。
url 删除特殊字符（例如 UK 周围的括号）。

我很想看看如何实现这个目标，大概是使用正则表达式（尽管对任何选项都开放）。

capital列非 NA时的部分解决方案

如果列中有非 NA 条目，capital则以下代码可以实现我的目标：

df %>% mutate( capital =   str_replace(capital, "#", ""), 
               country = str_replace(country, capital,"") 
              )

                                country capital                                 url
1                               Germany  Berlin                 /country/germany/01
2                    England (UK)London      NA              /country/england-uk/02
3                                 Spain  Madrid                   /country/spain/03
4 United States of AmericaWashington DC      NA country/united-states-of-america/04

标签： rregexsplitpattern-matching

您可以从这样的事情开始并继续改进，直到获得（100%）正确的结果，然后看看您是否可以跳过/合并任何步骤。

library(magrittr)

df$country2 <- df$url %>%
  gsub("-", " ", .) %>%
  gsub(".+try/(.+)/.+", "\\1", .) %>%
  gsub("(\\b[a-z])", "\\U\\1", ., perl = TRUE)

df$capital <- df$country %>%
  gsub("[()]", " ", .) %>%
  gsub(" +", " ", .) %>%
  gsub(paste(df$country2, collapse = "|"), "", ., ignore.case = TRUE)

df$country <- df$country2
df$country2 <- NULL

df
                   country        capital                                 url
1                  Germany         Berlin                 /country/germany/01
2               England Uk         London              /country/england-uk/02
3                    Spain         Madrid                   /country/spain/03
4 United States Of America  Washington DC country/united-states-of-america/04
5                    Haiti Port-au-Prince                    country/haiti/05
6                 Country6          6city                 country/country6/0

r - 用于取消合并行条目的正则表达式

问题描述

解决方案

推荐阅读