首页 > 解决方案 > 用于取消合并行条目的正则表达式

问题描述

我有一个示例数据集

df <- data.frame(
country = c("GermanyBerlin", "England (UK)London", "SpainMadrid", "United States of AmericaWashington DC", "HaitiPort-au-Prince", "country66city"),
  capital = c("#Berlin", "NA", "#Madrid", "NA", "NA", "NA"),
  url = c("/country/germany/01", "/country/england-uk/02", "/country/spain/03", "country/united-states-of-america/04", "country/haiti/05", "country/country6/06"),
  stringsAsFactors = FALSE
)

                                country capital                                 url
1                         GermanyBerlin #Berlin                 /country/germany/01
2                    England (UK)London      NA              /country/england-uk/02
3                           SpainMadrid #Madrid                   /country/spain/03
4 United States of AmericaWashington DC      NA country/united-states-of-america/04
5                   HaitiPort-au-Prince      NA                    country/haiti/05
6                         country66city      NA                 country/country6/06

目的是整理这一点,以便列与人们对它们的名称所期望的一样:

所以我想要的输出是:

                  country        capital                                 url
1                  Germany         Berlin                 /country/germany/01
2             England (UK)         London              /country/england-uk/02
3                    Spain         Madrid                   /country/spain/03
4 United States of America  Washington DC country/united-states-of-america/04
5                    Haiti Port-au-Prince                    country/haiti/05
6                 country6          6city                 country/country6/06

在列中有非 NA 条目的情况下capital,我有一段代码可以实现这一点(见帖子底部)。

因此,我正在寻找一种解决方案,该解决方案可以识别url列的模式可用于将大写从country列中拆分出来。

这需要考虑到这样一个事实

我很想看看如何实现这个目标,大概是使用正则表达式(尽管对任何选项都开放)。


capital列非 NA时的部分解决方案

如果列中有非 NA 条目,capital则以下代码可以实现我的目标:

df %>% mutate( capital =   str_replace(capital, "#", ""), 
               country = str_replace(country, capital,"") 
              )

                                country capital                                 url
1                               Germany  Berlin                 /country/germany/01
2                    England (UK)London      NA              /country/england-uk/02
3                                 Spain  Madrid                   /country/spain/03
4 United States of AmericaWashington DC      NA country/united-states-of-america/04

标签: rregexsplitpattern-matching

解决方案


您可以从这样的事情开始并继续改进,直到获得(100%)正确的结果,然后看看您是否可以跳过/合并任何步骤。

library(magrittr)

df$country2 <- df$url %>%
  gsub("-", " ", .) %>%
  gsub(".+try/(.+)/.+", "\\1", .) %>%
  gsub("(\\b[a-z])", "\\U\\1", ., perl = TRUE)

df$capital <- df$country %>%
  gsub("[()]", " ", .) %>%
  gsub(" +", " ", .) %>%
  gsub(paste(df$country2, collapse = "|"), "", ., ignore.case = TRUE)

df$country <- df$country2
df$country2 <- NULL

df
                   country        capital                                 url
1                  Germany         Berlin                 /country/germany/01
2               England Uk         London              /country/england-uk/02
3                    Spain         Madrid                   /country/spain/03
4 United States Of America  Washington DC country/united-states-of-america/04
5                    Haiti Port-au-Prince                    country/haiti/05
6                 Country6          6city                 country/country6/0

推荐阅读