首页 > 解决方案 > 通过排除一系列字符串来处理子集时的空格

问题描述

我有一个看起来像这样的数据框:

Author ID     Country Year
A      12345  US      2011
B      13254  Germany 2018
C      54952  Belgium 2005
D      58774  UK      2009
E      88569  Lebanon 2015
...

我想排除所有属于欧盟和美国的国家。但是,我在包含空格的国家/地区遇到问题,例如捷克共和国和英国。

到目前为止我已经尝试使用

non_other_countries<-c("Belgium", "Bulgaria", "Demnark", "Germany", "Estonia", "Finland", "France", "Greece", "Ireland", "Italy", "Croatia", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Austria", "Poland", "Portugal", "Romania", "Slovakia", "Slovania", "Spain", "Sweden", "Czech Republic", "Hungary", "United Kingdom", "Cyprus", "United States")
other_post_2011 <- other_post_2011_with_id[, setdiff(names(other_post_2011_with_id), non_other_countries)]

other_post_2011 <- subset(other_post_2011_with_id, ! Country %in% c("Belgium", "Bulgaria", "Demnark", "Germany", "Estonia", "Finland", "France", "Greece", "Ireland", "Italy", "Croatia", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Austria", "Poland", "Portugal", "Romania", "Slovakia", "Slovania", "Spain", "Sweden", "Czech Republic", "Hungary", "United Kingdom", "Cyprus", "United States", "USA"))

但是,两者都无法排除包含空格的国家/地区。

我现在开发了一个(imo)非常丑陋的解决方案,将所有捷克共和国替换为捷克共和国,将所有英国替换为英国

other_post_2011_with_id$Country[other_post_2011_with_id$Country == "Czech Republic"] <- "Czechia"
other_post_2011_with_id$Country[other_post_2011_with_id$Country == "United Kingdom"] <- "UK"

但我一直想知道是否还有其他更优雅、更通用的解决方案。非常感谢!

标签: rsubset

解决方案


由于您提供的数据不完整,因此不知道您的代码到底出了什么问题,但请尝试以下方法。

head(dat)
#   a id        country year
# 1 a  1 United Kingdom 2006
# 2 b  5  Bouvet Island 2010
# 3 c  8        Hungary 2010
# 4 d 10 Czech Republic 2004
# 5 e 12  Bouvet Island 2001
# 6 f 19 United Kingdom 2004

excl <- c("Czech Republic", "Hungary", "United Kingdom", "Cyprus", 
          "United States")

dat[!dat$country %in% excl, ]
#    a id       country year
# 2  b  5 Bouvet Island 2010
# 5  e 12 Bouvet Island 2001
# 7  g 20      Dominica 2004
# 9  i 32       Namibia 2000
# 10 j 34 Bouvet Island 2011
# 11 k 35 Bouvet Island 2001
# 12 l 52 Bouvet Island 2010
# 13 m 54      Dominica 2005
# 14 n 56       Namibia 2000
# 17 q 77 Bouvet Island 2001
# 18 r 79         Qatar 2011
# 19 s 82 Bouvet Island 2002

数据

dat <- structure(list(a = structure(1:20, .Label = c("a", "b", "c", 
"d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", 
"q", "r", "s", "t"), class = "factor"), id = c(1L, 5L, 8L, 10L, 
12L, 19L, 20L, 31L, 32L, 34L, 35L, 52L, 54L, 56L, 61L, 67L, 77L, 
79L, 82L, 90L), country = structure(c(8L, 1L, 5L, 3L, 1L, 8L, 
4L, 2L, 6L, 1L, 1L, 1L, 4L, 6L, 5L, 2L, 1L, 7L, 1L, 3L), .Label = c("Bouvet Island", 
"Cyprus", "Czech Republic", "Dominica", "Hungary", "Namibia", 
"Qatar", "United Kingdom"), class = "factor"), year = c(2006L, 
2010L, 2010L, 2004L, 2001L, 2004L, 2004L, 2009L, 2000L, 2011L, 
2001L, 2010L, 2005L, 2000L, 2001L, 2006L, 2001L, 2011L, 2002L, 
2003L)), class = "data.frame", row.names = c(NA, -20L))

推荐阅读