首页 > 解决方案 > R:对字符值进行分组,并按条件从向量中仅保留一个值

问题描述

例如,我有以下数据集(我的真实数据集有超过 100000 行和 70 个变量):

Country   Year   Flag
Norway    2018   drop: reason1
Norway    2018   drop: reason2
Sweden    2016   drop: reason3
France    2011   drop: reason2
France    2011   drop: reason3
France    2011   drop: reason4

首先,我想按变量CountryYear对Flag 值进行分组,所以我想得到一个像这样的表:

Country   Year   Flag
Norway    2018   drop: reason1, drop: reason2
Sweden    2016   drop: reason3
France    2011   drop: reason2, drop: reason3, drop: reason4

其次,如果 Flag 列中有多个值,我想只留下 1 并遵循以下逻辑:如果drop: reason1存在,则保留它并删除其余部分。如果没有drop: reason1,但有 adrop: reason2和 a drop: reason3,那么我们只留下drop: reason2

最后,我的数据集应该如下所示:

Country   Year   Flag
Norway    2018   drop: reason1
Sweden    2016   drop: reason3
France    2011   drop: reason2

我想基于 data.table 或 base R 方法来实现它。

如果有任何帮助,我将不胜感激!至少对于问题的第一部分。

标签: rdata.tablecharacteraggregategrouping

解决方案


我们可以order通过Country和获取数据,然后为每个和Flag选择第一个值。FlagCountryYear

这可以在基础 R 中完成:

aggregate(Flag~Country+Year, df[with(df, order(Country, Flag)), ], head, 1)

#  Country Year         Flag
#1  France 2011 drop:reason2
#2  Sweden 2016 drop:reason3
#3  Norway 2018 drop:reason1

或者dplyr

library(dplyr)

df %>%
  arrange(Country, Flag) %>%
  group_by(Country, Year) %>%
  summarise(Flag = first(Flag))

data.table

library(data.table)
setDT(df)
df[order(Country, Flag), (Flag = first(Flag)), .(Country, Year)]

数据

df <- structure(list(Country = structure(c(2L, 2L, 3L, 1L, 1L, 1L),
.Label = c("France","Norway", "Sweden"), class = "factor"), Year = c(2018L, 2018L, 
2016L, 2011L, 2011L, 2011L), Flag = structure(c(1L, 2L, 3L, 2L, 
3L, 4L), .Label = c("drop:reason1", "drop:reason2", "drop:reason3", 
"drop:reason4"), class = "factor")), class = "data.frame", row.names = c(NA, -6L))

推荐阅读