首页 > 解决方案 > 使用现有因子级别有条件地更改某些行中的值,可能在 dplyr

问题描述

我正在构建一个数据集。

我想用其中一列作为因子变量来启动它,该变量包含该变量可以采用的一组水平。然后,我想在应用构建数据集的规则时逐步编辑此列。

一个玩具示例数据集:

mammals <- tibble(animal_name = c("Inapplicable",
                                  "Don't know",
                                  "Cat",
                                  "Dog",
                                  "Shark",
                                  "Wolf",
                                  "Pig"),
                  match_status = factor("not matched yet",
                                        levels = c("matched",
                                                   "not matched yet",
                                                   "unmatchable")),
                  match_reason = NA_character_)

table(mammals$match_status)

现在我尝试开始应用一些条件来更改match_status变量的值。这不起作用:

mammals <- mammals %>%
  mutate(
    match_status = case_when(
      animal_name %in% c("Inapplicable", "Don't know") ~ "unmatchable",
      animal_name == "Shark"                           ~ "unmatchable",
      animal_name %in% c("Dog", "Wolf")                ~ "matched",
      TRUE                                             ~ match_status
    ),
    match_reason = case_when(
      animal_name %in% c("Inapplicable", "Don't know") ~ "No animal specified",
      animal_name == "Shark"                           ~ "Not a mammal",
      animal_name %in% c("Dog", "Wolf")                ~ "In list of canines",
      TRUE                                             ~ match_reason
    )
  )

我尝试将 a 包裹在as.factor()周围case_when(),但这也没有运行。

如果我注释掉 的前半部分 ( match_status =) mutate(),只留下match_reason =部分,则此方法有效。

我想我可以运行一个将因子级别转换为字符值的版本,在变量上应用我想要的条件更改,然后作为一个单独的阶段将其转换回一个因子,但我已经避开了它看起来更脆弱。我预先设置因子水平的原因是为了限制变量的合法值。

这工作,使用基本功能replace(),但需要更多的重复代码,并且在 dplyr 中似乎不是一种非常自然的方式:

mammals <- mammals %>%
  mutate(
    match_status = replace(match_status, 
                           animal_name %in% c("Inapplicable", "Don't know"),
                           "unmatchable"),
    match_status = replace(match_status, 
                           animal_name == "Shark",
                           "unmatchable"),
    match_status = replace(match_status, 
                           animal_name %in% c("Dog", "Wolf"),
                           "matched"),
    match_reason = case_when(
      animal_name %in% c("Inapplicable", "Don't know") ~ "No animal specified",
      animal_name == "Shark"                           ~ "Not a mammal",
      animal_name %in% c("Dog", "Wolf")                ~ "In list of canines",
      TRUE                                             ~ match_reason
    )
  )

允许我根据相同的标准(例如,在伪代码中if animal_name == "Shark" then set match_status = "unmatchable" and set match_reason = "Not a mammal")同时更新两个变量的方法的奖励标记。

我一直试图在 dplyr 中找到一种范式方法,但我想也对使用 base R 的清洁方法持开放态度。我可能更喜欢在 magrittr 管道中工作的东西,但即使这样也不会破坏交易。

标签: rdplyr

解决方案


对于您的原始代码

如果您将“无法匹配”包含在因子中并指示适当的级别,则它可以工作。

mammals <- mammals %>%
  mutate(
    match_status = case_when(
      animal_name %in% c("Inapplicable", "Don't know") ~ factor("unmatchable", levels = c("matched",
                                                                                          "not matched yet",
                                                                                          "unmatchable")),
      animal_name == "Shark"                           ~ factor("unmatchable", levels = c("matched",
                                                                                          "not matched yet",
                                                                                          "unmatchable")),
      animal_name %in% c("Dog", "Wolf")                ~ factor("matched", levels = c("matched",
                                                                                          "not matched yet",
                                                                                          "unmatchable")),
      TRUE                                             ~ match_status
    ),
    match_reason = case_when(
      animal_name %in% c("Inapplicable", "Don't know") ~ "No animal specified",
      animal_name == "Shark"                           ~ "Not a mammal",
      animal_name %in% c("Dog", "Wolf")                ~ "In list of canines",
      TRUE                                             ~ match_reason
    )
  )

选择

将匹配状态和匹配原因编码在一个字符串中,然后再将其分开。

mammals <- mammals %>%
  mutate(match_info = case_when(animal_name=="Shark" ~ "unmatchable/Not a mammal",
                                animal_name %in% c("Inapplicable", "Don't know") ~ "unmatchable/No animal specified",
                                animal_name %in% c("Dog", "Wolf") ~ "matched/In list of canines",
                                TRUE ~ "not matched yet/")) %>%
  separate(match_info, into=c("match_status", "match_reason"), sep="/") %>%
  mutate(match_status, match_status=factor(match_status, levels=c("matched",
                                                     "not matched yet",
                                                     "unmatchable")))

# A tibble: 7 x 3
  animal_name  match_status    match_reason         
  <chr>        <fct>           <chr>                
1 Inapplicable unmatchable     "No animal specified"
2 Don't know   unmatchable     "No animal specified"
3 Cat          not matched yet ""                   
4 Dog          matched         "In list of canines" 
5 Shark        unmatchable     "Not a mammal"       
6 Wolf         matched         "In list of canines" 
7 Pig          not matched yet ""     

推荐阅读