r - 使用现有因子级别有条件地更改某些行中的值,可能在 dplyr
问题描述
我正在构建一个数据集。
我想用其中一列作为因子变量来启动它,该变量包含该变量可以采用的一组水平。然后,我想在应用构建数据集的规则时逐步编辑此列。
一个玩具示例数据集:
mammals <- tibble(animal_name = c("Inapplicable",
"Don't know",
"Cat",
"Dog",
"Shark",
"Wolf",
"Pig"),
match_status = factor("not matched yet",
levels = c("matched",
"not matched yet",
"unmatchable")),
match_reason = NA_character_)
table(mammals$match_status)
现在我尝试开始应用一些条件来更改match_status
变量的值。这不起作用:
mammals <- mammals %>%
mutate(
match_status = case_when(
animal_name %in% c("Inapplicable", "Don't know") ~ "unmatchable",
animal_name == "Shark" ~ "unmatchable",
animal_name %in% c("Dog", "Wolf") ~ "matched",
TRUE ~ match_status
),
match_reason = case_when(
animal_name %in% c("Inapplicable", "Don't know") ~ "No animal specified",
animal_name == "Shark" ~ "Not a mammal",
animal_name %in% c("Dog", "Wolf") ~ "In list of canines",
TRUE ~ match_reason
)
)
我尝试将 a 包裹在as.factor()
周围case_when()
,但这也没有运行。
如果我注释掉 的前半部分 ( match_status =
) mutate()
,只留下match_reason =
部分,则此方法有效。
我想我可以运行一个将因子级别转换为字符值的版本,在变量上应用我想要的条件更改,然后作为一个单独的阶段将其转换回一个因子,但我已经避开了它看起来更脆弱。我预先设置因子水平的原因是为了限制变量的合法值。
这工作,使用基本功能replace()
,但需要更多的重复代码,并且在 dplyr 中似乎不是一种非常自然的方式:
mammals <- mammals %>%
mutate(
match_status = replace(match_status,
animal_name %in% c("Inapplicable", "Don't know"),
"unmatchable"),
match_status = replace(match_status,
animal_name == "Shark",
"unmatchable"),
match_status = replace(match_status,
animal_name %in% c("Dog", "Wolf"),
"matched"),
match_reason = case_when(
animal_name %in% c("Inapplicable", "Don't know") ~ "No animal specified",
animal_name == "Shark" ~ "Not a mammal",
animal_name %in% c("Dog", "Wolf") ~ "In list of canines",
TRUE ~ match_reason
)
)
允许我根据相同的标准(例如,在伪代码中if animal_name == "Shark" then set match_status = "unmatchable" and set match_reason = "Not a mammal"
)同时更新两个变量的方法的奖励标记。
我一直试图在 dplyr 中找到一种范式方法,但我想也对使用 base R 的清洁方法持开放态度。我可能更喜欢在 magrittr 管道中工作的东西,但即使这样也不会破坏交易。
解决方案
对于您的原始代码
如果您将“无法匹配”包含在因子中并指示适当的级别,则它可以工作。
mammals <- mammals %>%
mutate(
match_status = case_when(
animal_name %in% c("Inapplicable", "Don't know") ~ factor("unmatchable", levels = c("matched",
"not matched yet",
"unmatchable")),
animal_name == "Shark" ~ factor("unmatchable", levels = c("matched",
"not matched yet",
"unmatchable")),
animal_name %in% c("Dog", "Wolf") ~ factor("matched", levels = c("matched",
"not matched yet",
"unmatchable")),
TRUE ~ match_status
),
match_reason = case_when(
animal_name %in% c("Inapplicable", "Don't know") ~ "No animal specified",
animal_name == "Shark" ~ "Not a mammal",
animal_name %in% c("Dog", "Wolf") ~ "In list of canines",
TRUE ~ match_reason
)
)
选择
将匹配状态和匹配原因编码在一个字符串中,然后再将其分开。
mammals <- mammals %>%
mutate(match_info = case_when(animal_name=="Shark" ~ "unmatchable/Not a mammal",
animal_name %in% c("Inapplicable", "Don't know") ~ "unmatchable/No animal specified",
animal_name %in% c("Dog", "Wolf") ~ "matched/In list of canines",
TRUE ~ "not matched yet/")) %>%
separate(match_info, into=c("match_status", "match_reason"), sep="/") %>%
mutate(match_status, match_status=factor(match_status, levels=c("matched",
"not matched yet",
"unmatchable")))
# A tibble: 7 x 3
animal_name match_status match_reason
<chr> <fct> <chr>
1 Inapplicable unmatchable "No animal specified"
2 Don't know unmatchable "No animal specified"
3 Cat not matched yet ""
4 Dog matched "In list of canines"
5 Shark unmatchable "Not a mammal"
6 Wolf matched "In list of canines"
7 Pig not matched yet ""
推荐阅读
- eclipse - Eclipse IDE 到 AWS - 编辑器将无法连接
- laravel - 我的 sum 函数在我看来多次出现
- html - 在相对容器内保持 100% 的宽度?
- machine-learning - 可以在特征工程阶段使用的用于时间序列预测的简单机器学习模型是什么?
- php - 查询和联接的问题
- vb.net - VB中的“WithEvents变量”是什么意思?
- java - 查找递归子序列的索引边界
- ruby-on-rails - 除根以外的所有端点都重定向到 HTTPS
- google-play - 谷歌播放 API
- r - 从 r 中的整洁数据执行 t.test 的最有效方法是什么?