r - 在 ifelse 语句中使用 %in%?
问题描述
我知道论坛上已经有很多关于 ifelse 语句的问题,但我似乎找不到特定查询的答案。
我想使用 ifelse 根据两个条件之一在数据框中生成一个新列。基本上,如果“高血压”在心脏病列中,或者如果 bp 药物 = 1,我希望高血压说 1。正如您在下面看到的,当前所有行的高血压列都标记为 1。在 ifelse 语句中使用 %in% 命令是否有问题,或者我在其他地方出错了?
heart_conditions high_chol_tabs bp_med hypertension
1 hypertension high_cholesterol 2 [no] 2 [no] 1
2 none 4 [not applicable] 4 [not applicable] 1
3 hypertension high_cholesterol 1 [yes] 1 [yes] 1
4 heart_attack angina 4 [not applicable] 4 [not applicable] 1
5 high_cholesterol 2 [no] 4 [not applicable] 1
6 hypertension high_cholesterol 1 [yes] 1 [yes] 1
7 none 4 [not applicable] 4 [not applicable] 1
8 none 4 [not applicable] 4 [not applicable] 1
9 high_cholesterol 2 [no] 4 [not applicable] 1
10 hypertension high_cholesterol 1 [yes] 1 [yes] 1
hypertension.df$hypertension <- ifelse(("hypertension" %in% heart_conditions)|(bp_med == 1), 1, 2)
解决方案
你搞错了。你想要的是heart_conditions %in% "hypertension"
(或heart_conditions == "hyptertension"
)!
或完整答案:
hypertension.df$hypertension <- ifelse(heart_conditions == "hypertension" | bp_med == 1, 1, 2)
# or using %in%
selection <- "hypertension"
hypertension.df$hypertension <- ifelse(heart_conditions %in% selection | bp_med == 1, 1, 2)
更长的解释
%in%
检查左侧是否存在于右侧并返回左侧长度的对象。
names <- c("Alice", "Bob", "Charlie")
names %in% c("Alice", "Charlie")
#> [1] TRUE FALSE TRUE
"Alice" %in% names
#> [1] TRUE
由reprex 包(v0.3.0)于 2020-08-06 创建
部分匹配
如评论中所述:%in%
完全比较元素。要检查一个字符串是否在另一个字符串中,我们可以执行以下操作:
字符串比较
library(tibble) # data.frames
df <- tribble(
~heart_conditions, ~high_chol_tabs, ~bp_med, ~hypertension,
"hypertension high_cholesterol", 2, 2, 1,
"none", 4, 4, 1,
"hypertension high_cholesterol", 1, 1, 1,
"heart_attack angina", 4, 4, 1,
"high_cholesterol", 2, 4, 1,
"hypertension high_cholesterol", 1, 1, 1,
"none", 4, 4, 1,
"none", 4, 4, 1,
"high_cholesterol", 2, 4, 1,
"hypertension high_cholesterol", 1, 1, 1
)
df$hypertension1 <- ifelse(grepl("hypertension", df$heart_conditions) | df$bp_med == 1, 1, 2)
library(stringr)
# imho more user friendly than grepl, but slightly slower
df$hypertension2 <- ifelse(str_detect(df$heart_conditions, "hypertension") | df$bp_med == 1, 1, 2)
df
#> # A tibble: 10 x 6
#> heart_conditions high_chol_tabs bp_med hypertension hypertension1
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 hypertension hi… 2 2 1 1
#> 2 none 4 4 1 2
#> 3 hypertension hi… 1 1 1 1
#> 4 heart_attack an… 4 4 1 2
#> 5 high_cholesterol 2 4 1 2
#> 6 hypertension hi… 1 1 1 1
#> 7 none 4 4 1 2
#> 8 none 4 4 1 2
#> 9 high_cholesterol 2 4 1 2
#> 10 hypertension hi… 1 1 1 1
#> # … with 1 more variable: hypertension2 <dbl>
由reprex 包(v0.3.0)于 2020-08-06 创建
拆分和比较
不依赖字符串比较的稍微慢一点的解决方案是按空格分割条件并检查是否有高血压,您可以这样做:
# split the heart-conditions
conds <- strsplit(df$heart_conditions, " ")
conds
#> [[1]]
#> [1] "hypertension" "high_cholesterol"
#>
#> [[2]]
#> [1] "none"
#>
#> [[3]]
#> [1] "hypertension" "high_cholesterol"
#>
#> [[4]]
#> [1] "heart_attack" "angina"
#>
#> [[5]]
#> [1] "high_cholesterol"
#>
#> [[6]]
#> [1] "hypertension" "high_cholesterol"
#>
#> [[7]]
#> [1] "none"
#>
#> [[8]]
#> [1] "none"
#>
#> [[9]]
#> [1] "high_cholesterol"
#>
#> [[10]]
#> [1] "hypertension" "high_cholesterol"
# for each row of the data, check if any value is hypertension
has_hypertension <- sapply(conds, function(cc) any(cc == "hypertension"))
has_hypertension
#> [1] TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
df$hypertension3 <- ifelse(has_hypertension | df$bp_med == 1, 1, 2)
df
#> # A tibble: 10 x 7
#> heart_conditions high_chol_tabs bp_med hypertension hypertension1
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 hypertension hi… 2 2 1 1
#> 2 none 4 4 1 2
#> 3 hypertension hi… 1 1 1 1
#> 4 heart_attack an… 4 4 1 2
#> 5 high_cholesterol 2 4 1 2
#> 6 hypertension hi… 1 1 1 1
#> 7 none 4 4 1 2
#> 8 none 4 4 1 2
#> 9 high_cholesterol 2 4 1 2
#> 10 hypertension hi… 1 1 1 1
#> # … with 2 more variables: hypertension2 <dbl>, hypertension3 <dbl>
由reprex 包(v0.3.0)于 2020-08-06 创建
基准
对我之前的评论很感兴趣,我运行了一个比较不同解决方案的快速基准测试,还使用以下方法添加了一个解决方案stringi
:
# splitter function
has_hypertension <- function(x) sapply(strsplit(x, " "), function(cc) any(cc == "hypertension"))
# create a larger dataset
df_large <- df %>% slice(rep(1:n(), 10000))
# benchmark the code:
bench::mark(
grepl = grepl("hypertension", df_large$heart_conditions),
stringi = stringi::stri_detect(df_large$heart_conditions, fixed = "hypertension"),
stringr = str_detect(df_large$heart_conditions, "hypertension"),
splitter = has_hypertension(df_large$heart_conditions)
)
#> # A tibble: 4 x 13
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> #> <bch:tm> <list> <list> <list> <list>
#> 1 grepl 16.67ms 16.91ms 59.0 390.67KB 2.11 28 1 474ms <lgl [100,00… <Rprofmem[,3] [1 × … <bch:tm [2… <tibble [29 ×…
#> 2 stringi 2.68ms 2.93ms 344. 390.67KB 6.22 166 3 482ms <lgl [100,00… <Rprofmem[,3] [1 × … <bch:tm [1… <tibble [169 …
#> 3 stringr 17.74ms 17.96ms 55.1 390.67KB 0 28 0 508ms <lgl [100,00… <Rprofmem[,3] [1 × … <bch:tm [2… <tibble [28 ×…
#> 4 splitter 153.39ms 153.39ms 6.52 3.67MB 19.6 1 3 153ms <lgl [100,00… <Rprofmem[,3] [551 … <bch:tm [4… <tibble [4 × …
这清楚地表明这stringi::stri_detect(txt, fixed = "hypertension")
是迄今为止最快的!
推荐阅读
- php - Laravel 5.5 将文件名保存到数据库为 *.tmp
- python - (Python) 无法从目录打开文件
- scala - 获取火花数据框中二维包装数组的每个元素的第一个元素
- python - Python:import foo.bar as bar vs from foo import bar
- database - 如何查询以在超级账本作曲家查询中包含特定单词?
- bootstrap-4 - 如何上传我的网站?
- python - pyopengl - 有没有办法获得线和四边形/三角形的交点?
- c# - C# 析构函数与 IDisposable
- c++ - 为什么复制构造函数无法“复制”
- windows - 使用 Powershell 从注册表返回密钥路径