首页 > 解决方案 > R tidymodels 配方接近数字属性的零方差过滤器

问题描述

我在使用 R tidymodels 食谱中的 step_nzv 来过滤掉具有小方差但连续值的数字属性时遇到了麻烦。在我看来,该步骤仅适用于名义值,因为它计算唯一值的数量以及最常见与第二常见的比率。但是我有一个属性,它几乎无处不在接近零,从不为零。我是否必须先装箱(并用相同大小的垃圾箱离散化会改变一切)?在下面的代码中,我有一个最小的示例。我希望两个列 low_variance_num 和 low_variance_nom 都被过滤掉,这不会发生:

library(tidymodels)

data <- tibble(num = seq(1000),rand = runif(1000)) %>% 
  mutate(low_variance_num = ifelse(num == 1, 1, rand/10000),
         low_variance_nom = ifelse(num == 1, 1, 0))

data
var(data$low_variance_num)
var(data$low_variance_nom)

recipe <- recipe(formula = num ~., data = data) %>% 
  update_role("num", new_role = "label") %>%
  step_nzv(all_predictors(), freq_cut = 995/5, unique_cut = 10) %>% # 5min bis hier
  prep()
summary(recipe)

PS:有没有办法在不提供配方的情况下使用食谱?在这种情况下,公式是无稽之谈。

标签: rtidymodelsr-recipes

解决方案


For starters, yes, there is a way to use recipes without providing a formula. To do that you call recipe() with only the data as an argument and then manually update the roles via update_role(). This is the recommended approach when the number of variables is very high, as the formula method is memory-inefficient with many variables.

Next, I want to clarify what we mean in tidymodels by "nominal":

Nominal variables include both character and factor.

A numeric variable of all 1s and 0s would not be a nominal variable in tidymodels (would not be selected by all_nominal(), etc).

Next, I want to point out that I don't think step_nzv() is going to do what you are hoping here because you are using the term "variance" in a different sense. If you check out the docs, it describes what we mean here by near-zero-variance:

For example, an example of near-zero variance predictor is one that, for 1000 samples, has two distinct values and 999 of them are a single value.

To be flagged, first, the frequency of the most prevalent value over the second most frequent value (called the "frequency ratio") must be above freq_cut. Secondly, the "percent of unique values," the number of unique values divided by the total number of samples (times 100), must also be below unique_cut.

The example low_variance_num variable you made is not particularly low-variance by the definition used in this step; it has lots of unique values.

For reference, here is a demo of how to build a recipe without the formula:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

df <- tibble(num = seq(1000), rand = runif(1000)) %>% 
  mutate(pred1 = ifelse(num == 1, 1, rand/10000),
         pred2 = ifelse(num == 1, 1, 0))

rec <- recipe(df) %>% 
  update_role(num, new_role = "label") %>%
  update_role(rand, pred1, pred2, new_role = "predictor") %>%
  step_nzv(all_predictors())

rec %>% prep() %>% bake(new_data = NULL)
#> # A tibble: 1,000 x 3
#>      num  rand     pred1
#>    <int> <dbl>     <dbl>
#>  1     1 0.842 1        
#>  2     2 0.942 0.0000942
#>  3     3 0.977 0.0000977
#>  4     4 0.595 0.0000595
#>  5     5 0.259 0.0000259
#>  6     6 0.454 0.0000454
#>  7     7 0.550 0.0000550
#>  8     8 0.388 0.0000388
#>  9     9 0.702 0.0000702
#> 10    10 0.481 0.0000481
#> # … with 990 more rows

Created on 2021-01-07 by the reprex package (v0.3.0)

The predictor pred2 was removed because it has so few unique values and they are almost all 0. The predictor pred1 was not removed because it has many unique values. I think if I wanted to do the kind of filtering you are describing, I would do it in data cleaning/preparation, not within a feature engineering recipe in a model pipeline.


推荐阅读