r - step_num2factor() 用法——Tidymodel(配方包)
问题描述
好吧,老实说,我已经阅读了 step_num2factor 的函数参考,并没有弄清楚如何正确使用它。
temp_names <- as.character(unique(sort(all_raw$MSSubClass)))
price_recipe <-
recipe(SalePrice ~ . , data = train_raw) %>%
step_num2factor(MSSubClass, levels = temp_names)
temp_rec <- prep(price_recipe, training = train_raw, strings_as_factors = FALSE) # temporary recipe
temp_data <- bake(temp_rec, new_data = all_raw) # temporary data
class(all_raw$MSSubClass)
# > col_double()
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
使用 step 后数据输出temp_data$MSSubClass
全是 NA。obs 保存为 20,30,40.... 190,我想转换为名称(甚至是相同的数字,但作为无序因子)
如果你知道更多关于 step_num2factor 使用的博客文章或一些使用的代码,我也很乐意看到。
完整的数据集由 kaggle 提供: kaggle data
提前谢谢,
解决方案
我不认为这step_num2factor()
最适合这个变量。再次查看帮助,并注意您需要提供一个transform
参数,该参数可用于在确定级别之前修改数值。如果这些数据都是 10 的倍数,这将可以正常工作,但是您有一些值,例如 75 和 85,所以我认为您不希望这样。此配方步骤最适用于数字/整数变量,您可以使用简单的函数更轻松地将其转换为一组整数。
相反,我认为您应该考虑对step_mutate()
因子类型进行简单的强制:
library(tidyverse)
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#>
#> fixed
#> The following object is masked from 'package:stats':
#>
#> step
train_raw <- read_csv("~/Downloads/house-prices-advanced-regression-techniques/train.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_character(),
#> Id = col_double(),
#> MSSubClass = col_double(),
#> LotFrontage = col_double(),
#> LotArea = col_double(),
#> OverallQual = col_double(),
#> OverallCond = col_double(),
#> YearBuilt = col_double(),
#> YearRemodAdd = col_double(),
#> MasVnrArea = col_double(),
#> BsmtFinSF1 = col_double(),
#> BsmtFinSF2 = col_double(),
#> BsmtUnfSF = col_double(),
#> TotalBsmtSF = col_double(),
#> `1stFlrSF` = col_double(),
#> `2ndFlrSF` = col_double(),
#> LowQualFinSF = col_double(),
#> GrLivArea = col_double(),
#> BsmtFullBath = col_double(),
#> BsmtHalfBath = col_double(),
#> FullBath = col_double()
#> # ... with 18 more columns
#> )
#> See spec(...) for full column specifications.
price_recipe <-
recipe(SalePrice ~ ., data = train_raw) %>%
step_mutate(MSSubClass = factor(MSSubClass))
juiced_price <- prep(price_recipe) %>%
juice()
levels(juiced_price$MSSubClass)
#> [1] "20" "30" "40" "45" "50" "60" "70" "75" "80" "85" "90" "120"
#> [13] "160" "180" "190"
juiced_price %>%
count(MSSubClass)
#> # A tibble: 15 x 2
#> MSSubClass n
#> <fct> <int>
#> 1 20 536
#> 2 30 69
#> 3 40 4
#> 4 45 12
#> 5 50 144
#> 6 60 299
#> 7 70 60
#> 8 75 16
#> 9 80 58
#> 10 85 20
#> 11 90 52
#> 12 120 87
#> 13 160 63
#> 14 180 10
#> 15 190 30
由reprex 包(v0.3.0)于 2020-05-03 创建
在我看来,这可以让您获得所需的因子水平。如果您想将.txt
文件中的这些字符串(例如“1-STORY 1945 & OLDER”)保存为new_levels
向量,您可以说factor(MSSubClass, levels = new_levels)
.
推荐阅读
- tinymce - Tinymce - 如何在我的服务器上指定一个目录以在我插入图像时使用?
- helidon - io.helidon.webserver.ServerResponse.send() 在服务器上抛出错误但文件在前端下载
- kubernetes - Dapr 服务发现
- omnet++ - 隐式块序列化错误
- python - Python 线程 - 不断保存来自其他线程的结果
- sql - 分组方式未按预期对从字符串中提取的字段进行分组
- r - 为什么在 foreach 循环中定义的全局变量不适用于此后调用的函数?
- r - 在 lapply 中调用汇总函数返回 NaN 值
- c# - 复制文件但带有“另存为”窗口
- validation - 如何在 Joi 验证器中基于对象之外的字段进行条件验证