首页 > 解决方案 > 使用 R 中的线性模型从接收方创建工作流时出错

问题描述

我正在训练一个线性回归模型,使用 StackOverflow 数据从公司规模 ( company_size_number) 和国家 ( ) 预测薪水。country

我执行的是:

  1. 读取数据。将数据分成训练集 (75%) 和测试集 (25%)。
  2. 创建一个转换company_size_number为因子变量的配方,然后将两个预测变量转换为虚拟变量。
  3. 创建模型规范。
  4. 创建一个工作流对象并向其添加配方和模型规范,然后将模型拟合到训练集上。
  5. 在测试集上计算 R²。

这是我的代码

library(tidyverse)
library(tidymodels)

so <- read_rds("stackoverflow.rds") 

set.seed(123)
init_split <- initial_split(so)
so_training <- training(init_split)
so_testing <- testing(init_split)

rec <- recipe(salary ~ ., data = so_training %>% select(salary, company_size_number, country)) %>%
  step_num2factor(company_size_number = factor(company_size_number)) %>%
  step_dummy(country, company_size_number)

model_spec <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

fit <- workflow() %>%
  add_model(model_spec) %>%
  add_recipe(rec) %>%
  fit(data = so_training)

predict(fit, new_data = so_testing) %>%
  mutate(truth = so_testing$salary) %>%
  rmse(estimate = .pred, truth = truth)

但由于错误无法继续:

Error: Please provide a character vector of appropriate length for `levels`.

我想我在这里搞砸了spec_*()

rec <- recipe(salary ~ ., data = so_training %>% select(salary, company_size_number, country)) %>%
  step_novel(company_size_number = factor(company_size_number)) %>%
  step_dummy(country, company_size_number)

但不确定这是否正确。任何输入都会有所帮助。

> dput(head(so))
structure(list(country = structure(c(5L, 5L, 4L, 4L, 5L, 5L), .Label = c("Canada", 
"Germany", "India", "United Kingdom", "United States"), class = "factor"), 
    salary = c(63750, 93000, 40625, 45000, 1e+05, 170000), years_coded_job = c(4L, 
    9L, 8L, 3L, 8L, 12L), open_source = c(0, 1, 1, 1, 0, 1), 
    hobby = c(1, 1, 1, 0, 1, 1), company_size_number = c(20, 
    1000, 10000, 1, 10, 100), remote = structure(c(1L, 1L, 1L, 
    1L, 1L, 1L), .Label = c("Remote", "Not remote"), class = "factor"), 
    career_satisfaction = c(8L, 8L, 5L, 10L, 8L, 10L), data_scientist = c(0, 
    0, 1, 0, 0, 0), database_administrator = c(1, 0, 1, 0, 0, 
    0), desktop_applications_developer = c(1, 0, 1, 0, 0, 0), 
    developer_with_stats_math_background = c(0, 0, 0, 0, 0, 0
    ), dev_ops = c(0, 0, 0, 0, 0, 1), embedded_developer = c(0, 
    0, 0, 0, 0, 0), graphic_designer = c(0, 0, 0, 0, 0, 0), graphics_programming = c(0, 
    0, 0, 0, 0, 0), machine_learning_specialist = c(0, 0, 0, 
    0, 0, 0), mobile_developer = c(0, 1, 0, 0, 1, 0), quality_assurance_engineer = c(0, 
    0, 0, 0, 0, 0), systems_administrator = c(1, 0, 1, 0, 0, 
    1), web_developer = c(0, 0, 0, 1, 1, 1)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

标签: rtidyversetidymodels

解决方案


我有一些关于调整你正在做的事情的建议。

  • 首先是在拆分之前选择变量,这样当您使用类似的公式时salary ~ .,您和/或函数就不会对那里的内容感到困惑。
  • 二是不要step_num2factor()以你有的方式使用;让它正常工作需要很多时间,我认为你最好在拆分之前将其转换为一个因子。查看此步骤的文档以了解此配方步骤的更合适用途,并注意您必须提供levels. 这就是您看到错误的原因,但老实说,我不会尝试找到正确的级别并在那里输入它们;我会在分手前做。
library(tidyverse)
library(tidymodels)

data("stackoverflow", package = "modeldata")
so <- janitor::clean_names(stackoverflow)

set.seed(123)
init_split <- so %>%
   select(salary, company_size_number, country) %>%
   mutate(company_size_number = factor(company_size_number)) %>%
   initial_split()
so_training <- training(init_split)
so_testing <- testing(init_split)

rec <- recipe(salary ~ ., data = so_training) %>%
   step_dummy(country, company_size_number)

model_spec <- linear_reg() %>%
   set_engine("lm") %>%
   set_mode("regression")

fit <- workflow() %>%
   add_model(model_spec) %>%
   add_recipe(rec) %>%
   fit(data = so_training)

predict(fit, new_data = so_testing) %>%
   mutate(truth = so_testing$salary) %>%
   rmse(estimate = .pred, truth = truth)
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard      27822.

reprex 包于 2021-05-25 创建 (v2.0.0 )


推荐阅读