首页 > 解决方案 > 使用“group_by”来运行分组线性回归是否足够?

问题描述

我的数据集中有大约 7000 个家庭,对于每个家庭,我都有父母的收入和他们孩子的收入。现在我想对父母的收入对其子女的收入进行简单的线性回归。但是,我需要确保为每个家庭运行此回归。

示例数据集:

income_parents <- c(1000, 15000, 4500, 7000, 6500, 2500, 3500, 9000, 1200)
income_children <- c(1200, 7500, 2500, 8000, 5500,  7500, 3250, 7500, 850)
family_name <- c("Miller", "Smith", "Clark", "Powell", "Brown", "Jone", "Garcia", "Williams", "Lopez")

df <- data.frame(income_parents, income_children, family_name)

按family_name分组后,我运行以下回归:

df_AR <- df %>% group_by(family_name)
AR_1 <- lm(income_children ~ income_parents, data = df_AR)
summary(AR_1)

现在我想知道 lm() 函数是否考虑了嵌套数据结构?如果不是:如何更改我的代码以使其考虑在内?

标签: rgroup-bydplyrlinear-regression

解决方案


那不会如你所愿。该lm方法非常古老,因此它不熟悉新库(如dplyr.

我相信您可以通过将姓氏作为指标添加到模型中来完成您想要的。就像是:

model <- lm(income_children ~ family_name + family_name:income_parents, data = df)

这将有效地为每个家庭创建一个迷你模型。第一部分给出每个族的截距,交互变量给出 的斜率income_parents

如果您想坚持使用多模型方法,可以使用nest

model_one <- function(data) {
  lm(income_children ~ income_parents, data = data)
}

models <- df %>%
  group_by(family_name) %>%
  nest() %>%
  mutate(model = map(data, model_one))

models
# # A tibble: 9 x 3
# # Groups:   family_name [9]
#   family_name data             model 
#   <fct>       <list>           <list>
# 1 Miller      <tibble [1 × 2]> <lm>  
# 2 Smith       <tibble [1 × 2]> <lm>  
# 3 Clark       <tibble [1 × 2]> <lm>  
# 4 Powell      <tibble [1 × 2]> <lm>  
# 5 Brown       <tibble [1 × 2]> <lm>  
# 6 Jone        <tibble [1 × 2]> <lm>  
# 7 Garcia      <tibble [1 × 2]> <lm>  
# 8 Williams    <tibble [1 × 2]> <lm>  
# 9 Lopez       <tibble [1 × 2]> <lm>  

你会注意到这个输出还不是很有用。它可以用broom::glance简洁地概括,然后取消嵌套。这里不是很有趣,因为每个模型只有一个数据点。

summarized <- models %>%
  mutate(summary = map(model, broom::glance)) %>%
  unnest(summary)

# Drop the still-nested columns for display.
summarized %>% select(-data, -model)
#   family_name r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC deviance df.residual
#   <fct>           <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
# 1 Miller              0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 2 Smith               0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 3 Clark               0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 4 Powell              0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 5 Brown               0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 6 Jone                0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 7 Garcia              0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 8 Williams            0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 9 Lopez               0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0

有关更多详细信息,我推荐 R 的 Data Science 的Many Models章节。


推荐阅读