首页 > 解决方案 > 如何使用 R 中的面板数据进行回归分析?

问题描述

所以我是 R 的菜鸟,自从我使用 R 以来已经有一年多了,而且我似乎忘记了很多...... :(

我有一个面板数据,其中包括 2005 年、2010 年和 2015 年观察到的不同国家,如下所示:

   Location Year Health_Spending Total NCD Deaths_male
1      CAN 2005        3282.454                 101.4
2      CAN 2010        4225.189                 105.5
3      CAN 2015        4632.837                 109.2
4      ESP 2005        2126.553                 179.9
5      ESP 2010        2882.912                 180.6
6      ESP 2015        3175.457                 183.1
  Total NCD Deaths_female
1                   102.7
2                   107.3
3                   110.2
4                   170.4
5                   170.6
6                   180.8

我正在尝试使用 Health_Spending 作为 Y,将 Total NCD Deaths_male 和 Total NCD Deaths_female 作为 X1 和 X2 进行回归分析。

我一直在查找,似乎 plm 包被大量用于分析 R 中的面板数据,但我无法弄清楚如何使用它。

一个善良的灵魂可以帮助我并指导我做我需要做的事情吗?

(这是我的数据的 dput 版本以防万一)

    structure(list(Location = c("CAN", "CAN", "CAN", "ESP", "ESP", 
"ESP", "GBR", "GBR", "GBR", "ISR", "ISR", "ISR", "JPN", "JPN", 
"JPN", "KOR", "KOR", "KOR", "MEX", "MEX", "MEX", "NLD", "NLD", 
"NLD", "NOR", "NOR", "NOR", "POL", "POL", "POL", "TUR", "TUR", 
"TUR", "USA", "USA", "USA"), Year = c(2005L, 2010L, 2015L, 2005L, 
2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 
2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 
2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 
2010L, 2015L, 2005L, 2010L, 2015L), Health_Spending = c(3282.454, 
4225.189, 4632.837, 2126.553, 2882.912, 3175.457, 2331.136, 3040.114, 
4071.806, 1768.952, 2032.725, 2646.915, 2463.725, 3205.216, 4428.349, 
1183.438, 1895.699, 2481.587, 730.816, 911.351, 1037.424, 3454.707, 
4633.738, 5148.399, 3980.768, 5162.669, 6239.435, 806.974, 1352.424, 
1687.009, 582.888, 871.677, 1028.911, 6443.02, 7939.798, 9491.4
), `Total NCD Deaths_male` = c("101.4", "105.5", "109.2", "179.9", 
"180.6", "183.1", "245.8", "242.0", "249.0", "16.7", "16.8", 
"18.0", "460.3", "503.7", "543.2", "105.7", "110.2", "118.3", 
"194.7", "230.7", "257.5", "58.9", "58.6", "63.2", "17.4", "17.5", 
"17.1", "172.7", "175.1", "175.9", "185.3", "197.4", "211.8", 
"1024.9", "1061.6", "1159.5"), `Total NCD Deaths_female` = c("102.7", 
"107.3", "110.2", "170.4", "170.6", "180.8", "268.2", "259.0", 
"264.1", "17.5", "17.4", "18.7", "405.0", "458.9", "528.4", "92.9", 
"93.3", "102.2", "181.4", "214.2", "235.5", "62.1", "62.6", "67.7", 
"18.4", "18.8", "18.2", "163.1", "168.6", "174.6", "150.3", "162.6", 
"181.0", "1111.6", "1115.5", "1183.4")), .Names = c("Location", 
"Year", "Health_Spending", "Total NCD Deaths_male", "Total NCD Deaths_female"
), class = "data.frame", row.names = c(NA, -36L))

标签: rdataframeregressionanalysispanel-data

解决方案


我假设您想使用标准的多元回归方法。你可以很容易地做到这一点lm(Health_Spending~Total.NCD.Deaths_male + Total.NCD.Deaths_female + Location, data = df)。只需确保变量Total NCD Deaths_maleandTotal NCD Deaths_female是 ype numeric, andLocation是 type categorical

以下代码段将向您展示如何更改数据类型、构建模型和报告模型结果。


# Data
df <- data.frame(structure(list(Location = c("CAN", "CAN", "CAN", "ESP", "ESP", 
                "ESP", "GBR", "GBR", "GBR", "ISR", "ISR", "ISR", "JPN", "JPN", 
                "JPN", "KOR", "KOR", "KOR", "MEX", "MEX", "MEX", "NLD", "NLD", 
                "NLD", "NOR", "NOR", "NOR", "POL", "POL", "POL", "TUR", "TUR", 
                "TUR", "USA", "USA", "USA"), Year = c(2005L, 2010L, 2015L, 2005L, 
                2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 
                2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 
                2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 
                2010L, 2015L, 2005L, 2010L, 2015L), Health_Spending = c(3282.454, 
                4225.189, 4632.837, 2126.553, 2882.912, 3175.457, 2331.136, 3040.114, 
                4071.806, 1768.952, 2032.725, 2646.915, 2463.725, 3205.216, 4428.349, 
                1183.438, 1895.699, 2481.587, 730.816, 911.351, 1037.424, 3454.707, 
                4633.738, 5148.399, 3980.768, 5162.669, 6239.435, 806.974, 1352.424, 
                1687.009, 582.888, 871.677, 1028.911, 6443.02, 7939.798, 9491.4
                ), `Total NCD Deaths_male` = c("101.4", "105.5", "109.2", "179.9", 
                "180.6", "183.1", "245.8", "242.0", "249.0", "16.7", "16.8", 
                "18.0", "460.3", "503.7", "543.2", "105.7", "110.2", "118.3", 
                "194.7", "230.7", "257.5", "58.9", "58.6", "63.2", "17.4", "17.5", 
                "17.1", "172.7", "175.1", "175.9", "185.3", "197.4", "211.8", 
                "1024.9", "1061.6", "1159.5"), `Total NCD Deaths_female` = c("102.7", 
                "107.3", "110.2", "170.4", "170.6", "180.8", "268.2", "259.0", 
                "264.1", "17.5", "17.4", "18.7", "405.0", "458.9", "528.4", "92.9", 
                "93.3", "102.2", "181.4", "214.2", "235.5", "62.1", "62.6", "67.7", 
                "18.4", "18.8", "18.2", "163.1", "168.6", "174.6", "150.3", "162.6", 
                "181.0", "1111.6", "1115.5", "1183.4")), .Names = c("Location", 
                "Year", "Health_Spending", "Total NCD Deaths_male", "Total NCD Deaths_female"
                ), class = "data.frame", row.names = c(NA, -36L)))


# Data transformation
df$Health_Spending <- as.numeric(df$Health_Spending)
df$Location <- as.factor(df$Location)
df$Total.NCD.Deaths_male <- as.numeric(df$Total.NCD.Deaths_male)
df$Total.NCD.Deaths_female <- as.numeric(df$Total.NCD.Deaths_female)

# Model and model summary
m <- lm(Health_Spending~Total.NCD.Deaths_male + Total.NCD.Deaths_female + Location, data = df)
summary(m)

在摘要中,您会发现除“加拿大”之外的所有位置都是解释性因素变量。这是因为加拿大已自动被选为所有其他位置的参考变量。在模型摘要中,您可以看到在 10% 的水平上被Total.NCD.Deaths_female认为是微不足道的(用“.”表示)Total.NCD.Deaths_male

一些谨慎的话


在构建模型之前,您应该始终注意数据的结构。如果您决定删除Location模型中的变量,您将得到非常不同的结果,甚至可能得出两个变量Total.NCD.Deaths_maleTotal.NCD.Deaths_female非常重要的结论:

Call:
lm(formula = Health_Spending ~ Total.NCD.Deaths_male + Total.NCD.Deaths_female, 
    data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2319.8 -1176.5    56.6   943.0  3535.2 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              2615.17     362.78   7.209 2.89e-08 ***
Total.NCD.Deaths_male     -36.39      11.83  -3.077  0.00418 ** 
Total.NCD.Deaths_female    39.08      11.34   3.447  0.00156 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1509 on 33 degrees of freedom
Multiple R-squared:  0.5136,    Adjusted R-squared:  0.4841 
F-statistic: 17.42 on 2 and 33 DF,  p-value: 6.849e-06

但是,由于数据集的结构,这将是一个极具误导性的结论:

在此处输入图像描述

如您所见,所有位置都会出现多次。同样的事情也适用Year。如果不对数据进行子集化,更简单的model m <- lm(Health_Spending~Total.NCD.Deaths_male + Total.NCD.Deaths_female, data = df)人不会考虑到这一点。使用Location作为因子变量将在一定程度上弥补这一点,但您还应该考虑将其作为类型或Year的解释变量,或者以其他方式考虑时间元素——也许通过将数据集拆分为不同的时期。numericcategorical

我希望这就是你要找的。如果没有,请随时告诉我。


推荐阅读