r - 如何使用 R 中的面板数据进行回归分析?
问题描述
所以我是 R 的菜鸟,自从我使用 R 以来已经有一年多了,而且我似乎忘记了很多...... :(
我有一个面板数据,其中包括 2005 年、2010 年和 2015 年观察到的不同国家,如下所示:
Location Year Health_Spending Total NCD Deaths_male
1 CAN 2005 3282.454 101.4
2 CAN 2010 4225.189 105.5
3 CAN 2015 4632.837 109.2
4 ESP 2005 2126.553 179.9
5 ESP 2010 2882.912 180.6
6 ESP 2015 3175.457 183.1
Total NCD Deaths_female
1 102.7
2 107.3
3 110.2
4 170.4
5 170.6
6 180.8
我正在尝试使用 Health_Spending 作为 Y,将 Total NCD Deaths_male 和 Total NCD Deaths_female 作为 X1 和 X2 进行回归分析。
我一直在查找,似乎 plm 包被大量用于分析 R 中的面板数据,但我无法弄清楚如何使用它。
一个善良的灵魂可以帮助我并指导我做我需要做的事情吗?
(这是我的数据的 dput 版本以防万一)
structure(list(Location = c("CAN", "CAN", "CAN", "ESP", "ESP",
"ESP", "GBR", "GBR", "GBR", "ISR", "ISR", "ISR", "JPN", "JPN",
"JPN", "KOR", "KOR", "KOR", "MEX", "MEX", "MEX", "NLD", "NLD",
"NLD", "NOR", "NOR", "NOR", "POL", "POL", "POL", "TUR", "TUR",
"TUR", "USA", "USA", "USA"), Year = c(2005L, 2010L, 2015L, 2005L,
2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L,
2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L,
2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L,
2010L, 2015L, 2005L, 2010L, 2015L), Health_Spending = c(3282.454,
4225.189, 4632.837, 2126.553, 2882.912, 3175.457, 2331.136, 3040.114,
4071.806, 1768.952, 2032.725, 2646.915, 2463.725, 3205.216, 4428.349,
1183.438, 1895.699, 2481.587, 730.816, 911.351, 1037.424, 3454.707,
4633.738, 5148.399, 3980.768, 5162.669, 6239.435, 806.974, 1352.424,
1687.009, 582.888, 871.677, 1028.911, 6443.02, 7939.798, 9491.4
), `Total NCD Deaths_male` = c("101.4", "105.5", "109.2", "179.9",
"180.6", "183.1", "245.8", "242.0", "249.0", "16.7", "16.8",
"18.0", "460.3", "503.7", "543.2", "105.7", "110.2", "118.3",
"194.7", "230.7", "257.5", "58.9", "58.6", "63.2", "17.4", "17.5",
"17.1", "172.7", "175.1", "175.9", "185.3", "197.4", "211.8",
"1024.9", "1061.6", "1159.5"), `Total NCD Deaths_female` = c("102.7",
"107.3", "110.2", "170.4", "170.6", "180.8", "268.2", "259.0",
"264.1", "17.5", "17.4", "18.7", "405.0", "458.9", "528.4", "92.9",
"93.3", "102.2", "181.4", "214.2", "235.5", "62.1", "62.6", "67.7",
"18.4", "18.8", "18.2", "163.1", "168.6", "174.6", "150.3", "162.6",
"181.0", "1111.6", "1115.5", "1183.4")), .Names = c("Location",
"Year", "Health_Spending", "Total NCD Deaths_male", "Total NCD Deaths_female"
), class = "data.frame", row.names = c(NA, -36L))
解决方案
我假设您想使用标准的多元回归方法。你可以很容易地做到这一点lm(Health_Spending~Total.NCD.Deaths_male + Total.NCD.Deaths_female + Location, data = df)
。只需确保变量Total NCD Deaths_male
andTotal NCD Deaths_female
是 ype numeric
, andLocation
是 type categorical
。
以下代码段将向您展示如何更改数据类型、构建模型和报告模型结果。
# Data
df <- data.frame(structure(list(Location = c("CAN", "CAN", "CAN", "ESP", "ESP",
"ESP", "GBR", "GBR", "GBR", "ISR", "ISR", "ISR", "JPN", "JPN",
"JPN", "KOR", "KOR", "KOR", "MEX", "MEX", "MEX", "NLD", "NLD",
"NLD", "NOR", "NOR", "NOR", "POL", "POL", "POL", "TUR", "TUR",
"TUR", "USA", "USA", "USA"), Year = c(2005L, 2010L, 2015L, 2005L,
2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L,
2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L,
2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L,
2010L, 2015L, 2005L, 2010L, 2015L), Health_Spending = c(3282.454,
4225.189, 4632.837, 2126.553, 2882.912, 3175.457, 2331.136, 3040.114,
4071.806, 1768.952, 2032.725, 2646.915, 2463.725, 3205.216, 4428.349,
1183.438, 1895.699, 2481.587, 730.816, 911.351, 1037.424, 3454.707,
4633.738, 5148.399, 3980.768, 5162.669, 6239.435, 806.974, 1352.424,
1687.009, 582.888, 871.677, 1028.911, 6443.02, 7939.798, 9491.4
), `Total NCD Deaths_male` = c("101.4", "105.5", "109.2", "179.9",
"180.6", "183.1", "245.8", "242.0", "249.0", "16.7", "16.8",
"18.0", "460.3", "503.7", "543.2", "105.7", "110.2", "118.3",
"194.7", "230.7", "257.5", "58.9", "58.6", "63.2", "17.4", "17.5",
"17.1", "172.7", "175.1", "175.9", "185.3", "197.4", "211.8",
"1024.9", "1061.6", "1159.5"), `Total NCD Deaths_female` = c("102.7",
"107.3", "110.2", "170.4", "170.6", "180.8", "268.2", "259.0",
"264.1", "17.5", "17.4", "18.7", "405.0", "458.9", "528.4", "92.9",
"93.3", "102.2", "181.4", "214.2", "235.5", "62.1", "62.6", "67.7",
"18.4", "18.8", "18.2", "163.1", "168.6", "174.6", "150.3", "162.6",
"181.0", "1111.6", "1115.5", "1183.4")), .Names = c("Location",
"Year", "Health_Spending", "Total NCD Deaths_male", "Total NCD Deaths_female"
), class = "data.frame", row.names = c(NA, -36L)))
# Data transformation
df$Health_Spending <- as.numeric(df$Health_Spending)
df$Location <- as.factor(df$Location)
df$Total.NCD.Deaths_male <- as.numeric(df$Total.NCD.Deaths_male)
df$Total.NCD.Deaths_female <- as.numeric(df$Total.NCD.Deaths_female)
# Model and model summary
m <- lm(Health_Spending~Total.NCD.Deaths_male + Total.NCD.Deaths_female + Location, data = df)
summary(m)
在摘要中,您会发现除“加拿大”之外的所有位置都是解释性因素变量。这是因为加拿大已自动被选为所有其他位置的参考变量。在模型摘要中,您可以看到在 10% 的水平上被Total.NCD.Deaths_female
认为是微不足道的(用“.”表示)Total.NCD.Deaths_male
一些谨慎的话
在构建模型之前,您应该始终注意数据的结构。如果您决定删除Location
模型中的变量,您将得到非常不同的结果,甚至可能得出两个变量Total.NCD.Deaths_male
都Total.NCD.Deaths_female
非常重要的结论:
Call:
lm(formula = Health_Spending ~ Total.NCD.Deaths_male + Total.NCD.Deaths_female,
data = df)
Residuals:
Min 1Q Median 3Q Max
-2319.8 -1176.5 56.6 943.0 3535.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2615.17 362.78 7.209 2.89e-08 ***
Total.NCD.Deaths_male -36.39 11.83 -3.077 0.00418 **
Total.NCD.Deaths_female 39.08 11.34 3.447 0.00156 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1509 on 33 degrees of freedom
Multiple R-squared: 0.5136, Adjusted R-squared: 0.4841
F-statistic: 17.42 on 2 and 33 DF, p-value: 6.849e-06
但是,由于数据集的结构,这将是一个极具误导性的结论:
如您所见,所有位置都会出现多次。同样的事情也适用Year
。如果不对数据进行子集化,更简单的model m <- lm(Health_Spending~Total.NCD.Deaths_male + Total.NCD.Deaths_female, data = df)
人不会考虑到这一点。使用Location
作为因子变量将在一定程度上弥补这一点,但您还应该考虑将其作为类型或Year
的解释变量,或者以其他方式考虑时间元素——也许通过将数据集拆分为不同的时期。numeric
categorical
我希望这就是你要找的。如果没有,请随时告诉我。
推荐阅读
- flutter - 在毫秒时代 Flutter 中转换时间戳
- node.js - 如何测试节点服务的“FormData”请求?
- python - 在尝试搜索特定文本时卡在“while”循环中
- api-key - Guidewire InsuranceSuite 10 在进行 API 调用时可以使用自定义标头吗?
- database - watermelonDb:如何在模式中将字符串数组作为列类型传递?
- javascript - 用 JS 关闭 py 终端
- c# - 事件在错误的按钮上触发
- cypress - 在 Cypress 中重载自定义命令
- php - 当帖子分配给多个时,如何仅显示特定分配的类别
- python - Factoranalyzer 使用 Python 3.5 和 Python 3.7 计算不同的分数