首页 > 解决方案 > Linear regression: calculate confidence and prediction intervals with the standard errors of the fitted parameters the correlation coefficient

问题描述

In many fields of the natural sciences it is common practice to report the results of linear regression analysis as y = (a1 +- u(a1)) + (a2 +- u(a2)) * x, including R2 and p, but not the original data. u(a1) and u(a2) are the uncertainties (standard error) of a1 and a2. How could I calculate with this information the confidence and prediction intervals, or have a “reasonable” estimation?

Let me clarify with an example. This is a dummy dataset, with a line of slope 1 and gaussian noise 10:

set.seed(1961)
npoints <- 1e2
(x <- 1:npoints)
(y <-1:npoints + rnorm(npoints, 0, npoints/10))

Now I perform the linear regression:

par(mar = c(4, 4, 1, 1))
xy.model <- lm(y ~ x)
plot(x, y, pch = 16)
abline(xy.model, col = "orange", lwd = 2)
(xy.sum   <- summary(xy.model))
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -1.28106    1.94918  -0.657    0.513    
# x            1.00484    0.03351  29.987   <2e-16 ***
# Residual standard error: 9.673 on 98 degrees of freedom
# Multiple R-squared:  0.9017,  Adjusted R-squared:  0.9007 
# F-statistic: 899.2 on 1 and 98 DF,  p-value: < 2.2e-16

I calculate the confidence and prediction intervals:

x.new   <- data.frame(x = 1:npoints)
xy.conf <- predict.lm(xy.model, se.fit = TRUE, interval = "confidence", newdata = x.new)
xy.pred <- predict.lm(xy.model, se.fit = TRUE, interval = "prediction", newdata = x.new)

For example, the confidence and prediction intervals of the first point are:

xy.conf$fit[1, ] 
#        fit        lwr        upr 
# -0.2762127 -4.0867009  3.5342755
xy.pred$fit[1, ]
#       fit         lwr         upr 
# -0.2762127 -19.8462821  19.2938568

If the regression equation is reported as y = (-1.28106 +- 1.94918) + (1.00484 +- 0.03351) * x, R2 = 0.9017, p < 0.05, but the original data are not provided, how can I reproduce (at least approximately) the values, of the confidence and prediction intervals?

标签: rlinear-regression

解决方案


如果没有原始数据,您还需要一条信息:两个变量的均值。您提供的统计数据允许构建线性回归线,但置信区间和预测区间在均值 (x)、均值 (y) 处最窄,因此如果没有这些,您将无法计算它们。

一个简单的例子可能会更清楚地说明这一点。从一些数据开始:

z <- structure(list(x = c(5, 5.1, 5.4, 5.8, 4.7, 5.7, 4.8, 5.1, 4.6, 
5.4, 5.2, 5, 5, 5.5, 5.2, 5.1, 4.7, 5.2, 4.8, 5.4, 4.8, 5.1, 
5, 4.6, 4.8), y = c(3.4, 3.7, 3.4, 4, 3.2, 3.8, 3, 3.5, 3.1, 
3.7, 4.1, 3.4, 3.6, 4.2, 3.5, 3.3, 3.2, 3.4, 3, 3.9, 3.1, 3.5, 
3.5, 3.4, 3.4)), row.names = c(NA, -25L), class = "data.frame")

计算回归线并将其与数据一起绘制:

z.lm <- lm(y~x, z)
z.lm
# 
# Call:
# lm(formula = y ~ x, data = z)
# 
# Coefficients:
# (Intercept)            x  
#     -0.4510       0.7762  
# 
plot(y~x, z, xlim=c(0, 20), ylim=c(0, 20))
abline(z.lm)

现在从原始数据创建一个新数据集并计算回归:

x2 <- z$x + 10
y2 <- z$y+(10 * coef(z.lm)[2])
z2 <- data.frame(x=x2, y=y2)
points(y~x, z2, col="red")
z2.lm <- lm(y~x, z2)
z2.lm
# 
# Call:
# lm(formula = y ~ x, data = z2)
# 
# Coefficients:
# (Intercept)            x  
#     -0.4510       0.7762  

请注意,回归系数与原始数据相同。事实上,将 10 更改为任何其他值都会产生另一组具有相同回归结果的数据。

回归线


推荐阅读