首页 > 解决方案 > 优化样条回归中的自由度

问题描述

我有两个基因表达时间过程数据集:

首先,在 4 个组的 14 个时间点测量基因表达:

df1 <- structure(list(val = c(-0.1, -0.13, -0.4, -0.3, -0.3, -0.2, -0.24, 
                            0.1, 0.2, 0.13, 0, 0.63, 0.83, 0.85, -0.07, -0.07, -0.27, -0.2, 
                            -0.2, -0.1, 0.2, 0.1, 0.07, 0.17, 0.6, 0.75, 1.1, 1.1, -0.13, 
                            -0.15, -0.26, -0.25, -0.14, 0.04, 0.2, 0.24, 0.23, 0.2, 0.1, 
                            0.73, 1, 1.3, 0, 0.06, -0.24, -0.17, -0.17, -0.04, 0.16, 0.1, 
                            0.14, 0.27, 0.34, 0.9, 0.97, 1.04), 
                    time = c(-1, 0, 1, 1.58,2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                             -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                             -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17,7.39, 
                             -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58,6.17, 7.39), 
                    group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                        2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 2L, 2L, 
                                        3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,3L, 3L, 3L, 
                                        4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,4L), 
                                      .Label = c("a", "b", "c", "d"), class = "factor")), .Names = c("val","time", "group"), 
               row.names = c(NA, -56L), class = "data.frame")


df1$group <- factor(df1$group,levels=c("a","b","c","d"))

看起来像这样(添加loess平滑的趋势线):

library(ggplot2)
ggplot(df1,aes(x=time,y=val,color=group))+geom_point()+theme_minimal()+geom_smooth(se=F)+theme(legend.position="top",legend.title=element_blank())

在此处输入图像描述

其次,在相似的 14 个时间点测量基因表达,但现在来自 2 个不同的组,每个组由两种性别代表:

df2 <- structure(list(val = c(-0.23, -0.01, -0.14, -0.01, -0.21, -0.16, 
                       -0.24, -0.11, 0.02, -0.11, -0.01, -0.25, -0.47, -1.25, 0.02, 
                       -0.3, -0.02, 0.14, 0.25, -0.05, 0.15, 0.11, -0.24, -0.18, -0.39, 
                       -0.49, -0.5, -0.65, -0.06, 0.09, 0.1, 0.15, 0.08, 0.15, 0.4, 
                       0.24, 0.07, 0.08, -0.18, -0.35, -0.19, -0.81, -0.16, 0.29, -0.05, 
                       0.14, 0.14, 0.48, 0.34, 0.11, -0.07, -0.13, -0.41, -0.22, -0.54, 
                       -0.76, 0.35, 0.34, -0.06, 0.21, 0.14, 0.14, 0.25, 0.22, 0.25, 
                       0.16, 0.3, 0.44, 0.08, 0.48, 0.1, 0.16, -0.03, -0.22, 0.2, 0.01, 
                       -0.09, -0.02, -0.01, 0.06, -0.13, 0.19, 0.11, -0.04, -0.39, 0.03, 
                       -0.01, 0.09, 0.1, -0.14, -0.12, -0.1, 0.36, 0.08, 0.09, 0.09, 
                       0.42, 0.37, -0.14, 0.12, 0.09, 0.03, 0.06, -0.25, 0.2, -0.06, 
                       -0.44, 0.23, 0.03, 0.16, 0.81, 0.83),
               time = c(-1, 0, 1, 1.58,2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                        -1, 0,1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17,7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58,6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58,5.58, 6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17,4.58, 5.58, 6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39), 
               sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
                                 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
                               .Label = c("F", "M"), class = "factor"), group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                                                                            2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
                                                                                          .Label = c("a", "b"), class = "factor")), .Names = c("val", "time", "sex", "group"), row.names = c(NA, -112L), class = "data.frame")
df2$sex <- ordered(df2$sex,levels=c("M","F"))

df2$group <- ordered(df2$group,levels=c("a","b"))

df2$col <- factor(paste0(df2$group,":",df2$sex))

看起来像这样(添加黄土平滑趋势线):

ggplot(df2,aes(x=time,y=val,color=col))+geom_point()+theme_minimal()+geom_smooth(se=F)+theme(legend.position="top",legend.title=element_blank())

在此处输入图像描述

对于df1,我想估计 对 的影响timeval调整为group

对于df2,我想估计 对 的影响time:groupval调整为sex

查看我认为使用spline regressions 的数据是合适的,所以我使用了包中的gam函数,据我所知,它优化了s 拟合数据mgcv的自由度。spline

这就是我适合的df1

mgcv1.fit <- mgcv::gam(val ~ group+s(time),data=df1)

这使:

Family: gaussian 
Link function: identity 

Formula:
val ~ group + s(time)

Estimated degrees of freedom:
7.18  total = 11.18 

GCV score: 0.01258176     

但是对于这些数据来说,7.18 的自由度似乎太大了。

对于df2

mgcv2.fit <- mgcv::gam(val ~ sex+s(time,by=group),data=df2)

这使:

Family: gaussian 
Link function: identity 

Formula:
val ~ sex + s(time, by = group)

Estimated degrees of freedom:
1.72  total = 3.72 

GCV score: 0.08522094     

我想在这种情况下,我会想象自由度会略高一些。

还有一点。绘制这两个数据集的拟合值:

df1$mgcv <- mgcv1.fit$fitted.values
ggplot(df1,aes(x=time,y=mgcv,color=group))+geom_point()+theme_minimal()+geom_smooth(se=F)+theme(legend.position="top",legend.title=element_blank())

在此处输入图像描述

看起来不错。

但对于df2

df2$mgcv <- mgcv2.fit$fitted.values
ggplot(df2,aes(x=time,y=mgcv,color=col))+geom_point()+theme_minimal()+geom_smooth(se=F)+theme(legend.position="top",legend.title=element_blank())

在此处输入图像描述

看起来它翻转了组标签。

所以我的问题是:

  1. mgcv::gam是否正确使用优化我的问题的样条自由度?
  2. 是否mgcv重新排序其中的样本fitted.values

标签: rregressionsplinegammgcv

解决方案


首先,mgcv在因子水平上做正确的事。如果你检查str(df2$sex),你会看到“M”(男性)是第一级,“F”(女性)是第二级。但似乎str(df2$col)“F”是第一个,所以在制作情节时你会得到一些错误的标签。

其次,您的第二个模型没有正确指定。

  1. s(time)当没有“by”变量或“by”是一个因素时,样条曲线处于居中约束。因此,您可以group在模型公式中将“按”变量作为单独的术语提供,以捕捉其边际效应;
  2. 由于“by”变量group是有序变量,mgcv因此对其应用对比,在构造s(time, by = group). 所以你需要提供一个单独s(time)的平滑作为基线。

您当前mgcv2.fit的模型是一个相当差的模型(不足为奇),给出了 9% 的解释偏差。但是,如果您执行以下操作,您将获得 64%。

gam(val ~ sex + s(time) + group + s(time, by = group), data = df2, method = "REML")

现在ggplot看起来是正确的(我没有改变df2$col,所以颜色仍然可能是颠倒的)。

gam默认使用“GCV.Cp”作为平滑参数选择方法。但建议使用“REML”,因为它不太容易过度拟合。


备注 1

如果“by”变量group是(无序的)因子,则它不受对比的影响。所以模型公式应该是:

val ~ sex + group + s(time, by = group)

以下内容引用自'by' 变量部分?gam.models

 If a ‘by’ variable is a ‘factor’ then it generates an indicator
 vector for each level of the factor, unless it is an ‘ordered’
 factor. In the non-ordered case, the model matrix for the smooth
 term is then replicated for each factor level, and each copy has
 its rows multiplied by the corresponding rows of its indicator
 variable. The smoothness penalties are also duplicated for each
 factor level.  In short a different smooth is generated for each
 factor level (the ‘id’ argument to ‘s’ and ‘te’ can be used to
 force all such smooths to have the same smoothing parameter).
 ‘ordered’ ‘by’ variables are handled in the same way, except that
 no smooth is generated for the first level of the ordered factor
 (see ‘b3’ example below).  This is useful for setting up
 identifiable models when the same smooth occurs more than once in
 a model, with different factor ‘by’ variables.

备注 2

我不是要评判你的模型,但“F”和“M”之间似乎存在明显的组内差异。从您的数据中,我们看到“F”和“M”在“b”组中的差异比在“a”组中更大。目前,sex两组的效果是相同的,只是垂直移动。ggplot您可以在此答案的上述内容中观察到这一点。最终由您决定模型,但如果您想对这种sex-group交互进行建模,您可以这样做

df2$sex_group <- with(df2, interaction(sex, group))  ## the new variable is unordered
test <- gam(val ~ sex + group + s(time, by = sex_group), data = df2, method = "REML")

请注意我如何向by. 创建一个辅助变量sex_group


推荐阅读