首页 > 解决方案 > Normalization/transformation prior to PCA with Box-Cox

问题描述

Prior to calculating a PCA, I need to normalize my data. I have a matrix where the row names represent the disease group ( 0 represents control, 1 is Ulcerative Colitis and 2 is Crohn's). The rest of the data represents gene expression values.

I have tried log transformation which did not normalize ( as confirmed through plotting histograms for some of the columns and also through the Anderson-Darling test).

Update: I am trying the Box-Cox transformation. I am not sure how to convert my matrix of values into a linear model class prior to using the below ( where lm would be replaced by my data). I understand the lm formula has to be in the form of response ~ terms, where terms specify a linear predictor for the response.

      bc=boxcox(Gene1 ~ 1, lambda=seq(-2, 2))  (as suggested in comments). 

Not sure whether I would need to change the terms variable to disease (once disease column has been added to data).

         bc=boxcox(Gene1 ~ disease , lambda=seq(-2,2))

         best.lam=bc$x[which(bc$y==max(bc$y))]

There are 24 rows and 13 columns. How would I easily apply the transformation to each column in the data set?

Firstl, I am unsure how I would linearise each column quickly. When you ?lm, it states that if the response variable is a matrix, then you can use model.matrix to fit a linear model to individual columns prior to calculating boxcox. However, there are no examples of this online or in R help.

Secondly, I am unsure how I would then alter the y values of each column via the corresponding lambda quickly ( potentially a for loop or using one of the apply functions).

Please find below my new data. The real thing contains over 600 genes and 190 rows. Any further help would be appreciated.

     structure(c(5.54e-05, 5.58e-06, 9.74e-05, 1.33e-06, 1.29e-05, 
     7.22e-06, 0.000215899, 3.6e-06, 0.000146724, 1.53e-05, 0.000913187, 
     1.9e-06, 0.007421464, 0.000648006, 5.1e-06, 6.15e-06, 4.73e-06, 
     0.000119899, 0.000884487, 0.000850632, 0.000236607, 7.36e-06, 
     8.48e-06, 2.63e-05, 0.001368493, 1.12e-05, 0.000177568, 0.006338532, 
     0.006162866, 0.040695132, 0.013255055, 0.033086619, 0.074158811, 
     0.004967497, 0.01247423, 0.043201417, 0.011470285, 0.038447751, 
     0.018825124, 0.027701807, 0.063373762, 0.005374513, 0.048876252, 
     0.009959848, 0.004434078, 0.004176856, 0.015288913, 0.060226053, 
     0.05128922, 0.006557554, 0.017460326, 0.007684784, 0.002107577, 
     0.005773192, 0.076186393, 0.037631043, 0.052159393, 0.012179365, 
     0.047199766, 0.022458838, 0.030261613, 0.00626629, 0.028664896, 
     0.02285845, 0.02801855, 0.017681676, 0.040563592, 0.029791175, 
     0.034778056, 0.019318473, 0.011847912, 0.009614177, 0.064027542, 
     0.035334149, 0.041638955, 0.056015014, 0.03304865, 0.017660205, 
     0.030187166, 0.057919531, 0.029990489, 0.000112884, 0.000920886, 
     0.001081748, 0.000195159, 0.001678445, 0.000171612, 0.000191702, 
     0.000560035, 0.000384056, 0.000454783, 0.000723385, 0.000203897, 
     0.000973337, 0.000822171, 0.000620526, 0.000260769, 0.000214607, 
     0.002077443, 0.00065843, 0.000403672, 0.000378651, 0.000409306, 
     0.001722587, 0.000213785, 0.000176643, 0.002022878, 0.001886929, 
     0.053029236, 0.022594965, 0.011967636, 0.026851113, 0.03773798, 
     0.031356268, 0.10410326, 0.063265216, 0.018028454, 0.116038001, 
     0.00572817, 0.053635968, 0.059126941, 0.011835241, 0.004639624, 
     0.014302911, 0.082948853, 0.015202238, 0.021295431, 0.043342, 
     0.008153675, 0.015613747, 0.043289609, 0.048834321, 0.019144763, 
     0.059809871, 0.006990685, 0.04082966, 0.02986135, 0.061405171, 
     0.006142619, 0.009767602, 0.035427993, 0.03729329, 0.01309739, 
     0.00221718, 0.040211393, 0.006303841, 0.030146612, 0.032033879, 
     0.024590398, 0.077991721, 0.017215666, 0.014731147, 0.04802582, 
     0.03168714, 0.03244771, 0.032278613, 0.017301885, 0.013450667, 
     0.040207755, 0.042669615, 0.03456749, 0.034631319, 1.93e-05, 
     4.72e-06, 5.41e-05, 0, 1.91e-05, 9.33e-07, 5.98e-06, 0, 1.05e-06, 
     4.1e-07, 7.72e-05, 4.07e-07, 0.000585154, 0.000246992, 7.86e-06, 
     3.13e-06, 2.14e-06, 7.56e-06, 9.29e-05, 0.000116024, 5.51e-05, 
     7.79e-06, 6.65e-06, 2.06e-06, 0.000104342, 4.16e-06, 1.27e-05, 
     0.000197502, 0.00015135, 0.000107306, 6.54e-05, 0.000225564, 
     0.000142631, 0.000168873, 3.5e-05, 0.000365242, 0.000174254, 
     0.000339327, 8.7e-05, 0.000136679, 0.000156634, 0.000224181, 
     0.000205305, 8.87e-05, 0.000305774, 0.000133615, 0.00015118, 
     0.000107229, 0.000162579, 0.000152249, 6.88e-05, 0.000113864, 
     0.000249258, 0.00024256, 0.00079296, 0.007640951, 0.004937327, 
     0.000422361, 0.000953513, 0.000951187, 0.000671306, 0.001106406, 
     0.002606568, 0.003006867, 0.001911646, 0.00135411, 0.012461738, 
     0.000434917, 0.00237646, 0.007857561, 0.000436889, 0.00048816, 
     0.000348146, 0.000931449, 0.000323974, 0.004945321, 0.000693845, 
     0.000479572, 0.000843415, 0.001419675, 0.001547478, 8.16e-05, 
     6.63e-05, 0.000101583, 3.08e-05, 0.000147039, 5.13e-05, 0.000109479, 
     2.39e-05, 0.000225475, 4.28e-05, 0.000230785, 2.1e-05, 0.0001356, 
     0.000124173, 0.000245128, 0.000275446, 3.18e-05, 0.00017516, 
     0.000180192, 0.000246669, 0.000378708, 4.35e-05, 0.000267824, 
     7.2e-05, 7.65e-05, 8.79e-05, 0.000130026, 0.000111462, 3.17e-05, 
     0.000200096, 3.12e-06, 8.75e-05, 3.11e-06, 6.89e-06, 0.000165936, 
     5.98e-05, 0.000201355, 5.92e-06, 2.57e-05, 2.53e-05, 3.27e-05, 
     0.000137446, 0.000134402, 5.86e-07, 3.9e-05, 0.018886909, 0.050343466, 
     4.15e-05, 1.67e-05, 0.000172614, 4.95e-05, 1.27e-05, 9.85e-05, 
     4.28e-05, 0.002708402, 0.003215586, 0.00457116, 0.001713549, 
     0.024353184, 0.006660748, 0.003198887, 0.003094386, 0.004789163, 
     0.002816955, 0.021587313, 0.002084725, 0.00378062, 0.021751495, 
     0.009097143, 0.012216225, 0.001125765, 0.013043534, 0.005514773, 
     0.008323962, 0.026898764, 0.002149135, 0.008021623, 0.006673567, 
     0.005391139, 0.018578559, 0.013786297, 0.00080595, 0.001289505, 
     0.002451416, 0.000234107, 0.001694733, 0.000288175, 0.002357478, 
     0.000856129, 0.00159752, 0.000117538, 0.000166581, 0.000367288, 
     0.001039841, 0.001779528, 0.000438092, 0.001012515, 0.000529936, 
     0.003193086, 0.002562702, 0.00277401, 0.003013136, 0.001349197, 
     0.001646296, 0.001114222, 0.001207882, 0.002804949, 0.000366419
     ), .Dim = c(27L, 13L), .Dimnames = list(c("2", "0", "0", "0", 
    "1", "0", "0", "1", "1", "1", "2", "0", "0", "1", "2", "2", "1", 
    "2", "2", "2", "2", "1", "1", "2", "2", "0", "0"), c("Gene1", 
    "Gene2", "Gene3", "Gene4", "Gene5", "Gene6", "Gene7", "Gene8", 
    "Gene9", "Gene10", "Gene11", "Gene12", "Gene13")))

标签: rpca

解决方案


Caret might make this a lot easier.

Creating your data structure

data <- structure(c(5.54e-05, 5.58e-06, 9.74e-05, 1.33e-06, 1.29e-05, 
            7.22e-06, 0.000215899, 3.6e-06, 0.000146724, 1.53e-05, 0.000913187, 
            1.9e-06, 0.007421464, 0.000648006, 5.1e-06, 6.15e-06, 4.73e-06, 
            0.000119899, 0.000884487, 0.000850632, 0.000236607, 7.36e-06, 
            8.48e-06, 2.63e-05, 0.001368493, 1.12e-05, 0.000177568, 0.006338532, 
            0.006162866, 0.040695132, 0.013255055, 0.033086619, 0.074158811, 
            0.004967497, 0.01247423, 0.043201417, 0.011470285, 0.038447751, 
            0.018825124, 0.027701807, 0.063373762, 0.005374513, 0.048876252, 
            0.009959848, 0.004434078, 0.004176856, 0.015288913, 0.060226053, 
            0.05128922, 0.006557554, 0.017460326, 0.007684784, 0.002107577, 
            0.005773192, 0.076186393, 0.037631043, 0.052159393, 0.012179365, 
            0.047199766, 0.022458838, 0.030261613, 0.00626629, 0.028664896, 
            0.02285845, 0.02801855, 0.017681676, 0.040563592, 0.029791175, 
            0.034778056, 0.019318473, 0.011847912, 0.009614177, 0.064027542, 
            0.035334149, 0.041638955, 0.056015014, 0.03304865, 0.017660205, 
            0.030187166, 0.057919531, 0.029990489, 0.000112884, 0.000920886, 
            0.001081748, 0.000195159, 0.001678445, 0.000171612, 0.000191702, 
            0.000560035, 0.000384056, 0.000454783, 0.000723385, 0.000203897, 
            0.000973337, 0.000822171, 0.000620526, 0.000260769, 0.000214607, 
            0.002077443, 0.00065843, 0.000403672, 0.000378651, 0.000409306, 
            0.001722587, 0.000213785, 0.000176643, 0.002022878, 0.001886929, 
            0.053029236, 0.022594965, 0.011967636, 0.026851113, 0.03773798, 
            0.031356268, 0.10410326, 0.063265216, 0.018028454, 0.116038001, 
            0.00572817, 0.053635968, 0.059126941, 0.011835241, 0.004639624, 
            0.014302911, 0.082948853, 0.015202238, 0.021295431, 0.043342, 
            0.008153675, 0.015613747, 0.043289609, 0.048834321, 0.019144763, 
            0.059809871, 0.006990685, 0.04082966, 0.02986135, 0.061405171, 
            0.006142619, 0.009767602, 0.035427993, 0.03729329, 0.01309739, 
            0.00221718, 0.040211393, 0.006303841, 0.030146612, 0.032033879, 
            0.024590398, 0.077991721, 0.017215666, 0.014731147, 0.04802582, 
            0.03168714, 0.03244771, 0.032278613, 0.017301885, 0.013450667, 
            0.040207755, 0.042669615, 0.03456749, 0.034631319, 1.93e-05, 
            4.72e-06, 5.41e-05, 0, 1.91e-05, 9.33e-07, 5.98e-06, 0, 1.05e-06, 
            4.1e-07, 7.72e-05, 4.07e-07, 0.000585154, 0.000246992, 7.86e-06, 
            3.13e-06, 2.14e-06, 7.56e-06, 9.29e-05, 0.000116024, 5.51e-05, 
            7.79e-06, 6.65e-06, 2.06e-06, 0.000104342, 4.16e-06, 1.27e-05, 
            0.000197502, 0.00015135, 0.000107306, 6.54e-05, 0.000225564, 
            0.000142631, 0.000168873, 3.5e-05, 0.000365242, 0.000174254, 
            0.000339327, 8.7e-05, 0.000136679, 0.000156634, 0.000224181, 
            0.000205305, 8.87e-05, 0.000305774, 0.000133615, 0.00015118, 
            0.000107229, 0.000162579, 0.000152249, 6.88e-05, 0.000113864, 
            0.000249258, 0.00024256, 0.00079296, 0.007640951, 0.004937327, 
            0.000422361, 0.000953513, 0.000951187, 0.000671306, 0.001106406, 
            0.002606568, 0.003006867, 0.001911646, 0.00135411, 0.012461738, 
            0.000434917, 0.00237646, 0.007857561, 0.000436889, 0.00048816, 
            0.000348146, 0.000931449, 0.000323974, 0.004945321, 0.000693845, 
            0.000479572, 0.000843415, 0.001419675, 0.001547478, 8.16e-05, 
            6.63e-05, 0.000101583, 3.08e-05, 0.000147039, 5.13e-05, 0.000109479, 
            2.39e-05, 0.000225475, 4.28e-05, 0.000230785, 2.1e-05, 0.0001356, 
            0.000124173, 0.000245128, 0.000275446, 3.18e-05, 0.00017516, 
            0.000180192, 0.000246669, 0.000378708, 4.35e-05, 0.000267824, 
            7.2e-05, 7.65e-05, 8.79e-05, 0.000130026, 0.000111462, 3.17e-05, 
            0.000200096, 3.12e-06, 8.75e-05, 3.11e-06, 6.89e-06, 0.000165936, 
            5.98e-05, 0.000201355, 5.92e-06, 2.57e-05, 2.53e-05, 3.27e-05, 
            0.000137446, 0.000134402, 5.86e-07, 3.9e-05, 0.018886909, 0.050343466, 
            4.15e-05, 1.67e-05, 0.000172614, 4.95e-05, 1.27e-05, 9.85e-05, 
            4.28e-05, 0.002708402, 0.003215586, 0.00457116, 0.001713549, 
            0.024353184, 0.006660748, 0.003198887, 0.003094386, 0.004789163, 
            0.002816955, 0.021587313, 0.002084725, 0.00378062, 0.021751495, 
            0.009097143, 0.012216225, 0.001125765, 0.013043534, 0.005514773, 
            0.008323962, 0.026898764, 0.002149135, 0.008021623, 0.006673567, 
            0.005391139, 0.018578559, 0.013786297, 0.00080595, 0.001289505, 
            0.002451416, 0.000234107, 0.001694733, 0.000288175, 0.002357478, 
            0.000856129, 0.00159752, 0.000117538, 0.000166581, 0.000367288, 
            0.001039841, 0.001779528, 0.000438092, 0.001012515, 0.000529936, 
            0.003193086, 0.002562702, 0.00277401, 0.003013136, 0.001349197, 
            0.001646296, 0.001114222, 0.001207882, 0.002804949, 0.000366419
), .Dim = c(27L, 13L), .Dimnames = list(c("2", "0", "0", "0", 
                                          "1", "0", "0", "1", "1", "1", "2", "0", "0", "1", "2", "2", "1", 
                                          "2", "2", "2", "2", "1", "1", "2", "2", "0", "0"), c("Gene1", 
                                                                                               "Gene2", "Gene3", "Gene4", "Gene5", "Gene6", "Gene7", "Gene8", 
                                                                                               "Gene9", "Gene10", "Gene11", "Gene12", "Gene13")))

And transform your data.

library(caret)

#estimate a Box–Cox transformation 
preProcessValues <- preProcess(data, method = "BoxCox")

#transform data
dataBC <- predict(preProcessValues, data)

推荐阅读