首页 > 解决方案 > 为什么 ggplot 制作的箱线图与 Prsim 等其他软件制作的箱线图不同?

问题描述

我使用以下代码用 ggplot 制作了箱线图。

ggplot(data = df.08.long,
      aes(x = TMT_signals, y = as.numeric(TMT_Intensities), fill = `probe.Mod.or.not(Y/N)`)) +
  geom_boxplot() +
  ylim(0, 2.5e3) +
  theme_classic() +
  theme(axis.title=element_text(size=8),
        axis.text=element_text(size=10),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

这是我的数据框。

 structure(list(Scan.number = c(10017, 10017, 10017, 10017, 10017, 
    10017, 10017, 10017, 10017, 13240, 13240, 13240, 13240, 13240, 
    13240, 13240, 13240, 13240, 27592, 27592), Sequence = c("AAAYSAQVQPVDGATR", 
    "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", 
    "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", 
    "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", 
    "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", 
    "AAAYSAQVQPVDGATR", "AAEQAHLWAELVFLYDKYEEYDNAIITMMNHPTDAWK", 
    "AAEQAHLWAELVFLYDKYEEYDNAIITMMNHPTDAWK"), Length = c(16L, 16L, 
    16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 
    16L, 16L, 16L, 37L, 37L), Missed.cleavages = c(0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L
    ), Modified.sequence = c("_AAAYSAQVQPVDGATR_", "_AAAYSAQVQPVDGATR_", 
    "_AAAYSAQVQPVDGATR_", "_AAAYSAQVQPVDGATR_", "_AAAYSAQVQPVDGATR_", 
    "_AAAYSAQVQPVDGATR_", "_AAAYSAQVQPVDGATR_", "_AAAYSAQVQPVDGATR_", 
    "_AAAYSAQVQPVDGATR_", "_AAAY(XO44_TMT6)SAQVQPVDGATR_", "_AAAY(XO44_TMT6)SAQVQPVDGATR_", 
    "_AAAY(XO44_TMT6)SAQVQPVDGATR_", "_AAAY(XO44_TMT6)SAQVQPVDGATR_", 
    "_AAAY(XO44_TMT6)SAQVQPVDGATR_", "_AAAY(XO44_TMT6)SAQVQPVDGATR_", 
    "_AAAY(XO44_TMT6)SAQVQPVDGATR_", "_AAAY(XO44_TMT6)SAQVQPVDGATR_", 
    "_AAAY(XO44_TMT6)SAQVQPVDGATR_", "_AAEQAHLWAELVFLYDKYEEYDNAIITMMNHPTDAWK_", 
    "_AAEQAHLWAELVFLYDKYEEYDNAIITMMNHPTDAWK_"), probe_TMT6.Probabilities = c("", 
    "", "", "", "", "", "", "", "", "AAAY(1)SAQVQPVDGATR", "AAAY(1)SAQVQPVDGATR", 
    "AAAY(1)SAQVQPVDGATR", "AAAY(1)SAQVQPVDGATR", "AAAY(1)SAQVQPVDGATR", 
    "AAAY(1)SAQVQPVDGATR", "AAAY(1)SAQVQPVDGATR", "AAAY(1)SAQVQPVDGATR", 
    "AAAY(1)SAQVQPVDGATR", "", ""), `Uniprot ID` = c("Q9H7E9", "Q9H7E9", 
    "Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9", 
    "Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9", 
    "Q9H7E9", "Q9H7E9", "Q00610", "Q00610"), `probe.Mod.or.not(Y/N)` = c("N", 
    "N", "N", "N", "N", "N", "N", "N", "N", "Y", "Y", "Y", "Y", "Y", 
    "Y", "Y", "Y", "Y", "N", "N"), `kinase.or.not(Y/N)` = c("N", 
    "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", 
    "N", "N", "N", "N", "N", "N"), Gene.Names = c("C8orf33", "C8orf33", 
    "C8orf33", "C8orf33", "C8orf33", "C8orf33", "C8orf33", "C8orf33", 
    "C8orf33", "C8orf33", "C8orf33", "C8orf33", "C8orf33", "C8orf33", 
    "C8orf33", "C8orf33", "C8orf33", "C8orf33", "CLTC", "CLTC"), 
        Charge = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 
        4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), m.z = c(802.90499, 802.90499, 
        802.90499, 802.90499, 802.90499, 802.90499, 802.90499, 802.90499, 
        802.90499, 647.57262, 647.57262, 647.57262, 647.57262, 647.57262, 
        647.57262, 647.57262, 647.57262, 647.57262, 1107.5251, 1107.5251
        ), Score = c(86.313, 86.313, 86.313, 86.313, 86.313, 86.313, 
        86.313, 86.313, 86.313, 41.695, 41.695, 41.695, 41.695, 41.695, 
        41.695, 41.695, 41.695, 41.695, 28.532, 28.532), Retention.time = c(27.774, 
        27.774, 27.774, 27.774, 27.774, 27.774, 27.774, 27.774, 27.774, 
        35.978, 35.978, 35.978, 35.978, 35.978, 35.978, 35.978, 35.978, 
        35.978, 72.556, 72.556), Precursor.Intensity = c(460631.45703125, 
        460631.45703125, 460631.45703125, 460631.45703125, 460631.45703125, 
        460631.45703125, 460631.45703125, 460631.45703125, 460631.45703125, 
        472201.625, 472201.625, 472201.625, 472201.625, 472201.625, 
        472201.625, 472201.625, 472201.625, 472201.625, 388790.9296875, 
        388790.9296875), Localization.prob = c(NaN, NaN, NaN, NaN, 
        NaN, NaN, NaN, NaN, NaN, 1, 1, 1, 1, 1, 1, 1, 1, 1, NaN, 
        NaN), probe_TMT6.site.IDs = c("", "", "", "", "", "", "", 
        "", "", "308", "308", "308", "308", "308", "308", "308", 
        "308", "308", "", ""), TMT_signals = c("TMT126", "TMT127N", 
        "TMT128N", "TMT128C", "TMT129N", "TMT129C", "TMT130N", "TMT130C", 
        "TMT131", "TMT126", "TMT127N", "TMT128N", "TMT128C", "TMT129N", 
        "TMT129C", "TMT130N", "TMT130C", "TMT131", "TMT126", "TMT127N"
        ), TMT_Intensities = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1824.667, 
        3470.869, 1691.413, 2367.219, 1895.059, 1712.427, 1529.349, 
        1617.825, 1677.578, 0, 0)), row.names = c(NA, -20L), class = c("tbl_df", 
    "tbl", "data.frame"))

这是我从 ggplot 得到的情节。

在此处输入图像描述

这是我从 Prism 得到的箱线图。

在此处输入图像描述

看起来 ggplot 和 Prism 计算的中值不同。例如,126 Yggplot 中的中值约为 1500,而126 YPrism 中的中值约为 5000。

我检查了我的数据集,这些列中没有 NA 值。我自己计算的中值与 Prism 中的值相匹配。

有谁知道发生了什么?

谢谢!

标签: rggplot2boxplot

解决方案


这里的问题在于ylim,与 一起scale_y_continuous(limits = ...),它的行为让一些用户感到意外。如帮助中所述?ylim

这是为各个比例提供限制参数的捷径。默认情况下,任何超出指定限制的值都将替换为 NA。请注意,这将删除超出限制的数据,这可能会产生意想不到的结果。要更改 x 或 y 轴限制而不丢弃数据观察,请参阅 coord_cartesian()。

这在汇总操作期间会产生特别令人困惑的结果,例如您的 for geom_boxplot(),因为它不会出错,它只会产生不同的结果和您可能会错过或忽略的警告。

例如,在下面的图表中,我们希望箱线图的范围从 0 到 100,但只能得到零值。这是因为在执行任何汇总计算之前,使用ylimorscale_y_continuous(limits = ...)都会过滤掉范围之外的数据。

df1 <- data.frame(x = 1, y = c(0, 100))
ggplot(df1, aes(x, y)) +
   geom_boxplot() +
   ylim(c(0,90))

在此处输入图像描述

有一个警告,但很容易错过或没有意识到它的含义:

警告消息:删除了 1 行包含非有限值 (stat_boxplot)。

据说其中一个结果是 NA(在上述步骤之后)并且在统计计算中被忽略。

要获得预期的结果,您通常应该使用coord_cartesian(ylim = ...)来控制查看窗口。

ggplot(df1, aes(x, y)) +
   geom_boxplot() +
   coord_cartesian(ylim = c(0,90))

在此处输入图像描述


推荐阅读