r - 为什么 ggplot 制作的箱线图与 Prsim 等其他软件制作的箱线图不同?
问题描述
我使用以下代码用 ggplot 制作了箱线图。
ggplot(data = df.08.long,
aes(x = TMT_signals, y = as.numeric(TMT_Intensities), fill = `probe.Mod.or.not(Y/N)`)) +
geom_boxplot() +
ylim(0, 2.5e3) +
theme_classic() +
theme(axis.title=element_text(size=8),
axis.text=element_text(size=10),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
这是我的数据框。
structure(list(Scan.number = c(10017, 10017, 10017, 10017, 10017,
10017, 10017, 10017, 10017, 13240, 13240, 13240, 13240, 13240,
13240, 13240, 13240, 13240, 27592, 27592), Sequence = c("AAAYSAQVQPVDGATR",
"AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR",
"AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR",
"AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR",
"AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR", "AAAYSAQVQPVDGATR",
"AAAYSAQVQPVDGATR", "AAEQAHLWAELVFLYDKYEEYDNAIITMMNHPTDAWK",
"AAEQAHLWAELVFLYDKYEEYDNAIITMMNHPTDAWK"), Length = c(16L, 16L,
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L,
16L, 16L, 16L, 37L, 37L), Missed.cleavages = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L
), Modified.sequence = c("_AAAYSAQVQPVDGATR_", "_AAAYSAQVQPVDGATR_",
"_AAAYSAQVQPVDGATR_", "_AAAYSAQVQPVDGATR_", "_AAAYSAQVQPVDGATR_",
"_AAAYSAQVQPVDGATR_", "_AAAYSAQVQPVDGATR_", "_AAAYSAQVQPVDGATR_",
"_AAAYSAQVQPVDGATR_", "_AAAY(XO44_TMT6)SAQVQPVDGATR_", "_AAAY(XO44_TMT6)SAQVQPVDGATR_",
"_AAAY(XO44_TMT6)SAQVQPVDGATR_", "_AAAY(XO44_TMT6)SAQVQPVDGATR_",
"_AAAY(XO44_TMT6)SAQVQPVDGATR_", "_AAAY(XO44_TMT6)SAQVQPVDGATR_",
"_AAAY(XO44_TMT6)SAQVQPVDGATR_", "_AAAY(XO44_TMT6)SAQVQPVDGATR_",
"_AAAY(XO44_TMT6)SAQVQPVDGATR_", "_AAEQAHLWAELVFLYDKYEEYDNAIITMMNHPTDAWK_",
"_AAEQAHLWAELVFLYDKYEEYDNAIITMMNHPTDAWK_"), probe_TMT6.Probabilities = c("",
"", "", "", "", "", "", "", "", "AAAY(1)SAQVQPVDGATR", "AAAY(1)SAQVQPVDGATR",
"AAAY(1)SAQVQPVDGATR", "AAAY(1)SAQVQPVDGATR", "AAAY(1)SAQVQPVDGATR",
"AAAY(1)SAQVQPVDGATR", "AAAY(1)SAQVQPVDGATR", "AAAY(1)SAQVQPVDGATR",
"AAAY(1)SAQVQPVDGATR", "", ""), `Uniprot ID` = c("Q9H7E9", "Q9H7E9",
"Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9",
"Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9", "Q9H7E9",
"Q9H7E9", "Q9H7E9", "Q00610", "Q00610"), `probe.Mod.or.not(Y/N)` = c("N",
"N", "N", "N", "N", "N", "N", "N", "N", "Y", "Y", "Y", "Y", "Y",
"Y", "Y", "Y", "Y", "N", "N"), `kinase.or.not(Y/N)` = c("N",
"N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N",
"N", "N", "N", "N", "N", "N"), Gene.Names = c("C8orf33", "C8orf33",
"C8orf33", "C8orf33", "C8orf33", "C8orf33", "C8orf33", "C8orf33",
"C8orf33", "C8orf33", "C8orf33", "C8orf33", "C8orf33", "C8orf33",
"C8orf33", "C8orf33", "C8orf33", "C8orf33", "CLTC", "CLTC"),
Charge = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), m.z = c(802.90499, 802.90499,
802.90499, 802.90499, 802.90499, 802.90499, 802.90499, 802.90499,
802.90499, 647.57262, 647.57262, 647.57262, 647.57262, 647.57262,
647.57262, 647.57262, 647.57262, 647.57262, 1107.5251, 1107.5251
), Score = c(86.313, 86.313, 86.313, 86.313, 86.313, 86.313,
86.313, 86.313, 86.313, 41.695, 41.695, 41.695, 41.695, 41.695,
41.695, 41.695, 41.695, 41.695, 28.532, 28.532), Retention.time = c(27.774,
27.774, 27.774, 27.774, 27.774, 27.774, 27.774, 27.774, 27.774,
35.978, 35.978, 35.978, 35.978, 35.978, 35.978, 35.978, 35.978,
35.978, 72.556, 72.556), Precursor.Intensity = c(460631.45703125,
460631.45703125, 460631.45703125, 460631.45703125, 460631.45703125,
460631.45703125, 460631.45703125, 460631.45703125, 460631.45703125,
472201.625, 472201.625, 472201.625, 472201.625, 472201.625,
472201.625, 472201.625, 472201.625, 472201.625, 388790.9296875,
388790.9296875), Localization.prob = c(NaN, NaN, NaN, NaN,
NaN, NaN, NaN, NaN, NaN, 1, 1, 1, 1, 1, 1, 1, 1, 1, NaN,
NaN), probe_TMT6.site.IDs = c("", "", "", "", "", "", "",
"", "", "308", "308", "308", "308", "308", "308", "308",
"308", "308", "", ""), TMT_signals = c("TMT126", "TMT127N",
"TMT128N", "TMT128C", "TMT129N", "TMT129C", "TMT130N", "TMT130C",
"TMT131", "TMT126", "TMT127N", "TMT128N", "TMT128C", "TMT129N",
"TMT129C", "TMT130N", "TMT130C", "TMT131", "TMT126", "TMT127N"
), TMT_Intensities = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1824.667,
3470.869, 1691.413, 2367.219, 1895.059, 1712.427, 1529.349,
1617.825, 1677.578, 0, 0)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
这是我从 ggplot 得到的情节。
这是我从 Prism 得到的箱线图。
看起来 ggplot 和 Prism 计算的中值不同。例如,126 Y
ggplot 中的中值约为 1500,而126 Y
Prism 中的中值约为 5000。
我检查了我的数据集,这些列中没有 NA 值。我自己计算的中值与 Prism 中的值相匹配。
有谁知道发生了什么?
谢谢!
解决方案
这里的问题在于ylim
,与 一起scale_y_continuous(limits = ...)
,它的行为让一些用户感到意外。如帮助中所述?ylim
,
这是为各个比例提供限制参数的捷径。默认情况下,任何超出指定限制的值都将替换为 NA。请注意,这将删除超出限制的数据,这可能会产生意想不到的结果。要更改 x 或 y 轴限制而不丢弃数据观察,请参阅 coord_cartesian()。
这在汇总操作期间会产生特别令人困惑的结果,例如您的 for geom_boxplot()
,因为它不会出错,它只会产生不同的结果和您可能会错过或忽略的警告。
例如,在下面的图表中,我们希望箱线图的范围从 0 到 100,但只能得到零值。这是因为在执行任何汇总计算之前,使用ylim
orscale_y_continuous(limits = ...)
都会过滤掉范围之外的数据。
df1 <- data.frame(x = 1, y = c(0, 100))
ggplot(df1, aes(x, y)) +
geom_boxplot() +
ylim(c(0,90))
有一个警告,但很容易错过或没有意识到它的含义:
警告消息:删除了 1 行包含非有限值 (stat_boxplot)。
据说其中一个结果是 NA(在上述步骤之后)并且在统计计算中被忽略。
要获得预期的结果,您通常应该使用coord_cartesian(ylim = ...)
来控制查看窗口。
ggplot(df1, aes(x, y)) +
geom_boxplot() +
coord_cartesian(ylim = c(0,90))
推荐阅读
- android - WebResourceResponse 不显示视频
- java - 在句中打印事件严重性:log4j2
- nix - Nix Flake:如何接受 Android SDK 许可证
- reactjs - 当我转到我的域名时,它会转到我的应用程序,但只在地址栏中显示 IP 地址
- angular - Angular 7 i18n 支持翻译打字稿文件
- xcode - 如何获取 xcode-organizer 性能指标的原始数据?
- c# - .NET Core 控制台应用程序无法识别 Linux 上的输入参数
- javascript - 在回显的 php 文件中调用 onClick()
- django - 图像未传递给视图(cropperjs、django、ajax)
- spring - Spring Webflux - 延迟初始化的内存缓存