首页 > 解决方案 > 重复点

问题描述

当数据框中有重复点时,R 的 lowess 函数似乎会产生一些奇怪的结果。在下面的 2016 年奥运会女子七项全能跳高跳高成绩数据框中,两人得分相同。

# Here is the data frame I'm working with
structure(list(rank = 1:29, lastname = c("Thiam", "Ennis-Hill", 
"Eaton", "Ikauniece-Admidina", "Schafer", "Johnson-Thompson", 
"Rodriguez", "Zsivoczky-Farkas", "Oeser", "Vetter", "Ida", "Nwaba", 
"Broersen", "Rath", "Aguilar", "Krizsan", "Williams", "Miller-Koch", 
"Visser", "Jones", "Dadic", "Klucinova", "Chefer", "Cachova", 
"Kasyanova", "Felix", "Yfantidou", "Fodorova", "Osazuwa"), hj = c(1.98, 
1.89, 1.86, 1.77, 1.83, 1.98, 1.86, 1.86, 1.86, 1.77, 1.77, 1.83, 
1.77, 1.74, 1.74, 1.77, 1.83, 1.8, 1.68, 1.89, 1.77, 1.8, 1.68, 
1.77, 1.77, 1.68, 1.65, 1.8, 1.77), pts_hj = c(1211L, 1093L, 
1054L, 941L, 1016L, 1211L, 1054L, 1054L, 1054L, 941L, 941L, 1016L, 
941L, 903L, 903L, 941L, 1016L, 978L, 830L, 1093L, 941L, 978L, 
830L, 941L, 941L, 830L, 795L, 978L, 941L), dvvb_hj = c(2.26375883781343, 
1.13834730130046, 0.763210122462812, -0.36220141405015, 0.388072943625158, 
2.26375883781343, 0.763210122462812, 0.763210122462812, 0.763210122462812, 
-0.36220141405015, -0.36220141405015, 0.388072943625158, -0.36220141405015, 
-0.737338592887804, -0.737338592887804, -0.36220141405015, 0.388072943625158, 
0.0129357647875044, -1.48761295056311, 1.13834730130046, -0.36220141405015, 
0.0129357647875044, -1.48761295056311, -0.36220141405015, -0.36220141405015, 
-1.48761295056311, -1.86275012940077, 0.0129357647875044, -0.36220141405015
)), class = "data.frame", row.names = c(NA, -29L))

hept$hjhept$pts_hj几乎是线性排列的。绘制lowess曲线给出了一个急剧弯曲的曲线。

plot(hept$hj, hept$pts_hj)
lines(lowess(hept$hj, hept$pts_hj))

具有意外急剧弯曲的低曲线 改变“更平滑的跨度”给出了预期的数字

plot(hept$hj, hept$pts_hj)
lines(lowess(hept$hj, hept$pts_hj, f = 1/3))

接近线性lowess曲线 推测它与重复点有关,因为两者都没有

lines(lowess(jitter(hept$hj), jitter(hept$pts_hj)))

也不

lines(lowess(hepthj$hj[hepthj$rank != 1] ~ hepthj$pts_hj[hepthj$rank != 1]))

产生弯曲。难道我做错了什么?

标签: r

解决方案


文档lowess说“通常使用局部线性多项式拟合,但在某些情况下(参见文件)可以使用局部常数拟合。” 它所指的文件是https://github.com/wch/r-source/blob/trunk/src/library/stats/src/lowess.doc。我猜想您的数据集几乎是完全线性的事实导致算法中的一些不稳定性切换到右侧点的局部常数。

使用更现代的代码loess()可以避免这个问题:

loess(pts_hj ~ hj, data=hept)

推荐阅读