r - 将非唯一值放入新列的 spread()
问题描述
我有一些看起来像这样的数据(最后的输入代码):
#> artist album year source id
#> 1 Beatles Sgt. Pepper's 1967 amazon B0025KVLTM
#> 2 Beatles Sgt. Pepper's 1967 spotify 6QaVfG1pHYl1z15ZxkvVDW
#> 3 Beatles Sgt. Pepper's 1967 amazon B06WGVMLJY
#> 4 Rolling Stones Sticky Fingers 1971 spotify 29m6DinzdaD0OPqWKGyMdz
我想修复“id”列(其中包括来自多个来源的 id,如“source”列中所示。
这应该是直截了当的spread()
,但复杂之处在于,有时我们会从完全相同的来源获得重复的 id:请参见上面的第 1 行和第 3 行。
有没有一种简单的方法可以spread()
将重复的 id 放在新列中?
我想要的结果是:
#> artist album year source amazon_id amazon_id_2
#> 1 Beatles Sgt. Pepper's 1967 amazon B0025KVLTM B06WGVMLJY
#> 2 Rolling Stones Sticky Fingers 1971 spotify <NA> <NA>
#> spotify
#> 1 6QaVfG1pHYl1z15ZxkvVDW
#> 2 29m6DinzdaD0OPqWKGyMdz
下面的代码是输入样本数据:
df <- data.frame(stringsAsFactors=FALSE,
artist = c("Beatles", "Beatles", "Beatles", "Rolling Stones"),
album = c("Sgt. Pepper's", "Sgt. Pepper's", "Sgt. Pepper's",
"Sticky Fingers"),
year = c(1967, 1967, 1967, 1971),
source = c("amazon", "spotify", "amazon", "spotify"),
id = c("B0025KVLTM", "6QaVfG1pHYl1z15ZxkvVDW", "B06WGVMLJY",
"29m6DinzdaD0OPqWKGyMdz")
)
df
解决方案
这可以通过dcast
from data.table
in one (looong) line 来完成。但因此我认为非常优雅。
library(data.table)
dcast(df, artist + album + year ~ paste(source, rowid(artist, source), sep = "_"))
# artist album year amazon_1 amazon_2 spotify_1
#1 Beatles Sgt. Pepper's 1967 B0025KVLTM B06WGVMLJY 6QaVfG1pHYl1z15ZxkvVDW
#2 Rolling Stones Sticky Fingers 1971 <NA> <NA> 29m6DinzdaD0OPqWKGyMdz