首页 > 解决方案 > 创建一个热编码列,同时保留其他功能

问题描述

我有以下数据:

dataset <- structure(list(id = structure(c(2L, 3L, 1L, 3L, 1L, 9L), .Label = c("215101", 
"215559", "216566", "217284", "219435", "220209", "220249", "220250", 
"225678", "225679", "225687", "225869", "228420", "228435", "230621", 
"230623", "233063", "233097", "233098", "235546", "235560", "235567", 
"236379"), class = "factor"), cat1 = c("A", "B", "B", "A", "A", 
"A"), cat2 = c("item 1", "item 1", "item 2", "item 5", "item 3", 
"item 28"), cat3 = c("theme 2", "theme 2", "theme 1", "theme 4", 
"theme 10", "theme 40")), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))

我想创建一种模型矩阵,其中包含一个从列cat2cat3. 因此,我的输出将如下所示:

structure(list(id = structure(c(1L, 1L, 2L, 3L, 3L, 9L), .Label = c("215101", 
"215559", "216566", "217284", "219435", "220209", "220249", "220250", 
"225678", "225679", "225687", "225869", "228420", "228435", "230621", 
"230623", "233063", "233097", "233098", "235546", "235560", "235567", 
"236379"), class = "factor"), cat1 = c("A", "B", "A", "A", "B", 
"A"), `item 1` = c(0, 0, 1, 0, 1, 0), `item 2` = c(0, 1, 0, 0, 
0, 0), `item 28` = c(0, 0, 0, 0, 0, 1), `item 3` = c(1, 0, 0, 
0, 0, 0), `item 5` = c(0, 0, 0, 1, 0, 0), `theme 1` = c(0, 1, 
0, 0, 0, 0), `theme 10` = c(1, 0, 0, 0, 0, 0), `theme 2` = c(0, 
0, 1, 0, 1, 0), `theme 4` = c(0, 0, 0, 1, 0, 0), `theme 40` = c(0, 
0, 0, 0, 0, 1)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L))

但是,我在这个数据集中没有我的自变量,我想保留idcat1列。我怎样才能做到这一点?

标签: rone-hot-encoding

解决方案


你可以使用mergeanddcast两次。

library(reshape2)
merge(dcast(dataset, id + cat1 ~ cat2, fun.aggregate = length),
      dcast(dataset, id + cat1 ~ cat3, fun.aggregate = length),
      by = c("id", "cat1"))
#      id cat1 item 1 item 2 item 28 item 3 item 5 theme 1 theme 10 theme 2 theme 4 theme 40
#1 215101    A      0      0       0      1      0       0        1       0       0        0
#2 215101    B      0      1       0      0      0       1        0       0       0        0
#3 215559    A      1      0       0      0      0       0        0       1       0        0
#4 216566    A      0      0       0      0      1       0        0       0       1        0
#5 216566    B      1      0       0      0      0       0        0       1       0        0
#6 225678    A      0      0       1      0      0       0        0       0       0        1

如果您有两个以上的变量要传播,您可能会melt先获取数据。这将为您节省一些打字时间。

dcast(melt(dataset, id.vars = c("id", "cat1")), id + cat1 ~ value, fun.aggregate = length)

推荐阅读