首页 > 解决方案 > 如何为某个单元格中的多个个体制作二进制变量?

问题描述

我正在处理一些不稳定的数据框。我正在使用一个标准的表格数据集/.csv 文件,该文件在大多数情况下是相当标准的,但是,在一列中,每个观察都是个人列表。这是它的外观:

Layer       Grade              Players
Top           A         NY 08; NY 27; NY 80
Bottom        D         MA 27; MA 45; MA 65
Middle        B         NY 09; MA 48; NY 66
...

如您所见,数据框是标准的,除了 Players 列。如何为每个玩家添加一列,提供他们是否在游戏中的二进制指标?我希望上面的数据框变成这样:

Layer       Grade       Players                       NYAL 08     NYAL 27     NYAL 80    MAAC 27
Top           A         NYAL 08; NYAL 27; NYAL 80       1           1           1          0
Bottom        D         MAAC 27; MAAC 45; MAAC 65       0           0           0          1
Middle        B         NYAL 08; MAAC 48; NYAL 66       1           0           0          0
...

等等。

谢谢!

标签: rdataframe

解决方案


我们可以使用cSplit_efrom splitstackshape。它将在一行中以紧凑的方式获得输出

library(splitstackshape)
out <- cSplit_e(df1, 'Players', sep=";", type = "character", fill = 0)
out

-输出

#Layer Grade             Players Players_MA 27 Players_MA 45 Players_MA 48 Players_MA 65 Players_NY 08 Players_NY 09 Players_NY 27
#1    Top     A NY 08; NY 27; NY 80             0             0             0             0             1             0             1
#2 Bottom     D MA 27; MA 45; MA 65             1             1             0             1             0             0             0
#3 Middle     B NY 09; MA 48; NY 66             0             0             1             0             0             1             0
#  Players_NY 66 Players_NY 80
#1             0             1
#2             0             0
#3             1             0

如果我们想删除列名中的前缀

names(out)[-(1:3)] <- sub('Players_', '', names(out)[-(1:3)])

或者另一种选择是mtabulate

cbind(df1, mtabulate(strsplit(df1$Player, ";\\s+")))

-输出

#   Layer Grade             Players MA 27 MA 45 MA 48 MA 65 NY 08 NY 09 NY 27 NY 66 NY 80
#1    Top     A NY 08; NY 27; NY 80     0     0     0     0     1     0     1     0     1
#2 Bottom     D MA 27; MA 45; MA 65     1     1     0     1     0     0     0     0     0
#3 Middle     B NY 09; MA 48; NY 66     0     0     1     0     0     1     0     1     0

数据

df1 <- structure(list(Layer = c("Top", "Bottom", "Middle"), Grade = c("A", 
"D", "B"), Players = c("NY 08; NY 27; NY 80", "MA 27; MA 45; MA 65", 
"NY 09; MA 48; NY 66")), class = "data.frame", row.names = c(NA, 
-3L))

推荐阅读