首页 > 解决方案 > 二进制编码(类似于 one-hot 编码),但在单个列和行中允许多个值

问题描述

想象一下,我在一个数据框中有带有 ID 和三个可能标签的数据,这样:

+-------------------+-------+
|        ID         | TYPE  |
+-------------------+-------+
| Lord of the Rings | Movie |
| Lord of the Rings | Book  |
| Lord of the Rings | Game  |
| Alien             | Movie |
| Alien             | Game  |
| Fight Club        | Book  |
| Fight Club        | Movie |
| Scar Face         | Movie |
| God of War        | Game  |
| Tomb Raider       | Movie |
| Tomb Raider       | Game  |
| Borderlands       | Game  |
| Ulysses           | Book  |
+-------------------+-------+

我想要做的基本上是对这些数据进行一次热编码,以便我添加三列Movie, BookGame它们是二进制编码的,以显示该类型对于每个 ID 是真还是假。然而,有了这些数据,它就不会考虑重复。例如,如果我使用pd.get_dummies,我最终会得到

+-------------------+-------+-------+------+------+
|        ID         | TYPE  | Movie | Game | Book |
+-------------------+-------+-------+------+------+
| Lord of the Rings | Movie |     1 |    0 |    0 |
| Lord of the Rings | Book  |     0 |    0 |    1 |
| Lord of the Rings | Game  |     0 |    1 |    0 |
| Alien             | Movie |     1 |    0 |    0 |
| Alien             | Game  |     0 |    1 |    0 |
| Fight Club        | Book  |     0 |    0 |    1 |
| Fight Club        | Movie |     1 |    0 |    0 |
| Scar Face         | Movie |     1 |    0 |    0 |
| God of War        | Game  |     0 |    1 |    0 |
| Tomb Raider       | Movie |     1 |    0 |    0 |
| Tomb Raider       | Game  |     0 |    1 |    0 |
| Borderlands       | Game  |     0 |    1 |    0 |
| Ulysses           | Book  |     0 |    0 |    1 |
+-------------------+-------+-------+------+------+

正如预期的那样,它为每条记录提供了一个新行。所以我的问题是,我可以把这些数据变成

+-------------------+-------------------+-------+------+------+
|        ID         |       TYPE        | Movie | Game | Book |
+-------------------+-------------------+-------+------+------+
| Lord of the Rings | [Movie,Game,Book] |     1 |    1 |    1 |
| Alien             | [Movie,Game]      |     1 |    1 |    0 |
| Fight Club        | [Movie,Book]      |     1 |    0 |    1 |
| Scar Face         | [Movie]           |     1 |    0 |    0 |
| God of War        | [Game]            |     0 |    1 |    0 |
| Tomb Raider       | [Movie,Game]      |     1 |    1 |    0 |
| Borderlands       | [Game]            |     0 |    1 |    0 |
| Ulysses           | [Book]            |     0 |    0 |    1 |
+-------------------+-------------------+-------+------+------+

没有完全转换我的数据?基本上,我想找到 ID 中的所有重复条目并将它们连接在一起,以便给定唯一 ID 的所有类型都在一个地方(理想情况下在单个记录中的列表中),然后将其单热编码为这样我就可以TYPE在一行中看到所有真值或假值,并与(现在)唯一 ID 对齐。

标签: pythonpandasdataframe

解决方案


你可以做:

(pd.concat( (pd.get_dummies(df['Type']), df), axis=1, sort=False)
   .groupby('ID', as_index=False, sort=False)
   .agg({'TYPE': list, 'Movie':'sum', 'Game':'sum', 'Book':'sum'})
)

输出:

                  ID                 TYPE  Movie  Game  Book
0  Lord of the Rings  [Movie, Book, Game]      1     1     1
1              Alien        [Movie, Game]      1     1     0
2         Fight Club        [Book, Movie]      1     0     1
3          Scar Face              [Movie]      1     0     0
4         God of War               [Game]      0     1     0
5        Tomb Raider        [Movie, Game]      1     1     0
6        Borderlands               [Game]      0     1     0
7            Ulysses               [Book]      0     0     1

推荐阅读