首页 > 解决方案 > 如何创建多个虚拟变量(两列之间的交互)?

问题描述

我需要为每个选择和每个城市创建虚拟变量。选择集是整数列表:[10, 20, 30, 40, 50],城市集是字符串列表:['XX', 'YY', 'ZZ']

这是数据框:

 choice city
     10   XX
     20   YY
     20   YY
     30   XX
     10   XX
     20   YY
     40   ZZ
     40   ZZ
     50   YY

预期结果:

 choice city  10_XX  10_YY  10_ZZ  20_XX  20_YY  20_ZZ  30_XX  30_YY  30_ZZ  40_XX  40_YY  40_ZZ  50_XX  50_YY  50_ZZ
     10   XX      1      0      0      0      0      0      0      0      0      0      0      0      0      0      0
     20   YY      0      0      0      0      1      0      0      0      0      0      0      0      0      0      0
     20   YY      0      0      0      0      1      0      0      0      0      0      0      0      0      0      0
     30   XX      0      0      0      0      0      0      1      0      0      0      0      0      0      0      0
     10   XX      1      0      0      0      0      0      0      0      0      0      0      0      0      0      0
     20   YY      0      0      0      0      1      0      0      0      0      0      0      0      0      0      0
     40   ZZ      0      0      0      0      0      0      0      0      0      0      0      1      0      0      0
     40   ZZ      0      0      0      0      0      0      0      0      0      0      0      1      0      0      0
     50   YY      0      0      0      0      0      0      0      0      0      0      0      0      0      1      0

标签: pandasnumpy

解决方案


您可以使用outer比较。


u = np.equal.outer(df, df).any(1).all(-1).view('i1')

array([[1, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 1, 0, 0, 0],
       [0, 1, 1, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1]], dtype=int8)

现在回到所需的 DataFrame:

index = pd.MultiIndex.from_frame(df)
columns = index.map("{0[0]}_{0[1]}".format)

allc = set(
  f'{i}_{j}' for i in df['choice'] for j in df['city'])

res = pd.DataFrame(u, index, columns).T.drop_duplicates().T

res.reindex(allc, axis=1, fill_value=0)

             40_ZZ  50_ZZ  20_YY  50_XX  40_XX  20_ZZ  20_XX  10_YY  30_ZZ  30_YY  10_XX  30_XX  50_YY  40_YY  10_ZZ
choice city
10     XX        0      0      0      0      0      0      0      0      0      0      1      0      0      0      0
20     YY        0      0      1      0      0      0      0      0      0      0      0      0      0      0      0
       YY        0      0      1      0      0      0      0      0      0      0      0      0      0      0      0
30     XX        0      0      0      0      0      0      0      0      0      0      0      1      0      0      0
10     XX        0      0      0      0      0      0      0      0      0      0      1      0      0      0      0
20     YY        0      0      1      0      0      0      0      0      0      0      0      0      0      0      0
40     ZZ        1      0      0      0      0      0      0      0      0      0      0      0      0      0      0
       ZZ        1      0      0      0      0      0      0      0      0      0      0      0      0      0      0
50     YY        0      0      0      0      0      0      0      0      0      0      0      0      1      0      0

推荐阅读