首页 > 解决方案 > 使用 one-hot 编码拆分字符串并将 df 从长格式转换为宽格式

问题描述

以下是相关 df 的简化版本的脚本:

df = pd.DataFrame({ 
               'id' : [1,1,2,2,3,3], 
               'feature': ['colour','interior_features','colour','interior_features','colour','interior_features'],
               'feature_value' : ['blue','cd_player<->sat_nav<->usb_port','red','cd_player<->usb_port','red','cd_player<->sat_nav<->sub_woofer'],
                 })
df

   id   feature             feature_value
0   1   colour              blue
1   1   interior_features   cd_player<->sat_nav<->usb_port
2   2   colour              red
3   2   interior_features   cd_player<->usb_port
4   3   colour              red
5   3   interior_features   cd_player<->sat_nav<->sub_woofer

首先,我想将'interior_features'中的字符串转换 为一个列表,其中'<->'是分隔符,如下所示:

    id  feature             feature_value
0   1   colour              blue
1   1   interior_features   [cd_player, sat_nav, usb_port]
2   2   colour              red
3   2   interior_features   [cd_player, usb_port]
4   3   colour              red
5   3   interior_features   [cd_player, sat_nav, sub_woofer]

然后我想取消嵌套这个列表并使用单热编码将二进制值分配给“feature_value”列中的“interior_features” 。

预期的DF:

    id  feature     feature_value
0   1   colour      blue
1   1   cd_player   1
2   1   sat_nav     1
3   1   usb_port    1
4   1   sub_woofer  0
5   2   colour      red
6   2   cd_player   1
7   2   sat_nav     0
8   2   usb_port    1
9   2   sub_woofer  0
10  3   colour      red
11  3   cd_player   1
12  3   sat_nav     1
13  3   usb_port    0
14  3   sub_woofer  1

任何帮助将非常感激。

标签: pythonpandas

解决方案


您可以尝试split然后explodecrosstab每个 id 填写未命中行

df1 = df.loc[df['feature']=='colour'] 
# slice out the row do not need to unnest
df2 = df.drop(df1.index)    
df2['feature'] = df2['feature_value'].str.split('<->')
s = df2.explode('feature') 
s = pd.crosstab(s['id'],s['feature']).stack().reset_index(name='feature_value')
out = pd.concat([df1,s]).sort_values('id')
out
Out[356]: 
    id     feature feature_value
0    1      colour          blue
0    1   cd_player             1
1    1     sat_nav             1
2    1  sub_woofer             0
3    1    usb_port             1
2    2      colour           red
4    2   cd_player             1
5    2     sat_nav             0
6    2  sub_woofer             0
7    2    usb_port             1
4    3      colour           red
8    3   cd_player             1
9    3     sat_nav             1
10   3  sub_woofer             1
11   3    usb_port             0

推荐阅读