首页 > 解决方案 > 如何将数据与 CountVectorizer 功能合并

问题描述

这是我的数据集

        body                                            customer_id   name
14828   Thank you to apply to us.                       5458          Sender A
23117   Congratulation your application is accepted.    5136          Sender B
23125   Your OTP will expire in 10 minutes.             5136          Sender A

这是我的代码

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
b = a['body']
vect = CountVectorizer()
vect.fit(b)
X_vect=vect.transform(b)
pd.DataFrame(X_vect.toarray(), columns=vect.get_feature_names())

输出是

    10  application apply ... your  
0   0   0           1         0
1   0   1           0         1
2   1   0           0         1 

我需要的是

        body                                            customer_id   name        10  application apply ... your
14828   Thank you to apply to us.                       5458          Sender A    0   0           1         0
23117   Congratulation your application is accepted.    5136          Sender B    0   1           0         1
23125   Your OTP will expire in 10 minutes.             5136          Sender A    1   0           0         1

假设我如何做到这一点?我仍然希望使用CountVectorizer,以便将来可以修改该功能

标签: pythonpandasdataframescikit-learncountvectorizer

解决方案


您可以添加indexDataframe构造函数,然后使用默认值添加join到原始:dfleft join

b = pd.DataFrame(X_vect.toarray(), columns=vect.get_feature_names(), index= a.index)
a = a.join(b)

或使用merge,但需要更多参数,因为默认为inner join

a = a.merge(b, left_index=True, right_index=True, how='left')

推荐阅读