python - Gridsearch for NLP - 如何结合 CountVec 和其他功能？

问题描述

我正在做一个关于情感分析的基本 NLP 项目，我想使用 GridsearchCV 来优化我的模型。

下面的代码显示了我正在使用的示例数据框。'Content' 是要传递给 CountVectorizer 的列，'label' 是要预测的 y 列，而 feature_1、feature_2 也是我希望包含在我的模型中的列。

'content': 'Got flat way today Pot hole Another thing tick crap thing happen week list',
'feature_1': '1', 
'feature_2': '34', 
'label':1}, 
{'content': 'UP today Why doe head hurt badly',
'feature_1': '5', 
'feature_2': '142', 
'label':1},
{'content': 'spray tan fail leg foot Ive scrubbing foot look better ',
 'feature_1': '7', 
'feature_2': '123', 
'label':0},])

我正在关注stackoverflow的答案：使用管道和网格搜索执行特征选择

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import TransformerMixin, BaseEstimator
class CustomFeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, feature_1=True, feature_2=True):
        self.feature_1=feature_1
        self.feature_2=feature_2
        
    def extractor(self, tweet):
        features = []

        if self.feature_2:
            
            features.append(df['feature_2'])

        if self.feature_1:
            features.append(df['feature_1'])
        
          
        return np.array(features)

    def fit(self, raw_docs, y):
        return self

    def transform(self, raw_docs):
        
        return np.vstack(tuple([self.extractor(tweet) for tweet in raw_docs]))

下面是我试图适合我的数据框的网格搜索：

lr = LogisticRegression()

# Pipeline
pipe = Pipeline([('features', FeatureUnion([("vectorizer", CountVectorizer(df['content'])),
                                            ("extractor", CustomFeatureExtractor())]))
                 ,('classifier', lr())
                ])
But yields results: TypeError: 'LogisticRegression' object is not callable

想知道是否还有其他更简单的方法可以做到这一点？

但是，我已经参考了以下线程，但无济于事：如何将 TFIDF 特征与其他特征结合使用管道和网格搜索执行特征选择

标签： pythonnlppipelinemodeling

你不能做lr()，LogisticRegression确实是不可调用的，它有一些lr对象的方法。

改为尝试（lr不带括号）：

lr = LogisticRegression()
pipe = Pipeline([('features', FeatureUnion([("vectorizer", CountVectorizer(df['content'])),
                                            ("extractor", CustomFeatureExtractor())]))
                 ,('classifier', lr)
                ])

你的错误信息应该会消失。

python - Gridsearch for NLP - 如何结合 CountVec 和其他功能？

问题描述

解决方案

推荐阅读