python - 如何将 sklearn 管道转换为普通代码?
问题描述
我从一个教程中获得了这个 sklearn 代码:
pipe = Pipeline([("cleaner", predictors()),
('vectorizer', bow_vector),
('classifier', classifier)])
我想将其转换为普通代码,如下所示:
X_train = predictors.fit_transform(X_train)
X_train = bow_vector.fit_transform(X_train)
classifier.fit(X_train)
但我经常遇到错误。快速阅读文档没有帮助
UPD
我的确切代码是
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
df = pd.read_excel('data.xlsx')
from sklearn.model_selection import train_test_split
X = df['X']
ylabels = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3, random_state=42)
标点符号列表
punctuations = string.punctuation
自然语言处理引擎
nlp = spacy.load('en')
停用词列表
stop_words = spacy.lang.en.stop_words.STOP_WORDS
加载英语分词器、标注器、解析器、NER 和词向量
parser = English()
分词器
def spacy_tokenizer(sentence):
# Creating an token object
mytokens = parser(sentence)
# Lemmatizing each token and converting each token into lowercase
mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
# Removing stop words
mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
# return preprocessed list of tokens
return mytokens
管道的第一个元素
class predictors(TransformerMixin):
def transform(self, X, **transform_params):
# Cleaning Text
return [clean_text(text) for text in X]
def fit(self, X, y=None, **fit_params):
return self
def get_params(self, deep=True):
return {}
清除文本的基本功能
def clean_text(text):
# Removing spaces and converting text into lowercase
return text.strip().lower()
解决方案
我解决了我的问题。
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
cleaner = predictors()
X_train_cleaned = cleaner.transform(X_train)
X_train_transformed = tfidf_vector.fit_transform(X_train_cleaned)
classifier = LogisticRegression(solver='lbfgs')
classifier.fit(X_train_transformed, y_train)
cleaner = predictors()
X_test_cleaned = cleaner.transform(X_test)
X_test_transformed = tfidf_vector.transform(X_test_cleaned)
推荐阅读
- c# - 如何使用 HelixViewport 提高渲染 3D 场景的性能
- java - SQLite 异常:找不到这样的列(id 主键)
- algorithm - 如何解决类似于最短路径的图论问题?
- python-3.x - 在 QGraphicsView 中禁用鼠标指针
- reactjs - 状态更改时我无法获得 API 响应
- python - 如何检查字典中是否存在一个值,该值是否在另一个字典的列表中?
- javascript - 单击按钮的警报内部 HTML
- javascript - Jquery函数在按钮悬停时从div中添加/删除类
- neo4j - neo4j - 尽管没有存储数据,但数据库的大小为 28 MB
- c - 与子进程共享内存排序