python - 训练模型后的预测值问题
问题描述
我使用这个函数来计算我的文本上的 tf-idf,有 1,100,000 个样本:
# Calculating Tf_idf using PipeLine
transformer = FeatureUnion([
('Source1_tfidf',
Pipeline([('extract_field',
FunctionTransformer(lambda x: x['Text1'],
validate=False)),
('tfidf',
TfidfVectorizer())])),
('Source2_tfidf',
Pipeline([('extract_field',
FunctionTransformer(lambda x: x['Text2'],
validate=False)),
('tfidf',
TfidfVectorizer())]))])
transformer.fit(Fulldf31)
#now our vocabulatry has merged
Source1_vocab = transformer.transformer_list[0][1].steps[1] [1].get_feature_names()
Source2_vocab = transformer.transformer_list[1][1].steps[1][1].get_feature_names()
vocab = Source1_vocab + Source2_vocab
#vocab
tfidf_vectorizer_vectors31=transformer.transform(Fulldf31)
在训练机器之后,我在 100000 个文本上计算 tf-idf,然后在预测中我收到此错误:
ValueError: X has a different shape than during fitting.
解决方案
与其拟合两个 TfidfVectorizer,然后尝试组合它们,不如逐行连接文本数据,然后将它们传递给单个 TfidfVectorizer。
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
fruit = ['apple', 'banana', 'pear', 'kiwi']
vegetables = ['tomatoes', 'peppers', 'broccoli', 'carrots']
df = pd.DataFrame(
{'Fruit': fruit, 'Vegetables': vegetables, 'Integers': np.arange(1, 5)})
# Select text data and join them along each row
def prepare_text_data(data):
text_cols = [col for col in data.columns if (df[col].dtype == 'object')]
text_data = data[text_cols].apply(lambda x: ' '.join(x), axis=1)
return text_data
pipeline = Pipeline([
('text_selector', FunctionTransformer(prepare_text_data,
validate=False)),
('vectorizer', TfidfVectorizer())])
pipeline = pipeline.fit(df)
tfidf = pipeline.transform(df)
# Check the vocabulary to verify it contains all tokens from df
pipeline['vectorizer'].vocabulary_
Out[39]:
{'apple': 0,
'tomatoes': 7,
'banana': 1,
'peppers': 6,
'pear': 5,
'broccoli': 2,
'kiwi': 4,
'carrots': 3}
# Here is the resulting Tfidf matrix with 4 rows and 8 columns corresponding to
# the number of rows in the df and the number of tokens in the Tfidf vocabulary
tfidf.A
Out[40]:
array([[0.70710678, 0. , 0. , 0. , 0. ,
0. , 0. , 0.70710678],
[0. , 0.70710678, 0. , 0. , 0. ,
0. , 0.70710678, 0. ],
[0. , 0. , 0.70710678, 0. , 0. ,
0.70710678, 0. , 0. ],
[0. , 0. , 0. , 0.70710678, 0.70710678,
0. , 0. , 0. ]])
推荐阅读
- javascript - 有没有办法将所有控制台输出记录到 HTML 页面?
- python - Python3递归调用函数,递归调用函数时缩进每个新调用和意外输出
- azure - Power BI DirectQuery 数据不会加载
- python - 如何将父模型字段与django中的当前模型字段相乘?
- php - php 有时会在发布表单后回显 file_get_contents('php://input')
- magento2 - Magento 2 - 电子邮件重置密码链接不起作用
- javascript - 模拟空中客车 ECAM 显示数字量具
- android - android : 通过数据库搜索具有多个微调器的 Listview
- amazon-web-services - 将多个 S3 文件聚合到一个文件中
- c# - 当用户已经登录网站'A'时如何自动登录网站'B'