首页 > 解决方案 > 无法在逻辑回归中将字符串转换为浮点数

问题描述

我写了以下代码:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Spam_model = LogisticRegression(solver='liblinear', penalty='l1')
print(X_train)

Spam_model.fit(X_train, Y_train)
pred = Spam_model.predict(X_test)
accuracy_score(Y_test,pred)

它抛出以下错误。这可能是什么原因?

在此处输入图像描述

标签: pythonlogistic-regression

解决方案


如果您有文本作为数据,则需要在应用分类器之前进行特征提取。使用sklearn 中的一个旧示例:

from sklearn.datasets import fetch_20newsgroups
cats = ['alt.atheism', 'sci.space']

newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
X_train = newsgroups_train.data
Y_train = newsgroups_train.target

newsgroups_test = fetch_20newsgroups(subset='test', categories=cats)
X_test = newsgroups_test.data
Y_test = newsgroups_test.target
 

数据如下所示:

Y_train
array([0, 1, 1, ..., 1, 1, 1])

X_train[0][:50]
'From: bil@okcforum.osrhe.edu (Bill Conner)\nSubject'

应用矢量化器将文本转换为基本的数字特征,然后训练模型:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

model = LogisticRegression(solver='liblinear', penalty='l1')

model.fit(X_train_vec, Y_train)
pred = model.predict(X_test_vec)
accuracy_score(Y_test,pred)

0.906030855539972

推荐阅读