python - 如何在没有输出列的新文件上使用此机器学习模型?
问题描述
我使用了一个 csv 文件中的一些数据,它有 2 列,第一列是评论,第二列是结果。我有一个输出,但想在没有输出列的文件上测试这个模型。我该怎么做?
import csv
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import BernoulliNB
from sklearn import cross_validation
from sklearn.metrics import classification_report
import numpy as np
from sklearn.metrics import accuracy_score
# review.csv contains two columns
# first column is the review content (quoted)
# second column is the assigned sentiment (positive or negative)
def load_file():
with open('review.csv') as csv_file:
reader = csv.reader(csv_file,delimiter=",",quotechar='"')
reader.next()
data =[]
target = []
for row in reader:
# skip missing data
if row[0] and row[1]:
data.append(row[0])
target.append(row[1])
return data,target
# preprocess creates the term frequency matrix for the review data set
def preprocess():
data,target = load_file()
count_vectorizer = CountVectorizer(binary='true')
data = count_vectorizer.fit_transform(data)
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(data)
return tfidf_data
def learn_model(data,target):
# preparing data for split validation. 60% training, 40% test
data_train,data_test,target_train,target_test = cross_validation.train_test_split(data,target,test_size=0.4,random_state=43)
classifier = BernoulliNB().fit(data_train,target_train)
predicted = classifier.predict(data_test)
evaluate_model(target_test,predicted)
# read more about model evaluation metrics here
# http://scikit-learn.org/stable/modules/model_evaluation.html
def evaluate_model(target_true,target_predicted):
print classification_report(target_true,target_predicted)
print "The accuracy score is {:.2%}".format(accuracy_score(target_true,target_predicted))
def main():
data,target = load_file()
tf_idf = preprocess()
learn_model(tf_idf,target)
main()
我得到了 65% 的结果。现在我如何在没有输出列的新文件上测试这个模型并将输出打印到新文件
解决方案
一个简单的方法是使用 Sklearn 的管道
假设您使用以下内容阅读了训练数据:
def read_training(filename):
# Read from a csv file with two columns. Skip bad lines
df = pd.read_csv(
filename,
error_bad_lines=False,
names=['data', 'target']
)
return df.data, df.target
您可以对新数据执行类似的操作。确保您有一个包含单列的文件。
def read_test(filename):
# Read from a csv file with a single column. Skip bad lines
df = pd.read_csv(
filename,
error_bad_lines=False,
names=['data']
)
return df.data
管道
然后,您应该使用管道使您的函数更加动态。请参阅下面的代码,该代码易于阅读。它没有显示您显示的评分步骤。
from sklearn.pipeline import Pipeline
import numpy as np
def main():
# Read training file
train_data, train_target = read_training('review.csv')
# Prepare all sklearn functions in a single pipeline
pipeline = Pipeline([
('count_vectorizer', CountVectorizer(binary='true')),
('tf_idf_transformer', TfidfTransformer(use_idf=False)),
('bernoulli_nb', BernoulliNB())
])
# This trains the entire pipeline on your training data
pipeline.fit(train_data, train_target)
# Your pipeline is now ready to apply to new data!
test_data = read_test('test.csv')
prediction = pipeline.predict(test_data)
# Write prediction to file
np.savetxt("prediction.csv", prediction, delimiter=",", fmt="%s")
推荐阅读
- ios - 为什么在使用 swift 导航到 iOS 中的另一个 ViewController 时会更改当前 ViewController
- php - 目标类控制器不存在 - Laravel 8
- c# - 如何在字典c#中包含char []键
- javascript - 如何从 React 中的不同文件检索状态
- .htaccess - .htaccess 重写在原始 URL 时不起作用
- java - 在 aws 中部署应用程序会引发错误 java.sql.SQLException: No database selected
- python - 在 matplotlib 中保存动画
- javascript - 我如何使用键盘输入更改精灵我正在使用 vanilla JS
- c++ - C++ 使用 operator<< 输出作为函数中的参数
- azure - Azure ML Studio 中的字母识别错误