首页 > 解决方案 > 如何在没有输出列的新文件上使用此机器学习模型?

问题描述

我使用了一个 csv 文件中的一些数据,它有 2 列,第一列是评论,第二列是结果。我有一个输出,但想在没有输出列的文件上测试这个模型。我该怎么做?

import csv

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import BernoulliNB
from sklearn import cross_validation
from sklearn.metrics import classification_report
import numpy as np
from sklearn.metrics import accuracy_score

# review.csv contains two columns
# first column is the review content (quoted)
# second column is the assigned sentiment (positive or negative)
def load_file():
    with open('review.csv') as csv_file:
        reader = csv.reader(csv_file,delimiter=",",quotechar='"')
        reader.next()
        data =[]
        target = []
        for row in reader:
            # skip missing data
            if row[0] and row[1]:
                data.append(row[0])
                target.append(row[1])

        return data,target

# preprocess creates the term frequency matrix for the review data set
def preprocess():
    data,target = load_file()
    count_vectorizer = CountVectorizer(binary='true')
    data = count_vectorizer.fit_transform(data)
    tfidf_data = TfidfTransformer(use_idf=False).fit_transform(data)

    return tfidf_data

def learn_model(data,target):
    # preparing data for split validation. 60% training, 40% test
    data_train,data_test,target_train,target_test = cross_validation.train_test_split(data,target,test_size=0.4,random_state=43)
    classifier = BernoulliNB().fit(data_train,target_train)
    predicted = classifier.predict(data_test)
    evaluate_model(target_test,predicted)

# read more about model evaluation metrics here
# http://scikit-learn.org/stable/modules/model_evaluation.html
def evaluate_model(target_true,target_predicted):
    print classification_report(target_true,target_predicted)
    print "The accuracy score is {:.2%}".format(accuracy_score(target_true,target_predicted))

def main():
    data,target = load_file()
    tf_idf = preprocess()
    learn_model(tf_idf,target)


main()

我得到了 65% 的结果。现在我如何在没有输出列的新文件上测试这个模型并将输出打印到新文件

标签: pythonmachine-learningscikit-learn

解决方案


一个简单的方法是使用 Sklearn 的管道

假设您使用以下内容阅读了训练数据:

def read_training(filename):
    # Read from a csv file with two columns. Skip bad lines
    df = pd.read_csv(
        filename,
        error_bad_lines=False,
        names=['data', 'target']
    )
    return df.data, df.target

您可以对新数据执行类似的操作。确保您有一个包含单列的文件。

def read_test(filename):
    # Read from a csv file with a single column. Skip bad lines
    df = pd.read_csv(
        filename,
        error_bad_lines=False,
        names=['data']
    )
    return df.data

管道

然后,您应该使用管道使您的函数更加动态。请参阅下面的代码,该代码易于阅读。它没有显示您显示的评分步骤。

from sklearn.pipeline import Pipeline
import numpy as np

def main():
    # Read training file
    train_data, train_target = read_training('review.csv')

    # Prepare all sklearn functions in a single pipeline
    pipeline = Pipeline([
        ('count_vectorizer', CountVectorizer(binary='true')),
        ('tf_idf_transformer', TfidfTransformer(use_idf=False)),
        ('bernoulli_nb', BernoulliNB())        
    ])

    # This trains the entire pipeline on your training data
    pipeline.fit(train_data, train_target)

    # Your pipeline is now ready to apply to new data! 
    test_data = read_test('test.csv')
    prediction = pipeline.predict(test_data)

    # Write prediction to file
    np.savetxt("prediction.csv", prediction, delimiter=",", fmt="%s")

推荐阅读