首页 > 解决方案 > 我们如何将显式测试数据和训练数据提供给 SVM,而不是使用 train_test_split 函数?

问题描述

我计划明确地向算法提供测试和训练数据集,而不是使用 train_test_split 方法将数据随机拆分为测试和训练。

我想在测试和训练模型时将评论和标签数据保存在同一个文件中。

你们中的任何人都可以建议我关于相同的...

我的代码:

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import average_precision_score
from sklearn.metrics import confusion_matrix

with open("/Users/xyz/Desktop/reviews.txt") as f:
    reviews = f.read().split("\n")
with open("/Users/xyz/Desktop/labels.txt") as f:
    labels = f.read().split("\n")

reviews_tokens = [review.split() for review in reviews]


onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)


X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=None)

lsvm = LinearSVC()
lsvm.fit(onehot_enc.transform(X_train), y_train)
accuracy_score = lsvm.score(onehot_enc.transform(X_test), y_test)

print("Accuracy score of SVM:" , accuracy_score)

测试.txt

review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative

火车.txt:

review,label
The picture is clear and beautiful,positive
Picture is not clear,negative

标签: pythonscikit-learnsvm

解决方案


做你想做的。解决方案非常简单:

X_train = reviews_tokens[:number_of_rows_of_train_data]
X_test = reviews_tokens[number_of_rows_of_train_data:]

y_train对和做同样的事情y_test

当然,您需要知道文件中的哪些行用于训练,哪些行用于测试。

如果您想将特征和标签保留在同一个文件中 - 没问题。您将需要一个额外的步骤来将标签与要素分开。使用熊猫会容易得多。

编辑

拥有您提供的文件,您可以像这样得到您想要的:

def load_data(filename):

    X = list()
    y = list()
    with open(filename) as file:
        file.readline()
        for line in file:
            line = line.strip().split(',')
            y.append(line[1])
            X.append(line[0].split())

    return X, y

X_train, y_train = load_data('train.txt')
X_test, y_test = load_data('test.txt')

推荐阅读