首页 > 解决方案 > 设置为 50-50 的 train_test_split 返回高准确度,但在 2 个文件中分离时返回低

问题描述

我有 1 个数据集(称为train_plus_test.csv),它有 1275 行,带有相应的列和标签,用于对两种活动进行分类,即步行和躺着。这是一个平衡的数据集,每个类别的数量相同。

我在 2 个场景中实现随机森林

场景 1:在train_plus_test.csv上进行训练,训练测试拆分为 0.75 - 0.25,准确率达到 91.8%

场景2:将上述文件train_plus_test.csv分成2个文件(training.csv)和testing(testing.csv),分成75% - 25%。然后我在 train.csv 上训练模型并在test.csv上进行预测,但准确率是 52%。我现在想知道我到底错在哪里?@@

感谢您的阅读!

我在这里包含的 python 代码(下)和上面的 3 个 csv 文件:

[GoogleDrive] https://drive.google.com/drive/folders/1AAOOFhR1QpoPPtSNTofBnouBaYHfFbir?usp=sharing&fbclid=IwAR10SjHCu-6Sszd-okes-IneAA8pWzals9-NNtAsmrw0ql28mk3geZfmnQI

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier

# Scenario 1 ==================>
dataset = pd.read_csv('train_plus_test.csv')
feature_cols = list(dataset.columns.values)
feature_cols.remove('label')
X = dataset[feature_cols] # Features
y = dataset['label'] # Target

clf_RF = RandomForestClassifier(n_estimators=100, random_state=0, max_features=8, min_samples_leaf=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)
clf_RF.fit(X_train, y_train)

y_pred_RF = clf_RF.predict(X_test)
print('Accuracy of training')
print(metrics.accuracy_score(y_test, y_pred_RF))

# Scenario 2 ======= comment Secenario 1 before running the scenario 2 ===========>

train_dataset = pd.read_csv('train.csv')
test_dataset = pd.read_csv('test.csv')
feature_cols = list(train_dataset.columns.values)
feature_cols.remove('label')
clf_RF = RandomForestClassifier(n_estimators=100, random_state=0, max_features=8, min_samples_leaf=3 )
X = train_dataset[feature_cols] # Features
y = train_dataset['label'] # Target
clf_RF.fit(X, y)

X_test_data = test_dataset[feature_cols]
y_test_data = test_dataset['label']
y_test_pred = clf_RF.predict(X_test_data)
print('Accuracy of testing')
print(metrics.accuracy_score(y_test_data, y_test_pred))

标签: python-3.xrandom-foresttrain-test-split

解决方案


推荐阅读