首页 > 解决方案 > SciKitLearn 估计器选择

问题描述

下午好,

作为一个从事机器学习项目的谦逊新手,我正在尝试最基本的估计器(线性回归),尽管我很确定我根据我的数据做出了错误的选择。在我的数据中,我有 38 列,其中有一个日期时间列,两个字符串列,我的三个目标是:两个 int 类型列和一个字符串(单字符)类型列,而其他列由浮点数组成。使用线性回归(在删除日期时间列后,将每个字符串类型转换为数字类型)在使用 for 循环进行 10000 次迭代后,我的模型的最大准确度为 44% (0.44)。

这是我的代码。

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pickle

#Import xls#
data_18_19 = pd.read_excel(r'c:\Users\unkno\Desktop\xxxxx_x_.xls')
data_19_20 = pd.read_excel(r'c:\Users\unkno\Desktop\yyyyy_y_.xls')

#fusione dfs#
merge_data = [data_18_19, data_19_20]
data = pd.concat(merge_data, sort=False)


#Drop della colonna Div, tutte l1 e orario perché problematico#
data = data.drop(['Div'], 1)
data = data.drop(['Time'], 1)
data = data.drop(['Date'], 1)

#droplist str list comprehension dei nomi delle colonne#
droplist = [str(x) for x in data.iloc[0:0,37:]]
data = data.drop(droplist, 1)

#Cambio di HT, D, AT in 1,0,2 per HTR e FTR#
data['FTR'] = data['FTR'].replace(['H','D','A'], [1,0,2])
data['HTR'] = data['HTR'].replace(['H','D','A'], [1,0,2])

#Trasformazione s in numeri in ordine alfabetico#
dt = {'At':1,'Bo':2,'Br':3,'Ca':4,'Ch':23,'Em':22,'Fr':21,'Fi':5,'Ge':6,'In':7,'Ju':8,'La':9,'Le':10,'Mi':11,'Na':12,'Pa':13,'Ro':14,'Sa':15,'Sas':16,'Sp':17,'To':18,'Ud':19,'Ve':20}
data['HT'] = data['HT'].replace([i for i in dt.keys()], [j for j in dt.values()])
data['AT'] = data['AT'].replace([i for i in dt.keys()], [j for j in dt.values()])

#definizione della colonna target della predizione#
predict = 'FTR'

#Costituzione delle features(X) e dei target(y)#
X = np.array(data.drop([predict],axis=1))
y = np.array(data[predict])

best = 0
for i in range(10000):
    #split dei dati per validazione#
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.1)
    
    #definizione e training del modello da training#
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    #test di precisione#
    acc = model.score(X_test, y_test)
    
    #predizioni#
    predicts = model.predict(X_test)
    hr_predicts = np.around(predicts)
    
    if acc > best:
        best = acc
        with open(r"c:\Users\unkno\Desktop\dump.pickle", "wb") as doc:
            pickle.dump(model, doc)
            
    print("Precisione: ", acc)

我在徘徊如何提高准确性以及选择哪个估算器以获得更好的结果?提前致谢。

标签: python-3.xnumpyscikit-learn

解决方案


有几种方法。然而,几个简单的步骤并没有改变多少,已经完成了。

  1. 缩放数据后检查模型性能。如果预测变量的规模差异很大,则模型不会正确收敛。我假设,您要预测的 y 也是连续变量。检查它是否服从正态分布。否则,应用对数变换有助于规范化。

  2. 其次,在 train_test_split 中没有指定 random_state。这是故意的吗?请通过将其设置为随机 int 值来检查性能。

from sklearn.preprocessing import StandardScaler
scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
y_train = scalar.fit_transform(y_train)

# only transform the test data , else it leads to data leakage 
X_test = scalar.transform(X_test)
y_train = scalar.transform(y_test)

推荐阅读