首页 > 解决方案 > 在 Python 中使用贪心特征选择算法进行线性回归

问题描述

这是我正在学习的机器学习课程的作业问题。我将尽可能描述我采取的方法,哪些有效,哪些无效。


我们有四种类型的数据集:dev_sample.npydev_label.npytest_sample.npytest_label.npy。我们首先加载数据集如下:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

X_dev = np.load("./dev_sample.npy") # shape (900, 126)
y_dev = np.load("./dev_label.npy") # shape (900,)
X_test = np.load("/test_sample.npy") # shape (100, 126)
y_test = np.load("./test_label.npy") # shape (100,)

我们需要解决的问题是实现一个“贪心特征选择”算法,直到选择出 126 个特征中最好的 100 个。基本上我们用一个特征训练模型,选择最好的一个并存储它,训练 125 个模型,每个剩余的特征与所选的特征配对,选择下一个最好的模型并存储它,然后继续直到我们达到 100 个。

这是代码:

# Define linear regression function
# You may use sklearn.linear_model.LinearRegression
# Your code here
lin_reg = LinearRegression()
# End your code

# Basic settings. DO NOT MODIFY
selected_feature = []
sel_num = 100
valid_split = 1/5
cv = ShuffleSplit(n_splits=5, test_size=valid_split, random_state=0)

selected_train_error = []
selected_valid_error = []

# For greedy selection
for sel in range(sel_num) :
    min_train_error = +1000
    min_valid_error = +1000
    min_feature = 0

    for i in range(X_dev.shape[1]) :
        train_error_ith = []
        valid_error_ith = []

        # Select feature greedy
        # Hint : There should be no duplicated feature in selected_feature

        # Your code here
        X_dev_fs = X_dev[:, i]
        if (i in selected_feature):
            continue
        else:
            pass
        # End your code


        # For cross validation
        for train_index, test_index in cv.split(X_dev) : # train_index.shape = 720, test_index.shape = 180, 5 iterations
            X_train, X_valid = X_dev_fs[train_index], X_dev_fs[test_index]
            y_train, y_valid = y_dev[train_index], y_dev[test_index]

            # Derive training error, validation error
            # You may use sklearn.metrics.mean_squared_error, model.fit(), model.predict()

            # Your code here
            model_train = lin_reg.fit(X_train.reshape(-1, 1), y_train.reshape(-1, 1))
            predictions_train = model_train.predict(X_valid.reshape(-1, 1))
            train_error_ith.append(mean_squared_error(y_valid, predictions_train))

            model_valid = lin_reg.fit(X_valid.reshape(-1, 1), y_valid.reshape(-1, 1))
            predictions_valid = model_valid.predict(X_valid.reshape(-1, 1))
            valid_error_ith.append(mean_squared_error(y_valid, predictions_valid))

            # End your code

    # Select best performance feature set on each features
    # You should choose the feature which has minimum mean cross validation error

    # Your code here

    min_train_error = train_error_ith[np.argmin(train_error_ith)]
    min_valid_error = valid_error_ith[np.argmin(valid_error_ith)]
    min_feature = np.argmin(valid_error_ith)

    # End your code

print('='*50)
print("# of selected feature(s) : {}".format(sel+1))
print("min_train_error: {}".format(min_train_error))
print("min_valid_error: {}".format(min_valid_error))
print("Selected feature of this iteration : {}".format(min_feature))
selected_feature.append(min_feature)
selected_train_error.append(min_train_error)
selected_valid_error.append(min_valid_error)


我在填写这些#Your code部分时想到的算法是X_dev_fs将当前迭代的特征与先前选择的特征一起保存。然后,我们将使用交叉验证来推导训练和 CV 错误。

我运行这个程序后得到的当前输出是

==================================================
# of selected feature(s) : 1
min_train_error: 9.756743239446392
min_valid_error: 9.689856536723353
Selected feature of this iteration : 1
==================================================
# of selected feature(s) : 2
min_train_error: 9.70991346883164
min_valid_error: 9.674875050182653
Selected feature of this iteration : 1
==================================================

以此类推,# of selected feature(s)一直持续到 100。

问题是Selected feature of this iteration :不应多次输出相同的数字。我也无法弄清楚如何存储最佳功能并将其与后续迭代一起使用。

我的问题是:

  1. 为什么我的selected_feature列表包含相同的重复功能,我该如何防止?

  2. 如何将最佳特征存储在 中selected_feature,然后将其与每个后续剩余特征配对使用?


任何反馈表示赞赏。谢谢你。


编辑

这是我加载到变量中的文件的链接,以防有人需要它们。

dev_sample.npy

dev_label.npy

test_sample.npy

test_label.npy

标签: pythonmachine-learninglinear-regressionfeature-selection

解决方案


推荐阅读