首页 > 解决方案 > IndexError:索引 1967 超出轴 0 的范围,大小为 1967

问题描述

通过计算 p 值,我减少了大型稀疏文件中的特征数量。但我得到这个错误。我看过类似的帖子,但这段代码适用于非稀疏输入。你能帮忙吗?(如果需要,我可以上传输入文件)

import statsmodels.formula.api as sm

def backwardElimination(x, Y, sl, columns):
    numVars = len(x[0])
    pvalue_removal_counter = 0

    for i in range(0, numVars):
        print(i, 'of', numVars)
        regressor_OLS = sm.OLS(Y, x).fit()
        maxVar = max(regressor_OLS.pvalues).astype(float)

        if maxVar > sl:
            for j in range(0, numVars - i):
                if (regressor_OLS.pvalues[j].astype(float) == maxVar):
                    x = np.delete(x, j, 1)
                    pvalue_removal_counter += 1
                    columns = np.delete(columns, j)

    regressor_OLS.summary()
    return x, columns

输出:

0 of 1970
1 of 1970
2 of 1970
Traceback (most recent call last):
  File "main.py", line 142, in <module>
    selected_columns)
  File "main.py", line 101, in backwardElimination
    if (regressor_OLS.pvalues[j].astype(float) == maxVar):
IndexError: index 1967 is out of bounds for axis 0 with size 1967

标签: pythonnumpystatsmodelsp-valueindex-error

解决方案


这是一个固定版本。

我做了一些改变:

  1. OLS从 statsmodels.api导入正确的
  2. columns在函数中生成
  3. 用于np.argmax查找最大值的位置
  4. 使用布尔索引来选择列。在伪代码中,它就像x[:, [True, False, True]]保留第 0 列和第 2 列一样。
  5. 如果没有东西可以放下,就停下来。
import numpy as np
# Wrong import. Not using the formula interface, so using statsmodels.api
import statsmodels.api as sm

def backwardElimination(x, Y, sl):
    numVars = x.shape[1]  # variables in columns
    columns = np.arange(numVars)

    for i in range(0, numVars):
        print(i, 'of', numVars)
        regressor_OLS = sm.OLS(Y, x).fit()

        if maxVar > sl:
            # Use boolean selection
            retain = np.ones(x.shape[1], bool)
            drop = np.argmax(regressor_OLS.pvalues)
            # Drop the highest pvalue(s)
            retain[drop] = False
            # Keep the x we with to retain
            x = x[:, retain]
            # Also keep their column indices
            columns = columns[retain]
        else:
            # Exit early if everything has pval above sl
            break

    # Show the final summary
    print(regressor_OLS.summary())
    return x, columns

你可以用

x = np.random.standard_normal((1000,100))
y = np.random.standard_normal(1000)
backwardElimination(x,y,0.1)

推荐阅读