首页 > 解决方案 > Scikit 学习决策树不是确定性的

问题描述

我正在进行递归特征消除和交叉验证选择 (RFECV) 以获得最佳数量的特征。由于我将在稍后阶段比较处理不平衡数据的不同超参数和方法,我希望最好的特征是确定性的。因此,我使用了决策树。但是,每次我运行下面的代码时,我都会得到一个不同的号码。的最佳功能。我一直使用恒定的随机状态,无法理解为什么运行之间的结果不同?

RANDOM_ST = 123

def featureSelection(train, train_labels, test, test_labels):

    # Use kNN to illustrate effectiveness of feature selection. 
    clf = KNeighborsClassifier()

    # train the classifier
    clf = clf.fit(train, train_labels['gname_code'])

    # predict the class for unseen examples
    preds = clf.predict(test)

    # initial accuracy
    score = metrics.accuracy_score(preds, test_labels['gname_code'])
    print('Initial Result', score)

    # Decision tree for feature selection
    # RF is probably a better way to do feature selection but I want it to be deterministic for 
    # comparing unblanaced methods later. So use decTree instead
    #estimator = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=RANDOM_ST)
    estimator = DecisionTreeClassifier(random_state=RANDOM_ST)

    # Custom cv so I can seed with random state => results are comparable between different options later
    rskv = model_selection.RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=RANDOM_ST)

    # Greedy Feature Selection
    rfecv= RFECV(estimator, cv=rskv, n_jobs=-1)
    rfecv.fit(train, train_labels['gname_code'])

    # optimal number of features
    print('Optimal no. of features is: ', rfecv.n_features_)

    # drop the un-informative features
    train = train.iloc[:, rfecv.support_]
    test = test.iloc[:, rfecv.support_]

    # test again now
    clf = KNeighborsClassifier()
    clf = clf.fit(train, train_labels['gname_code'])
    preds = clf.predict(test)
    score = metrics.accuracy_score(preds, test_labels['gname_code'])
    print ('Result after feature selection: ', score)


    return train, train_labels, test, test_labels

标签: pythonpython-3.xmachine-learningscikit-learndecision-tree

解决方案


推荐阅读