首页 > 解决方案 > SparkML - 创建 RandomForestRegressionModel 的 df(feature, feature_importance)

问题描述

我正在通过以下方式训练随机森林模型:

//Indexer
val stringIndexers = categoricalColumns.map { colName =>
  new StringIndexer()
    .setInputCol(colName)
    .setOutputCol(colName + "Idx")
    .setHandleInvalid("keep")
    .fit(training)
}

//HotEncoder
val encoders = featuresEnconding.map { colName =>
  new OneHotEncoderEstimator()
    .setInputCols(Array(colName + "Idx"))
    .setOutputCols(Array(colName + "Enc"))
    .setHandleInvalid("keep")
}  

//Adding features into a feature vector column   
val assembler = new VectorAssembler()
              .setInputCols(featureColumns)
              .setOutputCol("features")


val rf = new RandomForestRegressor()
              .setLabelCol("label")
              .setFeaturesCol("features")

val stepsRF = stringIndexers ++ encoders ++ Array(assembler, rf)

val pipelineRF = new Pipeline()
                 .setStages(stepsRF)


val paramGridRF = new ParamGridBuilder()
                  .addGrid(rf.maxBins, Array(800))
                  .addGrid(rf.featureSubsetStrategy, Array("all"))
                  .addGrid(rf.minInfoGain, Array(0.05))
                  .addGrid(rf.minInstancesPerNode, Array(1))
                  .addGrid(rf.maxDepth, Array(28,29,30))
                  .addGrid(rf.numTrees, Array(20))
                  .build()


//Defining the evaluator
val evaluatorRF = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")

//Using cross validation to train the model
//Start with TrainSplit -Cross Validations taking so long so far
val cvRF = new CrossValidator()
.setEstimator(pipelineRF)
.setEvaluator(evaluatorRF)
.setEstimatorParamMaps(paramGridRF)
.setNumFolds(10)
.setParallelism(3)

//Fitting the model with our training dataset
val cvRFModel = cvRF.fit(training)

我现在想要的是在训练后得到模型中每个特征的重要性。

我可以像这样得到每个特性的重要性作为 Array[Double] :

val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]

val size = bestModel.stages.size-1

val ftrImp = bestModel.stages(size).asInstanceOf[RandomForestRegressionModel].featureImportances.toArray

但我只得到每个特征的重要性和一个数字索引,但我不知道我的模型中对应每个重要性值的特征名称是什么。

我还想提一下,由于我使用的是热编码器,所以最终的特征量比原始的 featureColumns 数组大得多。

如何提取模型训练期间使用的特征名称?

标签: scalaapache-spark-ml

解决方案


我找到了这个可能的解决方案:

import org.apache.spark.ml.attribute._

val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]

val lstModel = bestModel.stages.last.asInstanceOf[RandomForestRegressionModel]
val schema = predictions.schema

val featureAttrs = AttributeGroup.fromStructField(schema(lstModel.getFeaturesCol)).attributes.get
val mfeatures = featureAttrs.map(_.name.get)


val mdf = sc.parallelize(mfeatures zip ftrImp).toDF("featureName","Importance")
.orderBy(desc("Importance"))
display(mdf)

推荐阅读