首页 > 解决方案 > Pyspark k-fold 交叉验证平均 RMSE

问题描述

我正在使用 Pyspark 对数据集进行 k 折交叉验证的线性回归。我目前只能确定最佳模型的 RMSE。但我想要在交叉验证中评估的所有模型的平均 RMSE。如何获得交叉验证中所有评估模型的平均 RMSE?

from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

file_name = '/tmp/user/userfile/LS.csv'
data = spark.read.options(header='true', inferschema='true',                            
                          delimiter=',').csv(file_name)
data.cache()
features = ["x"]
lr_data = data.select(col("y").alias("label"), *features)
(training, test) = lr_data.randomSplit([.7, .3])

vectorAssembler = VectorAssembler(inputCols=features, outputCol="features")
training_ds = vectorAssembler.transform(training)
test_ds = vectorAssembler.transform(test)

lr = LinearRegression(maxIter=5, solver="l-bfgs") # solver="l-bfgs" here

modelEvaluator=RegressionEvaluator()

paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1,0.01]) 
                              .addGrid(lr.elasticNetParam, [0, 1]).build()

crossval = CrossValidator(estimator=lr,
                          estimatorParamMaps=paramGrid,
                          evaluator=modelEvaluator,
                          numFolds=2)

cvModel = crossval.fit(training_ds)

prediction = cvModel.transform(test_ds)

evaluator = RegressionEvaluator(labelCol="label",
                                predictionCol="prediction",
                                metricName="rmse")

rms = evaluator.evaluate(prediction)
print("Root Mean Squared Error (RMSE) on test data = %g" % rms)

标签: machine-learningpyspark

解决方案


只需要从交叉验证器中提取其他模型

Spark CrossValidatorModel 访问 bestModel 以外的其他模型?

然后对每个进行 RegressionEvaluator 并手动计算平均值。


推荐阅读