pyspark - 如何使用 spark ML 计算 pyspark 分类模型中的基尼指数?
问题描述
我正在尝试计算使用来自 pyspark ml 模型的 GBTClassifier 完成的分类模型的基尼指数。我似乎找不到像 python sklearn 中那样给出 roc_auc_score 的指标。
以下是我迄今为止在数据块上使用的代码。我目前正在使用数据块中的数据集
%fs ls databricks-datasets/adult/adult.data
from pyspark.sql.functions import *
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator, VectorAssembler, VectorSlicer
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.ml.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
dataset = spark.table("adult")
# spliting the train and test data frames
splits = dataset.randomSplit([0.7, 0.3])
train_df = splits[0]
test_df = splits[1]
def churn_predictions(train_df,
target_col,
# algorithm,
# model_parameters = conf['model_parameters']
):
"""
#Function attributes
dataframe - training df
target - target varibale in the model
Algorithm - Algorithm used
model_parameters - model parameters used to fine tune the model
"""
# one hot encoding and assembling
encoding_var = [i[0] for i in train_df.dtypes if (i[1]=='string') & (i[0]!=target_col)]
num_var = [i[0] for i in train_df.dtypes if ((i[1]=='int') | (i[1]=='double')) & (i[0]!=target_col)]
string_indexes = [StringIndexer(inputCol = c, outputCol = 'IDX_' + c, handleInvalid = 'keep') for c in encoding_var]
onehot_indexes = [OneHotEncoderEstimator(inputCols = ['IDX_' + c], outputCols = ['OHE_' + c]) for c in encoding_var]
label_indexes = StringIndexer(inputCol = target_col, outputCol = 'label', handleInvalid = 'keep')
assembler = VectorAssembler(inputCols = num_var + ['OHE_' + c for c in encoding_var], outputCol = "features")
gbt = GBTClassifier(featuresCol = 'features', labelCol = 'label',
maxDepth = 5,
maxBins = 45,
maxIter = 20)
pipe = Pipeline(stages = string_indexes + onehot_indexes + [assembler, label_indexes, gbt])
model = pipe.fit(train_df)
return model
gbt_model = churn_predictions(train_df = train_df,
target_col = 'income')
#### prediction in test sample ####
gbt_predictions = gbt_model.transform(test_df)
# display(gbt_predictions)
gbt_evaluator = MulticlassClassificationEvaluator(
labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = gbt_evaluator.evaluate(gbt_predictions) * 100
print("Accuracy on test data = %g" % accuracy)
gini_train = 2 * metrics.roc_auc_score(Y, pred_prob) - 1
正如您在最后一行代码中看到的那样,显然没有名为 roc_auc_score 的度量标准来计算基尼系数。
非常感谢您对此的任何帮助。
解决方案
通常 Gini 用于评估二元分类模型。
你可以在 pyspark 中通过以下方式计算它:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
auc = evaluator.evaluate(gbt_predictions, {evaluator.metricName: "areaUnderROC"})
gini = 2 * auc - 1.0
推荐阅读
- python - 基于特定值的具有最近日期的输出行
- angular - Angular:在没有内存泄漏的情况下,对不相关组件之间的通信感到困惑
- plotly - 向饼图切片添加后缀/实体
- c - 为什么启动 docker 容器时有这么多 NETLINK rtm_newlink 消息
- r - “NA in INT 64 Error”由于 DB Connect 从本地主机更改为 RMariaDB
- wordpress - Woocommerce 中损坏的 HTML 电子邮件
- django - 如何在 Django UpdateView 中添加一个按钮来更新另一个模型
- spartacus-storefront - 如何隐藏面包屑但保存 H1 页面标题
- php - 在 PHP 中将成本标签数据添加到 S3 预签名表单数据
- php - PHP 需要修复一个小表单提交验证错误