首页 > 解决方案 > AttributeError:“LogisticRegressionTrainingSummary”对象没有属性“areaUnderROC”

问题描述

我想为我的机器学习模型运行ROC测试下的区域,但是弹出属性错误。以下是我的完整代码,其中包含错误详细信息。我已经在飞行中拥有字符串索引器、一个热编码器和矢量汇编器。请参考下面的完整代码:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T

spark = SparkSession.builder.getOrCreate()
    
df=spark.read.csv("2018-2010_import.csv",inferSchema=True,header=True)
    
train, test = df.randomSplit([0.7, 0.3], seed=7)
    
print(f"Train set length: {train.count()} records")
print(f"Test set length: {test.count()} records")

train.dtypes

catCols = [x for (x, dataType) in train.dtypes if dataType == "string"]
numCols = [
    x for (x, dataType) in train.dtypes if ((dataType == "double") & (x != "HSCode"))
]

print(numCols)
print(catCols)

train.agg(F.countDistinct("Commodity","Country")).show()

train.groupBy("Commodity","Country").count().show()

from pyspark.ml.feature import (
    OneHotEncoder,
    StringIndexer,
)

string_indexer = [
    StringIndexer(inputCol=x, outputCol=x + "_StringIndexer", handleInvalid="skip")
    for x in catCols
]

one_hot_encoder = [
    OneHotEncoder(
        inputCols=[f"{x}_StringIndexer" for x in catCols],
        outputCols=[f"{x}_OneHotEncoder" for x in catCols],
    )
]

from pyspark.ml.feature import VectorAssembler

assemblerInput = [x for x in numCols]
assemblerInput += [f"{x}_OneHotEncoder" for x in catCols]

vector_assembler = VectorAssembler(
    inputCols=assemblerInput, outputCol="VectorAssembler_features", handleInvalid="skip"
)

stages = []
stages += string_indexer
stages += one_hot_encoder
stages += [vector_assembler]

from pyspark.ml import Pipeline

pipeline = Pipeline().setStages(stages)
model = pipeline.fit(train)

pp_df = model.transform(test)

pp_df.select(
    "HSCode", "Commodity", "value", "Country", "VectorAssembler_features",
).show(truncate=False)
from pyspark.ml.classification import LogisticRegression

data = pp_df.select(
    F.col("VectorAssembler_features").alias("features"),
    F.col("HSCode").alias("label"),
)

model = LogisticRegression().fit(data)

model_summary.areaUnderROC

AttributeError Traceback(最近一次调用最后)C:\Users\AZMANM~1\AppData\Local\Temp/ipykernel_4856/3039136250.py in ----> 1 model_summary.areaUnderROC AttributeError: 'LogisticRegressionTrainingSummary' 对象没有属性 'areaUnderROC'

model.summary.pr.show()

AttributeError Traceback(最近一次调用最后)C:\Users\AZMANM~1\AppData\Local\Temp/ipykernel_4856/3388404637.py in ----> 1 model.summary.pr.show()

AttributeError:“LogisticRegressionTrainingSummary”对象没有属性“pr”

标签: python-3.xmachine-learningpysparklogistic-regressionapache-spark-ml

解决方案


您将需要使用 BinaryClassificationEvaluator。在训练测试拆分后,我将训练集命名为 train_set,将测试数据命名为 test_set。这里 input_columns 是除标签列之外的所有列。

from pyspark.ml.evaluation import BinaryClassificationEvaluator
assembler= VectorAssembler(inputCols=input_columns,outputCol='features')

并调用向量汇编器来转换你的数据框

    final_data = assembler.transform(your_dataframe)
    print("Train test Split...")
    train,test = final_data.randomSplit([0.7,0.3], seed=4000)
    lr = LogisticRegression(labelCol="label", 
    featuresCol="features",maxIter=10 ,threshold=0.5)
    lr_model=lr.fit(train_set)
    predict_train=lr_model.transform(train_set)
    predict_test=lr_model.transform(test_set)
        
    evaluator = BinaryClassificationEvaluator()
    print("Test Area Under ROC: " + str(evaluator.evaluate(predict_test, {evaluator.metricName: "areaUnderROC"})))

推荐阅读