首页 > 解决方案 > AttributeError:“PipelinedRDD”对象没有属性“_jdf”

问题描述

我对 PySpark 还很陌生。尝试运行逻辑回归时出现属性错误。我正在尝试对 minmaxscaler 向量进行逻辑回归,以获得数据点之间可能匹配的概率值。

number_games = df2.filter(df2.GAME_ID > 22000000).filter(
    df2.GAME_ID < 40000000).groupby("TEAM_ABBREVIATION").agg(
    (F.sum("FGM") / F.countDistinct("GAME_ID")).alias('Points_Per_Game'))

vectorassembler = VectorAssembler(inputCols=["Points_Per_Game"],
                                  outputCol="Performance")
scaler = MinMaxScaler(inputCol="Performance", outputCol="Output")

vectors = vectorassembler.transform(number_games)
scaler_model = scaler.fit(vectors)
scaler_data = scaler_model.transform(vectors)
statistics_teams = scaler_data.select('TEAM_ABBREVIATION',
                                      'Output')  # teams match up against one another
statistics_teams

RDD2 = sc.parallelize(statistics_teams.collect())
# RDD4 = RDD2.map( lambda x: x.split()) even as a pipelineRDD I get the same attribute error

lr = LogisticRegression(maxIter=20, regParam=0.001)
logistic_model = lr.fit(RDD2)

logistic_model.show()

错误返回

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-46-3c0eb05824a8> in <module>
      1 lr = LogisticRegression(maxIter=20, regParam=0.001)
----> 2 logistic_model = lr.fit(RDD4)
      3 
      4 logistic_model.show()

c:\users\user\appdata\local\programs\python\python39\lib\site-packages\pyspark\ml\base.py in fit(self, dataset, params)
    159                 return self.copy(params)._fit(dataset)
    160             else:
--> 161                 return self._fit(dataset)
    162         else:
    163             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

c:\users\user\appdata\local\programs\python\python39\lib\site-packages\pyspark\ml\wrapper.py in _fit(self, dataset)
    333 
    334     def _fit(self, dataset):
--> 335         java_model = self._fit_java(dataset)
    336         model = self._create_model(java_model)
    337         return self._copyValues(model)

c:\users\user\appdata\local\programs\python\python39\lib\site-packages\pyspark\ml\wrapper.py in _fit_java(self, dataset)
    330         """
    331         self._transfer_params_to_java()
--> 332         return self._java_obj.fit(dataset._jdf)
    333 
    334     def _fit(self, dataset):

AttributeError: 'PipelinedRDD' object has no attribute '_jdf'

标签: pythonapache-sparkpysparkapache-spark-sql

解决方案


在这种情况下,您可以尝试调用.fit()实际的数据框statistics_teams吗?我认为 LogisticRegression 适用于数据帧而不是 RDD。


推荐阅读