python - 如何在 pyspark 管道阶段处理字符串索引器和 onehot 编码器
问题描述
面对此代码的此错误:
stage_string = [StringIndexer(inputCol=c, outputCol=c + "_string_encoded") for c in categorical_columns]
stage_one_hot = [OneHotEncoder(inputCol=c + "_string_encoded", outputCol=c + "_one_hot") for c in categorical_columns]
assembler = VectorAssembler(inputCols=feature_list, outputCol="features")
rf = RandomForestClassifier(labelCol="output", featuresCol="features")
pipeline = Pipeline(stages=[stage_string,stage_one_hot,assembler, rf])
pipeline.fit(df)
Cannot recognize a pipeline stage of type <class 'list'>.
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 132, in fit
return self._fit(dataset)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 97, in _fit
"Cannot recognize a pipeline stage of type %s." % type(stage))
TypeError: Cannot recognize a pipeline stage of type <class 'list'>.
解决方案
问题在于这个pipeline = Pipeline(stages=[stage_string,stage_one_hot,assembler, rf])
语句stage_string
,并且是和rfstage_one_hot
的列表是单独的管道阶段。PipelineStage
assembler
修改您的声明如下 -
stages = stage_string + stage_one_hot + [assembler, rf]
pipeline = Pipeline(stages=stages)
推荐阅读
- java - 嵌套循环的大 O 复杂度取决于分数结果的变化
- c# - 使用配置文件中的数据写入事件日志
- excel - 将“K”显示为千位,将“M”显示为百万 Excel
- python - 如何将 CNN 应用于短时傅里叶变换?
- swift - 如何在 OS X 中通过字符串查询搜索文件/文件夹?
- python - 使用频率字典绘制 wordcloud
- kubernetes - 如何确保使用ansible删除POD?
- javascript - Rxjs from() 运算符不发送数据
- javascript - 单击新 div 时禁用其他 div 的样式
- database - MEAN 堆栈:如何在 mongoDB 中存储用户凭据