python - 从决策树回归器拟合训练数据会导致崩溃
问题描述
尝试在一些训练数据上实现决策树回归算法,但是当我调用 fit() 时出现错误。
(trainingData, testData) = data.randomSplit([0.7, 0.3])
vecAssembler = VectorAssembler(inputCols=["_1", "_2", "_3", "_4", "_5", "_6", "_7", "_8", "_9", "_10"], outputCol="features")
dt = DecisionTreeRegressor(featuresCol="features", labelCol="_11")
dt_model = dt.fit(trainingData)
产生错误
File "spark.py", line 100, in <module>
main()
File "spark.py", line 87, in main
dt_model = dt.fit(trainingData)
File "/opt/spark/python/pyspark/ml/base.py", line 132, in fit
return self._fit(dataset)
File "/opt/spark/python/pyspark/ml/wrapper.py", line 295, in _fit
java_model = self._fit_java(dataset)
File "/opt/spark/python/pyspark/ml/wrapper.py", line 292, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.'
但数据结构完全相同。
解决方案
您缺少两个步骤。1. 转换部分,以及 2. 从转换后的数据中选择特征和标签。我假设数据只包含数字数据,即没有分类数据。我将写下一个通用的训练模型的流程pyspark.ml
来帮助你。
from pyspark.ml.feature
from pyspark.ml.classification import DecisionTreeClassifier
#date processing part
vecAssembler = VectorAssembler(input_cols=['col_1','col_2',...,'col_10'],outputCol='features')
#you missed these two steps
trans_data = vecAssembler.transform(data)
final_data = trans_data.select('features','col_11') #your label column name is col_11
train_data, test_data = final_data.randomSplit([0.7,0.3])
#ml part
dt = DecisionTreeClassifier(featuresCol='features',labelCol='col_11')
dt_model = dt.fit(train_data)
dt_predictions = dt_model.transform(test_data)
#proceed with the model evaluation part after this
推荐阅读
- linux - 对等错误重置 HTTPs 握手连接
- python - Python pandas根据日期范围按升序过滤数据
- android - Firebase Functions AppCheck 不断让我的设备失败
- python - 使用部分列名将数据框拆分为 3 个新数据框
- jquery - 从可编辑的 JQuery 数据表更新 Google 工作表
- javascript - 如何使用从 CMD 运行的脚本将 MongoDB/JavaScript/NodeJS 数据写入或附加到文本文件中?[ReferenceError: 要求未定义]
- amazon-web-services - 将我的 Amazon S3 存储桶访问权限限制在特定国家/地区
- android - 如何在 api 级别 30 中使用 MediaStore API 获取 pdf uri?
- java - 安卓源。编辑关键事件
- python - 如何在 selenium python 中检查 chrome 网络活动