首页 > 解决方案 > 决策树 - 处理字符串值需要很长时间,但对于浮点值工作正常。如何理解?

问题描述

我正在尝试使用下面的代码构建决策树分类器

from sklearn.tree import DecisionTreeClassifier clf = DecisionTreeClassifier()

我的数据是 age type_income loan_purpose loan_amount offer 18 Student study 500 yes 18 Student study 600 yes 18 Student study 700 yes 18 Student study 800 yes . . . 这样的决策树给出了一个错误,说它不能将学生转换为浮点值。

我能做些什么来解决这个问题?我不希望通过预处理手动转换数据以浮动我希望算法本身处理这个问题。是否有任何参数可以自动解决这个问题?

标签: pythonmachine-learningdecision-tree

解决方案


sklearn expects all inputs to be continuous, which is why there is no modules to automatically convert categorical variables to floats. You will have to do some kind of preprocessing manually.

However, there is a rather convenient option: go for onehot encoding of your categorical data (assuming there are not too many different levels for those factorsm in your example type_income and loan_purpose). Just converting the strings to floats (eg Student-> 0, Employee->1) is not adviseable because then sklearn will assume that there is a relation Student < Employee.

I suggest you take a look at section 4.3.5 of this documentation page


推荐阅读