r - problem with decision tree applied to dataset
问题描述
I was testing to program a decision tree by using R and decided to use the car dataset from UCI, available here.
According to the authors it has 7 attributes which are:
CAR car acceptability
. PRICE overall price
. . buying buying price
. . maint price of the maintenance
. TECH technical characteristics
. . COMFORT comfort
. . . doors number of doors
. . . persons capacity in terms of persons to carry
. . . lug_boot the size of luggage boot
. . safety estimated safety of the car
so I want to use a DT as a classifier for getting the car acceptability considering the buying price, maint, comfort, doors, persons, lug_boot and safety.
First of all I extracted the first column as the dependent variable and then I noticed that the data was arrange in order; depending on the value of the first column (very high, high, medium,low). For this reason, I decided to shuffle the data. My code is the following:
car_data<-read.csv("car.data")
library(C50)
set.seed(12345)
car_data_rand<-car_data[order(runif(1727)),]
car_data<-car_data_rand
car_data_train<-car_data[1:1500,]
car_data_test<-car_data[1501:1727,]
answer<-data_train$vhigh
answer_test<-data_test$vhigh
#deleting the dependent variable or y from the data
car_data_train$vhigh<-NULL
car_data_test$vhigh<-NULL
car_model<-C5.0(car_data_train,answer)
summary(car_model)
Here I get an awful error:
Evaluation on training data (1500 cases):
Decision Tree
----------------
Size Errors
7 967(64.5%) <<
What am I doing wrong?
解决方案
在你的代码中间你有
data_train
anddata_test
而不是car_data_train
andcar_data_test
。虽然错误很高,但没有任何问题。注意
1 - table(answer) / length(answer)
# answer
# high low med vhigh
# 0.7466667 0.7566667 0.7426667 0.7540000
这意味着如果你天真地总是猜“低”,你的错误将是 75.6%。所以,有一个改进,大约 11.1%。它有点低的事实意味着预测变量不是很好。
- 最后,存在不一致:您说要对汽车的可接受性进行建模,而您的代码是关于
buying
变量的。现在修复它只会导致 1.1% 的错误。但是,在这种情况下,您的样本非常不平衡:
1 - table(answer) / length(answer)
# answer
# acc good unacc vgood
# 0.7773333 0.9600000 0.3020000 0.9606667
也就是说,通过总是unacc
再次猜测你可能已经得到了 30.2% 的错误。然而,29.1% 的改进显然更大。
推荐阅读
- c# - C# - 解码 base64 字符串 PKCS7 签名 - 充气城堡?
- java - JAVA:将列表拆分为较小的列表,然后将它们流式传输到多个线程中
- griddb - GridDB 服务器启动错误 - 无效的集群名称
- spring-boot - 如何将 application.yml 值默认为空格?
- python - 重命名 col 标题列表
- rebus - 在消息处理程序中如何立即停止处理新消息?
- python - 如何在 python 中将 Plaid 银行 API 响应加载到 pandas 数据框?
- c# - 如何使用正则表达式从字符串的开头和结尾修剪字符?
- ios - CMFormatDescription.h 未知类型名称“AudioFormatListItem”
- json - 从 Excel VBA 中调用和解析 Rest API