首页 > 解决方案 > 什么时候应该在数据处理中使用 Data Binning?

问题描述

在数据预处理中,数据分箱是一种将特征的连续值转换为分类值的技术。例如,有时,age数据集中的特征值被替换为间隔之一,例如:

[10,20),
[20,30),
[30,40].

何时是使用数据分箱的最佳时间?它是否(总是)在预测系统中产生更好的结果,或者它可以作为试错法工作?

标签: data-sciencedata-miningdata-processingbinningfeature-engineering

解决方案


Trial and error mostly. When you apply binning to a continuous variable you automatically throw away some information. Many algorithms would prefer a continuous input to make a prediction and many would bin the continuous input themselves. Binning would be wise to apply if your continuous variable is noisy, meaning the values for your variable were not recorded very accurately. Then, binning could reduce this noise. There are binning strategies such as equal width binning or equal frequency binning. I would recommend avoiding equal width binning when your continuous variable is unevenly distributed.


推荐阅读