python - How to Measure the difference between features in dataframe?
问题描述
I have a dataframe with around 20000 rows and 98 features (all the features are numerical) and a target feature with binary values: 0 and 1. Basically, there are two population (first population with target value 1 --50%--, and the second with target value 0 -50%- balanced data). In a classification problem, I tried to predict the target value given the data. So, I have implanted a supervised learning algorithm (e.g., SVM) to predict the target value, and could obtain a very good result with around 0.95 accuracy. This result gives me a point that there is a considerable difference between the features. So, in the next step, I have to know what are the important features which made this difference, and what is best way to quantify this difference in the features between these two group of population. Any idea?
解决方案
To rank you features by importance, you can use Weka with its powerful toolkit for feature selection. See this blogpost for more info and examples. By the way, Weka also has SVM implementation. Once you have identified important features, you can visualize how different they are between the two classes e.g. by plotting their distributions for the classes. Matplotlib has tools like hist
or boxplot
for this.
If you have SVM with linear kernel, you can use its coefficients as direct decision weights for the input features:
推荐阅读
- ios - 架构 i386 的未定义符号,仅在 iPad 模拟器上
- javascript - 移动重定向代码发送到错误的位置
- angular - Angular:必须@Input() 高于@ViewChild?为什么?
- angular - 如何让 ng-packagr 生成源映射
- mysql - MYSQL:如何使用 GROUP_CONCAT 和使用 groupby 的分隔符附加 Null 值?
- copy - Oracle Forms 10g 复制命令
- javascript - 使用 Vanilla JavaScript继承 href 目标到标记
- spring - 当我尝试使用 Hibernate ogm 和 spring boot 时,控制台给出“无法实例化命名策略类”错误
- php - 我的问题是我在一行中插入多个图像,在 codeginter 中用逗号分隔
- javascript - 在共享主机上运行 phantomjs 程序?