首页 > 解决方案 > How to Measure the difference between features in dataframe?

问题描述

I have a dataframe with around 20000 rows and 98 features (all the features are numerical) and a target feature with binary values: 0 and 1. Basically, there are two population (first population with target value 1 --50%--, and the second with target value 0 -50%- balanced data). In a classification problem, I tried to predict the target value given the data. So, I have implanted a supervised learning algorithm (e.g., SVM) to predict the target value, and could obtain a very good result with around 0.95 accuracy. This result gives me a point that there is a considerable difference between the features. So, in the next step, I have to know what are the important features which made this difference, and what is best way to quantify this difference in the features between these two group of population. Any idea?

标签: pythonmachine-learningstatisticsdata-miningfeature-selection

解决方案


To rank you features by importance, you can use Weka with its powerful toolkit for feature selection. See this blogpost for more info and examples. By the way, Weka also has SVM implementation. Once you have identified important features, you can visualize how different they are between the two classes e.g. by plotting their distributions for the classes. Matplotlib has tools like hist or boxplot for this.

If you have SVM with linear kernel, you can use its coefficients as direct decision weights for the input features:


推荐阅读