python - Best model for variable selection with big data?
问题描述
I posted a question earlier about some code but now I realize I should be more broad with the general idea. Basically, I'm trying to build a statistical model with about 1000 observations and 2000 variables. I would like to determine which variables are most influential in effecting my dependent variable with high significance. I don't plan to use the model for prediction, just for variable selection. My independent variables are binary and dependent variable is continuous. I've tried multiple linear regression and fixed models with tools such as statsmodels and scikit-learn. However, I have encountered issues such as having more variables than observations. I would prefer to solve the problem in python since I have basic knowledge in it. However, stats is very new to me so I don't know the best direction. Any help is appreciated.
Tree method
import pandas as pd
from sklearn import tree
from sklearn import preprocessing
data=pd.read_excel('data_file.xlsx')
y=data.iloc[:, -1]
X=data.iloc[:, :-1]
le=preprocessing.LabelEncoder()
y=le.fit_transform(y)
clf=tree.DecisionTreeClassifier()
clf=clf.fit(X,y)
tree.export_graphviz(clf, out_file='tree.dot')
Or if I output to text file, the first few lines are:
digraph Tree {
node [shape=box] ;
0 [label="X[685] <= 0.5\ngini = 0.995\nsamples = 1097\nvalue = [2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1\n1, 1, 1, 8, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 4, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n1, 3, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2\n1, 1, 1, 1, 1, 1, 30, 3, 1, 3, 1, 1, 2, 1\n1, 5, 1, 2, 1, 4, 2, 1, 1, 1, 1, 1, 1, 1\n1, 1, 2, 1, 1, 1, 3, 1, 1, 3, 1, 2, 1, 1\n1, 7, 3, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1\n6, 2, 1, 2, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 7, 6, 1, 1, 1\n1, 1, 3, 4, 1, 1, 1, 1, 1, 4, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1\n1, 4, 1, 1, 4, 2, 1, 1, 1, 2, 1, 1, 2, 2\n11, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 12, 1\n1, 1, 3, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1\n6, 1, 1, 1, 1, 1, 4, 2, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, 1, 1\n1, 1, 1, 1, 1, 11, 1, 2, 1, 2, 1, 1, 1, 1\n4, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2\n1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3\n1, 7, 1, 1, 2, 1, 2, 7, 1, 1, 1, 3, 1, 11\n1, 1, 2, 2, 2, 1, 1, 10, 1, 1, 5, 21, 1, 1\n11, 1, 2, 1, 1, 1, 1, 1, 5, 15, 3, 1, 1, 1\n1, 1, 1, 3, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1\n1, 1, 6, 1, 1, 1, 1, 1, 1, 14, 1, 1, 1, 1\n17, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1, 4\n1, 1, 1, 6, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1\n1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 14, 1\n3, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 3, 1\n1, 2, 1, 12, 1, 1, 1, 1, 8, 2, 1, 1, 1, 2\n1, 1, 3, 1, 1, 6, 1, 1, 1, 3, 1, 1, 2, 1\n1, 1, 1, 1, 4, 1, 1, 2, 1, 3, 2, 4, 1, 3\n1, 1, 1, 1, 1, 7, 1, 1, 2, 1, 1, 2, 13, 2\n1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1\n9, 1, 2, 5, 7, 1, 1, 1, 2, 9, 2, 2, 13, 1\n1, 1, 1, 2, 1, 3, 1, 1, 6, 1, 3, 1, 1, 3\n1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 4, 1, 5, 1\n4, 1, 2, 3, 3]"] ;
1 [label="X[990] <= 0.5\ngini = 0.995\nsamples = 1040\nvalue = [2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1\n1, 1, 1, 8, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 4, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n1, 3, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2\n1, 1, 1, 1, 1, 1, 30, 3, 1, 3, 1, 1, 2, 1\n1, 5, 1, 2, 1, 4, 2, 1, 1, 1, 1, 1, 1, 1\n1, 1, 2, 1, 1, 1, 3, 1, 1, 3, 1, 2, 1, 1\n1, 7, 3, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1\n6, 2, 1, 2, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 7, 6, 1, 1, 1\n1, 1, 3, 4, 1, 1, 1, 1, 1, 4, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1\n1, 4, 1, 1, 4, 2, 1, 1, 1, 2, 1, 1, 2, 2\n11, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 12, 1\n1, 1, 3, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1\n6, 1, 0, 1, 1, 1, 4, 2, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 0, 1, 1\n1, 1, 1, 1, 1, 9, 1, 2, 1, 2, 1, 1, 1, 1\n4, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2\n1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3\n1, 7, 1, 1, 2, 1, 2, 7, 1, 1, 1, 1, 1, 11\n1, 1, 2, 2, 2, 1, 1, 10, 1, 1, 5, 21, 1, 1\n1, 1, 2, 1, 1, 1, 1, 1, 5, 15, 3, 1, 1, 1\n1, 1, 1, 3, 1, 1, 2, 1, 3, 1, 1, 0, 1, 1\n1, 1, 6, 1, 1, 1, 1, 1, 1, 14, 1, 1, 1, 1\n16, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1, 4\n1, 1, 1, 6, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1\n1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 0, 1\n3, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 3, 1\n1, 2, 1, 12, 1, 1, 1, 1, 8, 2, 0, 1, 1, 2\n1, 1, 3, 1, 1, 6, 1, 1, 1, 3, 1, 1, 2, 0\n1, 1, 1, 1, 4, 1, 1, 2, 1, 3, 2, 4, 1, 3\n1, 1, 1, 1, 1, 7, 1, 1, 2, 1, 0, 1, 3, 2\n1, 1, 1, 0, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1\n9, 1, 2, 5, 6, 1, 1, 1, 2, 9, 2, 2, 13, 1\n1, 1, 1, 2, 1, 3, 1, 1, 6, 1, 3, 1, 0, 3\n1, 0, 1, 1, 2, 0, 1, 2, 1, 1, 0, 1, 5, 1\n4, 1, 0, 3, 3]"] ;
解决方案
1.Variante A:特征选择 Scikit
对于未来的选择scikit
提供了很多不同的方法:
https ://scikit-learn.org/stable/modules/feature_selection.html
这里总结了上面的评论。
2.Variante B:线性回归的特征选择
如果您对其运行线性回归,您还可以读取您的特征重要性。https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html。该函数reg.coef_
会给你未来的系数,绝对数越高,你的特征越重要,所以例如 0.8 是一个非常重要的未来,而 0.00001 并不重要。
3.Variante C:PCA(不适用于二进制情况)
为什么你想杀死你的变量?我建议您使用:PCA - Principal ocmponent analysis https://en.wikipedia.org/wiki/Principal_component_analysis。
基本概念是将您的 2000 个特征转换为更小的空间(可能是 1000 个或其他),同时在数学上仍然有用。
Scikik-learn
有一个很好的包:https ://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
推荐阅读
- python - 将 2 个 numpy 数组合并到字典中
- reactjs - 如何在下一个带有firebase的js中正确地进行注册流程?
- swagger - Swagger OpenAPI 数组文档作为响应
- javascript - 使用 Express 流式传输结果时获取连接超时
- visual-studio-code - 使用我的大学 (.edu.in) 邮件创建 Azure 帐户以发布我团队的 VS Code 扩展,但无法将个人电子邮件添加为成员
- antlr4 - Antlr4 相邻令牌优先级
- c++ - avcodec_send_frame 返回“无效参数”
- javascript - 从对象中删除项目后 setState 不会触发渲染
- php - 如何使用 Symfony Doctrine 持久化枚举(实体字段类型:“枚举”)
- javascript - 通过 JavaScript 从 D365 下载注释附件