python - What does the high VIF for the constant term (intercept) indicate?
问题描述
I am building a Linear regression model on a car dataset using RFE technique and statsmodels library. My final model has p-value well within 5% and has high F-statistics. VIF values for the predictor variables are well below 5 but for the constant term(intercept) VIF is 8.18. I have used add_constant method to add constant to the model. Following are my doubts:
- What does High variance for the constant indicate ?
- Should i ignore the constant term while calculating VIF?
These are my results:
I am new to machine learning and also posting question on this site for the 1st time. Kindly let me know if any more information is needed to answer my question.
解决方案
statistical question are better asked on stats.stackexchange. However, I just went through this for statsmodels, e.g. https://github.com/statsmodels/statsmodels/issues/2376
First, there is no multicollinearity problem in your model and data. p-values are low and confidence intervals are pretty narrow, so the parameters in the model should be a good estimates. A vif of 8 is not large.
A large vif in the constant indicates that the (slope) explanatory variables have also a large constant component. An example would be when a variable has a large mean but only a small variance. An example for perfect collinearity with the constant and rank deficiency of the design matrix is the dummy variable trap, when we did not remove one of the levels of a categorical variable in dummy encoding and the dummies sum to 1 and, therefore, replicate a constant.
The purpose of including the constant in the vif computation is to discover this kind of problems with the design matrix exog
provided by the user. It would not show up if we compute vif on demeaned or standardized explanatory variables.
There has been a long standing debate in statistics and econometrics about whether multicollinearity measures should include a constant or work only with demeaned explanatory variables.
I am currently preparing an extension to statsmodels that gives users the option to compute both versions, with and without constant. In some cases reparameterization, demeaning and scaling, can improve numerical precision and prediction. So we want to have measures that check the actual design matrix provided by users, but also check a standardized version of the data to see whether demeaning and scaling might improve numerical precision.
推荐阅读
- amazon-web-services - 使用 HTTP 错误代码 (503) 在两个 AWS Auto Scaling 组之间自动切换蓝/绿
- kubernetes - nginx 入口控制器在添加 - -default-ssl-certificate=default/certificate-name 标志后进入 CrashLoopBackOff 状态
- javascript - 未捕获的错误:目标容器不是 DOM 元素。反应.js
- reactjs - 无法在 allMarkdownRemark graphql 上查询字段
- r - 我想使用 R 的任何其他函数而不是 sprintf 来计算列表组
- python - 如何让 discord.py bot 发送到它被调用的服务器?
- c# - 为什么我的代码中 REST API 的响应不是 JSON 格式,而是带有 Google 扩展“restman”的正确格式?
- vue-component - 从dopdown vue中选择选项时的调用方法
- google-cloud-platform - 在 GCP 堆栈驱动程序中重新触发警报
- reactjs - 组件卸载时如何取消 SvgUri 请求