首页 > 解决方案 > How does Stata treat multiple factor variables in regression?

问题描述

I have a city-year level dataset, and run the follow regression with city fixed effects:

reg y x i.city 

I think this is equivalent to generating a dummy variable for each of 300 cities in the data, and run (city 1 as base level):

reg y x city2 ... city300

However, I need to include year dummies as well. I get the estimates using:

reg y x i.city i.year

Does anyone know what is going behind this regression in matrix form? Is that the same as generating one dummy for each year and run the following?

reg y x city2 ... city300 year2 ... year20

The reason I want to do this is try to code the command from scratch using matrix operations (X'X)^{-1}(X'y), where X includes the city dummies and year dummies.

标签: regressionstatacategorical-data

解决方案


您正在使用的称为虚拟(0,1)变量的角点编码,其中 k-1 二进制(0,1)变量级别用于每个因子(分类变量)。如果您指定不应使用常数项:

reg y x i.city i.year, nocon

然后零和约束编码将用于二进制变量构造,其中将有一个用于 X 矩阵中的 city1 和 year1 的二进制变量。

如您所见(下图),当饮食中的视黄醇浓度 (retdiet) 回归male虚拟变量时,常数 (y-intcp) 的系数项是女性 (815) 中的平均值 y,系数为maledelta在男性和女性之间的 y 值。然而,当使用两个虚拟指标时 -femmale, 和, nocon被指定(在逗号之后),回归系数的值femmale是每组中 y (retdiet)的平均值。

在此处输入图像描述


推荐阅读