首页 > 解决方案 > 回归循环中的国家-年份固定效应(只有一个水平)

问题描述

我使用以下代码运行回归:

res <- lm (c241 ~ x + matchcode , data = df)
summary(res)

wherematchcode是一个变量,它是 Iso3c 代码和年份的组合。例如,对于俄罗斯 2005,它是RUS 2005(参见标题的第一个变量)。这个想法是使用这个匹配码作为一个固定的效果,就像lm上面一样。应用lm上述作品完美

因为我有庞大的数据集(总共超过 4000 个变量):

# A tibble: 450 x 546
   matchcode idstd year  country wt    region income industry sector ownership exporter c201  c202  c203a c203b c203c c203d c2041 c2042 c205a  c205b1 c205b2 c205b3 c205b4 c205b5 c205b6 c205b7 c205b8 c205b9 c205b10 c205c c205d c206a c206b c2071
   <chr+lbl> <dbl> <dbl> <chr+l> <dbl> <dbl+> <dbl+> <dbl+lb> <dbl+> <dbl+lbl> <dbl+lb> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 "BGD 200~  2128 2002  Bangla~ NA    6      1       8       1      2         2        1988  4     100     0   0     NA     2    NA        NA NA     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1     1    NA    2    
 2 "BGD 200~  2926 2002  Bangla~ NA    6      1       1       1      2         2        2000  1     100     0   0     NA     2    NA        NA NA     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1    NA    NA    1    
 3 "BGD 200~  2931 2002  Bangla~ NA    6      1       1       1      2         1        1993  4     100     0   0     NA     2    NA        NA NA     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1    NA    NA    2    
 4 "BRA 200~ 15303 2003  Brazil~ NA    4      2       9       1      2         2        1946  2     100     0   0      0     2    NA     18.72  1     NA     NA     NA     NA     NA     NA     NA     NA     NA       2     2     1     2    5    
 5 "BRA 200~ 14917 2003  Brazil~ NA    4      2       8       1      2         2        1984  2     100     0   0      0     2    NA     50.00  1     NA     NA     NA     NA     NA     NA     NA     NA      1       1     1     1     2    3    
 6 "BRA 200~ 14212 2003  Brazil~ NA    4      2      11       1      2         2        1998  2     100     0   0      0     2    NA     50.00  1     NA     NA     NA     NA     NA     NA     NA     NA      1       1     1     1     2    2    
 7 "KHM 200~ 16067 2003  Cambod~ NA    2      1      23       2      1         1        1993  4      50    50   0      0     2    NA    100.00  1     NA      1      1     NA     NA     NA     NA     NA     NA       1     1     1     1    1    
 8 "KHM 200~ 16233 2003  Cambod~ NA    2      1      10       4      2         2        1989  4     100     0   0      0     2    NA    100.00  1     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1     1     2    3    
 9 "KHM 200~ 16002 2003  Cambod~ NA    2      1       3       1      1         1        1990  5       0   100   0      0     2    NA     50.00  1     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1     1     1    1    
10 "CHN 200~ 17987 2002  China2~ NA    2      2       8       1      1         2        1993  6      55    45   0     NA    NA    NA        NA NA     NA     NA     NA     NA     NA     NA     NA     NA

我想循环变量如下LINK

dfoutput<- data.table(df)[, .(nm = names(.SD),dffits= lapply(.SD, function(x) if(is.numeric(x)) summary(lm(y~ x, na.action=na.omit)))), .SDcols = -1]

但是,这会在 data.table 中为matchcode变量(以及任何其他字符变量)创建一个 data.table NULL。

空值示例

当我尝试添加matchcode回归时:

dfoutput<- data.table(df)[, .(nm = names(.SD),dffits= lapply(.SD, function(x) if(is.numeric(x)) summary(lm(c241~ x + matchcode, na.action=na.omit)))), .SDcols = -1]

或者我使用 lapply matchcode

df <- lapply( df[,-1], function(x) summary(lm(df$c241~ x + df$matchcode)) )

它给出了以下“著名”错误:

 Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels 

虽然我已经读过这个错误可能意味着任何东西,但我的因子确实只有一个级别,但这似乎在单一回归中工作正常(也可以添加其他不是字符的变量)。当使用 data.table 循环或 lapply 时,它不会。我的问题有两个:

1) 为什么变量 variablematchcode在第一种情况 ( res <- lm (c241 ~ x + matchcode , data = df) 中起作用,而在第二种情况下不起作用dfoutput<- data.table(df)[, .(nm = names(.SD),dffits= lapply(.SD, function(x) if(is.numeric(x)) summary(lm(c241~ x + matchcode, na.action=na.omit)))), .SDcols = -1]

2)我能做些什么来规避这个?因为变量对模型至关重要。

也许我可以转换字符变量,或者我可以以某种方式重新编码?

更新:我已使用此链接中的代码:LINK将字符转换为具有一级的因子,最终导致相同的错误。

 ES1sample <- dput(head(ES1sample[, ],10))
structure(list(matchcode = structure(c("BGD 2002 ", "BRA 2003 ", 
"KHM 2003 ", "CHN 2002 ", "CHN 2003 ", "ECU 2003 ", "ERI 2002 ", 
"ETH 2002 ", "GTM 2003 ", "HND 2003 "), label = "", class = c("labelled", 
"character")), idstd = structure(c(2760, 14273, 16104, 17039, 
19095, 22207, 23063, 24046, 25420, 26212), label = "WEB STD FIRMID", format.stata = "%5.0f", class = c("labelled", 
"numeric")), year = structure(c(2002, 2003, 2003, 2002, 2003, 
2003, 2002, 2002, 2003, 2003), format.stata = "%9.0g", label = "", class = c("labelled", 
"numeric")), country = structure(c("Bangladesh2002", "Brazil2003", 
"Cambodia2003", "China2002", "China2003", "Ecuador2003", "Eritrea2002", 
"Ethiopia2002", "Guatemala2003", "Honduras2003"), label = "Country", format.stata = "%21s", class = c("labelled", 
"character")), wt = structure(c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
), label = "locations and sectors weights", format.stata = "%9.0g", class = c("labelled", 
"numeric")), region = structure(c(6, 4, 2, 2, 2, 4, 1, 1, 4, 
4), label = "", class = c("labelled", "numeric")), income = structure(c(1, 
2, 1, 2, 2, 2, 1, 1, 2, 2), label = "income grouping for survey year", class = c("labelled", 
"numeric")), industry = structure(c(1, 1, 20, 3, 20, 12, 7, 3, 
NA, 3), label = "Industry", class = c("labelled", "numeric")), 
    sector = structure(c(1, 1, 2, 1, 2, 1, 1, 1, 2, 1), label = "Sector", class = c("labelled", 
    "numeric")), ownership = structure(c(2, 2, 2, 2, 2, 2, 2, 
    2, 1, 1), label = "Ownership", class = c("labelled", "numeric"
    )), exporter = structure(c(2, 2, 2, 2, 2, 1, 2, 2, 2, 1), label = "Export", class = c("labelled", 
    "numeric")), c201 = structure(c(1991, 1993, 1999, 1979, 1998, 
    1997, 1996, 1998, 1990, 1998), label = "Year firm began operations in this country", format.stata = "%4.0f", class = c("labelled", 
    "numeric")), c202 = structure(c(2, 2, 4, 6, 6, 2, 6, NA, 
    NA, NA), label = "Current legal status of firm", class = c("labelled", 
    "numeric")), c203a = structure(c(100, 100, 100, 100, 0, 100, 
    0, 100, 0, 0), label = "Percentage of firm owned by domestic private sector", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c203b = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 
    100, 100), label = "Percentage of firm owned by foreign private sector", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c203c = structure(c(0, 0, 0, 0, 100, 0, 0, 0, 
    0, 0), label = "Percentage of firm owned by government/state", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c203d = structure(c(NA, 0, 0, NA, NA, 0, 100, 
    0, 0, 0), label = "Percentage of firm owned by other types of owner", format.stata = "%8.2f", class = c("labelled", 
    "numeric")), c2041 = structure(c(2, 2, 2, NA, NA, 2, 2, 2, 
    2, 2), label = "Firm previously owned by government?", class = c("labelled", 
    "numeric")), c2042 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Year of privatization", format.stata = "%4.0f", class = c("labelled", 
    "numeric")), c205a = structure(c(NA, 25, 100, NA, 100, 100, 
    NA, NA, 100, 100), label = "Percentage owned by largest shareholder", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c205b1 = structure(c(NA, NA, NA, NA, NA, NA, 
    NA, NA, 1, 1), label = "Largest shareholder: individual", class = c("labelled", 
    "numeric")), c205b2 = structure(c(NA, 1, 1, NA, NA, 1, NA, 
    NA, NA, NA), label = "Largest shareholder: family", class = c("labelled", 
    "numeric")), c205b3 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: domestic company", class = c("labelled", 
    "numeric")), c205b4 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: foreign company", class = c("labelled", 
    "numeric")), c205b5 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: bank", class = c("labelled", 
    "numeric")), c205b6 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: investment fund", class = c("labelled", 
    "numeric")), c205b7 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: firm managers", class = c("labelled", 
    "numeric")), c205b8 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: firm employees", class = c("labelled", 
    "numeric")), c205b9 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: government", class = c("labelled", 
    "numeric")), c205b10 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: other", class = c("labelled", 
    "numeric")), c205c = structure(c(1, 1, 1, NA, 2, 1, 1, 1, 
    1, 1), label = "Is principal shareholder also the manager/director?", class = c("labelled", 
    "numeric")), c205d = structure(c(1, 1, 1, NA, NA, 2, 1, 1, 
    1, 1), label = "Is the principal owner male?", class = c("labelled", 
    "numeric")), c206a = structure(c(1, 1, 1, NA, NA, 1, 0, 1, 
    2, 3), label = "Number of separate operating facilities in this country", format.stata = "%4.0f", class = c("labelled", 
    "numeric")), c206b = structure(c(NA, 2, 2, NA, NA, 2, 2, 
    2, 2, 1), label = "Operations in other countries?", class = c("labelled", 
    "numeric")), c2071 = structure(c(1, 5, 1, 1, 2, 1, 5, 1, 
    1, 3), label = "Location of establishment", class = c("labelled", 
    "numeric")), c2072 = structure(c(NA, NA, NA, NA, NA, 1, 5, 
    NA, 1, 3), label = "Location of headquarters", class = c("labelled", 
    "numeric")), c208 = structure(c(NA, 1723, NA, NA, NA, NA, 
    NA, NA, NA, 1810), label = "Main product line", format.stata = "%10.0g", class = c("labelled", 
    "numeric")), c209a = structure(c(NA, 1, 2, NA, NA, NA, NA, 
    NA, NA, NA), label = "Other income generating activities", class = c("labelled", 
    "numeric")), c209ba = structure(c(NA, 0, NA, NA, NA, 100, 
    NA, NA, NA, NA), label = "Percent workers' time: manufacturing", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c209bb = structure(c(NA, 0, NA, NA, NA, 0, NA, 
    NA, NA, NA), label = "Percent workers' time: services", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c209bc = structure(c(NA, 0, NA, NA, NA, NA, 
    NA, NA, NA, NA), label = "Percent workers' time: commerce (retail/wholesale trade)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c209bd = structure(c(NA, 0, NA, NA, NA, NA, 
    NA, NA, NA, NA), label = "Percent workers' time: construction", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c209be = structure(c(NA, 1, NA, NA, NA, 0, NA, 
    NA, NA, NA), label = "Percent workers' time: other", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c210a = structure(c(NA, 50, 1, NA, NA, NA, NA, 
    NA, 65, 0), label = "Main product line: firm's share of local market", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c210b = structure(c(12, 25, 1, 6, 44.6399993896484, 
    50, NA, 5, 45, 0), label = "Main product line: firm's share of national market", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c211a1 = structure(c(100, 100, 100, 99, 100, 
    90, 100, 100, 95, 0), label = "Percent of sales sold domestically", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c211a2 = structure(c(0, 0, 0, 0, 0, 10, 0, 0, 
    5, 100), label = "Percent of sales exported directly", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c211a3 = structure(c(0, 0, 0, 1, 0, 0, NA, 0, 
    0, 0), label = "Percent of sales exported indirectly", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c211b1 = structure(c(NA, 0, 0, 0, 0, 0, NA, 
    NA, NA, NA), label = "Percentage of domestic sales to government", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c282a2y = structure(c(451645, NA, NA, 49609, 
    1061449, 21, 38966.6171875, 43.0904998779297, NA, NA), label = "Total liabilities 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282b2y = structure(c(NA, NA, NA, 393, 81012, 
    1, 0, NA, NA, NA), label = "Long-term liabilities (>1 year) 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282c2y = structure(c(NA, NA, NA, 55193, 980437, 
    20, 7687.90576171875, NA, NA, NA), label = "Short-term liabilities (<1 year) 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282d2y = structure(c(NA, NA, 5500, 55193, 19194, 
    NA, 0, NA, NA, NA), label = "Payable short-term liabilities 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282e2y = structure(c(10000, NA, NA, 5000, 20329, 
    12, 29826.701171875, 40, NA, NA), label = "Equity–share capital 2 years ago (thousands LCU)", format.stata = "%10.0g", class = c("labelled", 
    "numeric")), c282f2y = structure(c(305436, NA, 5500, 916, 
    NA, NA, 1452.00903320312, 6.09049987792969, NA, NA), label = "Retained earnings 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282a3y = structure(c(456618, NA, NA, NA, 1063217, 
    16, 39676.3984375, 43.7501029968262, NA, NA), label = "Total liabilities 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282b3y = structure(c(NA, NA, NA, NA, 152156, 
    5, 0, NA, NA, NA), label = "Long-term liabilities (>1 year) 3 years ago (thousands LCU)", format.stata = "%10.0g", class = c("labelled", 
    "numeric")), c282c3y = structure(c(NA, NA, NA, NA, 911061, 
    11, 6299.255859375, NA, NA, NA), label = "Short-term liabilities (<1 year) 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282d3y = structure(c(NA, NA, NA, NA, 20964, 
    NA, 0, NA, NA, NA), label = "Payable short-term liabilities 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282e3y = structure(c(10000, NA, NA, NA, 124531, 
    9, 32368.564453125, 40, NA, NA), label = "Equity–share capital 3 years ago (thousands LCU)", format.stata = "%10.0g", class = c("labelled", 
    "numeric")), c282f3y = structure(c(282840, NA, NA, NA, NA, 
    NA, 1008.58001708984, 6.75010013580322, NA, NA), label = "Retained earnings 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), gni = structure(c(370, 2760, 300, 970, 1100, 
    1830, 150, 100, 1910, 960), label = "Gross National Income per capita, Atlas Method (current ), World Development Ind", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), pop = structure(c(135683664, 176596256, 13403644, 
    1280400000, 1288400000, 13007942, 4296700, 67217840, 12307091, 
    6968512), label = "Population, Total, in 2005 (World Development Indicators)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), country_proper = structure(c("Bangladesh", "Brazil", 
    "Cambodia", "China", "China", "Ecuador", "Eritrea", "Ethiopia", 
    "Guatemala", "Honduras"), format.stata = "%22s", label = "", class = c("labelled", 
    "character")), c_abbr = structure(c("BGD", "BRA", "KHM", 
    "CHN", "CHN", "ECU", "ERI", "ETH", "GTM", "HND"), format.stata = "%9s", label = "", class = c("labelled", 
    "character")), countryyear = structure(c("Bangladesh2002", 
    "Brazil2003", "Cambodia2003", "China2002", "China2003", "Ecuador2003", 
    "Eritrea2002", "Ethiopia2002", "Guatemala2003", "Honduras2003"
    ), label = "Country", format.stata = "%21s", class = c("labelled", 
    "character")), iso3c = structure(c("BGD", "BRA", "KHM", "CHN", 
    "CHN", "ECU", "ERI", "ETH", "GTM", "HND"), label = "", class = c("labelled", 
    "character")), cname = structure(c("Bangladesh", "Brazil", 
    "Cambodia", "China", "China", "Ecuador", "Eritrea", "Ethiopia", 
    "Guatemala", "Honduras"), label = "", class = c("labelled", 
    "character")), cyear = structure(c("2002", "2003", "2003", 
    "2002", "2003", "2003", "2002", "2002", "2003", "2003"), label = "", class = c("labelled", 
    "character"))), .Names = c("matchcode", "idstd", "year", 
"country", "wt", "region", "income", "industry", "sector", "ownership", 
"exporter", "c201", "c202", "c203a", "c203b", "c203c", "c203d", 
"c2041", "c2042", "c205a", "c205b1", "c205b2", "c205b3", "c205b4", 
"c205b5", "c205b6", "c205b7", "c205b8", "c205b9", "c205b10", 
"c205c", "c205d", "c206a", "c206b", "c2071", "c2072", "c208", 
"c209a", "c209ba", "c209bb", "c209bc", "c209bd", "c209be", "c210a", 
"c210b", "c211a1", "c211a2", "c211a3", "c211b1", "c282a2y", "c282b2y", 
"c282c2y", "c282d2y", "c282e2y", "c282f2y", "c282a3y", "c282b3y", 
"c282c3y", "c282d3y", "c282e3y", "c282f3y", "gni", "pop", "country_proper", 
"c_abbr", "countryyear", "iso3c", "cname", "cyear"), class = c("data.table", 
"data.frame"), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x0000000002570788>)

标签: rdatatablelapplylm

解决方案


据我了解,您已尝试进行特征选择,但您选择了一种复杂的方法,并且它产生了您的输出NULL。我同意,你有这么多的特征,你需要在回归之前进行特征选择。

有非常突出的特征选择方法,例如随机森林。随机森林可帮助您检测最佳预测因子。

鉴于我有兴趣预测植物的种类,但我不知道哪个特征可以最好地预测它(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)。所以下面的代码指定了最好的预测器:

library(party)
colnames(iris)
cf1 <- cforest(Species ~ . , data= iris, control=cforest_unbiased(mtry=2,ntree=50))

varimp(cf1)

结果varimp()给你:

“平均降低准确性”重要性分数的向量。换句话说,分数越高,它可能是更好的预测器。

在示例中:

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
 0.047636364  0.002909091  0.354181818  0.227636364

推荐阅读