首页 > 解决方案 > 在 R 中通过因子变量对数据框进行子集化时出现问题

问题描述

前三级表示地铁,后六级表示非地铁

- 首先我尝试通过地铁进行子集......

metro <- subset(train, area__rucc == c("Metro - Counties in metro areas of 1 million population or more", "Metro - Counties in metro areas of 250,000 to 1 million population", "Metro - Counties in metro areas of fewer than 250,000 population"))

这似乎按预期工作,并返回了 df 与 387 观察。

- 接下来我尝试按这样的非地铁级别进行子集化......

not_metro <- subset(train, area__rucc != c("Metro - Counties in metro areas of 1 million population or more", "Metro - Counties in metro areas of 250,000 to 1 million population", "Metro - Counties in metro areas of fewer than 250,000 population"))

这返回了 2811 个观测值,但经过进一步检查,df 包含 Metro 水平和非 Metro 水平。显然没有按我的预期工作。

- 我的第三枪...

non_metro <- subset(train, area__rucc == c("Nonmetro - Completely rural or less than 2,500 urban population, adjacent to a metro area", 
                "Nonmetro - Completely rural or less than 2,500 urban population, not adjacent to a metro area", 
                "Nonmetro - Urban population of 2,500 to 19,999, adjacent to a metro area", 
                "Nonmetro - Urban population of 2,500 to 19,999, not adjacent to a metro area", 
                "Nonmetro - Urban population of 20,000 or more, adjacent to a metro area", 
                "Nonmetro - Urban population of 20,000 or more, not adjacent to a metro area"))

在这里,我明确列出了非地铁级别 (4:9)。这返回了一个包含 354 个观测值的 df,所有这些观测值都是非 Metro 的。

387 (metro) + 354 (non-metro) != 3189 train$area_rucc 中没有缺失值,所以我试图从 train 创建的两个 df 应该与原始 df 保持相同数量的观察,对吗?

我有一种感觉,我正在犯一个我现在似乎无法理解的愚蠢错误(可能是缺乏经验),或者我可能完全不了解我在这里想要做的事情,但这开始令人沮丧我,任何见解将不胜感激。

标签: r

解决方案


我不知道你到底想要什么最终结果,我认为这样的整洁应该可以工作:

    train %>%
        mutate(metro = ifelse(area__rucc=="Metro - Counties in metro areas of 1 million population or more"|area__rucc=="Metro - Counties in metro areas of 250,000 to 1 million population",area__rucc("Metro - Counties in metro areas of fewer than 250,000 population",1,0) %>%
        group_by(metro)

推荐阅读