r - 有条件地更改数据框中分类调查响应列的值
问题描述
试图创建一个将某些类别合并到变量中的对象
background <- NULL
data$y11[data$y11 == "English/Welsh/Scottish/Northern Irish/British"] <-"White"
data$y11[data$y11 == "Gypsy or Irish Traveller"] <-"White"
data$y11[data$y11 == "Any other White background, please describe"] <-"White"
data$y11[data$y11 == "Irish"] <-"White"
data$y11[data$y11 == "Any other Mixed/Multiple ethnic background, please describe"] <-"Mixed"
data$y11[data$y11 == "White and Asian "] <-"Mixed"
data$y11[data$y11 == "White and Black African "] <-"Mixed"
data$y11[data$y11 == "White and Black Caribbean"] <-"Mixed"
data$y11[data$y11 == "Any other Asian background, please describe"] <-"Asian"
data$y11[data$y11 == "Bangladeshi"] <-"Asian"
data$y11[data$y11 == "Chinese"] <-"Asian"
data$y11[data$y11 == "Indian"] <-"Asian"
data$y11[data$y11 == "Pakistani"] <-"Asian"
data$y11[data$y11 == "Arab"] <-"Arab & Other"
data$y11[data$y11 == "Any other ethnic group, please describ"] <-"Arab & Other"
data$y11[data$y11 == "African"] <-"Black"
data$y11[data$y11 == "Any other Black/African/Caribbean background, please describe"] <-"Black"
data$y11[data$y11 == "Caribbean"] <-"Black"
但我保留有关“无效因子水平,NA 生成”的警告消息
请帮忙!
解决方案
stringsAsFactors = FALSE
您的主要问题是您在读取数据时没有使用(可能使用read.csv
)。因此,您应该将其添加到read.csv
通话中。
还有一种更好的方法来做你正在做的事情。一种方法是创建一个从一个类别到另一个类别的“查找”或“翻译”表,然后使用merge
基础 R 或left_join
“tidyverse”自动为您进行替换,而无需所有这些条件分配。
我们将制作翻译表:
data.frame(
answer = c(
"African", "Any other Asian background, please describe",
"Any other Black/African/Caribbean background, please describe",
"Any other ethnic group, please describ",
"Any other Mixed/Multiple ethnic background, please describe",
"Any other White background, please describe", "Arab", "Bangladeshi",
"Caribbean", "Chinese", "English/Welsh/Scottish/Northern Irish/British",
"Gypsy or Irish Traveller", "Indian", "Irish", "Pakistani", "White and Asian ",
"White and Black African ", "White and Black Caribbean"
),
subst = c(
"Black", "Asian", "Black", "Arab & Other", "Mixed", "White",
"Arab & Other", "Asian", "Black", "Asian", "White", "White", "Asian",
"White", "Asian", "Mixed", "Mixed", "Mixed"
),
stringsAsFactors = FALSE
) -> trans_tbl
现在我们将模拟一些数据(我使用dat
vsdata
作为变量名,因为使用data
最终会在某天给你带来痛苦,因为它是一个 R 函数名):
set.seed(2018-11-30)
data.frame(
y11 = sample(trans_tbl$answer, 100, replace = TRUE),
stringsAsFactors = FALSE
) -> dat
str(dat)
## 'data.frame': 100 obs. of 1 variable:
## $ y11: chr "Caribbean" "Chinese" "Indian" "Any other Black/African/Caribbean background, please describe" ...
您的数据框有不止一列,但您没有向我们展示,所以我只是用y11
. 现在,我们只需调用merge
:
dat <- merge(dat, trans_tbl, by.x="y11", by.y="answer", all.x=TRUE)
str(dat)
## 'data.frame': 100 obs. of 2 variables:
## $ y11 : chr "African" "African" "African" "African" ...
## $ subst: chr "Black" "Black" "Black" "Black" ...
然后,执行一些基本操作以将subst
列转换为y11
您的代码所做的那样:
dat$y11 <- dat$subst
dat$subst <- NULL
str(dat)
## 'data.frame': 100 obs. of 1 variable:
## $ y11: chr "Black" "Black" "Black" "Black" ...
我们也可以dplyr
从“tidyverse”中使用:
library(tidyverse)
set.seed(2018-11-30)
data_frame( # this is the `data_frame()` function from dplyr, NOT `data.frame()` from base R
y11 = sample(trans_tbl$answer, 100, replace = TRUE)
) -> dat
left_join(dat, trans_tbl, by = c("y11"="answer")) %>%
select(y11 = subst)
## # A tibble: 100 x 1
## y11
## <chr>
## 1 Black
## 2 Asian
## 3 Asian
## 4 Black
## 5 Asian
## 6 Mixed
## 7 Arab & Other
## 8 Asian
## 9 Arab & Other
## 10 Asian
## # ... with 90 more rows
另一种方法是使用因子运算。
我们将使用相同的代码来制作模拟数据框:
possible_answers <- c(
"African", "Any other Asian background, please describe",
"Any other Black/African/Caribbean background, please describe",
"Any other ethnic group, please describ",
"Any other Mixed/Multiple ethnic background, please describe",
"Any other White background, please describe", "Arab", "Bangladeshi",
"Caribbean", "Chinese", "English/Welsh/Scottish/Northern Irish/British",
"Gypsy or Irish Traveller", "Indian", "Irish", "Pakistani", "White and Asian ",
"White and Black African ", "White and Black Caribbean"
)
what_they_should_be <- c(
"Black", "Asian", "Black", "Arab & Other", "Mixed", "White",
"Arab & Other", "Asian", "Black", "Asian", "White", "White", "Asian",
"White", "Asian", "Mixed", "Mixed", "Mixed"
)
set.seed(2018-11-30)
data.frame(
y11 = sample(possible_answers, 100, replace = TRUE)
) -> dat
请注意,我没有使用它stringsAsFactors = FALSE
,这使它更像您在 R 会话中已经拥有的。
现在我们可以这样做:
dat$y11 <- as.character(factor(
x = dat$y11,
levels = possible_answers,
labels = what_they_should_be
))
str(dat)
## 'data.frame': 100 obs. of 1 variable:
## $ y11: chr "Black" "Asian" "Asian" "Black" ...
我们将翻译后的值作为字符向量而不是因子。
推荐阅读
- jquery - Shopify AJAX 请求在 Theme Customizer 中返回 404,但在前端返回 200
- stata - 将 Cox Proportional Hazards 的输出合并到一张表中
- javascript - 为什么我不能调用动作来调度?
- reactjs - 使用地图数据的无线电组未从其自己的组中选择
- javascript - 如何轮流在javascript中推送数组?
- sql - Postgres 递归查询和窗口函数从表中生成树
- google-chrome - Google 如何阻止 Chrome 在 Google Ads 等 Google 应用程序中自动填充表单?
- woocommerce - 如果 WoooCommerce 产品在购物车中,则在产品页面上显示带有购物车计数的消息
- python - 将字符串传递给 lambda 函数
- javascript - 如何使用 PHP 中的 AJAX 从 MYSQL 中获取数据