r - 使用函数和变异在 R 中创建一个新列
问题描述
我有一个学校项目,我花了三个多小时试图解决这个问题。我的数据集(“df”)的第一个变量是“AREA”。我已成功将其过滤掉,以便唯一的值是美国各州的名称。
我正在寻找一个名为“区域”的新列/变量。它采用“AREA”中列出的州并返回四个美国人口普查区域指定之一。显然,R(state.region?)中已经有一个现有的函数,但我无法让它工作,我宁愿编码它很长的路要走。
这是我在清理数据并安装“dplyr”、“tidyr”和“stringr”库后所拥有的:
#Create U.S. Census regions
regionconvert<-function(x)
{
if(x %in% c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware"))
{return("South")}
if(x %in% c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia"))
{return("Northeast")}
if(x %in% c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas"))
{return("Midwest")}
if(x %in% c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana"))
{return("West")}
}
dfRegion=mutate(df,"Region"=regionconvert(df$AREA))
我收到以下错误,并且我的新数据集的每一行都有“South”:
警告消息:在 if (x %in% c("Texas", "Oklahoma", "Arkansas", "Louisiana", "Mississippi", : 条件长度 > 1 且仅使用第一个元素
您能给我解决此问题的任何帮助将不胜感激
解决方案
在前面,不要df$
在你的调用中使用mutate
. 大多数dplyr
动词函数的吸引力(和要点)之一是它们无需一直被告知数据集对象即可工作。所以你的电话应该是这样的(尽管它仍然需要工作):
mutate(df, Region = regionconvert(AREA))
但它更进一步:如果/当您在管道中使用分组时,变量本身(如我在这里展示的)是当前组的有效数据,而不是整个数据集。例如,如果我们想对 cars' 进行排名mpg
,但在每个汽缸组内:
mtcars %>% group_by(cyl) %>% mutate(rnk = rank(mpg))
# # A tibble: 32 x 12
# # Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb rnk
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 5.5
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 5.5
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 3.5
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 7
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 13
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 2
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 4
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 5
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 3.5
# 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 3
# # ... with 22 more rows
thenrank
被调用 3 次:第一次有 11 个值 ( cyl == 4
),第二次有 7 个值 ( cyl == 6
),第三次有 14 个值 ( cyl == 8
)。相反,如果我们尝试调用:
mtcars %>% group_by(cyl) %>% mutate(rnk = rank(mtcars$mpg))
那么rank
每个调用中的调用将有 32 个值。(这将失败,因为mutate
每个函数调用都需要返回 1 个值或与输入相同数量的值。)
但是如果你正在做类似的事情
mtcars %>% group_by(cyl) %>% summarize(avg = mean(mpg))
mtcars %>% group_by(cyl) %>% summarize(avg = mean(mtcars$mpg))
然后第一个将给出cyl
平均数,第二个将报告所有三个相同的全球平均值。
好的,现在回答你的问题:
一个问题是您的函数期望x
是一个奇异值(标量,从技术上讲,在 R 中它是一个长度为 1 的向量)。不幸的是,当mutate
它被调用时,会传递一个值向量。有几种方法可以处理这个问题,从最不喜欢到最喜欢:
对其进行矢量化的最快方法是使用 . 返回每个值的特定区域
ifelse
。不过,我建议使用dplyr::if_else
这里,因为它确保了一些类型保证(base::ifelse
没有)。regionconvert2 <- function(x) { if_else(x %in% c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware"), "South", if_else(x %in% c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia"), "Northeast", if_else(x %in% c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas"), "Midwest", if_else(x %in% c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana"), "West", NA_character_)))) }
预填充一个完全
NA
输出,然后在我们确定它们时替换单个值:regionconvert3 <- function(x) { out <- x[NA] ind <- x %in% c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware") out[ind] <- "South" ind <- x %in% c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia") out[ind] <- "Northeast" ind <- x %in% c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas") out[ind] <- "Midwest" ind <- x %in% c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana") out[ind] <- "West" return(out) }
坦率地说,我不太喜欢这个,因为它的硬编码很多(并且有重复的代码),所以改进的版本是这样的:
regionlist <- list( South = c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware"), Northeast = c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia"), Midwest = c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas"), West = c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana") ) regionconvert4 <- function(x, lookup) { out <- x[NA] for (nm in names(lookup)) { ind <- x %in% lookup[[nm]] out[ind] <- nm } return(out) }
第二个的目的是用列表中条目的名称替换值(可能值的向量)。
对先前技术的轻微逆转是提供各种查找。我将修改
regionlist
上述内容,而不是名称是地区,名称是州。(这可以通过其他方式轻松创建。)statelist <- setNames(names(tibble::deframe(regiondf)), tibble::deframe(regiondf)) statelist[1:5] # Texas Oklahoma Arkansas Louisiana Mississippi # "South" "South" "South" "South" "South" statelist[ c("Colorado","New Jersey") ] # Colorado New Jersey # "West" "Northeast"
这消除了对函数 ala 的需求
statelist[AREA]
。合并/加入。这有点高级,但我认为从长远来看更易于维护(例如,您可以在简单的 CSV 或电子表格中维护状态/区域列表,这可能会使编辑/更改/扩展更容易, ETC)。我将从
regionlist
对象创建这个新框架,但它可以很容易地直接创建或通过更熟悉的方式创建:regiondf <- tibble::enframe(regionlist, name="region", value="AREA") %>% tidyr::unnest() regiondf # # A tibble: 50 x 2 # region AREA # <chr> <chr> # 1 South Texas # 2 South Oklahoma # 3 South Arkansas # 4 South Louisiana # 5 South Mississippi # 6 South Alabama # 7 South Georgia # 8 South Florida # 9 South Tennessee # 10 South Kentucky # # ... with 40 more rows
现在,我将使用一个简单的示例数据来演示所有这些功能。(旁注:如果事情对您不起作用,可能是因为我们没有您的样本数据和/或只有您知道的任何细微差别。将来,请提供一些样本数据以供测试和您的预期输出。 )
sampledata <- data_frame(AREA = c("Colorado", "California", "New Jersey", "Florida", "Guam"))
sampledata %>%
mutate(
r2 = regionconvert2(AREA),
r3 = regionconvert3(AREA),
r4 = regionconvert4(AREA, regionlist),
r5 = statelist[AREA]
) %>%
left_join(regiondf, by = "AREA")
# # A tibble: 5 x 6
# AREA r2 r3 r4 r5 region
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Colorado West West West West West
# 2 California West West West West West
# 3 New Jersey Northeast Northeast Northeast Northeast Northeast
# 4 Florida South South South South South
# 5 Guam <NA> <NA> <NA> <NA> <NA>
(如果您想使用第四种“合并/加入”技术,mutate
则没有必要。)
推荐阅读
- r - 如何在 R 中重新排列日期?
- javascript - Axios Get 在服务器上返回一个空数组,但在本地 [d3 application in react] 中工作正常
- r - 在 R 中的“Buy Til” You Die 包中出现值超出范围的错误
- r - 在 read.jdbc SparkR 中设置 Datefirst
- data-structures - 为什么教科书中的splay树与我的不同?
- javascript - 使用 NodeJS + HTML 建立 MySQL 数据库似乎不起作用
- html - 我正在尝试使用 PyMySQL 设计一个基本的 CRUD 应用程序并得到一个“KEY ERROR”
- spring-boot - 将 Spring Boot 微服务部署到 Heroku
- python - 评估 k 折交叉验证中的准确性与保留数据
- python - Python从最高分到最低分读取txt文件