首页 > 解决方案 > 使用函数和变异在 R 中创建一个新列

问题描述

我有一个学校项目,我花了三个多小时试图解决这个问题。我的数据集(“df”)的第一个变量是“AREA”。我已成功将其过滤掉,以便唯一的值是美国各州的名称。

我正在寻找一个名为“区域”的新列/变量。它采用“AREA”中列出的州并返回四个美国人口普查区域指定之一。显然,R(state.region?)中已经有一个现有的函数,但我无法让它工作,我宁愿编码它很长的路要走。

这是我在清理数据并安装“dplyr”、“tidyr”和“stringr”库后所拥有的:

#Create U.S. Census regions
regionconvert<-function(x)
{
  if(x %in% c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware"))
    {return("South")}
  if(x %in% c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia"))
    {return("Northeast")}
  if(x %in% c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas"))
    {return("Midwest")}
  if(x %in% c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana"))
    {return("West")}
}
dfRegion=mutate(df,"Region"=regionconvert(df$AREA))

我收到以下错误,并且我的新数据集的每一行都有“South”:

警告消息:在 if (x %in% c("Texas", "Oklahoma", "Arkansas", "Louisiana", "Mississippi", : 条件长度 > 1 且仅使用第一个元素

您能给我解决此问题的任何帮助将不胜感激

标签: rfunctiondplyr

解决方案


在前面,不要df$ 你的调用中使用mutate. 大多数dplyr动词函数的吸引力(和要点)之一是它们无需一直被告知数据集对象即可工作。所以你的电话应该是这样的(尽管它仍然需要工作):

mutate(df, Region = regionconvert(AREA))

但它更进一步:如果/当您在管道中使用分组时,变量本身(如我在这里展示的)是当前组的有效数据,而不是整个数据集。例如,如果我们想对 cars' 进行排名mpg,但在每个汽缸组内:

mtcars %>% group_by(cyl) %>% mutate(rnk = rank(mpg))
# # A tibble: 32 x 12
# # Groups:   cyl [3]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   rnk
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4   5.5
#  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4   5.5
#  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1   3.5
#  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1   7  
#  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2  13  
#  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1   2  
#  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4   4  
#  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2   5  
#  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2   3.5
# 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4   3  
# # ... with 22 more rows

thenrank被调用 3 次:第一次有 11 个值 ( cyl == 4),第二次有 7 个值 ( cyl == 6),第三次有 14 个值 ( cyl == 8)。相反,如果我们尝试调用:

mtcars %>% group_by(cyl) %>% mutate(rnk = rank(mtcars$mpg))

那么rank每个调用中的调用将有 32 个值。(这将失败,因为mutate每个函数调用都需要返回 1 个值或与输入相同数量的值。)

但是如果你正在做类似的事情

mtcars %>% group_by(cyl) %>% summarize(avg = mean(mpg))
mtcars %>% group_by(cyl) %>% summarize(avg = mean(mtcars$mpg))

然后第一个将给出cyl平均数,第二个将报告所有三个相同的全球平均值。


好的,现在回答你的问题:

一个问题是您的函数期望x是一个奇异值(标量,从技术上讲,在 R 中它是一个长度为 1 的向量)。不幸的是,当mutate它被调用时,会传递一个值向量。有几种方法可以处理这个问题,从最不喜欢到最喜欢:

  1. 对其进行矢量化的最快方法是使用 . 返回每个值的特定区域ifelse。不过,我建议使用dplyr::if_else这里,因为它确保了一些类型保证(base::ifelse没有)。

    regionconvert2 <- function(x) {
      if_else(x %in% c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware"),
              "South",
              if_else(x %in% c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia"),
                      "Northeast",
                      if_else(x %in% c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas"),
                              "Midwest",
                              if_else(x %in% c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana"),
                                      "West",
                                      NA_character_))))
    }
    
  2. 预填充一个完全NA输出,然后在我们确定它们时替换单个值:

    regionconvert3 <- function(x) {
      out <- x[NA]
      ind <- x %in% c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware")
      out[ind] <- "South"
      ind <- x %in% c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia")
      out[ind] <- "Northeast"
      ind <- x %in% c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas")
      out[ind] <- "Midwest"
      ind <- x %in% c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana")
      out[ind] <- "West"
      return(out)
    }
    

    坦率地说,我不太喜欢这个,因为它的硬编码很多(并且有重复的代码),所以改进的版本是这样的:

    regionlist <- list(
      South = c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware"),
      Northeast = c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia"),
      Midwest = c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas"),
      West = c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana")
    )
    regionconvert4 <- function(x, lookup) {
      out <- x[NA]
      for (nm in names(lookup)) {
        ind <- x %in% lookup[[nm]]
        out[ind] <- nm
      }
      return(out)
    }
    

    第二个的目的是用列表中条目的名称替换值(可能值的向量)。

  3. 对先前技术的轻微逆转是提供各种查找。我将修改regionlist上述内容,而不是名称是地区,名称是州。(这可以通过其他方式轻松创建。)

    statelist <- setNames(names(tibble::deframe(regiondf)),
                          tibble::deframe(regiondf))
    statelist[1:5]
    #       Texas    Oklahoma    Arkansas   Louisiana Mississippi 
    #     "South"     "South"     "South"     "South"     "South" 
    statelist[ c("Colorado","New Jersey") ]
    #    Colorado  New Jersey 
    #      "West" "Northeast" 
    

    这消除了对函数 ala 的需求statelist[AREA]

  4. 合并/加入。这有点高级,但我认为从长远来看更易于维护(例如,您可以在简单的 CSV 或电子表格中维护状态/区域列表,这可能会使编辑/更改/扩展更容易, ETC)。我将从regionlist对象创建这个新框架,但它可以很容易地直接创建或通过更熟悉的方式创建:

    regiondf <- tibble::enframe(regionlist, name="region", value="AREA") %>% tidyr::unnest()
    regiondf
    # # A tibble: 50 x 2
    #    region AREA       
    #    <chr>  <chr>      
    #  1 South  Texas      
    #  2 South  Oklahoma   
    #  3 South  Arkansas   
    #  4 South  Louisiana  
    #  5 South  Mississippi
    #  6 South  Alabama    
    #  7 South  Georgia    
    #  8 South  Florida    
    #  9 South  Tennessee  
    # 10 South  Kentucky   
    # # ... with 40 more rows
    

现在,我将使用一个简单的示例数据来演示所有这些功能。(旁注:如果事情对您不起作用,可能是因为我们没有您的样本数据和/或只有您知道的任何细微差别。将来,请提供一些样本数据以供测试和您的预期输出。 )

sampledata <- data_frame(AREA = c("Colorado", "California", "New Jersey", "Florida", "Guam"))

sampledata %>%
  mutate(
    r2 = regionconvert2(AREA),
    r3 = regionconvert3(AREA),
    r4 = regionconvert4(AREA, regionlist),
    r5 = statelist[AREA]
  ) %>%
  left_join(regiondf, by = "AREA")
# # A tibble: 5 x 6
#   AREA       r2        r3        r4        r5        region   
#   <chr>      <chr>     <chr>     <chr>     <chr>     <chr>    
# 1 Colorado   West      West      West      West      West     
# 2 California West      West      West      West      West     
# 3 New Jersey Northeast Northeast Northeast Northeast Northeast
# 4 Florida    South     South     South     South     South    
# 5 Guam       <NA>      <NA>      <NA>      <NA>      <NA>     

(如果您想使用第四种“合并/加入”技术,mutate则没有必要。)


推荐阅读