首页 > 解决方案 > 使用 tidyverse 管理数据集子集上的数据

问题描述

Tidyverse 让数据管理变得更加容易,我很感谢开发它的开发者。我熟悉基本的dplyr函数,如group_byfilterselectmutate。但是,有时我想在有数据子集时管理数据

dplyr starwars数据集为例,下面是我知道如何使用基本 R 管理的两个任务。我想知道tidyverse的等价物是什么。

library(dplyr, warn.conflicts = FALSE)
data(starwars)
dim(starwars)
#> [1] 87 14

starwars数据集中,假设数据录入人员在两个母星上犯了错误。

任务 1:在塔图因家园,所有身高至少 160 厘米的角色的头发颜色应为“金发”。在基础 R 中很容易做到这一点,但 tidyverse 等价物是什么?

starwars[starwars$homeworld == "Tatooine" & starwars$height >160, 
         "hair_color",
         drop = TRUE
] <- "blond"

任务2:有点复杂。假设在 Naboo (homeworld) 上出现错误,缺失的质量为 80 kg,并且想要从数据集中排除 homeworld 上的 Gungan 物种。我还想计算仅用于 Naboo homeworld 的行数。在基础 R 中,这很简单,但是由于大量的数据管理,我需要分配一个中间对象。

# Make a subset for those from Naboo homeworld and delete from main dataset 
starwars.Naboo <- starwars %>% filter(homeworld == "Naboo")
starwars <- starwars %>% filter(homeworld != "Naboo")

# Manage the subset of starwars.Naboo 
starwars.Naboo <- starwars.Naboo %>% 
    filter(species != "Gungan") %>%
    mutate(mass = coalesce(mass, 80)) %>%
    mutate(num = n())

# Re-add starwars.Naboo. num should be missing for all other homeworlds. 
starwars2 <- bind_rows(starwars, starwars.Naboo)


# Check to ensure transformation works
starwars2 %>% 
  tail(n = 15) %>%
  print(width = Inf)
#> # A tibble: 15 x 15
#>    name            height  mass hair_color skin_color       eye_color    
#>    <chr>            <int> <dbl> <chr>      <chr>            <chr>        
#>  1 Ratts Tyerell       79    15 none       grey, blue       unknown      
#>  2 Wat Tambor         193    48 none       green, grey      unknown      
#>  3 San Hill           191    NA none       grey             gold         
#>  4 Shaak Ti           178    57 none       red, blue, white black        
#>  5 Grievous           216   159 none       brown, white     green, yellow
#>  6 Tarfful            234   136 brown      brown            blue         
#>  7 Raymus Antilles    188    79 brown      light            brown        
#>  8 Sly Moore          178    48 none       pale             white        
#>  9 Tion Medon         206    80 none       grey             black        
#> 10 R2-D2               96    32 <NA>       white, blue      red          
#> 11 Palpatine          170    75 grey       pale             yellow       
#> 12 Gregar Typho       185    85 black      dark             brown        
#> 13 Cordé              157    80 brown      light            brown        
#> 14 Dormé              165    80 brown      light            brown        
#> 15 Padmé Amidala      165    45 brown      light            brown        
#>    birth_year sex    gender    homeworld   species films     vehicles  starships
#>         <dbl> <chr>  <chr>     <chr>       <chr>   <list>    <list>    <list>   
#>  1         NA male   masculine Aleen Minor Aleena  <chr [1]> <chr [0]> <chr [0]>
#>  2         NA male   masculine Skako       Skakoan <chr [1]> <chr [0]> <chr [0]>
#>  3         NA male   masculine Muunilinst  Muun    <chr [1]> <chr [0]> <chr [0]>
#>  4         NA female feminine  Shili       Togruta <chr [2]> <chr [0]> <chr [0]>
#>  5         NA male   masculine Kalee       Kaleesh <chr [1]> <chr [1]> <chr [1]>
#>  6         NA male   masculine Kashyyyk    Wookiee <chr [1]> <chr [0]> <chr [0]>
#>  7         NA male   masculine Alderaan    Human   <chr [2]> <chr [0]> <chr [0]>
#>  8         NA <NA>   <NA>      Umbara      <NA>    <chr [2]> <chr [0]> <chr [0]>
#>  9         NA male   masculine Utapau      Pau'an  <chr [1]> <chr [0]> <chr [0]>
#> 10         33 none   masculine Naboo       Droid   <chr [7]> <chr [0]> <chr [0]>
#> 11         82 male   masculine Naboo       Human   <chr [5]> <chr [0]> <chr [0]>
#> 12         NA male   masculine Naboo       Human   <chr [1]> <chr [0]> <chr [1]>
#> 13         NA female feminine  Naboo       Human   <chr [1]> <chr [0]> <chr [0]>
#> 14         NA female feminine  Naboo       Human   <chr [1]> <chr [0]> <chr [0]>
#> 15         46 female feminine  Naboo       Human   <chr [3]> <chr [0]> <chr [3]>
#>      num
#>    <int>
#>  1    NA
#>  2    NA
#>  3    NA
#>  4    NA
#>  5    NA
#>  6    NA
#>  7    NA
#>  8    NA
#>  9    NA
#> 10     6
#> 11     6
#> 12     6
#> 13     6
#> 14     6
#> 15     6

会话信息

xfun::session_info("dplyr")
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19041)
#> 
#> Locale:
#>   LC_COLLATE=English_United States.1252 
#>   LC_CTYPE=English_United States.1252   
#>   LC_MONETARY=English_United States.1252
#>   LC_NUMERIC=C                          
#>   LC_TIME=English_United States.1252    
#> 
#> Package version:
#>   cli_2.5.0        crayon_1.4.1     **dplyr_1.0.5**      ellipsis_0.3.2  
#>   fansi_0.5.0      generics_0.1.0   glue_1.4.2       graphics_4.0.4  
#>   grDevices_4.0.4  lifecycle_1.0.0  magrittr_2.0.1   methods_4.0.4   
#>   pillar_1.6.1     pkgconfig_2.0.3  purrr_0.3.4      R6_2.5.0        
#>   rlang_0.4.11     stats_4.0.4      tibble_3.1.2     tidyselect_1.1.0
#>   utf8_1.2.1       utils_4.0.4      vctrs_0.3.8

编辑 1:对于任务 2,我可以避免在 tidyverse 中分配中间对象吗?在上面的代码中是否有更优雅的方式来使用它?

标签: rdplyr

解决方案


您的第二项任务可以通过以下方式在一个管道中完成

starwars %>% 
  filter(homeworld == "Naboo", 
         species != "Gungan") %>%
  mutate(mass = coalesce(mass, 80),
         num = n()) %>% 
  bind_rows(starwars %>% filter(homeworld != "Naboo"))

推荐阅读