首页 > 解决方案 > R GSRUB 函数

问题描述

使用公共数据集,其中变量分类存储描述与 LicenseNo 关联的许可证类型的代码。任何许可证都可以有 1 到 19 个不同的并发许可证类型与不同的 licenseNo 相关联。一个函数似乎是将分类拆分为 [1:19] 新列 Classification1:Classification19 的正确策略。不知道从哪里开始。我还需要将代码转换为描述,鉴于我从网站上读到的内容,我认为它可以作为 rda 文件拉入,因此创建了一个表格来支持这篇文章。不知道从哪里开始。

在此处输入图像描述

LicenseNo <- c("1000002","1000003","1000012","1000015","1000026")
Classifications <- c("C57","C-6","B","C60| C51", "HAZ| C36| C10| A| B| C57| C-2| C-8| C12| C21| C27| C29| C35| C42| C45| C39| C50| C51| C31")
data <- data.frame(LicenseNo,Classifications)
View(data)

Descriptions <- c("Cabinet, Millwork and Finish Carpentry Contractor","General Building Contractor",
                  "Well Drilling Contractor", "Structural Steel Contractor","Welding Contractor",
                  "Hazardous Substance Removal Certification","Plumbing Contractor","Electrical Contractor",
                  "General Engineering Contractor", "Insulation and Acoustical Contractor")
Classifications <- c("C-6","B","C57","C51","C60","HAZ","C36","C10","A","C-2")
class_type <- data.frame(Descriptions,Classifications)
View(class_type)

最终,希望创建以下输出,...仅列出 4 个分类用于观察 1000026 以简化。 在此处输入图像描述

标签: rfunctiongsub

解决方案


tidyverse

library(dplyr)
# library(tidyr) # unnest, pivot_*
out <- data %>%
  mutate(Classifications = strsplit(Classifications, "[|\\s]+")) %>%
  tidyr::unnest(Classifications) %>%
  mutate(Classifications = trimws(Classifications)) %>%
  left_join(class_type, by = "Classifications") %>%
  mutate(Classifications = coalesce(Descriptions, Classifications)) %>%
  select(-Descriptions)

out
# # A tibble: 24 x 2
#    LicenseNo Classifications                                  
#    <chr>     <chr>                                            
#  1 1000002   Well Drilling Contractor                         
#  2 1000003   Cabinet, Millwork and Finish Carpentry Contractor
#  3 1000012   General Building Contractor                      
#  4 1000015   Welding Contractor                               
#  5 1000015   Structural Steel Contractor                      
#  6 1000026   Hazardous Substance Removal Certification        
#  7 1000026   Plumbing Contractor                              
#  8 1000026   Electrical Contractor                            
#  9 1000026   General Engineering Contractor                   
# 10 1000026   General Building Contractor                      
# # ... with 14 more rows

注意coalesce由于缺少组件,我没有描述和原始分类。例如,没有coalesce我们将看到:

out <- data %>%
  mutate(Classifications = strsplit(Classifications, "[|\\s]+")) %>%
  tidyr::unnest(Classifications) %>%
  mutate(Classifications = trimws(Classifications)) %>%
  left_join(class_type, by = "Classifications")
print(out,n=99)
# # A tibble: 24 x 3
#    LicenseNo Classifications Descriptions                                     
#    <chr>     <chr>           <chr>                                            
#  1 1000002   C57             Well Drilling Contractor                         
#  2 1000003   C-6             Cabinet, Millwork and Finish Carpentry Contractor
#  3 1000012   B               General Building Contractor                      
#  4 1000015   C60             Welding Contractor                               
#  5 1000015   C51             Structural Steel Contractor                      
#  6 1000026   HAZ             Hazardous Substance Removal Certification        
#  7 1000026   C36             Plumbing Contractor                              
#  8 1000026   C10             Electrical Contractor                            
#  9 1000026   A               General Engineering Contractor                   
# 10 1000026   B               General Building Contractor                      
# 11 1000026   C57             Well Drilling Contractor                         
# 12 1000026   C-2             Insulation and Acoustical Contractor             
# 13 1000026   C-8             <NA>                                             
# 14 1000026   C12             <NA>                                             
# 15 1000026   C21             <NA>                                             
# 16 1000026   C27             <NA>                                             
# 17 1000026   C29             <NA>                                             
# 18 1000026   C35             <NA>                                             
# 19 1000026   C42             <NA>                                             
# 20 1000026   C45             <NA>                                             
# 21 1000026   C39             <NA>                                             
# 22 1000026   C50             <NA>                                             
# 23 1000026   C51             Structural Steel Contractor                      
# 24 1000026   C31             <NA>                                             

我的猜测是您更愿意保留“某事” Vice ,因此当缺少描述时NA,我默认将 s 替换为分类。NA如果您的数据没有此类问题,那么您可以跳过该步骤(只需将描述重命名为分类)。

长格式对很多事情都有好处(尤其是ggplot2类似的“整洁”操作),但如果你想要宽格式,那么

out %>%
  group_by(LicenseNo) %>%
  mutate(rn = paste0("Classification", row_number())) %>%
  ungroup() %>%
  tidyr::pivot_wider(LicenseNo, names_from = rn, values_from = Classifications)
# # A tibble: 5 x 20
#   LicenseNo Classification1 Classification2 Classification3 Classification4 Classification5 Classification6 Classification7
#   <chr>     <chr>           <chr>           <chr>           <chr>           <chr>           <chr>           <chr>          
# 1 1000002   Well Drilling ~ <NA>            <NA>            <NA>            <NA>            <NA>            <NA>           
# 2 1000003   Cabinet, Millw~ <NA>            <NA>            <NA>            <NA>            <NA>            <NA>           
# 3 1000012   General Buildi~ <NA>            <NA>            <NA>            <NA>            <NA>            <NA>           
# 4 1000015   Welding Contra~ Structural Ste~ <NA>            <NA>            <NA>            <NA>            <NA>           
# 5 1000026   Hazardous Subs~ Plumbing Contr~ Electrical Con~ General Engine~ General Buildi~ Well Drilling ~ Insulation and~
# # ... with 12 more variables: Classification8 <chr>, Classification9 <chr>, Classification10 <chr>, Classification11 <chr>,
# #   Classification12 <chr>, Classification13 <chr>, Classification14 <chr>, Classification15 <chr>, Classification16 <chr>,
# #   Classification17 <chr>, Classification18 <chr>, Classification19 <chr>

推荐阅读