r - R GSRUB 函数
问题描述
使用公共数据集,其中变量分类存储描述与 LicenseNo 关联的许可证类型的代码。任何许可证都可以有 1 到 19 个不同的并发许可证类型与不同的 licenseNo 相关联。一个函数似乎是将分类拆分为 [1:19] 新列 Classification1:Classification19 的正确策略。不知道从哪里开始。我还需要将代码转换为描述,鉴于我从网站上读到的内容,我认为它可以作为 rda 文件拉入,因此创建了一个表格来支持这篇文章。不知道从哪里开始。
LicenseNo <- c("1000002","1000003","1000012","1000015","1000026")
Classifications <- c("C57","C-6","B","C60| C51", "HAZ| C36| C10| A| B| C57| C-2| C-8| C12| C21| C27| C29| C35| C42| C45| C39| C50| C51| C31")
data <- data.frame(LicenseNo,Classifications)
View(data)
Descriptions <- c("Cabinet, Millwork and Finish Carpentry Contractor","General Building Contractor",
"Well Drilling Contractor", "Structural Steel Contractor","Welding Contractor",
"Hazardous Substance Removal Certification","Plumbing Contractor","Electrical Contractor",
"General Engineering Contractor", "Insulation and Acoustical Contractor")
Classifications <- c("C-6","B","C57","C51","C60","HAZ","C36","C10","A","C-2")
class_type <- data.frame(Descriptions,Classifications)
View(class_type)
解决方案
tidyverse
library(dplyr)
# library(tidyr) # unnest, pivot_*
out <- data %>%
mutate(Classifications = strsplit(Classifications, "[|\\s]+")) %>%
tidyr::unnest(Classifications) %>%
mutate(Classifications = trimws(Classifications)) %>%
left_join(class_type, by = "Classifications") %>%
mutate(Classifications = coalesce(Descriptions, Classifications)) %>%
select(-Descriptions)
out
# # A tibble: 24 x 2
# LicenseNo Classifications
# <chr> <chr>
# 1 1000002 Well Drilling Contractor
# 2 1000003 Cabinet, Millwork and Finish Carpentry Contractor
# 3 1000012 General Building Contractor
# 4 1000015 Welding Contractor
# 5 1000015 Structural Steel Contractor
# 6 1000026 Hazardous Substance Removal Certification
# 7 1000026 Plumbing Contractor
# 8 1000026 Electrical Contractor
# 9 1000026 General Engineering Contractor
# 10 1000026 General Building Contractor
# # ... with 14 more rows
注意:coalesce
由于缺少组件,我没有描述和原始分类。例如,没有coalesce
我们将看到:
out <- data %>%
mutate(Classifications = strsplit(Classifications, "[|\\s]+")) %>%
tidyr::unnest(Classifications) %>%
mutate(Classifications = trimws(Classifications)) %>%
left_join(class_type, by = "Classifications")
print(out,n=99)
# # A tibble: 24 x 3
# LicenseNo Classifications Descriptions
# <chr> <chr> <chr>
# 1 1000002 C57 Well Drilling Contractor
# 2 1000003 C-6 Cabinet, Millwork and Finish Carpentry Contractor
# 3 1000012 B General Building Contractor
# 4 1000015 C60 Welding Contractor
# 5 1000015 C51 Structural Steel Contractor
# 6 1000026 HAZ Hazardous Substance Removal Certification
# 7 1000026 C36 Plumbing Contractor
# 8 1000026 C10 Electrical Contractor
# 9 1000026 A General Engineering Contractor
# 10 1000026 B General Building Contractor
# 11 1000026 C57 Well Drilling Contractor
# 12 1000026 C-2 Insulation and Acoustical Contractor
# 13 1000026 C-8 <NA>
# 14 1000026 C12 <NA>
# 15 1000026 C21 <NA>
# 16 1000026 C27 <NA>
# 17 1000026 C29 <NA>
# 18 1000026 C35 <NA>
# 19 1000026 C42 <NA>
# 20 1000026 C45 <NA>
# 21 1000026 C39 <NA>
# 22 1000026 C50 <NA>
# 23 1000026 C51 Structural Steel Contractor
# 24 1000026 C31 <NA>
我的猜测是您更愿意保留“某事” Vice ,因此当缺少描述时NA
,我默认将 s 替换为分类。NA
如果您的数据没有此类问题,那么您可以跳过该步骤(只需将描述重命名为分类)。
长格式对很多事情都有好处(尤其是ggplot2
类似的“整洁”操作),但如果你想要宽格式,那么
out %>%
group_by(LicenseNo) %>%
mutate(rn = paste0("Classification", row_number())) %>%
ungroup() %>%
tidyr::pivot_wider(LicenseNo, names_from = rn, values_from = Classifications)
# # A tibble: 5 x 20
# LicenseNo Classification1 Classification2 Classification3 Classification4 Classification5 Classification6 Classification7
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1000002 Well Drilling ~ <NA> <NA> <NA> <NA> <NA> <NA>
# 2 1000003 Cabinet, Millw~ <NA> <NA> <NA> <NA> <NA> <NA>
# 3 1000012 General Buildi~ <NA> <NA> <NA> <NA> <NA> <NA>
# 4 1000015 Welding Contra~ Structural Ste~ <NA> <NA> <NA> <NA> <NA>
# 5 1000026 Hazardous Subs~ Plumbing Contr~ Electrical Con~ General Engine~ General Buildi~ Well Drilling ~ Insulation and~
# # ... with 12 more variables: Classification8 <chr>, Classification9 <chr>, Classification10 <chr>, Classification11 <chr>,
# # Classification12 <chr>, Classification13 <chr>, Classification14 <chr>, Classification15 <chr>, Classification16 <chr>,
# # Classification17 <chr>, Classification18 <chr>, Classification19 <chr>
推荐阅读
- python - 执行 Python 脚本以从 Web 服务器控制 GPIO
- java - 我想使用 spring boot 创建消息应用程序
- javascript - 当 fetch 返回 404 时替换 img url
- r - 使用 dplyr 为不同的组选择不同数量的 top_n
- validation - Keras 回调 on_epoch_end 抛出错误(Nonetype 没有 len())
- javascript - 在 Cors 中发送原始 cookie 不适用于 VideoJS
- blogger - 如何将 googleusercontent.com 更改为外部图片博主
- database - Laravel:通过用户 ID 向数据库内的电子邮件地址发送电子邮件
- python - 使用 BeautifulSoup 获取“NoneType”对象没有属性“文本”错误
- python - 如何处理 DBSCAN 中的字符串变量?