r - (提取/分离/匹配)任意顺序的组
问题描述
# Sample Data Frame
df <- data.frame(Column_A
=c("1011 Red Cat",
"Mouse 2011 is in the House 3001", "Yellow on Blue Dog walked around Park"))
我有一列手动输入的数据,我正在尝试清理这些数据。
Column_A
1|1011 Red Cat |
2|Mouse 2011 is in the House 3001 |
2|Yellow on Blue Dog walked around Park|
我想将每个特征分成它自己的列,但仍然保持 A 列以便稍后提取其他特征。
Colour Code Column_A
1|Red |1001 |Cat
2|NA |2001 3001 |Mouse is in the House
3|Yellow on Blue |NA |Dog walked around Park
迄今为止,我一直在使用 gsub 和捕获组重新排序它们,然后使用 Tidyr::extract 将它们分开。
library(dplyr)
library(tidyr)
library(stringr)
df1 <- df %>%
# Reorders the Colours
mutate(Column_A = gsub("(.*?)?(Yellow|Blue|Red)(.*)?", "\\2 \\1\\3",
Column_A, perl = TRUE)) %>%
# Removes Whitespaces
mutate(Column_A =str_squish(Column_A)) %>%
# Extracts the Colours
extract(Column_A, c("Colour", "Column_A"), "(Red|Yellow|Blue)?(.*)") %>%
# Repeats the Prececding Steps for Codes
mutate(Column_A = gsub("(.*?)?(\\b\\d{1,}\\b)(.*)?", "\\2 \\1\\3",
Column_A, perl = TRUE)) %>%
mutate(Column_A =str_squish(Column_A)) %>%
extract(Column_A, c("Code", "Column_A"), "(\\b\\d{1,}\\b)?(.*)") %>%
mutate(Column_A = str_squish(Column_A))
结果如下:
Colour Code Column_A
|Red |1011 |Cat
|Yellow |NA |on Blue Dog walked around Park
|NA |1011 |Mouse is in the House 1001
这适用于第一行,但不适用于前面的空格和单词分隔的空格,我随后一直在提取和合并它们。这样做的更优雅的方式是什么?
解决方案
这是一个结合stringr
和的解决方案gsub
,使用 R 中提供的颜色列表:
library(dplyr)
library(stringr)
# list of colours from R colors()
cols <- as.character(colors())
apply(df,
1,
function(x)
tibble(
# Exctract CSV of colours
Color = cols[cols %in% str_split(tolower(x), " ", simplify = T)] %>%
paste0(collapse = ","),
# Extract CSV of sequential lists of digits
Code = str_extract_all(x, regex("\\d+"), simplify = T) %>%
paste0(collapse = ","),
# Remove colours and digits from Column_A
Column_A = gsub(paste0("(\\d+|",
paste0(cols, collapse = "|"),
")"), "", x, ignore.case = T) %>% trimws())) %>%
bind_rows()
# A tibble: 3 x 3
Color Code Column_A
<chr> <chr> <chr>
1 red 1011 Cat
2 "" 2011,3001 Mouse is in the House
3 blue,yellow "" on Dog walked around Park
推荐阅读
- python - Python/Pandas/Datetime:将列中的整个列表转换为日期时间
- c# - 如何将自定义属性值绑定到属性?
- android - Kotlin DSL 添加 Kotlin SourceSets 不会影响
- security - 使用 Turbo Intruder 时的奇怪反应
- java - 如何自动运行maven命令,而不是从命令?
- kotlin - 如何在循环中从 Firestore 查询中获取单个文档
- reactjs - 将多个项目添加到 Redux 商店
- angular - 如何在 Angular 8 中应用验证。我正在尝试,但无法正常工作
- visual-studio - Scaffold-DbContext 不会在 TFS 中检出更新的文件
- google-apps-script - onChange 仅在 A3 有数据时运行 - Google Apps 脚本/Google 表格