r - 在基于字符串部分匹配的数据框中创建新列而不重复
问题描述
我有一个包含 2 列的数据框,GL
并且GLDESC
想要添加一个KIND
基于 column 内部的一些数据调用的第三列GLDESC
。
东风:
GL GLDESC
1 515100 Payroll-ISL
2 515900 Payroll-ICA
3 532300 Bulk Gas
4 551000 Supply AB
5 551000 Supply XPTO
6 551100 Supply AB
7 551300 Intern
对于数据表的每一行:
如果在字符串
GLDESC
中的任何位置包含单词,那么我想成为.Payroll
KIND
Payroll
如果 GLDESC
Supply
在字符串中的任何位置包含单词,那么我想KIND
成为Supply
.在所有其他情况下,我想
KIND
成为Other
.
然后,我发现了这个:
DF$KIND <- ifelse(grepl("supply", DF$GLDESC, ignore.case = T), "Supply",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
但有了这个,我就有了所有匹配的东西Supply
,例如,分类。但是,与 DF 第 4 行和第 5 行一样,同样GL
有两个Supply
,这对我来说是不必要的。事实上,如果重复字符串相同,我只GLDESC
需要匹配一种类型。GL
编辑:我不能删除任何行。我想把它作为输出:
GL GLDESC KIND
A Supply1 Supply
A Supply2 N/A
A Supply3 N/A
A Supply4 N/A
A Supply5 N/A
A Supply6 N/A
A Payroll1 Payroll
B Supply2 Supply
B Payroll Payroll
解决方案
如果我们需要重复元素,请在“GLDESC”上NA
使用来获取逻辑向量并将“KIND”中的这些元素分配给duplicated
ifelse
NA
DF$KIND[duplicated(DF$GLDESC)] <- NA_character_
如果我们需要通过分组变量更改值
library(dplyr)
DF %>%
group_by(GL) %>%
mutate(KIND = replace(KIND, duplicated(KIND) & KIND == "Supply", NA_character_))
# A tibble: 9 x 3
# Groups: GL [2]
# GL GLDESC KIND
# <chr> <chr> <chr>
#1 A Supply1 Supply
#2 A Supply2 <NA>
#3 A Supply3 <NA>
#4 A Supply4 <NA>
#5 A Supply5 <NA>
#6 A Supply6 <NA>
#7 A Payroll1 Payroll
#8 B Supply2 Supply
#9 B Payroll Payroll
或进行全部更改
DF1 %>%
mutate(KIND = str_remove(GLDESC, "\\d+"),
KIND = replace(KIND, !KIND %in% c("Supply", "Payroll"), "Othere")) %>%
group_by(GL) %>%
mutate(KIND = replace(KIND, duplicated(KIND) & KIND == "Supply", NA_character_))
数据
DF1 <- structure(list(GL = c("A", "A", "A", "A", "A", "A", "A", "B",
"B"), GLDESC = c("Supply1", "Supply2", "Supply3", "Supply4",
"Supply5", "Supply6", "Payroll1", "Supply2", "Payroll")), row.names = c(NA,
-9L), class = "data.frame")