首页 > 解决方案 > 在基于字符串部分匹配的数据框中创建新列而不重复

问题描述

我有一个包含 2 列的数据框,GL并且GLDESC想要添加一个KIND基于 column 内部的一些数据调用的第三列GLDESC

东风:

      GL                             GLDESC
1 515100                        Payroll-ISL
2 515900                        Payroll-ICA
3 532300                           Bulk Gas
4 551000                          Supply AB
5 551000                        Supply XPTO
6 551100                          Supply AB
7 551300                             Intern

对于数据表的每一行:

然后,我发现了这个:

DF$KIND <- ifelse(grepl("supply", DF$GLDESC, ignore.case = T), "Supply", 
         ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))

但有了这个,我就有了所有匹配的东西Supply,例如,分类。但是,与 DF 第 4 行和第 5 行一样,同样GL有两个Supply,这对我来说是不必要的。事实上,如果重复字符串相同,我只GLDESC需要匹配一种类型。GL

编辑:我不能删除任何行。我想把它作为输出:

GL  GLDESC   KIND

A   Supply1  Supply
A   Supply2  N/A
A   Supply3  N/A
A   Supply4  N/A
A   Supply5  N/A
A   Supply6  N/A
A   Payroll1 Payroll
B   Supply2  Supply
B   Payroll  Payroll

标签: r

解决方案


如果我们需要重复元素,请在“GLDESC”上NA使用来获取逻辑向量并将“KIND”中的这些元素分配给duplicatedifelseNA

DF$KIND[duplicated(DF$GLDESC)] <- NA_character_

如果我们需要通过分组变量更改值

library(dplyr)
DF  %>%
    group_by(GL) %>%
    mutate(KIND = replace(KIND, duplicated(KIND) & KIND == "Supply", NA_character_))

# A tibble: 9 x 3
# Groups:   GL [2]
#  GL    GLDESC   KIND   
#  <chr> <chr>    <chr>  
#1 A     Supply1  Supply 
#2 A     Supply2  <NA>   
#3 A     Supply3  <NA>   
#4 A     Supply4  <NA>   
#5 A     Supply5  <NA>   
#6 A     Supply6  <NA>   
#7 A     Payroll1 Payroll
#8 B     Supply2  Supply 
#9 B     Payroll  Payroll

或进行全部更改

 DF1 %>%
    mutate(KIND = str_remove(GLDESC, "\\d+"), 
    KIND = replace(KIND, !KIND %in% c("Supply", "Payroll"), "Othere")) %>% 
    group_by(GL) %>% 
    mutate(KIND = replace(KIND, duplicated(KIND) & KIND == "Supply", NA_character_))

数据

DF1 <- structure(list(GL = c("A", "A", "A", "A", "A", "A", "A", "B", 
"B"), GLDESC = c("Supply1", "Supply2", "Supply3", "Supply4", 
"Supply5", "Supply6", "Payroll1", "Supply2", "Payroll")), row.names = c(NA, 
-9L), class = "data.frame")

推荐阅读