首页 > 解决方案 > Trace back - find if a string value occurs before another specific string value - dplyr/R

问题描述

I have a general problem in R. I wonder if there's a way to identify if a specific string value occurs after another specific string value within a group. The dataset is a time series. Each Group consist of 10 years.

I want something like the below, but instead of lag I wish to look at every year before " stringvalue1" within the group.

mutate(new_var = if_else(stringvar = "stringvalue1" & lag(stringvar) %in% c("stringvalue2", "stringvalue3"), "Match", "Not match"))  

Help would be much appreciated!

library(dplyr)

match_if_precedes <- function(column, this_string, preceded_by)
{
  matches    <- which(column == this_string)
  if (length(matches) == 0) return(rep("No Match", length(column)))
  last_match = matches[length(matches) - 1]
  if (last_match == 0) return(rep("No Match", length(column)))
  any_matches <- !is.na(preceded_by %in% column[1:last_match])
  if(length(any_matches) == 0) return(rep("No Match", length(column)))
  any_matches <- any(any_matches)
  if(any_matches) return(rep("Match", length(column)))
  return(rep("No Match", length(column)))
}

df1 <- structure(list(group = c("A", "A", "A", "A", "A",  
                               "B", "B", "B", "B", "B", 
                               "C", "C", "C", "C", "C"), 
                     stringvar = c("stringvalue4", "stringvalue2", "stringvalue1", "stringvalue1", "stringvalue1", 
                                   "stringvalue1", "stringvalue1", "stringvalue1", "stringvalue1","stringvalue4", 
                                   "stringvalue4", "stringvalue2", "stringvalue3", "stringvalue3", "stringvalue4")),
                                   row.names = c(NA, -15L), class = "data.frame")
df1 %>% 
  group_by(group) %>% 
  mutate(newvar = match_if_precedes(stringvar, "stringvalue1", 
                                    c("stringvalue2", "stringvalue3"))) 

   group stringvar    newvar  
   <chr> <chr>        <chr>   
 1 A     stringvalue4 Match   
 2 A     stringvalue2 Match   
 3 A     stringvalue1 Match   
 4 A     stringvalue1 Match   
 5 A     stringvalue1 Match   
 6 B     stringvalue1 Match   
 7 B     stringvalue1 Match   
 8 B     stringvalue1 Match   
 9 B     stringvalue1 Match   
10 B     stringvalue4 Match   
11 C     stringvalue4 No Match
12 C     stringvalue2 No Match
13 C     stringvalue3 No Match
14 C     stringvalue3 No Match
15 C     stringvalue4 No Match

标签: rdplyrstringr

解决方案


You can define a function that will return a vector of "Match" if the criteria are met, and a vector of "No Match" if the criteria are not met. These will be the same length as the input column.

I have added extensive comments to show how the function works:

# This function takes a vector of strings called `column`. It looks for any instances of the
# single string `this_string` and any of the vector of strings `preceded_by` within 
# `column`. If it finds any member of `preceded_by` in the vector before the last instance
# of `this_string` it returns a vector of the string "Match" with the same length as 
# the original `column` vector. In all other cases it returns a vector of "No Match"

match_if_precedes <- function(column, this_string, preceded_by)
{
  # Find instances of this_string. If there are no instances of this_string then
  # we want to return a vector of "No Match"
  matches    <- which(column == this_string)
  if (length(matches) == 0) return(rep("No Match", length(column)))

  # If there is more than one instance of this_string, we want to choose the last one
  last_match = matches[length(matches)] - 1

  # If the only instance of this_string is at position 1, there can't be any
  # instances of preceded_by before it, so return a vector of NA
  if (last_match == 0) return(rep("No Match", length(column)))

  # Now find the instances of preceded_by in the part of the column before the
  # last instance of this_string and remove any NA values
  any_matches <- preceded_by %in% column[1:last_match]
  any_matches <- any_matches[!is.na(any_matches)]

  # If no matches are valid, we return all NAs
  if(length(any_matches) == 0) return(rep("No Match", length(column)))

  # If any of our matches are TRUE, we return a vector of "Match"
  if(any(any_matches)) return(rep("Match", length(column)))

  # The only remaining possibility is that we had no matches, so return "No Match"
  return(rep("No Match", length(column)))
}

We can test this using the data from your question as modified by your comments:

df <- structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), 
    stringvar = c("stringvalue4", "stringvalue2", "stringvalue1", 
    "stringvalue1", "stringvalue1", "stringvalue1", "stringvalue1", 
    "stringvalue1", "stringvalue1", "stringvalue4", "stringvalue4", 
    "stringvalue2", "stringvalue3", "stringvalue3", "stringvalue1"
    )), row.names = c(NA, -15L), class = "data.frame")

find_these  <- c("stringvalue2", "stringvalue3")
before_this <- "stringvalue1"

Now I can use group_by and mutate to apply this function to each of the groups in the data frame:

library(dplyr)

df                                                                     %>%
group_by(group)                                                        %>% 
mutate(newvar = match_if_precedes(stringvar, before_this, find_these)) %>% 
as.data.frame()

Result:

#>    group    stringvar   newvar
#> 1      A stringvalue4    Match
#> 2      A stringvalue2    Match
#> 3      A stringvalue1    Match
#> 4      A stringvalue1    Match
#> 5      A stringvalue1    Match
#> 6      B stringvalue1 No Match
#> 7      B stringvalue1 No Match
#> 8      B stringvalue1 No Match
#> 9      B stringvalue1 No Match
#> 10     B stringvalue4 No Match
#> 11     C stringvalue4    Match
#> 12     C stringvalue2    Match
#> 13     C stringvalue3    Match
#> 14     C stringvalue3    Match
#> 15     C stringvalue1    Match

推荐阅读