I have a general problem in R. I wonder if there's a way to identify if a specific string value occurs after another specific string value within a group. The dataset is a time series. Each Group consist of 10 years.

I want something like the below, but instead of lag I wish to look at every year before " stringvalue1" within the group.

mutate(new_var = if_else(stringvar = "stringvalue1" & lag(stringvar) %in% c("stringvalue2", "stringvalue3"), "Match", "Not match"))  

Help would be much appreciated!


You can define a function that will return a vector of "Match" if the criteria are met, and a vector of "No Match" if the criteria are not met. These will be the same length as the input column.

I have added extensive comments to show how the function works:

# This function takes a vector of strings called `column`. It looks for any instances of the
# single string `this_string` and any of the vector of strings `preceded_by` within 
# `column`. If it finds any member of `preceded_by` in the vector before the last instance
# of `this_string` it returns a vector of the string "Match" with the same length as 
# the original `column` vector. In all other cases it returns a vector of "No Match"

match_if_precedes <- function(column, this_string, preceded_by)
  # Find instances of this_string. If there are no instances of this_string then
  # we want to return a vector of "No Match"
  matches    <- which(column == this_string)
  if (length(matches) == 0) return(rep("No Match", length(column)))

  # If there is more than one instance of this_string, we want to choose the last one
  last_match = matches[length(matches)] - 1

  # If the only instance of this_string is at position 1, there can't be any
  # instances of preceded_by before it, so return a vector of NA
  if (last_match == 0) return(rep("No Match", length(column)))

  # Now find the instances of preceded_by in the part of the column before the
  # last instance of this_string and remove any NA values
  any_matches <- preceded_by %in% column[1:last_match]
  any_matches <- any_matches[!is.na(any_matches)]

  # If no matches are valid, we return all NAs
  if(length(any_matches) == 0) return(rep("No Match", length(column)))

  # If any of our matches are TRUE, we return a vector of "Match"
  if(any(any_matches)) return(rep("Match", length(column)))

  # The only remaining possibility is that we had no matches, so return "No Match"
  return(rep("No Match", length(column)))

We can test this using the data from your question as modified by your comments:

df <- structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), 
    stringvar = c("stringvalue4", "stringvalue2", "stringvalue1", 
    "stringvalue1", "stringvalue1", "stringvalue1", "stringvalue1", 
    "stringvalue1", "stringvalue1", "stringvalue4", "stringvalue4", 
    "stringvalue2", "stringvalue3", "stringvalue3", "stringvalue1"
    )), row.names = c(NA, -15L), class = "data.frame")

find_these  <- c("stringvalue2", "stringvalue3")
before_this <- "stringvalue1"

Now I can use group_by and mutate to apply this function to each of the groups in the data frame:


df                                                                     %>%
group_by(group)                                                        %>% 
mutate(newvar = match_if_precedes(stringvar, before_this, find_these)) %>% 


#>    group    stringvar   newvar
#> 1      A stringvalue4    Match
#> 2      A stringvalue2    Match
#> 3      A stringvalue1    Match
#> 4      A stringvalue1    Match
#> 5      A stringvalue1    Match
#> 6      B stringvalue1 No Match
#> 7      B stringvalue1 No Match
#> 8      B stringvalue1 No Match
#> 9      B stringvalue1 No Match
#> 10     B stringvalue4 No Match
#> 11     C stringvalue4    Match
#> 12     C stringvalue2    Match
#> 13     C stringvalue3    Match
#> 14     C stringvalue3    Match
#> 15     C stringvalue1    Match
