r - Trace back - find if a string value occurs before another specific string value - dplyr/R
问题描述
I have a general problem in R. I wonder if there's a way to identify if a specific string value occurs after another specific string value within a group. The dataset is a time series. Each Group consist of 10 years.
I want something like the below, but instead of lag I wish to look at every year before " stringvalue1" within the group.
mutate(new_var = if_else(stringvar = "stringvalue1" & lag(stringvar) %in% c("stringvalue2", "stringvalue3"), "Match", "Not match"))
Help would be much appreciated!
library(dplyr)
match_if_precedes <- function(column, this_string, preceded_by)
{
matches <- which(column == this_string)
if (length(matches) == 0) return(rep("No Match", length(column)))
last_match = matches[length(matches) - 1]
if (last_match == 0) return(rep("No Match", length(column)))
any_matches <- !is.na(preceded_by %in% column[1:last_match])
if(length(any_matches) == 0) return(rep("No Match", length(column)))
any_matches <- any(any_matches)
if(any_matches) return(rep("Match", length(column)))
return(rep("No Match", length(column)))
}
df1 <- structure(list(group = c("A", "A", "A", "A", "A",
"B", "B", "B", "B", "B",
"C", "C", "C", "C", "C"),
stringvar = c("stringvalue4", "stringvalue2", "stringvalue1", "stringvalue1", "stringvalue1",
"stringvalue1", "stringvalue1", "stringvalue1", "stringvalue1","stringvalue4",
"stringvalue4", "stringvalue2", "stringvalue3", "stringvalue3", "stringvalue4")),
row.names = c(NA, -15L), class = "data.frame")
df1 %>%
group_by(group) %>%
mutate(newvar = match_if_precedes(stringvar, "stringvalue1",
c("stringvalue2", "stringvalue3")))
group stringvar newvar
<chr> <chr> <chr>
1 A stringvalue4 Match
2 A stringvalue2 Match
3 A stringvalue1 Match
4 A stringvalue1 Match
5 A stringvalue1 Match
6 B stringvalue1 Match
7 B stringvalue1 Match
8 B stringvalue1 Match
9 B stringvalue1 Match
10 B stringvalue4 Match
11 C stringvalue4 No Match
12 C stringvalue2 No Match
13 C stringvalue3 No Match
14 C stringvalue3 No Match
15 C stringvalue4 No Match
解决方案
You can define a function that will return a vector of "Match" if the criteria are met, and a vector of "No Match" if the criteria are not met. These will be the same length as the input column.
I have added extensive comments to show how the function works:
# This function takes a vector of strings called `column`. It looks for any instances of the
# single string `this_string` and any of the vector of strings `preceded_by` within
# `column`. If it finds any member of `preceded_by` in the vector before the last instance
# of `this_string` it returns a vector of the string "Match" with the same length as
# the original `column` vector. In all other cases it returns a vector of "No Match"
match_if_precedes <- function(column, this_string, preceded_by)
{
# Find instances of this_string. If there are no instances of this_string then
# we want to return a vector of "No Match"
matches <- which(column == this_string)
if (length(matches) == 0) return(rep("No Match", length(column)))
# If there is more than one instance of this_string, we want to choose the last one
last_match = matches[length(matches)] - 1
# If the only instance of this_string is at position 1, there can't be any
# instances of preceded_by before it, so return a vector of NA
if (last_match == 0) return(rep("No Match", length(column)))
# Now find the instances of preceded_by in the part of the column before the
# last instance of this_string and remove any NA values
any_matches <- preceded_by %in% column[1:last_match]
any_matches <- any_matches[!is.na(any_matches)]
# If no matches are valid, we return all NAs
if(length(any_matches) == 0) return(rep("No Match", length(column)))
# If any of our matches are TRUE, we return a vector of "Match"
if(any(any_matches)) return(rep("Match", length(column)))
# The only remaining possibility is that we had no matches, so return "No Match"
return(rep("No Match", length(column)))
}
We can test this using the data from your question as modified by your comments:
df <- structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
stringvar = c("stringvalue4", "stringvalue2", "stringvalue1",
"stringvalue1", "stringvalue1", "stringvalue1", "stringvalue1",
"stringvalue1", "stringvalue1", "stringvalue4", "stringvalue4",
"stringvalue2", "stringvalue3", "stringvalue3", "stringvalue1"
)), row.names = c(NA, -15L), class = "data.frame")
find_these <- c("stringvalue2", "stringvalue3")
before_this <- "stringvalue1"
Now I can use group_by
and mutate
to apply this function to each of the groups in the data frame:
library(dplyr)
df %>%
group_by(group) %>%
mutate(newvar = match_if_precedes(stringvar, before_this, find_these)) %>%
as.data.frame()
Result:
#> group stringvar newvar
#> 1 A stringvalue4 Match
#> 2 A stringvalue2 Match
#> 3 A stringvalue1 Match
#> 4 A stringvalue1 Match
#> 5 A stringvalue1 Match
#> 6 B stringvalue1 No Match
#> 7 B stringvalue1 No Match
#> 8 B stringvalue1 No Match
#> 9 B stringvalue1 No Match
#> 10 B stringvalue4 No Match
#> 11 C stringvalue4 Match
#> 12 C stringvalue2 Match
#> 13 C stringvalue3 Match
#> 14 C stringvalue3 Match
#> 15 C stringvalue1 Match
推荐阅读
- java - “参数索引超出范围(3 > 参数数量,即 2)。” 更新表时
- python - 从特定列表中读取 url 然后一个一个下载
- windows - Windows GIT 2.20.0 如何将文件夹拆分为自己的存储库并保留历史记录
- python - 为什么多处理不并行处理查询?
- python - 如何在 Linux 上安装 pylsmlib python 模块
- javascript - 然后块不会在链式承诺中被调用
- python - 如何根据 pandas 或 numpy 中的条件从数组中提取值?
- php - 我如何使用 Model::factory() 更新用户积分
- javascript - 使用 bbox 策略强制刷新 openlayers 5.3 集群源
- karate - 我打算中止场景,但是,场景和功能被报告为失败