首页 > 解决方案 > 使用滑动窗口匹配两个数据帧的字符串

问题描述

我有两个df。

df1
       col1
    1    a
    2    b
    3    c
    4    c
df2
   setID col1
1      1    a
2      1    b
3      1    b
4      1    a
5      2    w
6      2    v
7      2    c
8      2    b
9      3    a
10     3    a
11     3    b
12     3    a
13     4    a
14     4    b
15     4    c
16     4    a

我正在使用以下代码来匹配它们。

scorematch <- function () 
{
      require("dplyr")
      #to make sure every element is preceded by the one before that element
      combm <- rev(sapply(rev(seq_along(df1$col1)), function(i) paste0(df1$col1[i-1], df1$col1[i])));
      tempdf <- df2
      #group the history by their ID
      tempdf <- group_by(tempdf, setID)
      #collapse strings in history
      tempdf <- summarise(tempdf, ss = paste(col1, collapse = ""))
      tempdf <- rowwise(tempdf)
      #add score based on how it matches compared to path
      tempdf <- mutate(tempdf, score = sum(sapply(combm, function(x) sum(grepl(x, ss)))))
      tempdf <- ungroup(tempdf)
      #filter so that only IDs with scores more than 0 are available
      tempdf <- filter(tempdf, score != 0)
      tempdf <- pull(tempdf, setID)
      #filter original history to reflect new history
      tempdf2 <- filter(df2, setID %in% tempdf)
      tempdf2
    }

这段代码很好用。但我想更进一步。我想应用一个滑动窗口函数来获取我想要与 df2 匹配的 df1 值。到目前为止,我使用这个函数作为我的滑动窗口。

slidingwindow <- function(data, window, step)
{
  #data is dataframe with colname
  total <- length(data)
  #spots are start of each window
  spots <- seq(from=1, to=(total-step), by=step)
  result <- vector(length = length(spots))
  for(i in 1:length(spots)){
    ...
  }
  return(result)
}

scorematch 函数将嵌套在slidingwindow 函数中。我不确定如何从那里开始。理想情况下 df1 将被拆分为多个窗口。从第一个窗口开始,它将使用 scorematch 函数与 df2 进行匹配,以获得过滤掉的 df2。然后我希望 df1 的第二个窗口与新过滤的 df2 匹配,依此类推。过滤掉 df2 后,循环应该结束,因此它只包含 1 个不同的 setID 值。最终输出可以是整个过滤后的 df2,也可以是剩余的 setID。理想的输出将是

  setID col1
1     4    a
2     4    b
3     4    c
4     4    a

或者

[1] "4"

标签: rdataframedplyrsliding-window

解决方案


这是一个不使用for-loop 的解决方案。我使用for (尽管在这种情况下就足够了)和setID 并折叠每个组的字符串是stringr因为它具有良好的一致语法。purrrmaplapplydplyrgroup_by

library(dplyr)
library(purrr)
library(stringr)

首先,我折叠每个组的字符串。str_detect这使得使用-later 更容易使用模式匹配:

df2_collapse <- df2 %>% 
  group_by(setID) %>% 
  summarise(string = str_c(col1, collapse = ""))

df2_collapse
# A tibble: 4 x 2
#   setID string
#   <int> <chr> 
# 1     1 abba  
# 2     2 wvcb  
# 3     3 aaba  
# 4     4 abca  

“查找”字符串也被折叠,然后子字符串(即滑动窗口)被提取str_sub。在这里,我沿着字符串的长度工作,并在字符串str_length中的每个字母之后提取所有可能的组。

string <- str_c(df1$col1, collapse = "")

string
# [1] "abcc"

substrings <- 
  unlist(map(1:str_length(string), ~ str_sub(string, start = .x, end = .x:str_length(string))))

将子字符串存储在 tibble 中,并将其长度作为分数。

substrings
# [1] "a"    "ab"   "abc"  "abcc" "b"    "bc"   "bcc"  "c"    "cc"   "c" 

substrings <- tibble(substring = substrings, 
                     score = str_length(substrings))

substrings
# A tibble: 10 x 2
#    substring score
#    <chr>     <int>
#  1 a             1
#  2 ab            2
#  3 abc           3
#  4 abcc          4
#  5 b             1
#  6 bc            2
#  7 bcc           3
#  8 c             1
#  9 cc            2
# 10 c             1

对于每个 setID,提取它在子字符串数据中匹配的最大分数,并过滤掉所有 setID 中具有最大分数的行。

df2_collapse %>%
  mutate(score = map_dbl(string, 
                          ~ max(substrings$score[str_detect(.x, substrings$substring)]))) %>% 
  filter(score == max(score))

# A tibble: 1 x 3
#     setID string score
#     <int> <chr>  <dbl>
#   1     4 abca       3

数据

df1 <- structure(list(col1 = c("a", "b", "c", "c")), 
                 class = "data.frame", row.names = c("1", "2", "3", "4"))
df2 <- 
  structure(list(setID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), 
                 col1 = c("a", "b", "b", "a", "w", "v", "c", "b", "a", "a", "b", "a", "a", "b", "c", "a")), 
            class = "data.frame", 
            row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16"))

推荐阅读