首页 > 解决方案 > 获取模式匹配的 id

问题描述

我想提取引理 GO 的搭配。

df <- data.frame(
  id = 1:6,
  go = c("go after it", "here we go", "he went bust", "go get it go", 
         "i 'm gon na go", "she 's going berserk"))

我可以像这样提取搭配:

# lemma forms:
lemma_GO <- c("go", "goes", "going", "gone", "went", "gon na") 

# alternation pattern:
pattern_GO <- paste0("\\b(", paste0(lemma_GO, collapse = "|"), ")\\b")

# extraction:
library(stringr)
df_GO <- data.frame(
  left = unlist(str_extract_all(df$go, paste0("('?\\b[a-z']+\\b|^)(?=\\s?", pattern_GO, ")"))),
  node = unlist(str_extract_all(df$go, pattern_GO)),
  right = unlist(str_extract_all(df$go, paste0("(?<=\\s?", pattern_GO, "\\s?)('?\\b[a-z']+\\b|$)")))
)

结果很好,但它没有显示id值,即我不知道匹配项是从哪个“句子”中提取的:

df_GO
  left   node   right
1          go   after
2   we     go        
3   he   went    bust
4          go     get
5   it     go        
6   'm gon na      go
7   na     go        
8   's  going berserk

如何id获取值以便结果如下:

df_GO
  left   node   right    id
1          go   after     1
2   we     go             2   
3   he   went    bust     3
4          go     get     4
5   it     go             4   
6   'm gon na      go     5
7   na     go             5  
8   's  going berserk     6

标签: rregexextract

解决方案


你快到了。您需要做的是循环/迭代您的数据帧并对每一行执行操作。这也允许您提取和存储 id。

为此,我们将您的步骤包装到函数调用中并将 id 添加到其中。

以下使用tidyverse包,特别是{purrr}用于迭代。

library(tidyverse)

# wrap your call into a function that we perform on each row
extract_GO <- function(df_row){
    df_GO <- data.frame(
        id = df_row$id,    # we also store the id for the row we process

#---------------------- your work - just adapted the variable to function call, df_row
## this could have stayed the same, but this way it is easier to understand
## what happens here
        left = unlist(str_extract_all(df_row$go, paste0("('?\\b[a-z']+\\b|^)(?=\\s?", pattern_GO, ")"))),
        node = unlist(str_extract_all(df_row$go, pattern_GO)),
        right = unlist(str_extract_all(df_row$go, paste0("(?<=\\s?", pattern_GO, "\\s?)('?\\b[a-z']+\\b|$)")))
    )
}

# --------------- next we iterate with purrr
## try df %>% group_split(id) to see what group_split() does

df %>% 
   group_split(id) %>%    # splits data frame into list of bins, i.e. by id
   purrr::map_dfr(.x, .f = ~ extract_GO(.x))  # now we iterate over bins with our function

这产生:

  id left   node   right
1  1          go   after
2  2   we     go        
3  3   he   went    bust
4  4          go     get
5  4   it     go        
6  5   'm gon na      go
7  5   na     go        
8  6   's  going berserk

推荐阅读