首页 > 解决方案 > 使用R中的stringi提取字符串中某些字符之后的多个子字符串

问题描述

我在 R 中有一个大型数据框,其中有一列看起来像这样,其中每个句子都是一行

data <- data.frame(
   datalist = c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
   "these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
   "anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
   "while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations"),
   stringsAsFactors=FALSE)

我想提取“wiki/”之后的所有单词并将它们放在另一列中

所以第一行应该是“political_philosophy self-governance”第二行应该是“hierarchy free_association_(communism_and_anarchism)”第三行应该是“state_(polity)”第四行应该是“anti-statism "

我绝对想使用 stringi,因为它是一个巨大的数据框。提前感谢您的帮助。

我试过了

stri_extract_all_fixed(data$datalist, "wiki")[[1]]

但这只是提取单词 wiki

标签: rregexstring

解决方案


您可以使用正则表达式执行此操作。通过使用stri_match_而不是,stri_extract_我们可以使用括号来创建匹配组,让我们只提取正则表达式匹配的一部分。在下面的结果中,您可以看到 的每一行都df给出了一个列表项,其中包含一个数据框,其中第一列中的整个匹配项和以下列中的每个匹配组:

match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
match

[[1]]
     [,1]                        [,2]                  
[1,] "wiki/political_philosophy" "political_philosophy"
[2,] "wiki/self-governance"      "self-governance"     

[[2]]
     [,1]                                              [,2]                                        
[1,] "wiki/stateless_society"                          "stateless_society"                         
[2,] "wiki/hierarchy"                                  "hierarchy"                                 
[3,] "wiki/free_association_(communism_and_anarchism)" "free_association_(communism_and_anarchism)"

[[3]]
     [,1]                  [,2]            
[1,] "wiki/state_(polity)" "state_(polity)"

[[4]]
     [,1]                [,2]          
[1,] "wiki/anti-statism" "anti-statism"

然后,您可以使用 apply 函数将数据转换为您想要的任何形式:

match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
sapply(match, function(x) paste(x[,2], collapse = " "))

[1] "political_philosophy self-governance"                                  
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"                                                        
[4] "anti-statism"  

推荐阅读