首页 > 解决方案 > 如何在R中的重复字符串中选择最长的ngram?

问题描述

我有一个如下所示的数据集(只有更多行):

x = c("abov level", "abov level consist", "abov level consist price", 
"abov level consist price stabil", "abov level consist price stabil protract", 
"abov level consist price stabil protract period", "abov level consist price stabil protract period time", 
"abov level consist price stabil sinc", "abov level consist price stabil sinc last", 
"abov level consist price stabil sinc last autumn", "abov level consist price stabil some", 
"abov level consist price stabil some time", "abov over", "abov over come", 
"abov over come month", "abov precis", "abov precis level", "abov precis level depend", 
"abov precis level depend futur", "abov precis level depend futur energi", 
"abov precis level depend futur energi price", "abov precis level depend futur energi price develop"
)

 [1] "abov level"                                          
 [2] "abov level consist"                                  
 [3] "abov level consist price"                            
 [4] "abov level consist price stabil"                     
 [5] "abov level consist price stabil protract"            
 [6] "abov level consist price stabil protract period"     
 [7] "abov level consist price stabil protract period time"
 [8] "abov level consist price stabil sinc"                
 [9] "abov level consist price stabil sinc last"           
[10] "abov level consist price stabil sinc last autumn"    
[11] "abov level consist price stabil some"                
[12] "abov level consist price stabil some time"           
[13] "abov over"                                           
[14] "abov over come"                                      
[15] "abov over come month"                                
[16] "abov precis"                                         
[17] "abov precis level"                                   
[18] "abov precis level depend"                            
[19] "abov precis level depend futur"                      
[20] "abov precis level depend futur energi"               
[21] "abov precis level depend futur energi price"         
[22] "abov precis level depend futur energi price develop"

如您所见,有一个清晰的模式:在更改基数并再次重新启动该过程之前,一次将一个单词添加到前一个 ngram。让我以第一个“块”为例:

 [1] "abov level"                                          
 [2] "abov level consist"                                  
 [3] "abov level consist price"                            
 [4] "abov level consist price stabil"                     
 [5] "abov level consist price stabil protract"            
 [6] "abov level consist price stabil protract period"     
 [7] "abov level consist price stabil protract period time"

对于像上面这样的每个“块”,我只会保留最长的句子/ngram。在上述情况下,我只会保留第 7 行。对每个块都这样做,我会得到:

    
 [1] "abov level consist price stabil protract period time"           
 [2] "abov level consist price stabil sinc last autumn"    
 [3] "abov level consist price stabil some time"                                              
 [4] "abov over come month"                                      
 [5] "abov precis level depend futur energi price develop"

谁能帮我做到这一点?

谢谢!

标签: rstringdataframesubstringgsub

解决方案


我们可以使用filterin dplyrwithlead

library(dplyr)
tibble(x) %>%
     filter((nchar(lead(x, default = last(x))) - nchar(x)) <= 0)

推荐阅读