r - 如何在R中的重复字符串中选择最长的ngram?
问题描述
我有一个如下所示的数据集(只有更多行):
x = c("abov level", "abov level consist", "abov level consist price",
"abov level consist price stabil", "abov level consist price stabil protract",
"abov level consist price stabil protract period", "abov level consist price stabil protract period time",
"abov level consist price stabil sinc", "abov level consist price stabil sinc last",
"abov level consist price stabil sinc last autumn", "abov level consist price stabil some",
"abov level consist price stabil some time", "abov over", "abov over come",
"abov over come month", "abov precis", "abov precis level", "abov precis level depend",
"abov precis level depend futur", "abov precis level depend futur energi",
"abov precis level depend futur energi price", "abov precis level depend futur energi price develop"
)
[1] "abov level"
[2] "abov level consist"
[3] "abov level consist price"
[4] "abov level consist price stabil"
[5] "abov level consist price stabil protract"
[6] "abov level consist price stabil protract period"
[7] "abov level consist price stabil protract period time"
[8] "abov level consist price stabil sinc"
[9] "abov level consist price stabil sinc last"
[10] "abov level consist price stabil sinc last autumn"
[11] "abov level consist price stabil some"
[12] "abov level consist price stabil some time"
[13] "abov over"
[14] "abov over come"
[15] "abov over come month"
[16] "abov precis"
[17] "abov precis level"
[18] "abov precis level depend"
[19] "abov precis level depend futur"
[20] "abov precis level depend futur energi"
[21] "abov precis level depend futur energi price"
[22] "abov precis level depend futur energi price develop"
如您所见,有一个清晰的模式:在更改基数并再次重新启动该过程之前,一次将一个单词添加到前一个 ngram。让我以第一个“块”为例:
[1] "abov level"
[2] "abov level consist"
[3] "abov level consist price"
[4] "abov level consist price stabil"
[5] "abov level consist price stabil protract"
[6] "abov level consist price stabil protract period"
[7] "abov level consist price stabil protract period time"
对于像上面这样的每个“块”,我只会保留最长的句子/ngram。在上述情况下,我只会保留第 7 行。对每个块都这样做,我会得到:
[1] "abov level consist price stabil protract period time"
[2] "abov level consist price stabil sinc last autumn"
[3] "abov level consist price stabil some time"
[4] "abov over come month"
[5] "abov precis level depend futur energi price develop"
谁能帮我做到这一点?
谢谢!
解决方案
我们可以使用filter
in dplyr
withlead
library(dplyr)
tibble(x) %>%
filter((nchar(lead(x, default = last(x))) - nchar(x)) <= 0)
推荐阅读
- dart - Dart 压缩 Uint8List 图像
- sql-server - 将堆表转换为聚簇表
- sql - 如何根据计数复制表的行
- python - 如何使用 get vars 从 Web2Py URL 中省略 NoneType
- php - 将顶级产品类别的正文类添加到 WooCommerce 档案
- byte-buddy - 我可以创建一个带有私有静态最终 MethodHandle 字段的 ByteBuddy 检测类型吗?
- c++ - 如何为我的证书获取 client_assertion
- javascript - 使用带有 VueJS 的单选按钮时更改事件未触发
- php - 我如何删除在php中两边都有空格的连字符
- php - 从txt文件中获取一行到php