首页 > 解决方案 > 是否有 R 函数来拆分句子

问题描述

我有几个像下面这样的非结构化句子。下面的描述是列名

Description

Automatic lever for a machine
Vaccum chamber with additional spare
Glove box for R&D
The Mini Guage 5 sets
Vacuum chamber only
Automatic lever only

我想将这句话从 Col1 拆分为 Col5 并计算出现如下

Col1             Col2            Col3               Col4               
Automatic_lever lever_for        for_a               a_machine  
Vaccum_chamber  chamber_with     with_additional    additional_spare     
Glove_box       box_for          for_R&D            R&D 
The_Mini        Mini_Guage       Guage_5             5_sets 
Vacuum_chamber  chamber_only     only       
Automatic_lever lever_only       only       

同样从上面的列中,我可以看到这些词的出现。就像,Vaccum_chamber 和 Automatic_lever 在这里重复了两次。同理,其他词的出现?

标签: r

解决方案


这是一个tidyverse选项

df %>%
    rowid_to_column("row") %>%
    mutate(words = map(str_split(Description, " "), function(x) {
        if (length(x) %% 2 == 0) words <- c(words, "")
        idx <- 1:(length(words) - 1)
        map_chr(idx, function(i) paste0(x[i:(i + 1)], collapse = "_"))
    })) %>%
    unnest() %>%
    group_by(row) %>%
    mutate(
        words = str_replace(words, "_NA", ""),
        col = paste0("Col", 1:n())) %>%
    filter(words != "NA") %>%
    spread(col, words, fill = "")
## A tibble: 6 x 6
## Groups:   row [6]
#    row Description                Col1        Col2       Col3       Col4
#  <int> <fct>                      <chr>       <chr>      <chr>      <chr>
#1     1 Automatic lever for a mac… Automatic_… lever_for  for_a      a_machine
#2     2 Vaccum chamber with addit… Vaccum_cha… chamber_w… with_addi… additional…
#3     3 Glove box for R&D          Glove_box   box_for    for_R&D    R&D
#4     4 The Mini Guage 5 sets      The_Mini    Mini_Guage Guage_5    5_sets
#5     5 Vacuum chamber only        Vacuum_cha… chamber_o… only       ""
#6     6 Automatic lever only       Automatic_… lever_only only       ""

说明:我们将句子拆分为Description一个空格" ",然后用滑动窗口方法将每两个单词连接在一起,确保每个单词总是有奇数个sentence;剩下的只是一个长期到广泛的转变。

不漂亮,但它重现了您的预期输出;而不是手动滑动窗口的方法,你也可以zoo::rollapply


样本数据

df <- read.table(text =
    "Description
'Automatic lever for a machine'
'Vaccum chamber with additional spare'
'Glove box for R&D'
'The Mini Guage 5 sets'
'Vacuum chamber only'
'Automatic lever only'", header = T)

推荐阅读