首页 > 解决方案 > 列 `token` 的长度必须为 2(行数)或 1,而不是 3

问题描述

我正在尝试标记长句子:

dat <- data.frame(text = c("hi i am Apple, not an orange. that is an orange","hello i am banana, not an pineapple. that is an pineapple"),
                  received = c(1, 0))

dat <- dat %>%
  mutate(token = sent_detect(text, language = "en"))

但我收到此错误:

Error: Column `token` must be length 2 (the number of rows) or one, not 3

这是因为 str_detect函数返回的句子列表不会映射回原始数据帧的长度。

library(openNLP)
library(NLP)

sent_detect <- function(text, language) {
  # Function to compute sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'. 
  sentence_token_annotator <- Maxent_Sent_Token_Annotator(language)

  # Convert text to class String from package NLP
  text <- as.String(text)

  # Sentence boundaries in text
  sentence.boundaries <- annotate(text, sentence_token_annotator)

  # Extract sentences
  sentences <- text[sentence.boundaries]

  # return sentences
  return(sentences)
}

我正在研究 purrr::map,但我不确定如何在这种情况下应用它。

我期待一个看起来像这样的结果:

text                                                    received    token
"hi i am Apple, not an orange. that is an orange"           1       "hi i am Apple, not an orange."
"hi i am Apple, not an orange. that is an orange"           1       "that is an orange"
"hello i am banana, not an pineapple. that is an pineapple" 0       "hello i am banana, not an pineapple."
"hello i am banana, not an pineapple. that is an pineapple" 0       "that is an pineapple"

标签: r

解决方案


使用 tidyr + purrr 可以让你到达那里。将创建一个嵌套输出,您可以使用tidyrmap将其提升到更高级别。unnest

library(tidyr)

dat %>% 
  mutate(sentences = purrr::map(text, sent_detect, "en")) %>% 
  unnest(sentences)


# A tibble: 4 x 3
  text                                                      received sentences                           
  <chr>                                                        <dbl> <chr>                               
1 hi i am Apple, not an orange. that is an orange                  1 hi i am Apple, not an orange.       
2 hi i am Apple, not an orange. that is an orange                  1 that is an orange                   
3 hello i am banana, not an pineapple. that is an pineapple        0 hello i am banana, not an pineapple.
4 hello i am banana, not an pineapple. that is an pineapple        0 that is an pineapple   

推荐阅读