首页 > 解决方案 > 如何使用 quanteda 进行命名实体识别 (NER)?

问题描述

有一个带有文本的数据框

df = data.frame(id=c(1,2), text = c("My best friend John works and Google", "However he would like to work at Amazon as he likes to use python and stay at Canada")

无需任何预处理

怎么可能像这样提取名称实体识别

示例结果词

dfresults = data.frame(id=c(1,2), ner_words = c("John, Google", "Amazon, python, Canada")

标签: rquanteda

解决方案


您可以在没有quanteda的情况下使用spacyr包 - 链接文章中提到的 spaCy 库的包装器。

在这里,我稍微编辑了您的输入 data.frame。

df <- data.frame(id = c(1, 2), 
                 text = c("My best friend John works at Google.", 
                          "However he would like to work at Amazon as he likes to use Python and stay in Canada."),
                 stringsAsFactors = FALSE)

然后:

library("spacyr")
library("dplyr")

# -- need to do these before the next function will work:
# spacy_install()
# spacy_download_langmodel(model = "en_core_web_lg")

spacy_initialize(model = "en_core_web_lg")
#> Found 'spacy_condaenv'. spacyr will use this environment
#> successfully initialized (spaCy Version: 2.0.10, language model: en_core_web_lg)
#> (python options: type = "condaenv", value = "spacy_condaenv")

txt <- df$text
names(txt) <- df$id

spacy_parse(txt, lemma = FALSE, entity = TRUE) %>%
    entity_extract() %>%
    group_by(doc_id) %>%
    summarize(ner_words = paste(entity, collapse = ", "))
#> # A tibble: 2 x 2
#>   doc_id ner_words             
#>   <chr>  <chr>                 
#> 1 1      John, Google          
#> 2 2      Amazon, Python, Canada

推荐阅读