首页 > 解决方案 > ngram 参考 quanteda 中的 docname

问题描述

我正在尝试创建一个类似于输出的数据表,quanteda::textstat_frequency但多了一个列,docnames,这是一串包含特定标记的文档名称。例如

a_corpus <- quanteda::corpus(c("some corpus text of no consequence that in practice is going to be very large",
                                   "and so one might expect a very large number of ngrams but for nlp purposes only care about top ten",
                                   "adding some corpus text word repeats to ensure ngrams top ten selection approaches are working"))

ngrams_dfm <- quanteda::dfm(a_corpus, tolower = T, stem = F, ngrams = 2)
freq = textstat_frequency(ngrams_dfm)
# freq's header has feature, frequency, rank, docfreq, group

data.table(feature = featnames(ngrams_dfm )[1:50], 
       frequency = colSums(ngrams_dfm)[1:50],
       doc_names = paste(docnames, collapse = ',')?, # what should be here?
       keep.rownames = F,
       stringsAsFactors = F)

标签: rquantedadfm

解决方案


另一种(固执的)方法可能是使用 udpipe R 包。下面的示例 - 它的优点是可以轻松地根据词性标签进行选择,或者您也可以使用它来选择比二元组好得多的特定依赖项解析结果(但这是另一个问题)

library(udpipe)
library(data.table)
txt <- c("some corpus text of no consequence that in practice is going to be very large",
       "and so one might expect a very large number of ngrams but for nlp purposes only care about top ten",
       "adding some corpus text word repeats to ensure ngrams top ten selection approaches are working")
x <- udpipe(txt, "english", trace = TRUE) ## rich output, but takes a while for large volumes of text
x <- setDT(x)
x <- x[, bigram_lemma := txt_nextgram(lemma, n = 2, sep = "-"), by = list(doc_id, paragraph_id, sentence_id)]
x <- x[, upos_next := txt_next(upos, n = 1), by = list(doc_id, paragraph_id, sentence_id)]
x_nouns <- subset(x, upos %in% c("ADJ") & upos_next %in% c("NOUN"))
View(x)
freqs <- document_term_frequencies(x, document = "doc_id", term = c("bigram_lemma", "lemma"))
dtm <- document_term_matrix(freqs)

推荐阅读