r - ngram 参考 quanteda 中的 docname
问题描述
我正在尝试创建一个类似于输出的数据表,quanteda::textstat_frequency
但多了一个列,docnames
,这是一串包含特定标记的文档名称。例如
a_corpus <- quanteda::corpus(c("some corpus text of no consequence that in practice is going to be very large",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten",
"adding some corpus text word repeats to ensure ngrams top ten selection approaches are working"))
ngrams_dfm <- quanteda::dfm(a_corpus, tolower = T, stem = F, ngrams = 2)
freq = textstat_frequency(ngrams_dfm)
# freq's header has feature, frequency, rank, docfreq, group
data.table(feature = featnames(ngrams_dfm )[1:50],
frequency = colSums(ngrams_dfm)[1:50],
doc_names = paste(docnames, collapse = ',')?, # what should be here?
keep.rownames = F,
stringsAsFactors = F)
解决方案
另一种(固执的)方法可能是使用 udpipe R 包。下面的示例 - 它的优点是可以轻松地根据词性标签进行选择,或者您也可以使用它来选择比二元组好得多的特定依赖项解析结果(但这是另一个问题)
library(udpipe)
library(data.table)
txt <- c("some corpus text of no consequence that in practice is going to be very large",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten",
"adding some corpus text word repeats to ensure ngrams top ten selection approaches are working")
x <- udpipe(txt, "english", trace = TRUE) ## rich output, but takes a while for large volumes of text
x <- setDT(x)
x <- x[, bigram_lemma := txt_nextgram(lemma, n = 2, sep = "-"), by = list(doc_id, paragraph_id, sentence_id)]
x <- x[, upos_next := txt_next(upos, n = 1), by = list(doc_id, paragraph_id, sentence_id)]
x_nouns <- subset(x, upos %in% c("ADJ") & upos_next %in% c("NOUN"))
View(x)
freqs <- document_term_frequencies(x, document = "doc_id", term = c("bigram_lemma", "lemma"))
dtm <- document_term_matrix(freqs)
推荐阅读
- python - 是否可以在 Tkinter 的按钮/标签内隐藏文本?
- c# - 基于客户的 C# 可定制/模块化(webapi)逻辑
- java - 我可以扩展 MapStruct 方法吗?
- mongodb - 如何使用代码在 mongoDB 中创建索引?
- python - 程序不升预定义 ZeroDivisionError
- reactjs - 在 useEffect() 清理函数中卸载状态更新
- python-3.x - 如何将bash数组传递给python脚本
- excel - 将重复的行移到上面
- visual-c++ - 在 Windows 11 中安装 Visual C++ Redistributable 2008 时遇到问题 - 错误 1935
- c# - VSCode 不会在外部窗口中打开控制台