r - 如何在 R 中使用 LDA 总结电子邮件文本
问题描述
我正在研究投诉数据分析,我正在调整文本摘要技术以减少不必要的文本并仅显示有用的文本。
我在 R 中使用 LDA - Latent Dirichlet Allocation 进行文本摘要,但我无法充分发挥它的潜力。
library(igraph)
library(iterators)
#create a TCM using skip grams, we'll use a 5-word window
tcm <- CreateTcm(doc_vec = datacopy$Text,skipgram_window = 10,
verbose = FALSE,cpus = 2)
# LDA to get embeddings into probability space
embeddings <- FitLdaModel(dtm = tcm, k = 50, iterations = 300,
burnin = 180, alpha = 0.1,beta = 0.05, optimize_alpha = TRUE,
calc_likelihood = FALSE,calc_coherence = FALSE, calc_r2 = FALSE,cpus=2)
#Summarizer function
summarizer <- function(doc, gamma) {
# handle multiple docs at once
if (length(doc) > 1 )
return(sapply(doc, function(d) try(summarizer(d, gamma))))
# parse it into sentences
sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]]
names(sent) <- seq_along(sent) # so we know index and order
# embed the sentences in the model
e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE, cpus = 2)
# remove any documents with 2 or fewer words
#e <- e[ rowSums(e) > 2 , ]
vocab <- intersect(colnames(e), colnames(gamma))
e <- e / rowSums(e)
e <- e[ , vocab ] %*% t(gamma[ , vocab ])
e <- as.matrix(e)
# get the pairwise distances between each embedded sentence
e_dist <- CalcHellingerDist(e)
# turn into a similarity matrix
g <- (1 - e_dist) * 100
# we don't need sentences connected to themselves
diag(g) <- 0
# turn into a nearest-neighbor graph
g <- apply(g, 1, function(x){
x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0
x
})
# by taking pointwise max, we'll make the matrix symmetric again
g <- pmax(g, t(g))
g <- graph.adjacency(g, mode = "undirected", weighted = TRUE)
# calculate eigenvector centrality
ev <- evcent(g)
# format the result
result<-sent[names(ev$vector)[order(ev$vector,decreasing=TRUE)[1:3]]]
result <- result[ order(as.numeric(names(result))) ]
paste(result, collapse = " ")
}
docs <- datacopy$Text[1:10]
names(docs) <- datacopy$Reference[1:10]
sums <- summarizer(docs,gamma = embeddings$gamma)
sums
错误-
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
Error in if (nrow(adjmatrix) != ncol(adjmatrix)) { :
argument is of length zero
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
Error in if (nrow(adjmatrix) != ncol(adjmatrix))
{:argument is of length zero
Error in if (nrow(adjmatrix) != ncol(adjmatrix))
{:argument is of length zero
实际文字: 处理松动的井盖是理事会的责任。您能否提供有关理事会采取的下一步措施的最新信息。** Trail Mails 文字如下 - 大约 50 行文字**
总结文字: 处理松动的井盖是理事会的责任。我已阅读电子邮件主题,请与ABC提供的电话联系“
解决方案
推荐阅读
- reactjs - 无法重置离子输入参考
- ios - 如何从 .failure 案例中访问文本值
- github-actions - 错误:无法解析操作“mygh/my-action@main”,未找到存储库
- python - AWS 资源标记 (python/boto3)
- c++ - 如何在递归构建子目录时跳过项目级别 Makefile 中声明的依赖项?
- reactjs - axios 日志和渲染日志显示下一个 js 不同的东西
- python - 如何使用 bash 脚本执行 python.py 应用程序,然后模拟用户输入
- qt - 水平布局仅在两个小部件之间平均共享空间
- python - 如何根据条件替换字符串中的字符 - python
- python - 无法将 KerasSurgeon 与 Tensorflow 2.4.1 版本一起使用