r - 如何在lda中保留已删除文本的文本ID
问题描述
我有一个这样的数据框
dtext <- data.frame(id = c(1,2,3,4), text = c("here","This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.", "The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.", "There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."),stringsAsFactors = F)
我用这个为 lda 执行文本清理
library(quanteda)
library(topicmodels)
library(tidyverse)
toks <- tokens(dtext$text)
toks <- tokens_remove(toks, c(
stopwords("en"),
stringi::stri_replace_all_fixed(stopwords("en"), "'", "")
))
toks <- toks %>% tokens_wordstem()
myDfm <- dfm(toks, ngrams = c(2,3)) %>%
dfm_trim(min_termfreq = 0.75, termfreq_type = "quantile")
dtm <- convert(myDfm, to = "topicmodels")
lda <- LDA(dtm, k = 2, control = list(seed = 1234))
但是我注意到在 dtm 中,当文本列不包含任何内容时,它会删除它。
gammaDF <- as.data.frame(lda@gamma)
toptopics <- as.data.frame(cbind(document = row.names(gammaDF),
topic = apply(gammaDF,1,function(x) names(gammaDF)[which(x==max(x))])))
但是,当我想获取第一个数据帧的主题和相关 id 时,它给了我一个问题。我该怎么做才能获得正确的结果?
id, topic 2 1 3 2 4 1
解决方案
在转换为dtm
using apply
and之前,您可以获取任何包含 0 个单词的文本的 ID which
:
library(quanteda)
library(topicmodels)
library(tidyverse)
toks <- tokens(dtext$text)
toks <- tokens_remove(toks, c(
stopwords("en"),
stringi::stri_replace_all_fixed(stopwords("en"), "'", "")
))
toks <- toks %>% tokens_wordstem()
myDfm <- dfm(toks, ngrams = c(2,3)) %>%
dfm_trim(min_termfreq = 0.75, termfreq_type = "quantile")
removed <- which(apply(myDfm, 1, sum) == 0)
结果:
> removed
text1
1
推荐阅读
- android-room - 为什么我的 Activity 暂停时,我的 Android ViewModel 的 Room RxJava3 Flowable 没有发布任何结果?
- javascript - 如何根据 Prestige 主题中的剩余库存为数量选择器添加限制
- ios - 在 Swift 中超过 2 个视图控制器之间传递数据
- split - 包含彼此的正则表达式模式优先级
- firebase - Flutter Firestore 监听 Stream
- c# - 单击时执行 C# XAML 数据绑定,单击还会在单独的类中运行逻辑以生成 ObservableCollection
- php - 使用 Laravel 6 使用用户名、电子邮件或电话登录。*
- google-cloud-firestore - Firestore 问题,在身份验证模式下恢复我的用户
- haskell - 解决 MultiParameterTypeClass 中的歧义
- java - 如何提示 Java 中的用户输入一周中的某一天,枚举中的天数输出我想要的响应