r - 在数据帧的子集上运行聚类函数并为数据帧中的每个子集写入结果
问题描述
数据
我在 R 中有一个具有以下结构的数据框:
ID group text
100 1 An apple is a sweet, edible fruit produced by an apple tree.
103 1 An apple is a sweet, edible fruit produced by an apple tree.
105 1 Some dog breeds show more variation in size than other dog breeds.
106 1 An apple is a sweet, edible fruit produced by an apple tree.
107 1 An apple is a sweet, edible fruit produced by an apple tree.
209 1 Some dog breeds show more variation in size than other dog breeds.
300 1 Some dog breeds show more variation in size than other dog breeds.
501 1 An apple is a sweet, edible fruit produced by an apple tree.
503 2 Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
711 2 Pizza is a savory dish of Italian origin.
799 2 Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
811 2 Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
可以使用以下代码复制:
test_df <- data.frame(
"ID" = c(100, 103, 105, 106, 107, 209, 300, 501, 503, 711, 799, 811,),
"group" = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2),
"text" = c('An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'Some dog breeds show more variation in size than other dog breeds.', 'Some dog breeds show more variation in size than other dog breeds.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'Some dog breeds show more variation in size than other dog breeds.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.', 'Pizza is a savory dish of Italian origin.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.')
)
实际上,每个主题的文本都略有不同,有几十万篇,分布在几十个群体中。
我正在尝试做的事情
我正在尝试编写一个执行以下操作的函数:
- 对于数据框中的每个组,比较该组中的所有文本,并确定主要词汇主题。
- 然后,将每个文本的相关主题作为数据框中的新列输入。
以下是分析后数据框中两行的示例:
ID group topic text
100 1 apple An apple is a sweet, edible fruit produced by an apple tree.
105 1 dog Some dog breeds show more variation in size than other dog breeds.
到目前为止我所拥有的
我通常可以使用以下代码在完整的数据帧上运行此类函数(不按组进行子集化):
# Preparing the texts
library(tm)
corpus <- Corpus(VectorSource(test_df$text))
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument, language = 'english')
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
# Identifying topics
library(topicmodels)
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
lda.output <- LDA(TF, k=2, method = 'Gibbs')
# Inputting the topic classification into the dataframe
test_df <- cbind(test_df, terms(lda.output)[topics(lda.output)])
我试图把它变成一个函数,然后使用以下代码按子集在数据帧上运行该函数:
library(tm)
library(topicmodels)
topic_identifier <- function(text) {
corpus <- Corpus(VectorSource(text))
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument, language = 'english')
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
lda.output <- LDA(TF, k=2, method = 'Gibbs')
test_df <- cbind(test_df, terms(lda.output)[topics(lda.output)])
}
by(test_df$text, test_df$group, topic_identifier)
但这不允许我为原始 df 中的每个子集保存相关输出。
解决方案
该by
函数在接收数据框作为输入而不是列向量text时效果最佳。然后,您可以操作此数据框以返回非原始test_df。本质上,整个数据帧上的完全相同的过程被保留以在子数据帧上运行。
此外,您需要分配结果<-
以构建一个对象列表,这些对象甚至可以在末尾用do.call
+绑定在一起rbind
(假设每个数据框都保持相同的列数和名称):
topic_identifier <- function(sub_df) {
corpus <- Corpus(VectorSource(sub_df$text))
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument, language = 'english')
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
lda.output <- LDA(TF, k=2, method = 'Gibbs')
sub_df <- cbind(sub_df, terms(lda.output)[topics(lda.output)])
return(sub_df)
}
# BUILD LIST OF DFs, PASSING IN AND RETURNING A DATA FRAME
df_list <- by(test_df, test_df$group, topic_identifier)
# CONCATENATE ALL DFs INTO SINGLE DF
final_df <- do.call(rbind, unname(df_list))
推荐阅读
- python - Pandas get_dummies 不断在列名后使用数字创建额外的列
- javascript - keyDown 处理程序在自定义 React 钩子中被触发两次而不是一次
- python - 如何使用元数据将音频文件放在目录中?
- subquery - 如何使用不同的标准多次拉取 MaxDate
- javascript - 我怎样才能让discord js随机选择两张图片?
- java - Spring.schemas 在提取的 jar 中不起作用
- java - 关于效率: .filter(Optional::isPresent).map(Optional::get) 不是比 .flatmap(Optional::stream) 好吗?
- python - Pandas:解析损坏的 .xls 文件
- java - 如何在 Java 16 中将 JAR 动态加载到运行时?
- sql - SQL 查询 - 只显示非空行