首页 > 解决方案 > 在数据帧的子集上运行聚类函数并为数据帧中的每个子集写入结果

问题描述

数据

我在 R 中有一个具有以下结构的数据框:

ID   group           text
100    1    An apple is a sweet, edible fruit produced by an apple tree.
103    1    An apple is a sweet, edible fruit produced by an apple tree.
105    1    Some dog breeds show more variation in size than other dog breeds.
106    1    An apple is a sweet, edible fruit produced by an apple tree.
107    1    An apple is a sweet, edible fruit produced by an apple tree.
209    1    Some dog breeds show more variation in size than other dog breeds.
300    1    Some dog breeds show more variation in size than other dog breeds.
501    1    An apple is a sweet, edible fruit produced by an apple tree.
503    2    Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
711    2    Pizza is a savory dish of Italian origin.
799    2    Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
811    2    Ice cream is a sweetened frozen food typically eaten as a snack or dessert.

可以使用以下代码复制:

test_df <- data.frame(
  "ID" = c(100, 103, 105, 106, 107, 209, 300, 501, 503, 711, 799, 811,),
  "group" = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2),
  "text" = c('An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'Some dog breeds show more variation in size than other dog breeds.', 'Some dog breeds show more variation in size than other dog breeds.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'Some dog breeds show more variation in size than other dog breeds.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.', 'Pizza is a savory dish of Italian origin.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.')
)

实际上,每个主题的文本都略有不同,有几十万篇,分布在几十个群体中。

我正在尝试做的事情

我正在尝试编写一个执行以下操作的函数:

以下是分析后数据框中两行的示例:

ID   group    topic           text
100    1      apple    An apple is a sweet, edible fruit produced by an apple tree.
105    1       dog     Some dog breeds show more variation in size than other dog breeds.

到目前为止我所拥有的

我通常可以使用以下代码在完整的数据帧上运行此类函数(不按组进行子集化):

# Preparing the texts

library(tm)
corpus <- Corpus(VectorSource(test_df$text))
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument, language = 'english')
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)

# Identifying topics

library(topicmodels)
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
lda.output <- LDA(TF, k=2, method = 'Gibbs')

# Inputting the topic classification into the dataframe

test_df <- cbind(test_df, terms(lda.output)[topics(lda.output)])

我试图把它变成一个函数,然后使用以下代码按子集在数据帧上运行该函数:

library(tm)
library(topicmodels)

topic_identifier <- function(text) {
  corpus <- Corpus(VectorSource(text))
    corpus <- tm_map(corpus, removeWords, stopwords('english'))
    corpus <- tm_map(corpus, stemDocument, language = 'english')
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, stripWhitespace)
  TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
      lda.output <- LDA(TF, k=2, method = 'Gibbs')
  test_df <- cbind(test_df, terms(lda.output)[topics(lda.output)])
    }

by(test_df$text, test_df$group, topic_identifier)

但这不允许我为原始 df 中的每个子集保存相关输出。

标签: rfunctiondataframesubsetcluster-analysis

解决方案


by函数在接收数据框作为输入而不是列向量text时效果最佳。然后,您可以操作此数据框以返回非原始test_df。本质上,整个数据帧上的完全相同的过程被保留以在子数据帧上运行。

此外,您需要分配结果<-以构建一个对象列表,这些对象甚至可以在末尾用do.call+绑定在一起rbind(假设每个数据框都保持相同的列数和名称):

topic_identifier <- function(sub_df) { 
   corpus <- Corpus(VectorSource(sub_df$text)) 
   corpus <- tm_map(corpus, removeWords, stopwords('english')) 
   corpus <- tm_map(corpus, stemDocument, language = 'english') 
   corpus <- tm_map(corpus, removePunctuation) 
   corpus <- tm_map(corpus, stripWhitespace) 

   TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
   lda.output <- LDA(TF, k=2, method = 'Gibbs') 

   sub_df <- cbind(sub_df, terms(lda.output)[topics(lda.output)]) 
   return(sub_df)
} 

# BUILD LIST OF DFs, PASSING IN AND RETURNING A DATA FRAME
df_list <- by(test_df, test_df$group, topic_identifier)

# CONCATENATE ALL DFs INTO SINGLE DF
final_df <- do.call(rbind, unname(df_list))

推荐阅读